Stanford EE364A Convex Optimization I Stephen Boyd I 2023 I Lecture 13

Stanford Online2024-03-23

Stanford#Stanford Online

144 views|4 months ago

💫 Short Summary

The video segments discuss the importance of numerical linear algebra for solving equations efficiently, focusing on methods like factorization and solve techniques. Emphasis is placed on understanding sparse matrices, banded matrices, and Cholesky factorization for positive definite systems. The discussions cover block elimination methods, risk models in finance, and iterative methods for solving complex problems. The speaker highlights the significance of matrix structures, permutations, and efficient problem-solving strategies in various fields like control systems and optimization. Linear convergence, descent methods, and optimization algorithms are also explored in detail, showcasing practical applications in mathematics and engineering.

✨ Highlights

📊 Transcript

✦

Importance of numerical linear algebra in solving large-scale problems.

01:08

Direct methods involve factoring a matrix into easy-to-invert matrices, eliminating the need to actually invert anything.

Understanding these methods is essential for efficiently solving equations.

Evaluating the inverse is highlighted, as forming the inverse of a matrix is rare in practical numeric work.

Mastery of numerical linear algebra is crucial for those dealing with computational problems.

✦

Solving a set of equations with the same coefficient matrix using factorization and the solve method.

04:18

The process includes factoring the matrix into permutation times lower triangular and upper triangular matrices.

Factorization cost is order n cubed flops.

The solve method involves solving in reverse order through forward and backward substitution.

Efficiency of solving linear equations with large matrices, with the ability to solve a thousand by thousand matrix in 100 milliseconds on a laptop.

✦

Importance of Sparse L Factorization in Solving Equations.

08:25

Understanding methods like sparse L factorization is crucial for efficiently solving equations with various unknowns and coefficients.

The process involves permutation matrices and lower and upper triangular matrices.

Proper factorization is essential to maintain sparsity in matrices and prevent denseness in solutions.

This knowledge is necessary for computational tasks.

✦

Importance of choosing the right P1 and P2 values when working with sparse matrices.

11:57

Poorly chosen values can lead to challenges storing dense lower triangular matrices.

Banded matrices with non-zero entries in a specific band allow for easier storage and factorization.

Lu factorization on banded matrices results in lower and upper triangular matrices that are also banded.

Matrix structure impacts the efficiency of computational tasks.

✦

Highlights of Banded Factorization in Solving Equations

14:46

Banded factorization is efficient for solving equations with many variables, emphasizing the importance of recognizing matrix structures like sparsity for quick problem-solving.

Practical applications in control and signal processing showcase how exploiting matrix structure can simplify complex issues.

The segment also discusses Cheski factorization for positive definite matrices, explaining its unique approach to factorization.

✦

Importance of Cholesky factorization in solving positive definite systems of equations.

18:10

Emphasis on matrix permutations in Cholesky factorization for sparse or dense results.

Illustration of concept through a famous example showcasing significance of sparsity in matrices.

Discussion on structure of upper Arrow matrices.

Impact of dense matrices on storage and efficiency.

✦

Importance of permutation in optimizing factorization and solve times for sparse linear equations.

22:27

The right permutation can greatly impact solve times, with a general rule of thumb suggesting time comparable to numerical factorization.

Applications like embedded systems for control may prioritize longer permutation times for efficiency.

Most practitioners use fast heuristics for permutation rather than exhaustive search.

The process of finding the optimal permutation is referred to as the symbolic solve or permutation phase.

✦

The importance of understanding graph calculations and factorization techniques.

25:41

Matrix properties play a crucial role in graph calculations.

Permutations in factorizations require location information rather than specific entries.

Offline permutation can improve efficiency in tasks such as solving optimal control problems.

Practical applications of mathematical concepts in real-world scenarios are showcased.

✦

Overview of fast backtesting and factorization process for fast solvers.

28:31

Cholesky factorization will fail if matrix is not positive definite.

Iterative refinement used for computing approximate solutions.

LDL transpose factorization method for non-singular symmetric matrices.

Introduction to Schur complement and its applications in statistics, mechanics, and electrical engineering.

✦

Solving equations involving matrices.

32:27

Manipulate the equations by rearranging terms and using matrix inverses.

Highlight the use of the Schur complement and provide a step-by-step guide.

Method involves factoring matrices, performing solves, and obtaining solutions for X1 and X2.

Emphasize the importance of matrix operations and showcase a systematic approach to solving matrix equations.

✦

The cost analysis of different steps in a process, factor costs, and solve costs are discussed.

36:11

Dense matrices and block elimination methods are emphasized for faster processing.

Sparsity patterns and diagonal matrices are introduced for their impact on inversion ease and computational efficiency.

A comparison of different matrix structures in terms of flops and practical implications for optimization strategies is provided.

✦

The segment discusses the block elimination method for solving equations efficiently.

40:40

The method can be linear in dimension with constant width and bandwidth.

Recognizing specific matrix patterns is crucial for solving problems quickly.

The Sherman Morrison Woodbury Matrix inversion Lemma is highlighted, with various names and applications discussed.

Understanding matrix structures is essential for efficient problem-solving in linear time.

✦

The use of risk models in quantitative finance, specifically focusing on covariance matrices being diagonal plus low rank.

44:56

The factor model is prevalent in risk management, where assets' returns are driven by a set number of factors.

Factors can be handcrafted matrices with zeros and ones, tailored to specific industries like pharmaceuticals or defense.

Idiosyncratic risk is highlighted as the variance not explained by the factors, affecting individual asset performance like Google or Coca-Cola.

✦

The concept of un elimination in numerical algorithms.

47:54

Un elimination introduces new variables and equations to expand the problem size, emphasizing the importance of approaching linear algebra correctly.

Making the equation set larger through un elimination allows for more efficient solutions to smaller sets.

The segment also delves into matrix operations and block elimination techniques for solving complex equations.

✦

Use of block elimination in forming complement equations for solving y and x is discussed.

50:50

Matrix Inversion Lemma is introduced as a method to simplify calculations, particularly with diagonal matrices.

Importance of utilizing sparse solvers for optimal control problems and Kalman filters is emphasized.

Traditional derivations may not always be practical, leading the speaker to opt for solving linear equations using sparse methods for more efficient results.

Sparse methods are shown to be beneficial across various fields, including statistics.

✦

Solving smooth unconstrained problems using iterative methods.

54:50

Iterative methods involve generating new X values based on querying F to converge to a solution.

Connecting linear algebra to optimization problems is crucial for practical problem-solving abilities.

The speaker plans to cover more advanced topics in the future, starting from basics to applications and optimization problems.

✦

The segment explores the convergence of function values to optimal values and stopping criteria in direct methods.

56:55

Technical conditions related to closed function sets and minimum curvature are discussed, emphasizing the importance of closed sets and function behavior near boundaries.

Examples of closed functions such as log sum X are provided, along with discussions on minimum curvature and its practical applications in regression.

The segment concludes with insights on determining minimum curvature values in specific cases.

✦

Descent methods in optimization focus on Delta X as the search direction.

01:02:21

The process includes querying F for gradient information, forming the search direction, and selecting a step size scaled by a positive factor T.

If the chosen direction has a negative inner product with the gradient and the step size is appropriate, the new point will have a lower function value than the previous one.

✦

The segment discusses descent methods in optimization algorithms.

01:04:06

Descent methods focus on descent direction, step length, and stopping criteria for optimization algorithms.

Different line search methods are explored, including exact line search and backtracking line search.

Backtracking line search involves adjusting the step length based on whether the function value has decreased.

It is important to choose the appropriate search direction and line search method for optimization algorithms.

✦

The importance of gradient descent in optimization.

01:09:35

Following the negative gradient direction leads to a faster decrease in function value.

Gradient descent is simple and efficient compared to exact line search.

Evaluating gradients helps determine the direction of fastest decrease in function value.

Choosing the right direction is crucial for efficient optimization.

✦

Overview of the gradient method in optimization and its slow convergence with an exact line search.

01:11:00

Importance of selecting the correct search direction for improved efficiency.

Introduction of controlling eccentricity with gamma to affect gradient direction and convergence.

Illustration of how gamma influences search direction and efficiency in optimization.

Emphasis on the significance of optimizing search strategies for better results in optimization problems.

✦

Optimization methods for functions with low condition numbers.

01:15:10

Importance of using low pass filters or adding momentum for efficient solutions.

Differences between backtracking line search and exact line search, including the zigzagging effect in real-life examples.

Concept of linear convergence on a log scale plot, with each iteration reducing error by a small factor.

✦

Explanation of linear convergence in iterative processes.

01:17:59

Dividing iterations by a specific formula helps determine error reduction at each step.

Understanding how much error decreases with each iteration is known as linear convergence.

The discussion ends, indicating a break before continuing with the next section in the future.

00:05let me uh I'll explain where we are sort

00:08of in the whole course in the next maybe

00:11two and a half three weeks um we're

00:14actually going to be we build up and

00:16we're going from the bottom up to the

00:17top so this will this will finish when

00:20you'll actually Implement your own like

00:22solver for an LP solver or something

00:24like that um we haven't quite figured

00:26out but you'll anyway you'll it'll

00:28finish with you implementing like a

00:30solver so you mostly by the way just to

00:33demystify it right just so when someone

00:36looks at you sometime in the future and

00:37says oh my God you can't write your own

00:40LP solver that's really complicated and

00:41you could say I oh oh contr I did it

00:45right so that's the main thing the main

00:47reason um but also this interesting

00:49stuff um right now we're at the bottom

00:52level which is linear algebra uh and

00:54actually in some ways it's by far the

00:56most important level um and my my own

00:58feeling is everybody certainly anybody

01:01in this class needs to know this

01:03material um do you need to know a whole

01:0610we course you know that you would take

01:08on computational linear

01:10algebra maybe actually but but this this

01:14certainly you need to be familiar with

01:15the ideas because uh you know and that's

01:18unlike you you don't have to know about

01:20computer architecture even though

01:22everything we do is thanks to the people

01:25who design and build you know uh you

01:29know computers stuff so but this you

01:31actually should know um

01:33okay so back to uh the numerical linear

01:37algebra and today actually we're going

01:38to cover material that is like I some of

01:41you may know rest of you probably don't

01:43and it is UN I'd list it in the top 10

01:46of the things you need to know from this

01:48course even though it's not technically

01:50part of the course so okay so here it is

01:53last time we talked about you know how

01:54do you solve a set of equations ax

01:56equals B um and you know the there there

02:00are several ways there's ones we're not

02:02going to talk about there are iterative

02:04methods to do this um which are used for

02:07like super large scale problems if

02:09you've got like a billion variables or

02:11if that represents like a pte you're

02:13solving right um we're not going to talk

02:15about those we're going to talk about

02:17these are what are called direct methods

02:19uh I don't know where that name came

02:21from but they're called direct and the

02:22way they work is they take the Matrix a

02:24and they first Factor it into a product

02:27of

02:28matrices Each of which which quote is

02:30easy to invert unquote so that that's

02:33the first step right you've seen some of

02:36these factorizations you know like SVD

02:38Ian decomposition you've probably seen a

02:40couple of the others as well um so an

02:43easy to invert is actually just uh

02:46that's dialect for it's easy to solve

02:50equations with that coefficient Matrix

02:52so that's what it really means because

02:53no one is inverting anything right it is

02:56unbelievably rare in good numeric work

03:00for anybody to actually form the inverse

03:02of a matrix there are cases where people

03:04do uh but it's very rare right so um

03:09okay uh so what you do then once you've

03:12factored your Matrix um into a product

03:14of things which are you know quote easy

03:16to invert unquote you have to interpret

03:18that correctly then you what you do is

03:21you then write out you know the inverse

03:23and the inverse is the product is is the

03:25product of the inverse is in reverse

03:26order and then you literally just

03:29evaluate this by first evaluating you

03:31know A1 inverse B that's one thing and

03:34and of course what that means is people

03:36write a A1 inverse B what they really

03:37mean is you call this you call solve A1

03:40comma B that's what that that's what it

03:42really means so you solve that then you

03:45pass this solution to the next thing

03:48which is A2 inverse uh times X1 and so

03:52you solve A2 uh X2 equal X1 and that

03:55gives you the second one and you keep

03:57going until you get to the end so again

03:59it it's like completely straightforward

04:02like okay like so what um we last time

04:06we talked about some very interesting uh

04:08very interesting things which is that if

04:11you want to solve a set of equations

04:14with the same coefficient Matrix but

04:16different right-and sides right then in

04:18fact what you can do is you factor once

04:22and you cach the factorization and then

04:24you simply use the solve method this is

04:27this is called the solve the first one

04:29is called factor and this is called

04:30solve and the SEC so you simply call the

04:33solve method on your multiple right hand

04:35sides so and we'll you get to some like

04:38stunning uh you get to surprising

04:42conclusions in some cases I'll get to

04:44that in a minute so all right so this is

04:46this is the one that you you probably

04:48saw a variation of this like in high

04:50school or something the first time you

04:52saw linear equations it's called

04:54gaussian elimination it's called lots of

04:56things um but here's what it is um if

04:59you have a non- singular Matrix you can

05:01Factor it as a permutation times a lower

05:04triangular Matrix uh it's it's a a lower

05:08triangular and U is upper triangular and

05:09they're all invertible of course

05:10actually if any one of them weren't then

05:12a would not be right um so the cost of

05:16that factorization which we're not going

05:18to go into how that's done it's actually

05:20not that difficult but um the cost is is

05:23is order n cubed flops okay so that that

05:26that's what that is and so this is what

05:28it looks like to solve you know linear

05:30equations you you you factor a uh that

05:34means you get p l and U uh then these

05:38three steps like 2 three four this is

05:39factor and then this is solve so again

05:42you all all you do is you just solve

05:44them in reverse order um now solving

05:47this one solving the permut this is

05:49silly it's just you multiply by P

05:50transpose right roughly it's the inverse

05:53permutate you just un you un permute

05:55them right so this is easy um the second

05:58one L is lower triangular and so for

06:00that you use forward substitution which

06:02we saw as an order n squ

06:04algorithm backward substitution is when

06:07you have U because you is upper

06:08triangular and you start if you'll

06:10recall by first solving for the last

06:12entry then the second last then the

06:13third last and so on that's backward

06:15substitution and that's also n squ flops

06:18and if you add these up what you find

06:20out is something kind of weird it's that

06:22the factor is order n cubed and the

06:24solve is order n s and so I think last

06:26time I finished by asking you a question

06:28which is like

06:30how long does it take to solve I don't

06:32know on my laptop uh a set of linear

06:35equations with you know let's say a

06:37thousand by, Matrix right which I could

06:40make sound pretty impressive right I

06:42could say you know my problem has more

06:44than a million coefficients in it which

06:48is true thousand by thousand Matrix does

06:50and then then you could say you know the

06:53N Cub the the factorization is order n

06:56cubed you know that's like a billion

06:59flops right of course that's a joke now

07:02right so and the answer is like I can do

07:04that in 100

07:06milliseconds okay just so can you my

07:09phone can do it in something on that

07:10order of magnitude right so which is

07:13actually kind of stunning um I think

07:16then I asked the following question I

07:18asked how long does it take you if it

07:21takes 100 milliseconds to solve let's

07:23say one set of a thousand by thousand

07:26equations right a thousand equations

07:28thousand unknowns how long does it take

07:29to solve 10 for that matter you know

07:32even 100 how long does it take if it

07:34takes a tenth of a second to to do one

07:37the answer is surprising it's this it's

07:39a tenth of a second everybody got that

07:41so not that many people appreciate this

07:44right or or understand this and this is

07:45actually it's this is used throughout a

07:47lot of uh numerical stuff a lot of it uh

07:50well you use it all the time you just

07:52don't know it uh so all the solvers that

07:54you use when you call CVX Pi almost all

07:56of them will use things like this right

07:59right so the most typical methods to

08:01solve things like socps sdps these

08:04things or for that matter even LPS

08:06involve two solves right and that's done

08:09in the same basically the factorization

08:12dominates and and that's that's the

08:14that's the point there okay so all right

08:17actually I'm just curious how many

08:18people have seen this before how many

08:19people knew that you could solve 10

08:21equations with a thousand variables

08:24thousand

08:25unknowns same coefficient Matrix

08:27different right hand sides uh you could

08:29do that you could solve 10 and the

08:31amount of time it takes you to solve one

08:33you knew come on other people yeah is no

08:37is no one else impressed or like

08:40shouldn't you be I mean I'm I maybe it's

08:43maybe too early but people should be

08:44going like that's awesome like really

08:47but I'm not getting that feeling but

08:48that's okay that's fine okay good but

08:51there we go all

08:52right um now uh sparse L factorization

08:57this is super interesting this is

08:59something everybody should know about

09:01right so um there's since I guess it's

09:04been developed since about the 1960s is

09:07uh basically stuff on sparse matrices

09:10right and so everything you know first

09:12of all you know how do you what are the

09:13data structures used to store them and

09:16then basically there's a whole parallel

09:17universe of doing all the stuff you

09:19would do with so-called dense matrices

09:21which are just stored as arrays right

09:24how do you do this for sparse matrices

09:25including factorizations back solves all

09:29these kinds of things everything you can

09:30do igen decompositions singular value

09:33decompositions there's a whole field

09:35where people have made this efficient

09:37for sparse matrices right and this is

09:39something like everybody needs to know

09:42about like everyone because if you do

09:45any kind of computation you need to know

09:47about this right because because there

09:48will be again there will be surprising

09:51statements we'll make about what you can

09:53solve and not solve um okay so if a is

09:57sparse um the typ iCal factorization is

10:00going to have four factors it's going to

10:03have a P1 is a a permutation in front P2

10:07is a permutation in back and then L and

10:10U are these uh lower triangular and

10:14upper triangular uh and those so the the

10:17idea here is mathematically you don't

10:20need P2 right you do need P1 it is it is

10:24false that any Matrix any non- singular

10:26Matrix can be factored as Lu

10:29okay you do need that initial

10:31permutation but you don't need that

10:33second one so the second one is there

10:35and by the way P1 is also there um so

10:39that when you carry out this Lu

10:42factorization L and U will also be

10:45sparse because if you do this if you're

10:47not careful about doing this your Matrix

10:50a will be sparse but L and U will be

10:52like dense okay when that happens it's

10:54referred to as you have fill in right

10:56you have lots of entries entries in l

10:59you that weren't there in a okay so and

11:02this is very important for everybody to

11:03know right so um and you know here' be

11:06an example um it it's extremely common

11:09to work all the time with let's say A

11:12you know a million bym million sparse

11:14Matrix it is simply not a problem right

11:18I can do stuff like that on my laptop

11:20right um you know the number of nonzeros

11:23is not going to be 10 to the 12 which

11:26would be a dense one you know it's going

11:27to be something reasonable like I don't

11:29know it's got you know 20 30 100

11:31non-zeros per row even that is not that

11:33big a deal because it's 100 million then

11:35it's 100 million coefficients if there's

11:37if there's 100 non zeros per row and you

11:39multiply that by about 10 so even in

11:41doubles it's like a gig which is just

11:44like very small it's it's a reasonably

11:46small amount of memory okay so um anyway

11:50so what happens is if you have your

11:52million bym million Matrix and you and

11:54you call a solver like this if you if

11:57you choose P1 and P2

11:59poorly then what'll happen is like L

12:02will be like a dense lower triangular

12:04Matrix right you can't even store it

12:06because it's a million by million right

12:08so it's 10 12 divided by two uh things I

12:10mean you can't let's just leave it that

12:13way okay making sense so so so the P1

12:17and P2 they're chosen euristic Le and

12:19you can you know people have been doing

12:20this since the 60s is actually very cool

12:22you can ask questions like oh find me

12:26the P1 and P2 that minimize the number

12:29of nonzeros in

12:31lnu that's a perfectly nice problem to

12:33come up with and it is NP hard okay so

12:37it's but who cares it's silly there are

12:39unbelievably good uh urtic for doing for

12:42doing this right uh that I mean all of

12:45us benefit from so that's this is great

12:48um okay and so what happens with sparse

12:51matrices is the factor and solve uh

12:55times whatever they are they're less

12:58they are less than than 2/3 n cubed um

13:00and I'm going to I'm going to give you a

13:01a a very important special case right

13:04now and then we'll we'll encounter

13:07others right so uh here's a super

13:10important special case is a so-called

13:11banded Matrix um so a banded Matrix

13:15looks like this uh well of course

13:17there's just a band here like this you

13:19know maybe this is K entries wide right

13:23and basically this is nonzero and

13:26everything out here is that's zero and

13:27that's zero so that's a banded Matrix

13:30right so I I can store a banded Matrix

13:33on my laptop easily if it's a million by

13:36million and the bandwidth is let's say

13:3810 or 20 right so everybody so that's a

13:41banded Matrix no no that's fine so it

13:45turns out when you uh here well this is

13:47a case where you don't need P1 and P2

13:50right but if you simply do the Lu

13:53factorization here the L and the U are

13:56going to be basically the same they're

13:58also going to be banded so I'll just

14:00write down l l is going to is lower

14:03triangular so whoops this goes down to

14:05there right and then it simply looks

14:08like that that's K over2 and and this is

14:11what L looks like and you look similarly

14:14right so um by the way sometimes

14:18abandoned Matrix comes to you looking

14:19like that but sometimes it's been

14:21permuted by the app by whoever form the

14:23problem right they decide well I'm going

14:25to list all my States first then all my

14:27controls then all this in a in an

14:29optimal control problem and then they

14:31comes out with some crazy sparsity

14:33pattern and uh any method would be able

14:36any methods used to euristic choose P1

14:38and P2 will permute it to look banded

14:41okay so what happens in a banded

14:43factorization is first of all the

14:46factorization time is is n

14:50k^2 and then the back solve is

14:54NK okay so these are stunning L smaller

14:59than n cub and

15:01n^2 okay so what this tells you is you

15:07can solve a million bym million

15:09equations if the bandwidth is you know

15:1110 like EAS like super easily and it's

15:14going to be so fast it's not even funny

15:16everybody following this so which is

15:18kind of shocking right because someone

15:19said what are you doing you say I'm

15:20solving a a set of a million equations

15:22for a million variables each equation

15:25involves 10 of the variables you're like

15:27whoa that sounds like it's hard right

15:30the answer is it's unbelievable this is

15:32a millisecond operation everyone follow

15:35it's just this is completely insane so

15:37but this is things you absolutely need

15:39to know this right because this is the

15:41this is the trick to many many things

15:43okay um so what's you know by the way

15:47what's interesting here is the factor

15:49solve ratio is not that high anymore now

15:51it's K it used to be n and so that says

15:54if I say how long does it take to solve

15:57a million by million equations banded

15:58with a bandwidth of 10 it's some number

16:00which is very which is shockingly small

16:02then you say how about 10 of them okay

16:04how about two of them and you say it's

16:06only a little tiny bit more everybody

16:08following this okay so by the way um

16:11what what you'll be doing in the next

16:14week two weeks is you will start to

16:18recognize Matrix structure like sparsity

16:21you're going to see some others later

16:22today like block arrow and weird other

16:25ones we'll we'll and you will start

16:26recognizing them they're super important

16:29because two things number one you'll you

16:32you'll real you'll start recognizing

16:34structures that you can exploit to solve

16:36extremely fast that's number one but

16:39number two on the other end you're going

16:40to start recognizing practical problems

16:43that have that that will result in

16:45matrices with that structure I'm going

16:47to mention one right now all of control

16:51all of signal processing just looks like

16:54that every last bit of it okay so that

16:58means if I ask you to you've solved

17:01problems you know little control

17:02problems and things like that but what

17:04this says is if someone says I have to

17:06compute I have to solve a control

17:08problem with you know 20,000 time steps

17:11and a state state dimension of 20 and 10

17:14actuators and someone say wow that

17:15sounds like it's going to be complicated

17:17you just laugh and say are you kidding

17:21like I can do that on my phone um and of

17:23course that's the key to you know

17:25actually putting these things into a lot

17:27of control systems right so lot a lot of

17:29them actually do this so okay that's the

17:31big picture so um this is this is just

17:35like super cool to know

17:37about okay this brings us to the uh a

17:42very famous factorization it's about 100

17:44years old um it's the cheski

17:47factorization so um if you have a

17:49positive definite Matrix you can write

17:51it as just LL transpose and if you

17:54insist that the entries of the diagonal

17:56entries of L are positive

17:58then it's Unique so there's no choice

18:01you just there's just one factorization

18:03period that that that's it there's one

18:05way to write a as LL transpose um and

18:08how do you solve uh you know how do you

18:10solve ax equals B where a is positive

18:13definite that's called a positive

18:15definite system of equations right so

18:16how do you do that you do a chesy

18:18factorization then you do a forward and

18:20a backward sub substitution with L andl

18:22transpose okay so so that's that's it

18:25and you know actually it's 1/3 n cubed

18:28as opposed to 2 N cubed for General

18:29matrices and you know that kind of makes

18:31sense I me first number one no one cares

18:34about a factor of two in a flop count

18:35estimate because these are so crude

18:38right what really matters is like the

18:40people who actually wrote the micro code

18:42and all that kind of stuff and how's it

18:44blocked and how many cores is it using

18:46all sorts of crazy stuff internal that

18:48we don't have to most of us don't all of

18:49us I'm guessing should not and do not

18:52have to worry about right but we should

18:54also be all be very grateful that there

18:56are people who do right so um yeah so

19:00there it's a factor of two which is not

19:02unsurprising it's symmetric so roughly

19:04speaking there's a redundancy of two you

19:06know or or you can even look at this and

19:08say hey that's an Lu

19:10factorization with u equals L and you

19:13know why you say because the Matrix is

19:14symmetric that's why right so so it's

19:16not anyway not not not too surprising um

19:20we will see very shortly what you know

19:23where positive definite systems of

19:25equations arise right actually they they

19:27arise by eles in a lot of applications

19:29but but specifically in convex

19:32optimization we'll see they're at the

19:33heart of a lot of

19:35stuff okay and now we get the sparse

19:38chesy factorization and here I'll tell

19:40you something um super uh interesting

19:42about I think um so here it turns out

19:46this is basically the same as calling a

19:48chesy factorization on ppose AP right

19:52because take that equation just multiply

19:53on the left by ppose on the right by P

19:57so what it basically said says is

19:59permute your Matrix your positive

20:01definite Matrix and then do a chesy

20:03factorization where there is no

20:05creativity it's just there's a unique

20:08chesy factorization everybody got this

20:10when you do that you will find some

20:12stunning things out you will find out

20:14that if you do a CH if you if you choose

20:16a good permutation the L is going to be

20:20super sparse if you choose a bad

20:23permutation it's going to be dense so

20:25and I'll give you a sort of a famous

20:26example of this

20:29um so it would be something like this if

20:32I gave you uh this Matrix here uh so I'm

20:37going to draw a uh here I just make it

20:40this way so it it's got first row and

20:42column and a diagonal okay so by the way

20:45that's that's called like an up what do

20:47they call it's an Arrow matrix right

20:48because if you blur your eyes it looks

20:50like an arrow or something I don't know

20:51anyway that that's an upper Arrow matrix

20:54everybody got it's super sparse right

20:56number of nonzeros per Row in column you

20:58know except for the first row and column

21:00it's like two right so okay so that

21:03that's an example of a that is an

21:05example of a structure you should never

21:08ever see again and not have an emotional

21:11response to ever right because this is

21:14super important okay so okay if I do a

21:17chesy factorization on that here's

21:20here's here's I'm going to show you the

21:21sparcity pattern of L it's this ready

21:24it's completely dense there completely

21:27dense

21:29okay so this of course ruins everything

21:32right for one thing you can't even store

21:34that much right because if you started

21:35with a million by million you know upper

21:37Arrow matrix called cheski on it it we

21:41it would at some point your OS would say

21:44no I no I am not going to allocate that

21:46amount of memory no so everybody got

21:49this okay so now turns out all you need

21:52to do with this is permute the first and

21:56last columns uh and first and glass rows

21:59and columns and then you end up with

22:01this

22:03Matrix and that that's called a you know

22:07like a I don't know it's a whatever AR

22:09that that's called an arrow structure

22:11right we we'll get to others but anyway

22:12that that's one okay if I do a cheski

22:15factorization on this I will show you

22:17the sparsity pattern you

22:19get you get

22:21this and that's all okay so what it

22:27means is you know if it means that the

22:30permutation that permuted this into this

22:33was super important at least from a

22:35sparcity point of view okay uh everybody

22:38got so what this means is if you had

22:40that set of equations to

22:42solve uh actually if you just called a

22:45sparse solver on your behalf you

22:47wouldn't even know it but the

22:49permutation would change it to this

22:51factor that and then your Factor time is

22:53going to be order n and your solve time

22:56will be order n

22:59right so

23:00everybody following this so that's it so

23:03this is fortunately for a lot of those

23:05things you don't have to worry about it

23:06that's all based inside all of you have

23:08benefited from this probably a million

23:11times just so far in this class because

23:13every time CVX Pi calls a solver this is

23:17this is what's Happ this is the kind of

23:18stuff that's happening on your behalf

23:20many many times every time you call call

23:22solve on something okay so doesn't

23:24matter I mean but just now you know um

23:28okay this make sense yeah how much time

23:32does it take for the Sol it to find that

23:35P that's that's an that's a perfect

23:38question I should have said it so the

23:40question is how long you know so yeah if

23:42I if I say oh man are you kidding I can

23:45do this for you not a problem I'm going

23:47to find a permutation like that and then

23:48I can solve this in tenth of a second

23:49and you go cool how long do it take you

23:51to find the permutation you go couple

23:53days you know it's just not yeah so the

23:56general rule of thumb is is people use

23:58urtic for finding P um that are they

24:03hope on the order of magnitude of the

24:07amount of time it takes to uh actually

24:09do the numerical

24:10factorization so that's what they do now

24:13there are applications where people will

24:15run they'll they'll take more time to

24:18actually work out a better permutation

24:21because they're going to use this a

24:22gazillion times okay so this would be

24:25the case for embedded systems like for

24:27control

24:28right so if you're doing control and

24:30you're going to put this into let's say

24:32a rocket or something like that and and

24:34you know it's going to run at a 100

24:36Hertz for like the entire Mission or

24:39whatever it is right that's a lot of

24:41time it's worth you can take it you

24:44could definitely take an evening to come

24:46up with a better permutation but this is

24:48not what most people doing numer most

24:50people doing numeric who work on the num

24:52on you know solution of sparse linear

24:54equations don't do that they just stick

24:57with your ristics that are fast so and

24:59by the way that that portion is called

25:01the symbolic it's sometimes they call

25:03that the symbolic solve or the

25:05permutation phase there's lots of names

25:07for it right and and you can tell if you

25:10actually uh if you do things like uh um

25:14if you actually look at it to see what

25:16it's actually doing you will find out

25:18that it's spending a whole bunch of time

25:20and very it's doing graph calculations

25:22there's no floating Point numbers

25:24involved okay then then then when

25:26floating Point number operations startup

25:28that means you're actually doing the

25:29numerical

25:31factorization that's a good question

25:34okay so now I know I'm not able to I

25:37mean I realize the for many of you if

25:39this is the first time you've seen it

25:40that's fine you should just get the

25:41highle picture of it and you could read

25:43there a little bit more in the appendix

25:46in the book there's an appendix on this

25:48um and this is just something you should

25:49know about so I don't know scipi you

25:52could poke around in scipi or something

25:54once you know about this you can't

25:55unknow about it and it's kind of you

25:56will have like incred I Powers yes what

25:59kind

26:00of uh well they're yeah they do so some

26:04are kind of greedy um I'd have to tell

26:07you actually how you actually do the

26:09factorization uh to actually explain

26:11explain the Uris stics um but a lot of

26:13them are just constructed one by one

26:15what they do is they say uh p is going

26:18to permute something to the top row and

26:20they say I'm going to find I'm going to

26:21find which entry I should permute to the

26:23top then who gets permuted to the second

26:26top then third top and fourth top things

26:28like that so so it's kind but that's

26:29greedy uh but that the that's the kind

26:32of things people use and they have lots

26:33of names and it's worth knowing a little

26:34bit about them yeah so doesn't be hard

26:37for finding the best one but are there

26:38like approximation guarantees no there's

26:41no approximation guarantees whatsoever

26:43Noone right so yeah no one has yeah so

26:47yeah I it's it's one of those things

26:49where you know we can ask of course you

26:52know this is all these are all graph

26:53calculations has nothing to these are

26:55just about these are just questions

26:56about graphs right oh there is one

26:58exception uh there's one theory of it

27:01that's this if the graph represented by

27:03the sparcity pattern of a is

27:06cordal then it turns out there is a perm

27:10there there exists a it's guaranteed

27:12there exists a permutation which has

27:14zero fill in meaning the number of

27:16nonzeros in L will be identical to the

27:18number of nonzeros in a divided by

27:20two right that so that if you know what

27:23a cordal graph is that's that that's one

27:25of them um that's actually kind of

27:26useless frankly in practice because the

27:28real question is not whether there

27:29exists one the question is whether our

27:31urtic find it and the answer is all

27:34decent urtic will find that one so yeah

27:38um each of these factorizations have

27:40required that we know the Matrix is non

27:42singular that it's postive

27:44def no so it turns out most of the so in

27:48this one to find the permutation for

27:50chesy factorization you actually don't

27:53have to know the entries in a you simply

27:55have to know their locations that's

27:58actually a super important more advanced

28:00but you you brought it up so what this

28:03means is we can do this ahead of time

28:05like so for example if we're solving an

28:06optimal control problem or something

28:08like that uh we can do the permutation

28:11once and for all offline and then when

28:13this thing is actually running a jet

28:14engine or something like that then we

28:16don't have to do it if we're doing

28:18finance and we want to do an insanely

28:20fast back test right we would first

28:23Factor we' do a we'd first do the

28:26factorization and then after that every

28:28solve is going to be really fast so

28:31that's how that works yeah so you don't

28:32know you don't have to know the if if a

28:34is not positive definite and you call an

28:36L you call a cheski factorization on it

28:39it will fail at some point at some point

28:41it's going to attempt to take the square

28:43root of a negative number right and then

28:46it'll it'll uh it'll send it it'll uh

28:48it'll throw a uh an exception to you of

28:52some kind right but that's the only time

28:54you would ever know yeah I'm curious

28:57when when uh does the solver no stop

29:00like what kind of Criterion oh s all of

29:02these are finite algorithms right and an

29:05infinite Precision they just solve right

29:08and when I say it's order n cubed it

29:09means I you know I will I can tell you

29:12for this you're going to do you know 1

29:14billion you know blah blah blah blah

29:15blah multiplies know 1.3 billion ads

29:19when it's done if okay if you did every

29:22that everything in infinite Precision

29:23you would have the answer end of story

29:25there's no stopping Criterion there's

29:27nothing like that okay now of course

29:29you're doing this in finite

29:31Precision right so since you ask there

29:35are methods to do something it's called

29:38iterative

29:39refinement and that what that does is

29:41you compute a solution uh that's

29:44approximate and then you can actually do

29:46you can iterate to get a better solution

29:49and people know about that but that's a

29:50little bit beyond what we're talking

29:52about here but it's good to know about

29:54too okay

29:56um okay okay um so LDL transpose

30:03factorization um this is any non-

30:06singular symmetric Matrix right um by

30:09the way we'll see examples of where

30:10these come up like maybe next week in

30:12fact so um so here it's like L it's like

30:17an LDL transpose factorization uh but D

30:20is diagonal and it's either 1ex one or 2

30:22x two blocks so that's just a

30:23mathematical fact that you might need to

30:25have these 2 by two blocks uh but the

30:27point is those are all easil you know

30:29easily inverted matrices permutations

30:31yes lower triangular yes um a block

30:34diagonal matrix with one by one and 2 by

30:36two blocks I think we know how to invert

30:38that that one we actually literally know

30:40how to invert um but we can certainly

30:42solve equations like that right so okay

30:46um so this is this is the LDL transpose

30:49factorization um and now we're going to

30:53talk about something this is something

30:54probably many of you have seen in other

30:56contexts it comes up in pretty much all

31:00Fields so um it's something called the

31:02sure uh complement is is the the the

31:05name of it and you I think you've

31:07already seen it in you know early on in

31:09the class or something like that but I

31:11can't think of a legitimate field where

31:13it doesn't come up right so in

31:15statistics it's conditioning a gaussian

31:17on some entries right in mechanics it's

31:20actually what happens when you impose

31:22the boundary conditions on something and

31:24ask where everything else is in

31:25electrical engineering you take a net

31:27and I I put a termination on some of the

31:29ports on the network and I ask what

31:31what's happening with the rest of them

31:32right so this I all Fields have this um

31:36okay and the idea is actually

31:38embarrassingly simple so it goes like

31:40this um we're going to write uh ax

31:45equals B we're going to block a into

31:48some blocks like a11 and I'm going to

31:49have a11 and a22 B uh square right um

31:54and I'll I'll also conformally block

31:57block X and B so they're going to be

31:59called X1 stacked on x2 and B1 stacked

32:01on B2 right okay now okay now but

32:06nothing has CH nothing has happened I

32:08just decided to name it this way so I I

32:10just divided it in some weird way okay

32:13uh what happens now is you do actually I

32:16mean you do kind of what you learned in

32:19maybe High School except those were

32:20numbers and not matrices um and so what

32:23you do is you just we just work out we

32:25can actually just solve this in a weird

32:27way so the first equation block equation

32:29is a11 X1 plus a12 X2 = B1 right and

32:36we're going to we're going to assume

32:37here we're going to make an assumption

32:38that a11 is invertible okay then so what

32:43that says is well I can I could do this

32:46this is minus A1

32:49tox2 right that's a that's that right

32:53and then I'll just put an a11 inverse

32:56here

32:57okay so there we go like that okay um

33:03yeah by the way already when you see

33:05this this is hinting that this is going

33:06to be interesting when a11 is easy to

33:09invert so that that's the that's going

33:10to be the interesting part okay but

33:12we'll get there okay I plug this into

33:15the second equation and the second

33:17equation is like a21 X1 plus a22 X2 = B2

33:26and I plug this this expression in here

33:29and I'm going to get something like you

33:31know a21 a11 inverse * B1 - a12 X2 plus

33:39a22 X2 = B2 right so makes sense okay

33:46and then I just I rearrange that to look

33:50like this equation here right I just put

33:53and and that's another equation in in uh

33:56X2 um this Matrix here is famous and

33:58it's called the sure complement of of

34:01the original Matrix I can't remember if

34:03it's with respect to the one one or two

34:05two block but it's one of those two and

34:07I don't remember which it is it doesn't

34:08matter it's a sure compliment okay um

34:11and you've probably seen it before or

34:13something in some yeah anyway that's it

34:17okay okay um okay so uh you could use

34:22this these calculations to solve the

34:24original thing and that generic

34:27algorithm looks like this it says first

34:29you should form a11 inverse a12 I mean

34:31because that's this thing right and a11

34:34inverse B1 right oh uh by the way um so

34:38let's talk about how we would do that so

34:41the way you would do that is you'd

34:42Factor

34:44a11 okay to compute this you would do a

34:49solve right because you factored it and

34:51now you do a solve to calculate this you

34:54do a solve on each column of A1 two

34:57everybody got that right so this could

35:00be this this that could end up being

35:02quite that's that's quite efficient

35:04right so so the way you know this if I

35:08if I were to double click on that and

35:09expand it it would look like this Factor

35:11a11 then solve with B1 and solve with

35:16each column of a12 right that gives me

35:20these two

35:21matrices okay uh next step is I form s

35:24that's the sure complement which is this

35:26thing

35:27um and I form this bti thing over here

35:31which is the right hand side of this

35:33equation right so this equation up here

35:34is sx2 equals B TI and then then what I

35:38do is I'm going to I I find x what here

35:41is I solve I do a factor on S then a

35:45back solve and then I can reassemble

35:47this gives me X2 directly and I can get

35:49X1 from like this equation here and by

35:52the way when I solve this equation I

35:54already factored a11 so this is a back

35:56this a solve okay right this is the okay

36:01so that's that's the idea now okay now

36:03it's a drum roll to find out is this

36:06faster is it better and it's actually

36:09kind of interesting

36:11um

36:13so if in step one you have f is going to

36:17be the factor cost for a11 S is the

36:20solve cost so in the first case you you

36:24factor a11 once and then you solve for

36:27all the columns of a12 and so that's N2

36:31uh s so that that's the cost of step one

36:34step two and um here that's actually

36:37just forming uh forming the the the the

36:41sure complement Matrix and what we're

36:42going to do is we're going to we're

36:43going to imagine that by the that the sh

36:45we're not going to exploit any structure

36:47in the sh complement although one could

36:49but we're not going to we're just going

36:50to treat it as dense um and then step

36:52three is going to be N2 cubed because

36:55it's just that's just just to to factor

36:57and solve it okay and so the total is

37:00this number up here it's f plus n2s plus

37:02blah blah okay all right so let's just

37:06throw in a general a11 and see what

37:08happens um when you do I mean hardly

37:10surprising but actually down to the Flop

37:13it's identical and you've done Absolut

37:15congratulations you've done a bunch of

37:17algebra and Analysis and it did

37:19absolutely nothing okay so

37:22everybody got oh by the way it does

37:25nothing in terms of flop count uh in

37:27reality it does do things right so in

37:29fact people use these block this sort of

37:31block elimination meth they actually use

37:33these for dense matrices and the reason

37:36is you would choose N1 and N2 so that

37:38one of these solve methods could be done

37:40at some level of

37:42cache right and therefore done quickly

37:45so so so that actually happens but again

37:47all of us are blissfully unaware of

37:49those details but so okay um now it gets

37:53interesting it's interesting precisely

37:55when f is is small compared to N1 cubed

37:59so I'm going to give you here's a great

38:02example is diagonal so what the factor

38:06cost for a diagonal is zero I mean

38:08whatever or n if you want to whatever it

38:10doesn't matter that's there l l is going

38:13to be the Matrix itself and youu will be

38:15the identity there it's zero okay

38:18um but so here it's actually really

38:22interesting you get 2 N2 ^2 N1 right

38:26Plus 2/3 n what's amazing about that is

38:28that's linear in N1 so now I'm going to

38:30show you some equation uh I'll show you

38:32some sparsity

38:34patterns uh and here it is one actually

38:37I was just showing you one I had one

38:39right there uh here's one remember this

38:46one that's this downward Arrow pattern

38:49um someone suggest a blocking of it like

38:53what would be a11 do you see a leading

38:55Matrix here that is easy to

38:58invert diagonal yeah yeah so here's

39:02here's a11 it looks like this

39:04right right it just looks like that

39:07right so this is a11 diagonal you know

39:12actually I was going to say matrices

39:13don't come easier to invert than

39:15diagonal that's false here's one that's

39:17easier is a multiple of the

39:19identity that that that's that's pretty

39:22straightforward actually I think in

39:23terms of flops it's the same anyway so

39:25it doesn't even matter but anyway here

39:26it is um okay so here's one where

39:30instead of this so N1 is I don't know

39:33let's suppose the original thing was n

39:34it's n minus one and N2 is one because

39:37this is this this thing is one by one

39:39right down here um and so in this case

39:43it turns out you're going to get a

39:45stunning uh reduction right because

39:47you're going to be able to you can

39:49invert this fast and in that case you

39:51would get exactly this number over here

39:53you would get you know 2 N2 2 N1 right

39:57so okay that's that's one example but

40:00I'll show you I'll show you some others

40:01right um what it means is if you saw

40:05yeah what if you saw a a a sparsity

40:08pattern that looked like

40:10this right this is uh banded and then uh

40:16over here these are these are like dense

40:18this is called dense rows and columns

40:20everybody got that so actually it's just

40:22a block Arrow matrix but what but with a

40:26thicker weight on the things because

40:28they're wide right this is that's banded

40:30and and then this is this is a whole

40:32chunk of stuff over here right everybody

40:34got that U so what this says is that

40:38this block elimination method here will

40:40solve

40:41that uh well it's actually insane it's

40:44actually going to be linear in the

40:46dimension if this if this width if this

40:49doesn't grow this way and if the

40:52bandwidth doesn't change this gets

40:54bigger and bigger and we can solve it

40:56right so U anyway it's pretty cool right

41:00so uh and that means if you see that you

41:04should you should uh smile we'll put it

41:06that way like or you're not going to see

41:08that probably because someone would

41:11already have to know but the way it's

41:13really going to work is someone's going

41:14to be talking about a problem it's going

41:17to be some weird complicated problem who

41:18knows what Fe it doesn't matter they're

41:20going to be talking about some

41:20complicated thing while they're talking

41:23you will be forming your mind what the

41:25sparsity pattern of the associated

41:27matrices look like okay when that image

41:30comes to mind you will smile and you

41:33will say I can solve that problem in

41:37half a second and they said I can't even

41:40load it into memory and you'd say just

41:44chill and I'll show you I'll show you

41:46what we're going to do right everybody

41:48following this so so I mean these are

41:50incredibly useful things right so and

41:52you know we can think about what this is

41:55this is basically a set of equ equations

41:57where banded means that equation you

42:01know 50 only references let's say

42:04variables 40 to

42:0760 okay and that would be here uh then

42:11you say but there's some special

42:12variables here which affect

42:14everybody okay so that that that's

42:17that's what this is so you're going to

42:18want to be making those uh yeah so yeah

42:22to be fully developed I would say here's

42:24what you want you want to be able to get

42:27a feel for when applications involve

42:30matrices that look like

42:32that then you're going to be happy

42:34because you're going to know something

42:35other people in the room don't because

42:37you're going to know that you can solve

42:38that in linear time right so that's why

42:42people would say like well I have to

42:44find this whole trajectory it's like

42:45it's it's you know it's 100,000 time

42:47points you know and you're like so but I

42:50mean do you have a phone or whatever

42:52it's just not going to be that

42:53complicated right again if you know what

42:54you're doing okay so okay

42:58um okay so

43:00now this leads us to uh and this this

43:04probably people have seen it's got lots

43:06of names it depends on where you're from

43:08I think it's called like Sherman

43:11Morrison Woodbury sometimes it's called

43:13the Woodbury thing sometimes it's called

43:14The Matrix inversion

43:16Lemma and I have a friend who can

43:19actually tell where someone learned

43:21their linear algebra because they use

43:24different terms and they they would they

43:26even use different terms I believe at

43:27Oxford and Cambridge just just so you

43:29know so this is like someone can just by

43:31asking someone what they call this and a

43:33couple of other questions you can figure

43:35out that they got their PHD at Oxford or

43:37Cambridge or something or Harvard or

43:38whatever anyway or MIT or something okay

43:41so here it is you start with um a A plus

43:45BC uh time xals B right so that's your

43:50that's in fact some people some people

43:52call this a a uh a low rank perturbation

43:56of a I mean that's not fair because no

43:58one said that B and C are small right

44:01but if B had small columns if if B had

44:04just a handful of columns and C had a

44:05handful of rows this would indeed be a

44:08low you know what people would call low

44:09rank update to a okay so we want to

44:13solve a matrix that looks like that so

44:15in fact an extreme case would be things

44:17like uh here's when you hear all the

44:19time diagonal plus low rank there a is

44:24diagonal and B and C are small right so

44:27B has got you know a small number of

44:29columns C has got a small number of rows

44:32that's super common that's also

44:34something you will I would hope never

44:36ever forget diagonal plus low rank and

44:39it's it's all related um here's how it's

44:43related um yeah actually let's just talk

44:45about this in the case where a was

44:47diagonal and B and C are are are you

44:49know mod have modest number of columns

44:51right so in fact let's make in fact I'll

44:54here here's one um

44:56so basically all of quantitative Finance

44:59uses a risk model where the covariance

45:01Matrix is diagonal plus low rank and

45:04it's called a factor model right so

45:07there C is B

45:08transposed and in uh If This Were Like A

45:12A gaussian or something if if this is

45:14the coari of a gaussian if you

45:15understand what I'm saying that's fine

45:16if you don't that's also cool but this

45:18is just completely Universal right so

45:20you'd have a what it would mean is it's

45:22even got a basian model and it basically

45:24says that uh that the returns of all

45:27these assets has is driven by let's say

45:31B and C they typically have B is about a

45:33width of 100 that would be very typical

45:36uh C would have a height of well C is be

45:38transposed right so you'd say look I

45:41don't know what these

45:438,000 returns are doing but they're

45:46mostly driven by these 100 factors right

45:51which which are which are actually so

45:53that's that's the underlying uh that's

45:55what's actually driving their returns

45:57and then the a is diagonal and that's

45:59they have a beautiful name for it's

46:00called idiosyncratic so that's the

46:02idiosyncratic risk it's the amount by

46:04which you know Google or you know

46:08whatever Coca-Cola or whatever it is go

46:10Chevron goes up and down all by is the

46:12part in fact the way they would say it

46:13this is dialect is you'd say that is the

46:16part of the variance not explained by

46:18the factors everyone got that so that's

46:20it so and the factors are actually kind

46:22of interesting sorry this is just a

46:23weird aside but since we're here I'll

46:25just continue

46:26um the factors are interesting some are

46:28handcrafted so they're literally

46:31matrices with zeros and ones and weird

46:33stuff like that and it like are you you

46:35what uh yeah what you know what region

46:38are you in right are you uh are you a

46:41Pharmaceuticals company you know do you

46:43do are you do defense do you do that

46:45kind of thing others are actually

46:47crafted by you know people using

46:48numerical algorithms so okay that's was

46:50just a small story about this okay um oh

46:55if I have a if I have something that

46:56looks like that a is diagonal and B and

46:58C are small I can store that like no

47:00problem right that's that's even for

47:02very big like if I do I I want to you

47:05want to run a fund that is looking at a

47:07universe I'm using dialect of let's say

47:10100,000 uh assets right across the whole

47:13world that I know people who do that but

47:15so so you do that um so the covariance

47:18Matrix is 100,000 by 100,000 Matrix for

47:20something like that you might someone

47:22say well Yi what's your risk model and

47:24they would say oh I have a 200 Factor uh

47:28model and that means their covariance

47:31Matrix is expressed as a diagonal matrix

47:34right which is whatever it is 100,000 uh

47:37100,000 by

47:38100,000 and then they have 200 factors

47:42uh and all of that is completely

47:43storable and stuff like that and you'll

47:45see in fact not only can you store you

47:46can actually solve problems super fast

47:48and things like that okay right that was

47:50all a weird aside but let's so let's

47:52let's keep going so now what you do here

47:54is something that is actually a little

47:56bit um unintuitive it's called you un

48:01eliminate we did this before in Duality

48:04and it turned out well there what you do

48:06to un eliminate means to introduce new

48:08variables right right to eliminate

48:10variables means to you know solve for

48:13them get rid of them and then substitute

48:14back their values into the rest that's

48:17elimination right that sounds good

48:20especially to an amateur or somebody

48:21first seeing this material because you

48:24know someone what are you doing you so

48:25I'm limiting some very why because I

48:27like to solve smaller sets of equations

48:29everybody I mean everyone like that's it

48:31makes perfect sense right it's

48:33completely and totally wrong that's not

48:36how you should be doing linear algebra

48:38but it does sound good and people can

48:39fall for that right okay un elimination

48:42is the opposite you say thank you for

48:44your set of equations if you don't mind

48:46I'm going to introduce new variables and

48:47equations and make it a bigger one and

48:49someone's like what why would you do

48:52that everyone see what I'm saying you'd

48:54say thank you for your you know thousand

48:56by thousand set of equations if you

48:57don't mind I'm going to expand it to a

48:5910,000 by 10,000 set of equ if it's okay

49:01with you right and they would say like

49:04why anyway and the answer is you'll

49:07you'll see why in a minute right so we

49:09write this this way we just introduce a

49:10new variable which is Y which is CX

49:13right so and I write that second

49:15equation as uh this bottom row which is

49:18cxus yal z right so so so the second the

49:22second system of equations is bigger and

49:25it is

49:26uh completely equivalent to the first

49:27one right okay now let's talk about this

49:32um when you see a matrix like that it

49:35you just encountered all this so far

49:38just a few minutes ago so I don't expect

49:40you to have this response now but

49:42hopefully within a week or two you

49:45should have the following response you

49:47will see a block there a minus I and

49:50your IE will go to it because it's

49:53really easily inverted and you're going

49:55to see a block and you're going to say

49:57well I you're going to have a an urge to

49:59do block elimination on this right

50:02because that and so what you would do is

50:04you'd say oh you know I know what I'm

50:05going to I'm going to permute the I the

50:07minus I up to the top because and then

50:09I'm going to use block elimination from

50:10the previous stuff right everybody got

50:13that and then you eliminate it and you

50:15will end up then forming the sh

50:18complement equations and it will be a

50:20plus bcx equals B so you'll be right

50:22back where you

50:23started Okay so here's what you do

50:26instead you resist the urge here to

50:31eliminate the minus I even though it is

50:33a very attractive Target for elimination

50:35because it's easily inverted right you

50:38resist that urge and instead you

50:39eliminate

50:41a and you will end up with a sure

50:43complement equation that looks like

50:46this okay um and this allows you to get

50:50y here you solve for y then you then

50:53from that you plug it back in and get X

50:55and you end up with a formula that's

50:57called The Matrix inversion Lemma that's

50:58the neutral version It's it says that a

51:01plus BC inverse is you know this

51:03horrible formula over here yeah so

51:06that's it um this gets interesting if

51:09like a is diagonal or something like

51:12that then you'll anyway a lot of things

51:15would would would come out of that so

51:17how many people have seen like Matrix

51:19inversion Lemma or something before or

51:21or some Sherman Morris and Woodbury it's

51:25same

51:25okay uh any other names for it I uh I

51:29used to know at least one or two others

51:31but anyway anybody else know another

51:34name for that it's a thing in linear

51:37algebra and how you call it depends on

51:39where you came from so

51:42okay okay so I think actually I'll uh oh

51:45yeah well here's an example like a

51:47diagonal and B and C are dense um this

51:50is you just you basically form uh you're

51:55going to form uh D is a plus BC this is

51:59if you do it the dumb way right you just

52:01write it out as a dense Matrix and

52:03solve um in this case then you get

52:06something that's PN squar sorry n cubed

52:09and if you do it the right way you get

52:10something that's linear in N right and

52:13these make a big difference right

52:16so yeah by by the way a lot of this

52:19stuff subsumes things that were taught

52:21in the 60s 7s and 80s and sometimes in

52:24the '90s

52:25by people who learned it from people who

52:27learned it in the 60s and 70s okay so so

52:30for example if you teach a a course on

52:32control you you might I've done this

52:35long I haven't done it for a long time I

52:36won't do it again but you go through a

52:38long derivation of what like the calman

52:40filter is or how to solve an optimal

52:42control problem right and it's very

52:44clever and you have little parts and

52:45this Bellman equation and blah blah blah

52:47and you get a and yeah you end up with

52:49stuff that actually works it's great um

52:52a very funny thing happens if you simply

52:54take

52:55the problem represented by an optimal

52:57control problem or a calman filter and

53:00just throw it at a sparse solver guess

53:02what it'll be just as it'll it'll do it

53:06it'll do it so basically all of those

53:07lectures were useless is what I'm saying

53:09as a practical matter which is kind of

53:11embarrassing but whatever so I don't

53:13teach that anymore so I I don't teach it

53:14that way anymore I just say you get a

53:17giant set of linear equations and you

53:21solve it using sparse uh methods done

53:24that that that is the correct way to do

53:26it it gives you a lot of stuff right so

53:29this in a lot of fields like yeah I

53:32think probably I'm almost certain in

53:33statistics there'd be many many Papers

53:35written on if you had a factor model so

53:38that's a covariance matrix is diagonal

53:40plus low rank and it would be all sorts

53:43of things about very intensely clever

53:44ways I could condition suppose I give

53:46you 80 80% of the entries of a vector

53:49and I said please compute the

53:50conditional means of the of the

53:51remaining ones Everybody follow this

53:53that's like a conditional mean

53:54imputation task

53:55right so I'm sure there's whole Papers

53:58written on the on this and clever

54:00methods for doing this and blah anyway

54:02the answer is you form the covariance

54:05Matrix literally write down the

54:07equations you want to solve and then you

54:10do nothing but pass it to a sparse

54:12solver and it'll be just as fan it'll be

54:15just as fancy probably more General than

54:18all those papers so okay it's

54:20embarrassing but true so so okay um all

54:23right um so I we won't I'm not going to

54:27go into this there there's a whole lot

54:28more I haven't told you about linear

54:30algebra but this is this this is these

54:31are the main points so okay so let me um

54:34yeah we'll yeah and we'll right now this

54:37is just sort of hanging here now you

54:38just know about linear algebra but I

54:39promise in the next week we're going to

54:42connect it to problems and then that to

54:46optimization problems and that will have

54:48a connection to applications and then

54:50you will start being able to listen to a

54:53problem a practical problem and say I

54:56can solve that like I can solve that

54:58super fast right so everybody uh getting

55:01this right so I there there's people

55:04where this is all they do right if you

55:05do scientific computation this is kind

55:07of a lot of the stuff you do so okay so

55:10now now we're going to start um

55:12unconstrained

55:13minimization um and we'll like I say

55:16we're building slowly up at the at the

55:18lowest level well at the lowest level

55:20are you know uh you know computers

55:23compilers micro code things like that

55:25which we're not going to look at uh next

55:27level up the bottom floor is what we

55:29just covered was linear

55:30algebra now the next floor is how do you

55:33solve like you know smooth unconstrained

55:36problems so we'll we'll we'll look at

55:37that first

55:39um and we'll build slowly up uh then

55:43we'll get equality constraints then

55:44we'll pull in inequality constraints and

55:46then we're there right so okay so here's

55:49what it looks like uh we're going to

55:50minimize a function f here that is we're

55:54going to assume it's it's smooth and and

55:56twice continuously differentiable right

55:59um and we we'll we'll assume that you

56:01know it it it's a this is the optimal

56:04value is achieved right so there's an

56:05there's at least an optimal point right

56:07now by the way a lot of problems like

56:09that that we want to solve in this class

56:10problems you have already solved don't

56:13satisfy this right like if you did an L1

56:16fitting problem you would minimize the

56:18L1 Norm of axus B and that doesn't count

56:21that's that's not here because that's

56:22not

56:22differentiable okay um okay so this is

56:26what we're going to do um in general

56:29except for uh a very important set of

56:31problems we will see shortly uh there

56:33are no analytical solutions to this

56:36right um so what instead we're going to

56:40have iterative methods and an iterative

56:41method is going to generate like you

56:43know an X it's going to query something

56:46about F and then generate a new X and so

56:49on a new X and a new X and what's going

56:51to happen is those X's are going to

56:52converge to you know hopefully a

56:55solution of this problem right um and

57:00what we'll do is you know the typical

57:02thing you're going to get is something

57:04like this sequence is going to have the

57:06function values converge to the optimal

57:08value uh and another way to say it is

57:10that the gradient at at at is you're

57:13going to have a sequence of points where

57:14the gradient gets smaller and smaller

57:16okay and then when you deem it to be

57:18small enough um you asked earlier about

57:20stopping criteria here you have stopping

57:22criteria for the direct methods you

57:24don't have stopping criteria well it

57:27just stops it's like after n cubed over

57:29three steps you've got the chesy factor

57:32there is no stopping Criterion here

57:33there

57:34is

57:36okay so well we're g to we'll make some

57:39assumptions I I don't want to get too

57:41technical about it because it gets it

57:42quickly gets kind of boring and stuff

57:44like that um and by the way there's

57:46entire classes that do that cover

57:48nothing but this material in hideous

57:50full mathematical detail uh so I mean

57:52they're interesting but but

57:55I don't know I I think then what some

57:58people grow grow out of it or so anyway

58:00that I'll I'll stop here since okay

58:02that's fine um okay

58:06um so yeah we're going to assume that

58:08the suble sets are are are closed and

58:11one one condition for that is pretty

58:13common is that the the function f is

58:15closed like if you do convex analysis

58:18which is the mathematics of convex

58:19optimization and convexity this is a

58:22central concept it just means that the

58:23epigraph is closed

58:25right um and one example would be uh

58:29would would be if you had a um here's

58:32some examples of some closed functions

58:33right log sum X this is because the the

58:36domain of this is anything right so so

58:38this this would count here um here's

58:40another one this is weird because you

58:42might say but there a constraint the

58:44constraint is that X lies in the in the

58:46interior of this polyhedrin because

58:48otherwise I mean there's so there's an

58:50embedded constraint here and you might

58:51say no I don't like that but this one is

58:53perfect perfectly good all the Su level

58:55sets are closed and that the reason is

58:57that the function value goes to Infinity

58:59as you approach the boundary so we'll

59:01we'll get to that okay is that okay so I

59:04this these are technical conditions I

59:06think not super duper

59:08important

59:10um people are sometimes assume uh

59:14something like a minimum

59:16curvature right so what that says is for

59:19example the minimum igen value of the

59:21hessen is bigger than some number M

59:23which is positive right like that gives

59:25you a minimum curvature to the function

59:28right that that'll allow you to say all

59:30sorts of things um yeah now you could

59:35ask questions like um how would you know

59:38what the minimum curvature is like it's

59:40a good question right um and the answer

59:42is in most cases you would have

59:43absolutely no idea there are cases where

59:45you do um oh here's one if I do mult

59:48let's say I do multinomial logistic

59:50regression and I add quadratic

59:53regularization to it

59:55then I just I don't even count the

59:57curvature that comes from the

59:59multinomial logistic regression loss and

01:00:02just the regular the quadratic

01:00:03regularization alone gives me a Minimum

01:00:05curvature Everybody follow so I mean but

01:00:08these are kind of rare right but that

01:00:09that is an example where you actually

01:00:11would know a candidate value for M right

01:00:14which is literally the uh it it it's

01:00:17going to be the regularization the ridge

01:00:19regression you know parameter um okay so

01:00:23if this is the case then this says that

01:00:25F minus M time the norm of you know

01:00:31x^2 is actually

01:00:34convex and if you apply that if you then

01:00:36apply the basic convexity formula to

01:00:39that you'll find out really cool you get

01:00:41a strengthened Jens you know I guess

01:00:43it's not Jensen whatever uh you know

01:00:45this first part this up to here that's

01:00:48true this is true for any convex

01:00:51function this part right that's the

01:00:53basic inequality what this says is oh

01:00:55yeah well it applies with a little bit

01:00:57more right it it's actually that's the

01:00:59minimum curvature constraint right so it

01:01:02strengthens that basic inequality we saw

01:01:04six weeks ago or whenever it was seven

01:01:06weeks ago six right um okay and this

01:01:11means all sorts of cool things because

01:01:12for example um I could minimize both of

01:01:16these equations over y on the left I get

01:01:21P star because it's just the minimum of

01:01:22f of Y on the right right this is a

01:01:25quadratic function in y I set the

01:01:27gradient equal to zero I minimize it and

01:01:30you get a number which is actually it's

01:01:32going to be uh anyway you end up with

01:01:34something that looks like this you end

01:01:35up with this inequality right and so

01:01:38this is kind of cool because this

01:01:40says that if you wanted a stopping

01:01:43Criterion you would run the algorithm

01:01:45you'd evaluate the gradient and when the

01:01:46gradient gets small enough you quit and

01:01:48someone says why are you quitting um and

01:01:51you'd say well the gradient is less than

01:01:52something and they go yeah but how far

01:01:54am I from optimal and then you would say

01:01:56oh good question I can actually bound it

01:01:58I can say that here's how far you are

01:01:59from the optimal value no more than 1

01:02:01over 2m times this of course then in

01:02:04general someone would say how do you how

01:02:05do you know M and the answer is you

01:02:06don't and blah blah blah so but okay

01:02:09everybody but it it it this is the

01:02:12justification you would have for

01:02:14stopping when the gradient is

01:02:16small okay

01:02:18um so we'll say a little bit about

01:02:21descent methods these are um yeah these

01:02:24are you know super widely used um and

01:02:28they they all have a a there's a generic

01:02:30form and it looks like this um in each

01:02:33step of a of a of a descent method

01:02:36you're going to find something called

01:02:38Delta X and that's called the step or

01:02:40the search Direction here um and so you

01:02:45know basically what happens is at

01:02:46iteration K you have XK you make some

01:02:49queries to F about like what's the

01:02:52gradient or something like that and then

01:02:54you form the search Direction so you

01:02:56stop and you ask for directions and it

01:02:58says well you're here you want to

01:03:00minimize you know you should kind of go

01:03:01that Direct that's Delta X that's this

01:03:03direction then what you do is you're

01:03:05going to go in that direction but an

01:03:07with an amount that you're going to then

01:03:09choose and that amount is scaled by this

01:03:11Factor T which is positive and this is

01:03:15uh that's that's called sometimes the

01:03:17step size or the step length even though

01:03:19it need not be it's not really the

01:03:21length unless the norm of the gradient

01:03:23is equal to one right but still people

01:03:25call it the step length or the step size

01:03:26or something like that so okay um that's

01:03:30what you do um and it turns out if if

01:03:35the if if the direction you move in has

01:03:38a negative inner product with the

01:03:40gradient then it turns out uh and and

01:03:44and if you choose the step size

01:03:45correctly the new point will actually

01:03:48have a function value less than the

01:03:49original Point that's good that's called

01:03:50a descent method because you're you're

01:03:53kind of descending right right and you

01:03:54and you end up with a function value

01:03:56that's lower okay so here's like a

01:03:58generic version of it uh to make this a

01:04:02real algorithm we have to fill in things

01:04:05we have to figure out how do you find

01:04:06The Descent

01:04:07Direction and then we have to figure out

01:04:10how do you choose the step length right

01:04:12and that's the whole thing right and the

01:04:14stopping Criterion could be some Norm of

01:04:15the gradient usually it's not usually

01:04:18it's something that is computed as a

01:04:20byproduct of these two steps in here so

01:04:22so this is going to be like hundreds of

01:04:24different methods with different names

01:04:26are going to have exactly this form

01:04:29right and they differ in what's the

01:04:31search Direction and they differ in what

01:04:33is the line search okay so that's that's

01:04:36the uh that's the

01:04:39idea so for line searches um there's a

01:04:44bunch I mean here's one one is an exact

01:04:46line search that says once you've

01:04:48decided that you're here and you're

01:04:50going in that direction you actually

01:04:53look at the function restricted to that

01:04:55Ray and then you look for the point

01:04:57where the where along that Ray you have

01:04:59the smallest values that's called an

01:05:01exact line search and you could do this

01:05:03in some cases right uh where you can

01:05:06actually work this out analytically or

01:05:07you can compute it cheaply or something

01:05:09like that right so that's that's one one

01:05:12place where you you might do this right

01:05:14um another one which sounds much crudder

01:05:18is called a backtracking line search and

01:05:20this has many many names again depending

01:05:21on where you came from it's called an

01:05:23arm line search it's called it's got all

01:05:26sorts of names for it in armo condition

01:05:28there's at least one of those I forget

01:05:29what it is in Russian but they have a

01:05:31different name for it so it's all this

01:05:34the lots of different names for this um

01:05:36so this has been rediscovered like I

01:05:37don't know a couple hundred times here's

01:05:39the way it works um it's actually it's

01:05:42called backtracking and it goes like

01:05:44this you start with T equals 1 so you

01:05:47you try X Plus Delta

01:05:50X then what you're going to do is you're

01:05:52going to check whether F has gone down

01:05:55and this is critical enough if it by the

01:05:57way if it goes if if it went up that's a

01:06:00really that that's basically not that's

01:06:02really bad right in fact you can even be

01:06:04out of domain so that that would be like

01:06:06a very you know that that would be it

01:06:07went up a lot because it went up to plus

01:06:09infinity right so okay so here's what

01:06:12this looks like here's here's T here um

01:06:16here here is the function restricted to

01:06:20T if you this would be the exact line

01:06:23search

01:06:24uh number that that's the exact step

01:06:26length you would take you'd take this to

01:06:27minimize this right um so the way this

01:06:30works is you look at this basic

01:06:33inequality here that basic inequality is

01:06:35this line right here right this dash

01:06:37line that's what and what it says is uh

01:06:41if I put Alpha equals 1 in that equation

01:06:43I can reverse that inequality because

01:06:46basically when I when I continue this

01:06:48linear thing I guarant this is a lower

01:06:50bound on this right so this this the

01:06:53actual ual function value could never be

01:06:55below this if if the function's convex

01:06:58right so here's what you do we put an

01:07:00alpha in there uh which is a number

01:07:03that's less than one uh yeah actually in

01:07:05fact it's less than a half for weird

01:07:07reasons I mean doesn't even it turns out

01:07:09it's not going to matter what it is too

01:07:10much um and then we degrade the slope by

01:07:13multiplying by Alpha so here Alpha is

01:07:15maybe a third I just made it up and that

01:07:18would be this

01:07:19line so if you take a really small step

01:07:22length you're definitely below the this

01:07:24line because if you take a really small

01:07:25step length here this is actually a

01:07:27really good approximation of how much

01:07:29the function is going to go down right

01:07:31um and then what you then your the rule

01:07:33is you have to be below this dashed line

01:07:38this this degraded predicted line and so

01:07:40way that works is you know you might

01:07:42start here and query the function and

01:07:44find this value and you're above this in

01:07:47fact the most dramatic one is you'd

01:07:48start way out here and you're o you're

01:07:49over here and actually when you do X

01:07:52Plus Delta X in that case you're

01:07:54actually have a higher value than F ofx

01:07:57everybody see that so so that would be a

01:07:59particularly poor choice of Step length

01:08:01right so what you do then is you

01:08:05multiply this by some Factor beta beta

01:08:09is typically like a half you know people

01:08:11use that they use 0.1 so you simply say

01:08:14Okay that was too much so you you go

01:08:16back by half and you land here you try

01:08:18this one you get this value it's less

01:08:20than this thing and you accept it so you

01:08:22take the first one you accept that's

01:08:24backtracking um it looks now you know

01:08:28that your intuition actually the story

01:08:30of a lot of this um uh unconstrained

01:08:32minimization is I guess there'll be

01:08:35several times when something you would

01:08:37think is is right is just wrong

01:08:39completely wrong so now I'll tell you

01:08:41the first one you would look at this and

01:08:43you'd say it's it's obviously clearly

01:08:46it's better to do an exact line search

01:08:49that's totally obvious right because

01:08:51you're actually minimizing the function

01:08:52along this line right this makes sense

01:08:55and this crude thing that stops when

01:08:57you've gotten like some acceptable by

01:08:59the way people use numbers like Alpha

01:09:00equals

01:09:0201 right meaning if you get even just

01:09:05the slightest decrease like good fine

01:09:07thanks we'll take it so everybody see

01:09:09what I'm saying so looks very

01:09:11unsophisticated right um anyway all

01:09:15right now I'll tell the here the punch

01:09:16line the punch line is it doesn't make

01:09:19any difference uh that in fact exact

01:09:21line search is not only not better in

01:09:23practice it's often

01:09:25worse and the simple dumb thing actually

01:09:28works just fine thank you this is not

01:09:31obvious right so um

01:09:35okay now we get to a gradient descent

01:09:39method this is the simplest one remember

01:09:41we had to choose a Direction I mean I

01:09:45would hope it would be the first it

01:09:47would it should be the first thing you

01:09:49would think of someone says make me an

01:09:51interative method I'm at f I meant X and

01:09:54I evaluate F at that point and I

01:09:56evaluate a gradient now gradient is

01:09:58pointing in the direction of fastest

01:10:01increase of f if I move in that

01:10:03direction F goes up as fast as possible

01:10:05if I move in the negative gradient

01:10:07descent Direction F goes down as quickly

01:10:09as

01:10:10possible and you not only would you come

01:10:13up with this um but you'd also ask

01:10:17yourself why would you do anything else

01:10:21Ian you realize what think about if

01:10:22anyone if you didn't do the gradient

01:10:25method what happens is you arrive here

01:10:27and someone says where's the negative

01:10:28gradient you go that direction you say

01:10:31what does that mean you go if you go in

01:10:32that direction you're going to go down

01:10:33as quickly as possible and you go cool

01:10:36which way do you recommend I go and you

01:10:38go um that way you're like like what

01:10:41like why would you recommend that anyway

01:10:44you will shortly see why but is this

01:10:46kind of non-intuitive right uh so but

01:10:48we'll we'll get to it anyway so gradient

01:10:50method is the most intuitive one it

01:10:51looks it looks like this uh use any it

01:10:54doesn't matter what line search you use

01:10:57and here it is and you can you know

01:10:59prove convergence of this and all that

01:11:00kind of stuff that's fine um it turns

01:11:02out this is often unbelievably slow so

01:11:07uh so in fact people you we we'll see

01:11:09what other search directions can

01:11:11actually dramatically improve this but

01:11:13this is slow very weird um and we'll

01:11:15look at a picture to understand it so I

01:11:18just take a quadratic looks like this

01:11:20right and here are some level curves of

01:11:21the quadratic down here and Gamma is a

01:11:25number uh oh yeah let's see so gamma

01:11:28allows me to change the eccentricity of

01:11:32the level sets right so if gamma equals

01:11:351 this is 12 X12 plus x22 oh yeah by the

01:11:39way what's the gradient in that case if

01:11:41the so the level sets are just

01:11:43circles you're

01:11:44here what what what does the gradient

01:11:47look

01:11:48like what is the

01:11:52gradient just in that case this this is

01:11:55actually more I'll draw this even it's

01:11:57fine here um so

01:12:00here here's your function uh here the

01:12:03level sets here's zero right these are

01:12:06Level sets you're here the gradient

01:12:08points directly out like that and the

01:12:11negative gradient points directly

01:12:12towards zero which is the minimizer so

01:12:16if gamma was one if I ask you to

01:12:18minimize X12 plus x22 using the gradient

01:12:21method with an exact line search tell me

01:12:23about how well that algorithm is going

01:12:25to

01:12:28work like like yeah very well like in

01:12:31one step right it's going to say which

01:12:33way should I go you say steepest descent

01:12:35which just happens in this case to aim

01:12:37directly at the

01:12:39solution right did he do an exact line

01:12:41search and it's game over everybody got

01:12:43that okay so but now that makes sense

01:12:47and in fact someone who was proposing

01:12:48like you know a gradient method would

01:12:50say yeah look look how well that works

01:12:52that's great like why would you not do

01:12:54that why would you why would you go in a

01:12:57crazy Direction like that right um but

01:13:01we can actually control the eccentricity

01:13:03by making gamma either really big or

01:13:06really

01:13:06small um and in this case you can

01:13:09actually work out analytically what the

01:13:12what the iterates are right and you get

01:13:14something that looks like this is

01:13:15actually a very good picture to have in

01:13:16your mind of course it's totally silly

01:13:18because we don't care about

01:13:20two-dimensional problems we care about

01:13:22you know problems with a minimum mention

01:13:23of 200 2,000 20,000 2 million but

01:13:26nevertheless this is this gives you the

01:13:28picture so here this is when it's just

01:13:31slightly eccentric and here these are

01:13:33the level curves and so the negative

01:13:35gradient is just simply the gradient is

01:13:38orth these are orthogonal to the level

01:13:40curves and it says go in that

01:13:42direction this is this is the minimum

01:13:45because we're doing exact line search

01:13:47and then it says which way should I go

01:13:48and you go like that

01:13:50everybody so this has got uh people

01:13:52refer to this this is zigzagging I can

01:13:55make we can make it arbitrarily bad

01:13:57right we can take gamma equals a th000

01:14:01and you'll find out that this work that

01:14:03this thing's basically going like this

01:14:04and I think and actually I think someone

01:14:06I think there's even a name in in

01:14:07Western optimization it's called the

01:14:09morat effect right so but it's just the

01:14:12other thing I like is it's just called

01:14:14zigzagging everybody got this so by the

01:14:17way you know when you see this picture

01:14:19you can kind of figure out some things

01:14:22immediately that would be pretty good

01:14:24right so if you're in e or some any

01:14:26other field you would say you're kind of

01:14:29going in the right direction on

01:14:32average right I mean the problem is

01:14:34you're just doing this weird thermal

01:14:35thing up and down like this whereas you

01:14:37want to really kind of be going that way

01:14:39and so you'd say you know what you

01:14:41should run you should run those search

01:14:42directions through a low pass filter or

01:14:44something I mean again I'm speaking

01:14:46dialect um or if you do physics you'd

01:14:48say yeah you know add some momentum to

01:14:50it or something like that or viscosity

01:14:52add a vis osity term right so that you

01:14:55would kind of then you do something that

01:14:57would go like right like that and head

01:14:59towards the solution right so and those

01:15:01are all methods we're not going to look

01:15:02at those but that's that's kind of the

01:15:04idea okay so even this baby example

01:15:07tells us that oh but it actually we get

01:15:10some really good intuition here the uh

01:15:14the gradient method works really well

01:15:16when you're when the subl sets of your

01:15:19function have low condition number or

01:15:22they are round

01:15:25because if they're round the the

01:15:28negative gradient points directly to the

01:15:30solution and you're going to go there

01:15:31quickly if you if it has high condition

01:15:34number or we you know if it's or if the

01:15:36suev sets are weirdly squished out like

01:15:38this that kind of thing is going to

01:15:40happen

01:15:41everybody got that so that's that's kind

01:15:44of the picture here I mean kind of

01:15:47that's actually super important

01:15:48observation because we're going to use

01:15:50that that's actually the segue to to

01:15:52basically Thursday's

01:15:53topic right so okay all right so here's

01:15:58just some examples um this is a you know

01:16:01Silly function that's non-

01:16:04quadratic um and this is happens if you

01:16:06use like a back Laing L backtracking

01:16:09line search and here's an exact line

01:16:10search and you'd look at these two and

01:16:11you'd go like well I told you exact line

01:16:13search was way better but this just

01:16:15shows you kind of the idea um but it

01:16:17shows you that in real non you know real

01:16:20examples you're actually getting this

01:16:21moratto or zigzagging effect right and

01:16:24then we can look at that over here and

01:16:26see what that looks like so um this

01:16:30looks like this right um and let me

01:16:34explain what this is these are the the

01:16:36iterations right this is a log scale

01:16:39telling you actually your

01:16:41suboptimality right um and what you can

01:16:44see is all sorts what you see all sorts

01:16:46of weird things here right like the

01:16:48first thing is that actually between

01:16:49zero and like 100

01:16:51iterations for whatever reason the

01:16:53backtracking line search was doing

01:16:55better like I don't know why but it was

01:16:57and this is kind of common right so this

01:16:59is what it this is what it looks like

01:17:01the other thing you see is that this

01:17:02kind of gets this gets kind of linear

01:17:04and that's referred to as linear

01:17:06convergence but that's a weird use of

01:17:08linear because it's linear on a log or a

01:17:13semi log plot right so it means because

01:17:17this is a log thing what linear

01:17:19convergence here means is that each step

01:17:22you do each iteration reduces the error

01:17:25by some factor it's pretty small frankly

01:17:28but we could work out what it is right

01:17:31if you you know if you if you divided it

01:17:34by well let's just have say that hit

01:17:36here so let's take to go from one to 10

01:17:39Theus 4 that took you 170 iterations

01:17:42right then that tells you you take uh

01:17:45the you take 10 the 4 and you take that

01:17:47to the 1 175th power and that tells you

01:17:51how much your error gets small each step

01:17:54right so that that tells you roughly

01:17:56what it is right okay so that's that's

01:17:59called linear linear convergence um okay

01:18:03I think maybe what we'll do is uh I

01:18:05won't even start the next the next uh

01:18:07section but this is a good place to uh

01:18:10to to stop and we'll uh we'll and we'll

01:18:13continue next

01:18:18time

🎥 Related Videos

How to become rich

Coursera CEO, Rick Levin: Leaders Must Communicate Their Vision

Profiles of Purpose: Yes She Can

Welp, Here's what the Fed Just Said [Fed FOMC Minutes]

WGU Data Analytics Degree: How To Graduate In 6 Months

5 Highest Paying Non-Coding Tech Jobs (Entry Level)

🔥 Recently Summarized Examples

4 Steps to Master Any Complex Skill (quickly)

40 Years of Fitness Experience in Less Than 11 Minutes.

Gun Controlling Media Makes FATAL Mistake... They Have Tied Their Fate To Biden's & Gun Rights Win

GET READY! Palantir Is Officially The Next Nvidia.

Abundant Thinking: The Hidden Key to Get Everything You Want (Audiobook)

The Coming Demonic Invasion (Revelation 9:12–21)

View original video