00:05let me uh I'll explain where we are sort

00:08of in the whole course in the next maybe

00:11two and a half three weeks um we're

00:14actually going to be we build up and

00:16we're going from the bottom up to the

00:17top so this will this will finish when

00:20you'll actually Implement your own like

00:22solver for an LP solver or something

00:24like that um we haven't quite figured

00:26out but you'll anyway you'll it'll

00:28finish with you implementing like a

00:30solver so you mostly by the way just to

00:33demystify it right just so when someone

00:36looks at you sometime in the future and

00:37says oh my God you can't write your own

00:40LP solver that's really complicated and

00:41you could say I oh oh contr I did it

00:45right so that's the main thing the main

00:47reason um but also this interesting

00:49stuff um right now we're at the bottom

00:52level which is linear algebra uh and

00:54actually in some ways it's by far the

00:56most important level um and my my own

00:58feeling is everybody certainly anybody

01:01in this class needs to know this

01:03material um do you need to know a whole

01:0610we course you know that you would take

01:08on computational linear

01:10algebra maybe actually but but this this

01:14certainly you need to be familiar with

01:15the ideas because uh you know and that's

01:18unlike you you don't have to know about

01:20computer architecture even though

01:22everything we do is thanks to the people

01:25who design and build you know uh you

01:29know computers stuff so but this you

01:31actually should know um

01:33okay so back to uh the numerical linear

01:37algebra and today actually we're going

01:38to cover material that is like I some of

01:41you may know rest of you probably don't

01:43and it is UN I'd list it in the top 10

01:46of the things you need to know from this

01:48course even though it's not technically

01:50part of the course so okay so here it is

01:53last time we talked about you know how

01:54do you solve a set of equations ax

01:56equals B um and you know the there there

02:00are several ways there's ones we're not

02:02going to talk about there are iterative

02:04methods to do this um which are used for

02:07like super large scale problems if

02:09you've got like a billion variables or

02:11if that represents like a pte you're

02:13solving right um we're not going to talk

02:15about those we're going to talk about

02:17these are what are called direct methods

02:19uh I don't know where that name came

02:21from but they're called direct and the

02:22way they work is they take the Matrix a

02:24and they first Factor it into a product

02:28matrices Each of which which quote is

02:30easy to invert unquote so that that's

02:33the first step right you've seen some of

02:36these factorizations you know like SVD

02:38Ian decomposition you've probably seen a

02:40couple of the others as well um so an

02:43easy to invert is actually just uh

02:46that's dialect for it's easy to solve

02:50equations with that coefficient Matrix

02:52so that's what it really means because

02:53no one is inverting anything right it is

02:56unbelievably rare in good numeric work

03:00for anybody to actually form the inverse

03:02of a matrix there are cases where people

03:04do uh but it's very rare right so um

03:09okay uh so what you do then once you've

03:12factored your Matrix um into a product

03:14of things which are you know quote easy

03:16to invert unquote you have to interpret

03:18that correctly then you what you do is

03:21you then write out you know the inverse

03:23and the inverse is the product is is the

03:25product of the inverse is in reverse

03:26order and then you literally just

03:29evaluate this by first evaluating you

03:31know A1 inverse B that's one thing and

03:34and of course what that means is people

03:36write a A1 inverse B what they really

03:37mean is you call this you call solve A1

03:40comma B that's what that that's what it

03:42really means so you solve that then you

03:45pass this solution to the next thing

03:48which is A2 inverse uh times X1 and so

03:52you solve A2 uh X2 equal X1 and that

03:55gives you the second one and you keep

03:57going until you get to the end so again

03:59it it's like completely straightforward

04:02like okay like so what um we last time

04:06we talked about some very interesting uh

04:08very interesting things which is that if

04:11you want to solve a set of equations

04:14with the same coefficient Matrix but

04:16different right-and sides right then in

04:18fact what you can do is you factor once

04:22and you cach the factorization and then

04:24you simply use the solve method this is

04:27this is called the solve the first one

04:29is called factor and this is called

04:30solve and the SEC so you simply call the

04:33solve method on your multiple right hand

04:35sides so and we'll you get to some like

04:38stunning uh you get to surprising

04:42conclusions in some cases I'll get to

04:44that in a minute so all right so this is

04:46this is the one that you you probably

04:48saw a variation of this like in high

04:50school or something the first time you

04:52saw linear equations it's called

04:54gaussian elimination it's called lots of

04:56things um but here's what it is um if

04:59you have a non- singular Matrix you can

05:01Factor it as a permutation times a lower

05:04triangular Matrix uh it's it's a a lower

05:08triangular and U is upper triangular and

05:09they're all invertible of course

05:10actually if any one of them weren't then

05:12a would not be right um so the cost of

05:16that factorization which we're not going

05:18to go into how that's done it's actually

05:20not that difficult but um the cost is is

05:23is order n cubed flops okay so that that

05:26that's what that is and so this is what

05:28it looks like to solve you know linear

05:30equations you you you factor a uh that

05:34means you get p l and U uh then these

05:38three steps like 2 three four this is

05:39factor and then this is solve so again

05:42you all all you do is you just solve

05:44them in reverse order um now solving

05:47this one solving the permut this is

05:49silly it's just you multiply by P

05:50transpose right roughly it's the inverse

05:53permutate you just un you un permute

05:55them right so this is easy um the second

05:58one L is lower triangular and so for

06:00that you use forward substitution which

06:02we saw as an order n squ

06:04algorithm backward substitution is when

06:07you have U because you is upper

06:08triangular and you start if you'll

06:10recall by first solving for the last

06:12entry then the second last then the

06:13third last and so on that's backward

06:15substitution and that's also n squ flops

06:18and if you add these up what you find

06:20out is something kind of weird it's that

06:22the factor is order n cubed and the

06:24solve is order n s and so I think last

06:26time I finished by asking you a question

06:30how long does it take to solve I don't

06:32know on my laptop uh a set of linear

06:35equations with you know let's say a

06:37thousand by, Matrix right which I could

06:40make sound pretty impressive right I

06:42could say you know my problem has more

06:44than a million coefficients in it which

06:48is true thousand by thousand Matrix does

06:50and then then you could say you know the

06:53N Cub the the factorization is order n

06:56cubed you know that's like a billion

06:59flops right of course that's a joke now

07:02right so and the answer is like I can do

07:06milliseconds okay just so can you my

07:09phone can do it in something on that

07:10order of magnitude right so which is

07:13actually kind of stunning um I think

07:16then I asked the following question I

07:18asked how long does it take you if it

07:21takes 100 milliseconds to solve let's

07:23say one set of a thousand by thousand

07:26equations right a thousand equations

07:28thousand unknowns how long does it take

07:29to solve 10 for that matter you know

07:32even 100 how long does it take if it

07:34takes a tenth of a second to to do one

07:37the answer is surprising it's this it's

07:39a tenth of a second everybody got that

07:41so not that many people appreciate this

07:44right or or understand this and this is

07:45actually it's this is used throughout a

07:47lot of uh numerical stuff a lot of it uh

07:50well you use it all the time you just

07:52don't know it uh so all the solvers that

07:54you use when you call CVX Pi almost all

07:56of them will use things like this right

07:59right so the most typical methods to

08:01solve things like socps sdps these

08:04things or for that matter even LPS

08:06involve two solves right and that's done

08:09in the same basically the factorization

08:12dominates and and that's that's the

08:14that's the point there okay so all right

08:17actually I'm just curious how many

08:18people have seen this before how many

08:19people knew that you could solve 10

08:21equations with a thousand variables

08:25unknowns same coefficient Matrix

08:27different right hand sides uh you could

08:29do that you could solve 10 and the

08:31amount of time it takes you to solve one

08:33you knew come on other people yeah is no

08:37is no one else impressed or like

08:40shouldn't you be I mean I'm I maybe it's

08:43maybe too early but people should be

08:44going like that's awesome like really

08:47but I'm not getting that feeling but

08:48that's okay that's fine okay good but

08:52right um now uh sparse L factorization

08:57this is super interesting this is

08:59something everybody should know about

09:01right so um there's since I guess it's

09:04been developed since about the 1960s is

09:07uh basically stuff on sparse matrices

09:10right and so everything you know first

09:12of all you know how do you what are the

09:13data structures used to store them and

09:16then basically there's a whole parallel

09:17universe of doing all the stuff you

09:19would do with so-called dense matrices

09:21which are just stored as arrays right

09:24how do you do this for sparse matrices

09:25including factorizations back solves all

09:29these kinds of things everything you can

09:30do igen decompositions singular value

09:33decompositions there's a whole field

09:35where people have made this efficient

09:37for sparse matrices right and this is

09:39something like everybody needs to know

09:42about like everyone because if you do

09:45any kind of computation you need to know

09:47about this right because because there

09:48will be again there will be surprising

09:51statements we'll make about what you can

09:53solve and not solve um okay so if a is

09:57sparse um the typ iCal factorization is

10:00going to have four factors it's going to

10:03have a P1 is a a permutation in front P2

10:07is a permutation in back and then L and

10:10U are these uh lower triangular and

10:14upper triangular uh and those so the the

10:17idea here is mathematically you don't

10:20need P2 right you do need P1 it is it is

10:24false that any Matrix any non- singular

10:26Matrix can be factored as Lu

10:29okay you do need that initial

10:31permutation but you don't need that

10:33second one so the second one is there

10:35and by the way P1 is also there um so

10:39that when you carry out this Lu

10:42factorization L and U will also be

10:45sparse because if you do this if you're

10:47not careful about doing this your Matrix

10:50a will be sparse but L and U will be

10:52like dense okay when that happens it's

10:54referred to as you have fill in right

10:56you have lots of entries entries in l

10:59you that weren't there in a okay so and

11:02this is very important for everybody to

11:03know right so um and you know here' be

11:06an example um it it's extremely common

11:09to work all the time with let's say A

11:12you know a million bym million sparse

11:14Matrix it is simply not a problem right

11:18I can do stuff like that on my laptop

11:20right um you know the number of nonzeros

11:23is not going to be 10 to the 12 which

11:26would be a dense one you know it's going

11:27to be something reasonable like I don't

11:29know it's got you know 20 30 100

11:31non-zeros per row even that is not that

11:33big a deal because it's 100 million then

11:35it's 100 million coefficients if there's

11:37if there's 100 non zeros per row and you

11:39multiply that by about 10 so even in

11:41doubles it's like a gig which is just

11:44like very small it's it's a reasonably

11:46small amount of memory okay so um anyway

11:50so what happens is if you have your

11:52million bym million Matrix and you and

11:54you call a solver like this if you if

11:57you choose P1 and P2

11:59poorly then what'll happen is like L

12:02will be like a dense lower triangular

12:04Matrix right you can't even store it

12:06because it's a million by million right

12:08so it's 10 12 divided by two uh things I

12:10mean you can't let's just leave it that

12:13way okay making sense so so so the P1

12:17and P2 they're chosen euristic Le and

12:19you can you know people have been doing

12:20this since the 60s is actually very cool

12:22you can ask questions like oh find me

12:26the P1 and P2 that minimize the number

12:31lnu that's a perfectly nice problem to

12:33come up with and it is NP hard okay so

12:37it's but who cares it's silly there are

12:39unbelievably good uh urtic for doing for

12:42doing this right uh that I mean all of

12:45us benefit from so that's this is great

12:48um okay and so what happens with sparse

12:51matrices is the factor and solve uh

12:55times whatever they are they're less

12:58they are less than than 2/3 n cubed um

13:00and I'm going to I'm going to give you a

13:01a a very important special case right

13:04now and then we'll we'll encounter

13:07others right so uh here's a super

13:10important special case is a so-called

13:11banded Matrix um so a banded Matrix

13:15looks like this uh well of course

13:17there's just a band here like this you

13:19know maybe this is K entries wide right

13:23and basically this is nonzero and

13:26everything out here is that's zero and

13:27that's zero so that's a banded Matrix

13:30right so I I can store a banded Matrix

13:33on my laptop easily if it's a million by

13:36million and the bandwidth is let's say

13:3810 or 20 right so everybody so that's a

13:41banded Matrix no no that's fine so it

13:45turns out when you uh here well this is

13:47a case where you don't need P1 and P2

13:50right but if you simply do the Lu

13:53factorization here the L and the U are

13:56going to be basically the same they're

13:58also going to be banded so I'll just

14:00write down l l is going to is lower

14:03triangular so whoops this goes down to

14:05there right and then it simply looks

14:08like that that's K over2 and and this is

14:11what L looks like and you look similarly

14:14right so um by the way sometimes

14:18abandoned Matrix comes to you looking

14:19like that but sometimes it's been

14:21permuted by the app by whoever form the

14:23problem right they decide well I'm going

14:25to list all my States first then all my

14:27controls then all this in a in an

14:29optimal control problem and then they

14:31comes out with some crazy sparsity

14:33pattern and uh any method would be able

14:36any methods used to euristic choose P1

14:38and P2 will permute it to look banded

14:41okay so what happens in a banded

14:43factorization is first of all the

14:46factorization time is is n

14:50k^2 and then the back solve is

14:54NK okay so these are stunning L smaller

15:01n^2 okay so what this tells you is you

15:07can solve a million bym million

15:09equations if the bandwidth is you know

15:1110 like EAS like super easily and it's

15:14going to be so fast it's not even funny

15:16everybody following this so which is

15:18kind of shocking right because someone

15:19said what are you doing you say I'm

15:20solving a a set of a million equations

15:22for a million variables each equation

15:25involves 10 of the variables you're like

15:27whoa that sounds like it's hard right

15:30the answer is it's unbelievable this is

15:32a millisecond operation everyone follow

15:35it's just this is completely insane so

15:37but this is things you absolutely need

15:39to know this right because this is the

15:41this is the trick to many many things

15:43okay um so what's you know by the way

15:47what's interesting here is the factor

15:49solve ratio is not that high anymore now

15:51it's K it used to be n and so that says

15:54if I say how long does it take to solve

15:57a million by million equations banded

15:58with a bandwidth of 10 it's some number

16:00which is very which is shockingly small

16:02then you say how about 10 of them okay

16:04how about two of them and you say it's

16:06only a little tiny bit more everybody

16:08following this okay so by the way um

16:11what what you'll be doing in the next

16:14week two weeks is you will start to

16:18recognize Matrix structure like sparsity

16:21you're going to see some others later

16:22today like block arrow and weird other

16:25ones we'll we'll and you will start

16:26recognizing them they're super important

16:29because two things number one you'll you

16:32you'll real you'll start recognizing

16:34structures that you can exploit to solve

16:36extremely fast that's number one but

16:39number two on the other end you're going

16:40to start recognizing practical problems

16:43that have that that will result in

16:45matrices with that structure I'm going

16:47to mention one right now all of control

16:51all of signal processing just looks like

16:54that every last bit of it okay so that

16:58means if I ask you to you've solved

17:01problems you know little control

17:02problems and things like that but what

17:04this says is if someone says I have to

17:06compute I have to solve a control

17:08problem with you know 20,000 time steps

17:11and a state state dimension of 20 and 10

17:14actuators and someone say wow that

17:15sounds like it's going to be complicated

17:17you just laugh and say are you kidding

17:21like I can do that on my phone um and of

17:23course that's the key to you know

17:25actually putting these things into a lot

17:27of control systems right so lot a lot of

17:29them actually do this so okay that's the

17:31big picture so um this is this is just

17:35like super cool to know

17:37about okay this brings us to the uh a

17:42very famous factorization it's about 100

17:44years old um it's the cheski

17:47factorization so um if you have a

17:49positive definite Matrix you can write

17:51it as just LL transpose and if you

17:54insist that the entries of the diagonal

17:56entries of L are positive

17:58then it's Unique so there's no choice

18:01you just there's just one factorization

18:03period that that that's it there's one

18:05way to write a as LL transpose um and

18:08how do you solve uh you know how do you

18:10solve ax equals B where a is positive

18:13definite that's called a positive

18:15definite system of equations right so

18:16how do you do that you do a chesy

18:18factorization then you do a forward and

18:20a backward sub substitution with L andl

18:22transpose okay so so that's that's it

18:25and you know actually it's 1/3 n cubed

18:28as opposed to 2 N cubed for General

18:29matrices and you know that kind of makes

18:31sense I me first number one no one cares

18:34about a factor of two in a flop count

18:35estimate because these are so crude

18:38right what really matters is like the

18:40people who actually wrote the micro code

18:42and all that kind of stuff and how's it

18:44blocked and how many cores is it using

18:46all sorts of crazy stuff internal that

18:48we don't have to most of us don't all of

18:49us I'm guessing should not and do not

18:52have to worry about right but we should

18:54also be all be very grateful that there

18:56are people who do right so um yeah so

19:00there it's a factor of two which is not

19:02unsurprising it's symmetric so roughly

19:04speaking there's a redundancy of two you

19:06know or or you can even look at this and

19:08say hey that's an Lu

19:10factorization with u equals L and you

19:13know why you say because the Matrix is

19:14symmetric that's why right so so it's

19:16not anyway not not not too surprising um

19:20we will see very shortly what you know

19:23where positive definite systems of

19:25equations arise right actually they they

19:27arise by eles in a lot of applications

19:29but but specifically in convex

19:32optimization we'll see they're at the

19:35stuff okay and now we get the sparse

19:38chesy factorization and here I'll tell

19:40you something um super uh interesting

19:42about I think um so here it turns out

19:46this is basically the same as calling a

19:48chesy factorization on ppose AP right

19:52because take that equation just multiply

19:53on the left by ppose on the right by P

19:57so what it basically said says is

19:59permute your Matrix your positive

20:01definite Matrix and then do a chesy

20:03factorization where there is no

20:05creativity it's just there's a unique

20:08chesy factorization everybody got this

20:10when you do that you will find some

20:12stunning things out you will find out

20:14that if you do a CH if you if you choose

20:16a good permutation the L is going to be

20:20super sparse if you choose a bad

20:23permutation it's going to be dense so

20:25and I'll give you a sort of a famous

20:29um so it would be something like this if

20:32I gave you uh this Matrix here uh so I'm

20:37going to draw a uh here I just make it

20:40this way so it it's got first row and

20:42column and a diagonal okay so by the way

20:45that's that's called like an up what do

20:47they call it's an Arrow matrix right

20:48because if you blur your eyes it looks

20:50like an arrow or something I don't know

20:51anyway that that's an upper Arrow matrix

20:54everybody got it's super sparse right

20:56number of nonzeros per Row in column you

20:58know except for the first row and column

21:00it's like two right so okay so that

21:03that's an example of a that is an

21:05example of a structure you should never

21:08ever see again and not have an emotional

21:11response to ever right because this is

21:14super important okay so okay if I do a

21:17chesy factorization on that here's

21:20here's here's I'm going to show you the

21:21sparcity pattern of L it's this ready

21:24it's completely dense there completely

21:29okay so this of course ruins everything

21:32right for one thing you can't even store

21:34that much right because if you started

21:35with a million by million you know upper

21:37Arrow matrix called cheski on it it we

21:41it would at some point your OS would say

21:44no I no I am not going to allocate that

21:46amount of memory no so everybody got

21:49this okay so now turns out all you need

21:52to do with this is permute the first and

21:56last columns uh and first and glass rows

21:59and columns and then you end up with

22:03Matrix and that that's called a you know

22:07like a I don't know it's a whatever AR

22:09that that's called an arrow structure

22:11right we we'll get to others but anyway

22:12that that's one okay if I do a cheski

22:15factorization on this I will show you

22:17the sparsity pattern you

22:21this and that's all okay so what it

22:27means is you know if it means that the

22:30permutation that permuted this into this

22:33was super important at least from a

22:35sparcity point of view okay uh everybody

22:38got so what this means is if you had

22:40that set of equations to

22:42solve uh actually if you just called a

22:45sparse solver on your behalf you

22:47wouldn't even know it but the

22:49permutation would change it to this

22:51factor that and then your Factor time is

22:53going to be order n and your solve time

23:00everybody following this so that's it so

23:03this is fortunately for a lot of those

23:05things you don't have to worry about it

23:06that's all based inside all of you have

23:08benefited from this probably a million

23:11times just so far in this class because

23:13every time CVX Pi calls a solver this is

23:17this is what's Happ this is the kind of

23:18stuff that's happening on your behalf

23:20many many times every time you call call

23:22solve on something okay so doesn't

23:24matter I mean but just now you know um

23:28okay this make sense yeah how much time

23:32does it take for the Sol it to find that

23:35P that's that's an that's a perfect

23:38question I should have said it so the

23:40question is how long you know so yeah if

23:42I if I say oh man are you kidding I can

23:45do this for you not a problem I'm going

23:47to find a permutation like that and then

23:48I can solve this in tenth of a second

23:49and you go cool how long do it take you

23:51to find the permutation you go couple

23:53days you know it's just not yeah so the

23:56general rule of thumb is is people use

23:58urtic for finding P um that are they

24:03hope on the order of magnitude of the

24:07amount of time it takes to uh actually

24:10factorization so that's what they do now

24:13there are applications where people will

24:15run they'll they'll take more time to

24:18actually work out a better permutation

24:21because they're going to use this a

24:22gazillion times okay so this would be

24:25the case for embedded systems like for

24:28right so if you're doing control and

24:30you're going to put this into let's say

24:32a rocket or something like that and and

24:34you know it's going to run at a 100

24:36Hertz for like the entire Mission or

24:39whatever it is right that's a lot of

24:41time it's worth you can take it you

24:44could definitely take an evening to come

24:46up with a better permutation but this is

24:48not what most people doing numer most

24:50people doing numeric who work on the num

24:52on you know solution of sparse linear

24:54equations don't do that they just stick

24:57with your ristics that are fast so and

24:59by the way that that portion is called

25:01the symbolic it's sometimes they call

25:03that the symbolic solve or the

25:05permutation phase there's lots of names

25:07for it right and and you can tell if you

25:10actually uh if you do things like uh um

25:14if you actually look at it to see what

25:16it's actually doing you will find out

25:18that it's spending a whole bunch of time

25:20and very it's doing graph calculations

25:22there's no floating Point numbers

25:24involved okay then then then when

25:26floating Point number operations startup

25:28that means you're actually doing the

25:31factorization that's a good question

25:34okay so now I know I'm not able to I

25:37mean I realize the for many of you if

25:39this is the first time you've seen it

25:40that's fine you should just get the

25:41highle picture of it and you could read

25:43there a little bit more in the appendix

25:46in the book there's an appendix on this

25:48um and this is just something you should

25:49know about so I don't know scipi you

25:52could poke around in scipi or something

25:54once you know about this you can't

25:55unknow about it and it's kind of you

25:56will have like incred I Powers yes what

26:00of uh well they're yeah they do so some

26:04are kind of greedy um I'd have to tell

26:07you actually how you actually do the

26:09factorization uh to actually explain

26:11explain the Uris stics um but a lot of

26:13them are just constructed one by one

26:15what they do is they say uh p is going

26:18to permute something to the top row and

26:20they say I'm going to find I'm going to

26:21find which entry I should permute to the

26:23top then who gets permuted to the second

26:26top then third top and fourth top things

26:28like that so so it's kind but that's

26:29greedy uh but that the that's the kind

26:32of things people use and they have lots

26:33of names and it's worth knowing a little

26:34bit about them yeah so doesn't be hard

26:37for finding the best one but are there

26:38like approximation guarantees no there's

26:41no approximation guarantees whatsoever

26:43Noone right so yeah no one has yeah so

26:47yeah I it's it's one of those things

26:49where you know we can ask of course you

26:52know this is all these are all graph

26:53calculations has nothing to these are

26:55just about these are just questions

26:56about graphs right oh there is one

26:58exception uh there's one theory of it

27:01that's this if the graph represented by

27:03the sparcity pattern of a is

27:06cordal then it turns out there is a perm

27:10there there exists a it's guaranteed

27:12there exists a permutation which has

27:14zero fill in meaning the number of

27:16nonzeros in L will be identical to the

27:18number of nonzeros in a divided by

27:20two right that so that if you know what

27:23a cordal graph is that's that that's one

27:25of them um that's actually kind of

27:26useless frankly in practice because the

27:28real question is not whether there

27:29exists one the question is whether our

27:31urtic find it and the answer is all

27:34decent urtic will find that one so yeah

27:38um each of these factorizations have

27:40required that we know the Matrix is non

27:42singular that it's postive

27:44def no so it turns out most of the so in

27:48this one to find the permutation for

27:50chesy factorization you actually don't

27:53have to know the entries in a you simply

27:55have to know their locations that's

27:58actually a super important more advanced

28:00but you you brought it up so what this

28:03means is we can do this ahead of time

28:05like so for example if we're solving an

28:06optimal control problem or something

28:08like that uh we can do the permutation

28:11once and for all offline and then when

28:13this thing is actually running a jet

28:14engine or something like that then we

28:16don't have to do it if we're doing

28:18finance and we want to do an insanely

28:20fast back test right we would first

28:23Factor we' do a we'd first do the

28:26factorization and then after that every

28:28solve is going to be really fast so

28:31that's how that works yeah so you don't

28:32know you don't have to know the if if a

28:34is not positive definite and you call an

28:36L you call a cheski factorization on it

28:39it will fail at some point at some point

28:41it's going to attempt to take the square

28:43root of a negative number right and then

28:46it'll it'll uh it'll send it it'll uh

28:48it'll throw a uh an exception to you of

28:52some kind right but that's the only time

28:54you would ever know yeah I'm curious

28:57when when uh does the solver no stop

29:00like what kind of Criterion oh s all of

29:02these are finite algorithms right and an

29:05infinite Precision they just solve right

29:08and when I say it's order n cubed it

29:09means I you know I will I can tell you

29:12for this you're going to do you know 1

29:14billion you know blah blah blah blah

29:15blah multiplies know 1.3 billion ads

29:19when it's done if okay if you did every

29:22that everything in infinite Precision

29:23you would have the answer end of story

29:25there's no stopping Criterion there's

29:27nothing like that okay now of course

29:29you're doing this in finite

29:31Precision right so since you ask there

29:35are methods to do something it's called

29:39refinement and that what that does is

29:41you compute a solution uh that's

29:44approximate and then you can actually do

29:46you can iterate to get a better solution

29:49and people know about that but that's a

29:50little bit beyond what we're talking

29:52about here but it's good to know about

29:56um okay okay um so LDL transpose

30:03factorization um this is any non-

30:06singular symmetric Matrix right um by

30:09the way we'll see examples of where

30:10these come up like maybe next week in

30:12fact so um so here it's like L it's like

30:17an LDL transpose factorization uh but D

30:20is diagonal and it's either 1ex one or 2

30:22x two blocks so that's just a

30:23mathematical fact that you might need to

30:25have these 2 by two blocks uh but the

30:27point is those are all easil you know

30:29easily inverted matrices permutations

30:31yes lower triangular yes um a block

30:34diagonal matrix with one by one and 2 by

30:36two blocks I think we know how to invert

30:38that that one we actually literally know

30:40how to invert um but we can certainly

30:42solve equations like that right so okay

30:46um so this is this is the LDL transpose

30:49factorization um and now we're going to

30:53talk about something this is something

30:54probably many of you have seen in other

30:56contexts it comes up in pretty much all

31:00Fields so um it's something called the

31:02sure uh complement is is the the the

31:05name of it and you I think you've

31:07already seen it in you know early on in

31:09the class or something like that but I

31:11can't think of a legitimate field where

31:13it doesn't come up right so in

31:15statistics it's conditioning a gaussian

31:17on some entries right in mechanics it's

31:20actually what happens when you impose

31:22the boundary conditions on something and

31:24ask where everything else is in

31:25electrical engineering you take a net

31:27and I I put a termination on some of the

31:29ports on the network and I ask what

31:31what's happening with the rest of them

31:32right so this I all Fields have this um

31:36okay and the idea is actually

31:38embarrassingly simple so it goes like

31:40this um we're going to write uh ax

31:45equals B we're going to block a into

31:48some blocks like a11 and I'm going to

31:49have a11 and a22 B uh square right um

31:54and I'll I'll also conformally block

31:57block X and B so they're going to be

31:59called X1 stacked on x2 and B1 stacked

32:01on B2 right okay now okay now but

32:06nothing has CH nothing has happened I

32:08just decided to name it this way so I I

32:10just divided it in some weird way okay

32:13uh what happens now is you do actually I

32:16mean you do kind of what you learned in

32:19maybe High School except those were

32:20numbers and not matrices um and so what

32:23you do is you just we just work out we

32:25can actually just solve this in a weird

32:27way so the first equation block equation

32:29is a11 X1 plus a12 X2 = B1 right and

32:36we're going to we're going to assume

32:37here we're going to make an assumption

32:38that a11 is invertible okay then so what

32:43that says is well I can I could do this

32:49tox2 right that's a that's that right

32:53and then I'll just put an a11 inverse

32:57okay so there we go like that okay um

33:03yeah by the way already when you see

33:05this this is hinting that this is going

33:06to be interesting when a11 is easy to

33:09invert so that that's the that's going

33:10to be the interesting part okay but

33:12we'll get there okay I plug this into

33:15the second equation and the second

33:17equation is like a21 X1 plus a22 X2 = B2

33:26and I plug this this expression in here

33:29and I'm going to get something like you

33:31know a21 a11 inverse * B1 - a12 X2 plus

33:39a22 X2 = B2 right so makes sense okay

33:46and then I just I rearrange that to look

33:50like this equation here right I just put

33:53and and that's another equation in in uh

33:56X2 um this Matrix here is famous and

33:58it's called the sure complement of of

34:01the original Matrix I can't remember if

34:03it's with respect to the one one or two

34:05two block but it's one of those two and

34:07I don't remember which it is it doesn't

34:08matter it's a sure compliment okay um

34:11and you've probably seen it before or

34:13something in some yeah anyway that's it

34:17okay okay um okay so uh you could use

34:22this these calculations to solve the

34:24original thing and that generic

34:27algorithm looks like this it says first

34:29you should form a11 inverse a12 I mean

34:31because that's this thing right and a11

34:34inverse B1 right oh uh by the way um so

34:38let's talk about how we would do that so

34:41the way you would do that is you'd

34:44a11 okay to compute this you would do a

34:49solve right because you factored it and

34:51now you do a solve to calculate this you

34:54do a solve on each column of A1 two

34:57everybody got that right so this could

35:00be this this that could end up being

35:02quite that's that's quite efficient

35:04right so so the way you know this if I

35:08if I were to double click on that and

35:09expand it it would look like this Factor

35:11a11 then solve with B1 and solve with

35:16each column of a12 right that gives me

35:21matrices okay uh next step is I form s

35:24that's the sure complement which is this

35:27um and I form this bti thing over here

35:31which is the right hand side of this

35:33equation right so this equation up here

35:34is sx2 equals B TI and then then what I

35:38do is I'm going to I I find x what here

35:41is I solve I do a factor on S then a

35:45back solve and then I can reassemble

35:47this gives me X2 directly and I can get

35:49X1 from like this equation here and by

35:52the way when I solve this equation I

35:54already factored a11 so this is a back

35:56this a solve okay right this is the okay

36:01so that's that's the idea now okay now

36:03it's a drum roll to find out is this

36:06faster is it better and it's actually

36:13so if in step one you have f is going to

36:17be the factor cost for a11 S is the

36:20solve cost so in the first case you you

36:24factor a11 once and then you solve for

36:27all the columns of a12 and so that's N2

36:31uh s so that that's the cost of step one

36:34step two and um here that's actually

36:37just forming uh forming the the the the

36:41sure complement Matrix and what we're

36:42going to do is we're going to we're

36:43going to imagine that by the that the sh

36:45we're not going to exploit any structure

36:47in the sh complement although one could

36:49but we're not going to we're just going

36:50to treat it as dense um and then step

36:52three is going to be N2 cubed because

36:55it's just that's just just to to factor

36:57and solve it okay and so the total is

37:00this number up here it's f plus n2s plus

37:02blah blah okay all right so let's just

37:06throw in a general a11 and see what

37:08happens um when you do I mean hardly

37:10surprising but actually down to the Flop

37:13it's identical and you've done Absolut

37:15congratulations you've done a bunch of

37:17algebra and Analysis and it did

37:19absolutely nothing okay so

37:22everybody got oh by the way it does

37:25nothing in terms of flop count uh in

37:27reality it does do things right so in

37:29fact people use these block this sort of

37:31block elimination meth they actually use

37:33these for dense matrices and the reason

37:36is you would choose N1 and N2 so that

37:38one of these solve methods could be done

37:42cache right and therefore done quickly

37:45so so so that actually happens but again

37:47all of us are blissfully unaware of

37:49those details but so okay um now it gets

37:53interesting it's interesting precisely

37:55when f is is small compared to N1 cubed

37:59so I'm going to give you here's a great

38:02example is diagonal so what the factor

38:06cost for a diagonal is zero I mean

38:08whatever or n if you want to whatever it

38:10doesn't matter that's there l l is going

38:13to be the Matrix itself and youu will be

38:15the identity there it's zero okay

38:18um but so here it's actually really

38:22interesting you get 2 N2 ^2 N1 right

38:26Plus 2/3 n what's amazing about that is

38:28that's linear in N1 so now I'm going to

38:30show you some equation uh I'll show you

38:34patterns uh and here it is one actually

38:37I was just showing you one I had one

38:39right there uh here's one remember this

38:46one that's this downward Arrow pattern

38:49um someone suggest a blocking of it like

38:53what would be a11 do you see a leading

38:55Matrix here that is easy to

38:58invert diagonal yeah yeah so here's

39:02here's a11 it looks like this

39:04right right it just looks like that

39:07right so this is a11 diagonal you know

39:12actually I was going to say matrices

39:13don't come easier to invert than

39:15diagonal that's false here's one that's

39:17easier is a multiple of the

39:19identity that that that's that's pretty

39:22straightforward actually I think in

39:23terms of flops it's the same anyway so

39:25it doesn't even matter but anyway here

39:26it is um okay so here's one where

39:30instead of this so N1 is I don't know

39:33let's suppose the original thing was n

39:34it's n minus one and N2 is one because

39:37this is this this thing is one by one

39:39right down here um and so in this case

39:43it turns out you're going to get a

39:45stunning uh reduction right because

39:47you're going to be able to you can

39:49invert this fast and in that case you

39:51would get exactly this number over here

39:53you would get you know 2 N2 2 N1 right

39:57so okay that's that's one example but

40:00I'll show you I'll show you some others

40:01right um what it means is if you saw

40:05yeah what if you saw a a a sparsity

40:08pattern that looked like

40:10this right this is uh banded and then uh

40:16over here these are these are like dense

40:18this is called dense rows and columns

40:20everybody got that so actually it's just

40:22a block Arrow matrix but what but with a

40:26thicker weight on the things because

40:28they're wide right this is that's banded

40:30and and then this is this is a whole

40:32chunk of stuff over here right everybody

40:34got that U so what this says is that

40:38this block elimination method here will

40:41that uh well it's actually insane it's

40:44actually going to be linear in the

40:46dimension if this if this width if this

40:49doesn't grow this way and if the

40:52bandwidth doesn't change this gets

40:54bigger and bigger and we can solve it

40:56right so U anyway it's pretty cool right

41:00so uh and that means if you see that you

41:04should you should uh smile we'll put it

41:06that way like or you're not going to see

41:08that probably because someone would

41:11already have to know but the way it's

41:13really going to work is someone's going

41:14to be talking about a problem it's going

41:17to be some weird complicated problem who

41:18knows what Fe it doesn't matter they're

41:20going to be talking about some

41:20complicated thing while they're talking

41:23you will be forming your mind what the

41:25sparsity pattern of the associated

41:27matrices look like okay when that image

41:30comes to mind you will smile and you

41:33will say I can solve that problem in

41:37half a second and they said I can't even

41:40load it into memory and you'd say just

41:44chill and I'll show you I'll show you

41:46what we're going to do right everybody

41:48following this so so I mean these are

41:50incredibly useful things right so and

41:52you know we can think about what this is

41:55this is basically a set of equ equations

41:57where banded means that equation you

42:01know 50 only references let's say

42:0760 okay and that would be here uh then

42:11you say but there's some special

42:12variables here which affect

42:14everybody okay so that that that's

42:17that's what this is so you're going to

42:18want to be making those uh yeah so yeah

42:22to be fully developed I would say here's

42:24what you want you want to be able to get

42:27a feel for when applications involve

42:30matrices that look like

42:32that then you're going to be happy

42:34because you're going to know something

42:35other people in the room don't because

42:37you're going to know that you can solve

42:38that in linear time right so that's why

42:42people would say like well I have to

42:44find this whole trajectory it's like

42:45it's it's you know it's 100,000 time

42:47points you know and you're like so but I

42:50mean do you have a phone or whatever

42:52it's just not going to be that

42:53complicated right again if you know what

42:54you're doing okay so okay

43:00now this leads us to uh and this this

43:04probably people have seen it's got lots

43:06of names it depends on where you're from

43:08I think it's called like Sherman

43:11Morrison Woodbury sometimes it's called

43:13the Woodbury thing sometimes it's called

43:14The Matrix inversion

43:16Lemma and I have a friend who can

43:19actually tell where someone learned

43:21their linear algebra because they use

43:24different terms and they they would they

43:26even use different terms I believe at

43:27Oxford and Cambridge just just so you

43:29know so this is like someone can just by

43:31asking someone what they call this and a

43:33couple of other questions you can figure

43:35out that they got their PHD at Oxford or

43:37Cambridge or something or Harvard or

43:38whatever anyway or MIT or something okay

43:41so here it is you start with um a A plus

43:45BC uh time xals B right so that's your

43:50that's in fact some people some people

43:52call this a a uh a low rank perturbation

43:56of a I mean that's not fair because no

43:58one said that B and C are small right

44:01but if B had small columns if if B had

44:04just a handful of columns and C had a

44:05handful of rows this would indeed be a

44:08low you know what people would call low

44:09rank update to a okay so we want to

44:13solve a matrix that looks like that so

44:15in fact an extreme case would be things

44:17like uh here's when you hear all the

44:19time diagonal plus low rank there a is

44:24diagonal and B and C are small right so

44:27B has got you know a small number of

44:29columns C has got a small number of rows

44:32that's super common that's also

44:34something you will I would hope never

44:36ever forget diagonal plus low rank and

44:39it's it's all related um here's how it's

44:43related um yeah actually let's just talk

44:45about this in the case where a was

44:47diagonal and B and C are are are you

44:49know mod have modest number of columns

44:51right so in fact let's make in fact I'll

44:56so basically all of quantitative Finance

44:59uses a risk model where the covariance

45:01Matrix is diagonal plus low rank and

45:04it's called a factor model right so

45:08transposed and in uh If This Were Like A

45:12A gaussian or something if if this is

45:14the coari of a gaussian if you

45:15understand what I'm saying that's fine

45:16if you don't that's also cool but this

45:18is just completely Universal right so

45:20you'd have a what it would mean is it's

45:22even got a basian model and it basically

45:24says that uh that the returns of all

45:27these assets has is driven by let's say

45:31B and C they typically have B is about a

45:33width of 100 that would be very typical

45:36uh C would have a height of well C is be

45:38transposed right so you'd say look I

45:41don't know what these

45:438,000 returns are doing but they're

45:46mostly driven by these 100 factors right

45:51which which are which are actually so

45:53that's that's the underlying uh that's

45:55what's actually driving their returns

45:57and then the a is diagonal and that's

45:59they have a beautiful name for it's

46:00called idiosyncratic so that's the

46:02idiosyncratic risk it's the amount by

46:04which you know Google or you know

46:08whatever Coca-Cola or whatever it is go

46:10Chevron goes up and down all by is the

46:12part in fact the way they would say it

46:13this is dialect is you'd say that is the

46:16part of the variance not explained by

46:18the factors everyone got that so that's

46:20it so and the factors are actually kind

46:22of interesting sorry this is just a

46:23weird aside but since we're here I'll

46:26um the factors are interesting some are

46:28handcrafted so they're literally

46:31matrices with zeros and ones and weird

46:33stuff like that and it like are you you

46:35what uh yeah what you know what region

46:38are you in right are you uh are you a

46:41Pharmaceuticals company you know do you

46:43do are you do defense do you do that

46:45kind of thing others are actually

46:47crafted by you know people using

46:48numerical algorithms so okay that's was

46:50just a small story about this okay um oh

46:55if I have a if I have something that

46:56looks like that a is diagonal and B and

46:58C are small I can store that like no

47:00problem right that's that's even for

47:02very big like if I do I I want to you

47:05want to run a fund that is looking at a

47:07universe I'm using dialect of let's say

47:10100,000 uh assets right across the whole

47:13world that I know people who do that but

47:15so so you do that um so the covariance

47:18Matrix is 100,000 by 100,000 Matrix for

47:20something like that you might someone

47:22say well Yi what's your risk model and

47:24they would say oh I have a 200 Factor uh

47:28model and that means their covariance

47:31Matrix is expressed as a diagonal matrix

47:34right which is whatever it is 100,000 uh

47:38100,000 and then they have 200 factors

47:42uh and all of that is completely

47:43storable and stuff like that and you'll

47:45see in fact not only can you store you

47:46can actually solve problems super fast

47:48and things like that okay right that was

47:50all a weird aside but let's so let's

47:52let's keep going so now what you do here

47:54is something that is actually a little

47:56bit um unintuitive it's called you un

48:01eliminate we did this before in Duality

48:04and it turned out well there what you do

48:06to un eliminate means to introduce new

48:08variables right right to eliminate

48:10variables means to you know solve for

48:13them get rid of them and then substitute

48:14back their values into the rest that's

48:17elimination right that sounds good

48:20especially to an amateur or somebody

48:21first seeing this material because you

48:24know someone what are you doing you so

48:25I'm limiting some very why because I

48:27like to solve smaller sets of equations

48:29everybody I mean everyone like that's it

48:31makes perfect sense right it's

48:33completely and totally wrong that's not

48:36how you should be doing linear algebra

48:38but it does sound good and people can

48:39fall for that right okay un elimination

48:42is the opposite you say thank you for

48:44your set of equations if you don't mind

48:46I'm going to introduce new variables and

48:47equations and make it a bigger one and

48:49someone's like what why would you do

48:52that everyone see what I'm saying you'd

48:54say thank you for your you know thousand

48:56by thousand set of equations if you

48:57don't mind I'm going to expand it to a

48:5910,000 by 10,000 set of equ if it's okay

49:01with you right and they would say like

49:04why anyway and the answer is you'll

49:07you'll see why in a minute right so we

49:09write this this way we just introduce a

49:10new variable which is Y which is CX

49:13right so and I write that second

49:15equation as uh this bottom row which is

49:18cxus yal z right so so so the second the

49:22second system of equations is bigger and

49:26uh completely equivalent to the first

49:27one right okay now let's talk about this

49:32um when you see a matrix like that it

49:35you just encountered all this so far

49:38just a few minutes ago so I don't expect

49:40you to have this response now but

49:42hopefully within a week or two you

49:45should have the following response you

49:47will see a block there a minus I and

49:50your IE will go to it because it's

49:53really easily inverted and you're going

49:55to see a block and you're going to say

49:57well I you're going to have a an urge to

49:59do block elimination on this right

50:02because that and so what you would do is

50:04you'd say oh you know I know what I'm

50:05going to I'm going to permute the I the

50:07minus I up to the top because and then

50:09I'm going to use block elimination from

50:10the previous stuff right everybody got

50:13that and then you eliminate it and you

50:15will end up then forming the sh

50:18complement equations and it will be a

50:20plus bcx equals B so you'll be right

50:23started Okay so here's what you do

50:26instead you resist the urge here to

50:31eliminate the minus I even though it is

50:33a very attractive Target for elimination

50:35because it's easily inverted right you

50:38resist that urge and instead you

50:41a and you will end up with a sure

50:43complement equation that looks like

50:46this okay um and this allows you to get

50:50y here you solve for y then you then

50:53from that you plug it back in and get X

50:55and you end up with a formula that's

50:57called The Matrix inversion Lemma that's

50:58the neutral version It's it says that a

51:01plus BC inverse is you know this

51:03horrible formula over here yeah so

51:06that's it um this gets interesting if

51:09like a is diagonal or something like

51:12that then you'll anyway a lot of things

51:15would would would come out of that so

51:17how many people have seen like Matrix

51:19inversion Lemma or something before or

51:21or some Sherman Morris and Woodbury it's

51:25okay uh any other names for it I uh I

51:29used to know at least one or two others

51:31but anyway anybody else know another

51:34name for that it's a thing in linear

51:37algebra and how you call it depends on

51:39where you came from so

51:42okay okay so I think actually I'll uh oh

51:45yeah well here's an example like a

51:47diagonal and B and C are dense um this

51:50is you just you basically form uh you're

51:55going to form uh D is a plus BC this is

51:59if you do it the dumb way right you just

52:01write it out as a dense Matrix and

52:03solve um in this case then you get

52:06something that's PN squar sorry n cubed

52:09and if you do it the right way you get

52:10something that's linear in N right and

52:13these make a big difference right

52:16so yeah by by the way a lot of this

52:19stuff subsumes things that were taught

52:21in the 60s 7s and 80s and sometimes in

52:25by people who learned it from people who

52:27learned it in the 60s and 70s okay so so

52:30for example if you teach a a course on

52:32control you you might I've done this

52:35long I haven't done it for a long time I

52:36won't do it again but you go through a

52:38long derivation of what like the calman

52:40filter is or how to solve an optimal

52:42control problem right and it's very

52:44clever and you have little parts and

52:45this Bellman equation and blah blah blah

52:47and you get a and yeah you end up with

52:49stuff that actually works it's great um

52:52a very funny thing happens if you simply

52:55the problem represented by an optimal

52:57control problem or a calman filter and

53:00just throw it at a sparse solver guess

53:02what it'll be just as it'll it'll do it

53:06it'll do it so basically all of those

53:07lectures were useless is what I'm saying

53:09as a practical matter which is kind of

53:11embarrassing but whatever so I don't

53:13teach that anymore so I I don't teach it

53:14that way anymore I just say you get a

53:17giant set of linear equations and you

53:21solve it using sparse uh methods done

53:24that that that is the correct way to do

53:26it it gives you a lot of stuff right so

53:29this in a lot of fields like yeah I

53:32think probably I'm almost certain in

53:33statistics there'd be many many Papers

53:35written on if you had a factor model so

53:38that's a covariance matrix is diagonal

53:40plus low rank and it would be all sorts

53:43of things about very intensely clever

53:44ways I could condition suppose I give

53:46you 80 80% of the entries of a vector

53:49and I said please compute the

53:50conditional means of the of the

53:51remaining ones Everybody follow this

53:53that's like a conditional mean

53:55right so I'm sure there's whole Papers

53:58written on the on this and clever

54:00methods for doing this and blah anyway

54:02the answer is you form the covariance

54:05Matrix literally write down the

54:07equations you want to solve and then you

54:10do nothing but pass it to a sparse

54:12solver and it'll be just as fan it'll be

54:15just as fancy probably more General than

54:18all those papers so okay it's

54:20embarrassing but true so so okay um all

54:23right um so I we won't I'm not going to

54:27go into this there there's a whole lot

54:28more I haven't told you about linear

54:30algebra but this is this this is these

54:31are the main points so okay so let me um

54:34yeah we'll yeah and we'll right now this

54:37is just sort of hanging here now you

54:38just know about linear algebra but I

54:39promise in the next week we're going to

54:42connect it to problems and then that to

54:46optimization problems and that will have

54:48a connection to applications and then

54:50you will start being able to listen to a

54:53problem a practical problem and say I

54:56can solve that like I can solve that

54:58super fast right so everybody uh getting

55:01this right so I there there's people

55:04where this is all they do right if you

55:05do scientific computation this is kind

55:07of a lot of the stuff you do so okay so

55:10now now we're going to start um

55:13minimization um and we'll like I say

55:16we're building slowly up at the at the

55:18lowest level well at the lowest level

55:20are you know uh you know computers

55:23compilers micro code things like that

55:25which we're not going to look at uh next

55:27level up the bottom floor is what we

55:29just covered was linear

55:30algebra now the next floor is how do you

55:33solve like you know smooth unconstrained

55:36problems so we'll we'll we'll look at

55:39um and we'll build slowly up uh then

55:43we'll get equality constraints then

55:44we'll pull in inequality constraints and

55:46then we're there right so okay so here's

55:49what it looks like uh we're going to

55:50minimize a function f here that is we're

55:54going to assume it's it's smooth and and

55:56twice continuously differentiable right

55:59um and we we'll we'll assume that you

56:01know it it it's a this is the optimal

56:04value is achieved right so there's an

56:05there's at least an optimal point right

56:07now by the way a lot of problems like

56:09that that we want to solve in this class

56:10problems you have already solved don't

56:13satisfy this right like if you did an L1

56:16fitting problem you would minimize the

56:18L1 Norm of axus B and that doesn't count

56:21that's that's not here because that's

56:22differentiable okay um okay so this is

56:26what we're going to do um in general

56:29except for uh a very important set of

56:31problems we will see shortly uh there

56:33are no analytical solutions to this

56:36right um so what instead we're going to

56:40have iterative methods and an iterative

56:41method is going to generate like you

56:43know an X it's going to query something

56:46about F and then generate a new X and so

56:49on a new X and a new X and what's going

56:51to happen is those X's are going to

56:52converge to you know hopefully a

56:55solution of this problem right um and

57:00what we'll do is you know the typical

57:02thing you're going to get is something

57:04like this sequence is going to have the

57:06function values converge to the optimal

57:08value uh and another way to say it is

57:10that the gradient at at at is you're

57:13going to have a sequence of points where

57:14the gradient gets smaller and smaller

57:16okay and then when you deem it to be

57:18small enough um you asked earlier about

57:20stopping criteria here you have stopping

57:22criteria for the direct methods you

57:24don't have stopping criteria well it

57:27just stops it's like after n cubed over

57:29three steps you've got the chesy factor

57:32there is no stopping Criterion here

57:36okay so well we're g to we'll make some

57:39assumptions I I don't want to get too

57:41technical about it because it gets it

57:42quickly gets kind of boring and stuff

57:44like that um and by the way there's

57:46entire classes that do that cover

57:48nothing but this material in hideous

57:50full mathematical detail uh so I mean

57:52they're interesting but but

57:55I don't know I I think then what some

57:58people grow grow out of it or so anyway

58:00that I'll I'll stop here since okay

58:06um so yeah we're going to assume that

58:08the suble sets are are are closed and

58:11one one condition for that is pretty

58:13common is that the the function f is

58:15closed like if you do convex analysis

58:18which is the mathematics of convex

58:19optimization and convexity this is a

58:22central concept it just means that the

58:25right um and one example would be uh

58:29would would be if you had a um here's

58:32some examples of some closed functions

58:33right log sum X this is because the the

58:36domain of this is anything right so so

58:38this this would count here um here's

58:40another one this is weird because you

58:42might say but there a constraint the

58:44constraint is that X lies in the in the

58:46interior of this polyhedrin because

58:48otherwise I mean there's so there's an

58:50embedded constraint here and you might

58:51say no I don't like that but this one is

58:53perfect perfectly good all the Su level

58:55sets are closed and that the reason is

58:57that the function value goes to Infinity

58:59as you approach the boundary so we'll

59:01we'll get to that okay is that okay so I

59:04this these are technical conditions I

59:06think not super duper

59:10um people are sometimes assume uh

59:14something like a minimum

59:16curvature right so what that says is for

59:19example the minimum igen value of the

59:21hessen is bigger than some number M

59:23which is positive right like that gives

59:25you a minimum curvature to the function

59:28right that that'll allow you to say all

59:30sorts of things um yeah now you could

59:35ask questions like um how would you know

59:38what the minimum curvature is like it's

59:40a good question right um and the answer

59:42is in most cases you would have

59:43absolutely no idea there are cases where

59:45you do um oh here's one if I do mult

59:48let's say I do multinomial logistic

59:50regression and I add quadratic

59:53regularization to it

59:55then I just I don't even count the

59:57curvature that comes from the

59:59multinomial logistic regression loss and

01:00:02just the regular the quadratic

01:00:03regularization alone gives me a Minimum

01:00:05curvature Everybody follow so I mean but

01:00:08these are kind of rare right but that

01:00:09that is an example where you actually

01:00:11would know a candidate value for M right

01:00:14which is literally the uh it it it's

01:00:17going to be the regularization the ridge

01:00:19regression you know parameter um okay so

01:00:23if this is the case then this says that

01:00:25F minus M time the norm of you know

01:00:34convex and if you apply that if you then

01:00:36apply the basic convexity formula to

01:00:39that you'll find out really cool you get

01:00:41a strengthened Jens you know I guess

01:00:43it's not Jensen whatever uh you know

01:00:45this first part this up to here that's

01:00:48true this is true for any convex

01:00:51function this part right that's the

01:00:53basic inequality what this says is oh

01:00:55yeah well it applies with a little bit

01:00:57more right it it's actually that's the

01:00:59minimum curvature constraint right so it

01:01:02strengthens that basic inequality we saw

01:01:04six weeks ago or whenever it was seven

01:01:06weeks ago six right um okay and this

01:01:11means all sorts of cool things because

01:01:12for example um I could minimize both of

01:01:16these equations over y on the left I get

01:01:21P star because it's just the minimum of

01:01:22f of Y on the right right this is a

01:01:25quadratic function in y I set the

01:01:27gradient equal to zero I minimize it and

01:01:30you get a number which is actually it's

01:01:32going to be uh anyway you end up with

01:01:34something that looks like this you end

01:01:35up with this inequality right and so

01:01:38this is kind of cool because this

01:01:40says that if you wanted a stopping

01:01:43Criterion you would run the algorithm

01:01:45you'd evaluate the gradient and when the

01:01:46gradient gets small enough you quit and

01:01:48someone says why are you quitting um and

01:01:51you'd say well the gradient is less than

01:01:52something and they go yeah but how far

01:01:54am I from optimal and then you would say

01:01:56oh good question I can actually bound it

01:01:58I can say that here's how far you are

01:01:59from the optimal value no more than 1

01:02:01over 2m times this of course then in

01:02:04general someone would say how do you how

01:02:05do you know M and the answer is you

01:02:06don't and blah blah blah so but okay

01:02:09everybody but it it it this is the

01:02:12justification you would have for

01:02:14stopping when the gradient is

01:02:18um so we'll say a little bit about

01:02:21descent methods these are um yeah these

01:02:24are you know super widely used um and

01:02:28they they all have a a there's a generic

01:02:30form and it looks like this um in each

01:02:33step of a of a of a descent method

01:02:36you're going to find something called

01:02:38Delta X and that's called the step or

01:02:40the search Direction here um and so you

01:02:45know basically what happens is at

01:02:46iteration K you have XK you make some

01:02:49queries to F about like what's the

01:02:52gradient or something like that and then

01:02:54you form the search Direction so you

01:02:56stop and you ask for directions and it

01:02:58says well you're here you want to

01:03:00minimize you know you should kind of go

01:03:01that Direct that's Delta X that's this

01:03:03direction then what you do is you're

01:03:05going to go in that direction but an

01:03:07with an amount that you're going to then

01:03:09choose and that amount is scaled by this

01:03:11Factor T which is positive and this is

01:03:15uh that's that's called sometimes the

01:03:17step size or the step length even though

01:03:19it need not be it's not really the

01:03:21length unless the norm of the gradient

01:03:23is equal to one right but still people

01:03:25call it the step length or the step size

01:03:26or something like that so okay um that's

01:03:30what you do um and it turns out if if

01:03:35the if if the direction you move in has

01:03:38a negative inner product with the

01:03:40gradient then it turns out uh and and

01:03:44and if you choose the step size

01:03:45correctly the new point will actually

01:03:48have a function value less than the

01:03:49original Point that's good that's called

01:03:50a descent method because you're you're

01:03:53kind of descending right right and you

01:03:54and you end up with a function value

01:03:56that's lower okay so here's like a

01:03:58generic version of it uh to make this a

01:04:02real algorithm we have to fill in things

01:04:05we have to figure out how do you find

01:04:07Direction and then we have to figure out

01:04:10how do you choose the step length right

01:04:12and that's the whole thing right and the

01:04:14stopping Criterion could be some Norm of

01:04:15the gradient usually it's not usually

01:04:18it's something that is computed as a

01:04:20byproduct of these two steps in here so

01:04:22so this is going to be like hundreds of

01:04:24different methods with different names

01:04:26are going to have exactly this form

01:04:29right and they differ in what's the

01:04:31search Direction and they differ in what

01:04:33is the line search okay so that's that's

01:04:36the uh that's the

01:04:39idea so for line searches um there's a

01:04:44bunch I mean here's one one is an exact

01:04:46line search that says once you've

01:04:48decided that you're here and you're

01:04:50going in that direction you actually

01:04:53look at the function restricted to that

01:04:55Ray and then you look for the point

01:04:57where the where along that Ray you have

01:04:59the smallest values that's called an

01:05:01exact line search and you could do this

01:05:03in some cases right uh where you can

01:05:06actually work this out analytically or

01:05:07you can compute it cheaply or something

01:05:09like that right so that's that's one one

01:05:12place where you you might do this right

01:05:14um another one which sounds much crudder

01:05:18is called a backtracking line search and

01:05:20this has many many names again depending

01:05:21on where you came from it's called an

01:05:23arm line search it's called it's got all

01:05:26sorts of names for it in armo condition

01:05:28there's at least one of those I forget

01:05:29what it is in Russian but they have a

01:05:31different name for it so it's all this

01:05:34the lots of different names for this um

01:05:36so this has been rediscovered like I

01:05:37don't know a couple hundred times here's

01:05:39the way it works um it's actually it's

01:05:42called backtracking and it goes like

01:05:44this you start with T equals 1 so you

01:05:47you try X Plus Delta

01:05:50X then what you're going to do is you're

01:05:52going to check whether F has gone down

01:05:55and this is critical enough if it by the

01:05:57way if it goes if if it went up that's a

01:06:00really that that's basically not that's

01:06:02really bad right in fact you can even be

01:06:04out of domain so that that would be like

01:06:06a very you know that that would be it

01:06:07went up a lot because it went up to plus

01:06:09infinity right so okay so here's what

01:06:12this looks like here's here's T here um

01:06:16here here is the function restricted to

01:06:20T if you this would be the exact line

01:06:24uh number that that's the exact step

01:06:26length you would take you'd take this to

01:06:27minimize this right um so the way this

01:06:30works is you look at this basic

01:06:33inequality here that basic inequality is

01:06:35this line right here right this dash

01:06:37line that's what and what it says is uh

01:06:41if I put Alpha equals 1 in that equation

01:06:43I can reverse that inequality because

01:06:46basically when I when I continue this

01:06:48linear thing I guarant this is a lower

01:06:50bound on this right so this this the

01:06:53actual ual function value could never be

01:06:55below this if if the function's convex

01:06:58right so here's what you do we put an

01:07:00alpha in there uh which is a number

01:07:03that's less than one uh yeah actually in

01:07:05fact it's less than a half for weird

01:07:07reasons I mean doesn't even it turns out

01:07:09it's not going to matter what it is too

01:07:10much um and then we degrade the slope by

01:07:13multiplying by Alpha so here Alpha is

01:07:15maybe a third I just made it up and that

01:07:19line so if you take a really small step

01:07:22length you're definitely below the this

01:07:24line because if you take a really small

01:07:25step length here this is actually a

01:07:27really good approximation of how much

01:07:29the function is going to go down right

01:07:31um and then what you then your the rule

01:07:33is you have to be below this dashed line

01:07:38this this degraded predicted line and so

01:07:40way that works is you know you might

01:07:42start here and query the function and

01:07:44find this value and you're above this in

01:07:47fact the most dramatic one is you'd

01:07:48start way out here and you're o you're

01:07:49over here and actually when you do X

01:07:52Plus Delta X in that case you're

01:07:54actually have a higher value than F ofx

01:07:57everybody see that so so that would be a

01:07:59particularly poor choice of Step length

01:08:01right so what you do then is you

01:08:05multiply this by some Factor beta beta

01:08:09is typically like a half you know people

01:08:11use that they use 0.1 so you simply say

01:08:14Okay that was too much so you you go

01:08:16back by half and you land here you try

01:08:18this one you get this value it's less

01:08:20than this thing and you accept it so you

01:08:22take the first one you accept that's

01:08:24backtracking um it looks now you know

01:08:28that your intuition actually the story

01:08:30of a lot of this um uh unconstrained

01:08:32minimization is I guess there'll be

01:08:35several times when something you would

01:08:37think is is right is just wrong

01:08:39completely wrong so now I'll tell you

01:08:41the first one you would look at this and

01:08:43you'd say it's it's obviously clearly

01:08:46it's better to do an exact line search

01:08:49that's totally obvious right because

01:08:51you're actually minimizing the function

01:08:52along this line right this makes sense

01:08:55and this crude thing that stops when

01:08:57you've gotten like some acceptable by

01:08:59the way people use numbers like Alpha

01:09:0201 right meaning if you get even just

01:09:05the slightest decrease like good fine

01:09:07thanks we'll take it so everybody see

01:09:09what I'm saying so looks very

01:09:11unsophisticated right um anyway all

01:09:15right now I'll tell the here the punch

01:09:16line the punch line is it doesn't make

01:09:19any difference uh that in fact exact

01:09:21line search is not only not better in

01:09:23practice it's often

01:09:25worse and the simple dumb thing actually

01:09:28works just fine thank you this is not

01:09:31obvious right so um

01:09:35okay now we get to a gradient descent

01:09:39method this is the simplest one remember

01:09:41we had to choose a Direction I mean I

01:09:45would hope it would be the first it

01:09:47would it should be the first thing you

01:09:49would think of someone says make me an

01:09:51interative method I'm at f I meant X and

01:09:54I evaluate F at that point and I

01:09:56evaluate a gradient now gradient is

01:09:58pointing in the direction of fastest

01:10:01increase of f if I move in that

01:10:03direction F goes up as fast as possible

01:10:05if I move in the negative gradient

01:10:07descent Direction F goes down as quickly

01:10:10possible and you not only would you come

01:10:13up with this um but you'd also ask

01:10:17yourself why would you do anything else

01:10:21Ian you realize what think about if

01:10:22anyone if you didn't do the gradient

01:10:25method what happens is you arrive here

01:10:27and someone says where's the negative

01:10:28gradient you go that direction you say

01:10:31what does that mean you go if you go in

01:10:32that direction you're going to go down

01:10:33as quickly as possible and you go cool

01:10:36which way do you recommend I go and you

01:10:38go um that way you're like like what

01:10:41like why would you recommend that anyway

01:10:44you will shortly see why but is this

01:10:46kind of non-intuitive right uh so but

01:10:48we'll we'll get to it anyway so gradient

01:10:50method is the most intuitive one it

01:10:51looks it looks like this uh use any it

01:10:54doesn't matter what line search you use

01:10:57and here it is and you can you know

01:10:59prove convergence of this and all that

01:11:00kind of stuff that's fine um it turns

01:11:02out this is often unbelievably slow so

01:11:07uh so in fact people you we we'll see

01:11:09what other search directions can

01:11:11actually dramatically improve this but

01:11:13this is slow very weird um and we'll

01:11:15look at a picture to understand it so I

01:11:18just take a quadratic looks like this

01:11:20right and here are some level curves of

01:11:21the quadratic down here and Gamma is a

01:11:25number uh oh yeah let's see so gamma

01:11:28allows me to change the eccentricity of

01:11:32the level sets right so if gamma equals

01:11:351 this is 12 X12 plus x22 oh yeah by the

01:11:39way what's the gradient in that case if

01:11:41the so the level sets are just

01:11:44here what what what does the gradient

01:11:52gradient just in that case this this is

01:11:55actually more I'll draw this even it's

01:12:00here here's your function uh here the

01:12:03level sets here's zero right these are

01:12:06Level sets you're here the gradient

01:12:08points directly out like that and the

01:12:11negative gradient points directly

01:12:12towards zero which is the minimizer so

01:12:16if gamma was one if I ask you to

01:12:18minimize X12 plus x22 using the gradient

01:12:21method with an exact line search tell me

01:12:23about how well that algorithm is going

01:12:28work like like yeah very well like in

01:12:31one step right it's going to say which

01:12:33way should I go you say steepest descent

01:12:35which just happens in this case to aim

01:12:39solution right did he do an exact line

01:12:41search and it's game over everybody got

01:12:43that okay so but now that makes sense

01:12:47and in fact someone who was proposing

01:12:48like you know a gradient method would

01:12:50say yeah look look how well that works

01:12:52that's great like why would you not do

01:12:54that why would you why would you go in a

01:12:57crazy Direction like that right um but

01:13:01we can actually control the eccentricity

01:13:03by making gamma either really big or

01:13:06small um and in this case you can

01:13:09actually work out analytically what the

01:13:12what the iterates are right and you get

01:13:14something that looks like this is

01:13:15actually a very good picture to have in

01:13:16your mind of course it's totally silly

01:13:18because we don't care about

01:13:20two-dimensional problems we care about

01:13:22you know problems with a minimum mention

01:13:23of 200 2,000 20,000 2 million but

01:13:26nevertheless this is this gives you the

01:13:28picture so here this is when it's just

01:13:31slightly eccentric and here these are

01:13:33the level curves and so the negative

01:13:35gradient is just simply the gradient is

01:13:38orth these are orthogonal to the level

01:13:40curves and it says go in that

01:13:42direction this is this is the minimum

01:13:45because we're doing exact line search

01:13:47and then it says which way should I go

01:13:48and you go like that

01:13:50everybody so this has got uh people

01:13:52refer to this this is zigzagging I can

01:13:55make we can make it arbitrarily bad

01:13:57right we can take gamma equals a th000

01:14:01and you'll find out that this work that

01:14:03this thing's basically going like this

01:14:04and I think and actually I think someone

01:14:06I think there's even a name in in

01:14:07Western optimization it's called the

01:14:09morat effect right so but it's just the

01:14:12other thing I like is it's just called

01:14:14zigzagging everybody got this so by the

01:14:17way you know when you see this picture

01:14:19you can kind of figure out some things

01:14:22immediately that would be pretty good

01:14:24right so if you're in e or some any

01:14:26other field you would say you're kind of

01:14:29going in the right direction on

01:14:32average right I mean the problem is

01:14:34you're just doing this weird thermal

01:14:35thing up and down like this whereas you

01:14:37want to really kind of be going that way

01:14:39and so you'd say you know what you

01:14:41should run you should run those search

01:14:42directions through a low pass filter or

01:14:44something I mean again I'm speaking

01:14:46dialect um or if you do physics you'd

01:14:48say yeah you know add some momentum to

01:14:50it or something like that or viscosity

01:14:52add a vis osity term right so that you

01:14:55would kind of then you do something that

01:14:57would go like right like that and head

01:14:59towards the solution right so and those

01:15:01are all methods we're not going to look

01:15:02at those but that's that's kind of the

01:15:04idea okay so even this baby example

01:15:07tells us that oh but it actually we get

01:15:10some really good intuition here the uh

01:15:14the gradient method works really well

01:15:16when you're when the subl sets of your

01:15:19function have low condition number or

01:15:25because if they're round the the

01:15:28negative gradient points directly to the

01:15:30solution and you're going to go there

01:15:31quickly if you if it has high condition

01:15:34number or we you know if it's or if the

01:15:36suev sets are weirdly squished out like

01:15:38this that kind of thing is going to

01:15:41everybody got that so that's that's kind

01:15:44of the picture here I mean kind of

01:15:47that's actually super important

01:15:48observation because we're going to use

01:15:50that that's actually the segue to to

01:15:52basically Thursday's

01:15:53topic right so okay all right so here's

01:15:58just some examples um this is a you know

01:16:01Silly function that's non-

01:16:04quadratic um and this is happens if you

01:16:06use like a back Laing L backtracking

01:16:09line search and here's an exact line

01:16:10search and you'd look at these two and

01:16:11you'd go like well I told you exact line

01:16:13search was way better but this just

01:16:15shows you kind of the idea um but it

01:16:17shows you that in real non you know real

01:16:20examples you're actually getting this

01:16:21moratto or zigzagging effect right and

01:16:24then we can look at that over here and

01:16:26see what that looks like so um this

01:16:30looks like this right um and let me

01:16:34explain what this is these are the the

01:16:36iterations right this is a log scale

01:16:39telling you actually your

01:16:41suboptimality right um and what you can

01:16:44see is all sorts what you see all sorts

01:16:46of weird things here right like the

01:16:48first thing is that actually between

01:16:49zero and like 100

01:16:51iterations for whatever reason the

01:16:53backtracking line search was doing

01:16:55better like I don't know why but it was

01:16:57and this is kind of common right so this

01:16:59is what it this is what it looks like

01:17:01the other thing you see is that this

01:17:02kind of gets this gets kind of linear

01:17:04and that's referred to as linear

01:17:06convergence but that's a weird use of

01:17:08linear because it's linear on a log or a

01:17:13semi log plot right so it means because

01:17:17this is a log thing what linear

01:17:19convergence here means is that each step

01:17:22you do each iteration reduces the error

01:17:25by some factor it's pretty small frankly

01:17:28but we could work out what it is right

01:17:31if you you know if you if you divided it

01:17:34by well let's just have say that hit

01:17:36here so let's take to go from one to 10

01:17:39Theus 4 that took you 170 iterations

01:17:42right then that tells you you take uh

01:17:45the you take 10 the 4 and you take that

01:17:47to the 1 175th power and that tells you

01:17:51how much your error gets small each step

01:17:54right so that that tells you roughly

01:17:56what it is right okay so that's that's

01:17:59called linear linear convergence um okay

01:18:03I think maybe what we'll do is uh I

01:18:05won't even start the next the next uh

01:18:07section but this is a good place to uh

01:18:10to to stop and we'll uh we'll and we'll