00:05let me uh I'll explain where we are sort
00:08of in the whole course in the next maybe
00:11two and a half three weeks um we're
00:14actually going to be we build up and
00:16we're going from the bottom up to the
00:17top so this will this will finish when
00:20you'll actually Implement your own like
00:22solver for an LP solver or something
00:24like that um we haven't quite figured
00:26out but you'll anyway you'll it'll
00:28finish with you implementing like a
00:30solver so you mostly by the way just to
00:33demystify it right just so when someone
00:36looks at you sometime in the future and
00:37says oh my God you can't write your own
00:40LP solver that's really complicated and
00:41you could say I oh oh contr I did it
00:45right so that's the main thing the main
00:47reason um but also this interesting
00:49stuff um right now we're at the bottom
00:52level which is linear algebra uh and
00:54actually in some ways it's by far the
00:56most important level um and my my own
00:58feeling is everybody certainly anybody
01:01in this class needs to know this
01:03material um do you need to know a whole
01:0610we course you know that you would take
01:08on computational linear
01:10algebra maybe actually but but this this
01:14certainly you need to be familiar with
01:15the ideas because uh you know and that's
01:18unlike you you don't have to know about
01:20computer architecture even though
01:22everything we do is thanks to the people
01:25who design and build you know uh you
01:29know computers stuff so but this you
01:31actually should know um
01:33okay so back to uh the numerical linear
01:37algebra and today actually we're going
01:38to cover material that is like I some of
01:41you may know rest of you probably don't
01:43and it is UN I'd list it in the top 10
01:46of the things you need to know from this
01:48course even though it's not technically
01:50part of the course so okay so here it is
01:53last time we talked about you know how
01:54do you solve a set of equations ax
01:56equals B um and you know the there there
02:00are several ways there's ones we're not
02:02going to talk about there are iterative
02:04methods to do this um which are used for
02:07like super large scale problems if
02:09you've got like a billion variables or
02:11if that represents like a pte you're
02:13solving right um we're not going to talk
02:15about those we're going to talk about
02:17these are what are called direct methods
02:19uh I don't know where that name came
02:21from but they're called direct and the
02:22way they work is they take the Matrix a
02:24and they first Factor it into a product
02:28matrices Each of which which quote is
02:30easy to invert unquote so that that's
02:33the first step right you've seen some of
02:36these factorizations you know like SVD
02:38Ian decomposition you've probably seen a
02:40couple of the others as well um so an
02:43easy to invert is actually just uh
02:46that's dialect for it's easy to solve
02:50equations with that coefficient Matrix
02:52so that's what it really means because
02:53no one is inverting anything right it is
02:56unbelievably rare in good numeric work
03:00for anybody to actually form the inverse
03:02of a matrix there are cases where people
03:04do uh but it's very rare right so um
03:09okay uh so what you do then once you've
03:12factored your Matrix um into a product
03:14of things which are you know quote easy
03:16to invert unquote you have to interpret
03:18that correctly then you what you do is
03:21you then write out you know the inverse
03:23and the inverse is the product is is the
03:25product of the inverse is in reverse
03:26order and then you literally just
03:29evaluate this by first evaluating you
03:31know A1 inverse B that's one thing and
03:34and of course what that means is people
03:36write a A1 inverse B what they really
03:37mean is you call this you call solve A1
03:40comma B that's what that that's what it
03:42really means so you solve that then you
03:45pass this solution to the next thing
03:48which is A2 inverse uh times X1 and so
03:52you solve A2 uh X2 equal X1 and that
03:55gives you the second one and you keep
03:57going until you get to the end so again
03:59it it's like completely straightforward
04:02like okay like so what um we last time
04:06we talked about some very interesting uh
04:08very interesting things which is that if
04:11you want to solve a set of equations
04:14with the same coefficient Matrix but
04:16different right-and sides right then in
04:18fact what you can do is you factor once
04:22and you cach the factorization and then
04:24you simply use the solve method this is
04:27this is called the solve the first one
04:29is called factor and this is called
04:30solve and the SEC so you simply call the
04:33solve method on your multiple right hand
04:35sides so and we'll you get to some like
04:38stunning uh you get to surprising
04:42conclusions in some cases I'll get to
04:44that in a minute so all right so this is
04:46this is the one that you you probably
04:48saw a variation of this like in high
04:50school or something the first time you
04:52saw linear equations it's called
04:54gaussian elimination it's called lots of
04:56things um but here's what it is um if
04:59you have a non- singular Matrix you can
05:01Factor it as a permutation times a lower
05:04triangular Matrix uh it's it's a a lower
05:08triangular and U is upper triangular and
05:09they're all invertible of course
05:10actually if any one of them weren't then
05:12a would not be right um so the cost of
05:16that factorization which we're not going
05:18to go into how that's done it's actually
05:20not that difficult but um the cost is is
05:23is order n cubed flops okay so that that
05:26that's what that is and so this is what
05:28it looks like to solve you know linear
05:30equations you you you factor a uh that
05:34means you get p l and U uh then these
05:38three steps like 2 three four this is
05:39factor and then this is solve so again
05:42you all all you do is you just solve
05:44them in reverse order um now solving
05:47this one solving the permut this is
05:49silly it's just you multiply by P
05:50transpose right roughly it's the inverse
05:53permutate you just un you un permute
05:55them right so this is easy um the second
05:58one L is lower triangular and so for
06:00that you use forward substitution which
06:02we saw as an order n squ
06:04algorithm backward substitution is when
06:07you have U because you is upper
06:08triangular and you start if you'll
06:10recall by first solving for the last
06:12entry then the second last then the
06:13third last and so on that's backward
06:15substitution and that's also n squ flops
06:18and if you add these up what you find
06:20out is something kind of weird it's that
06:22the factor is order n cubed and the
06:24solve is order n s and so I think last
06:26time I finished by asking you a question
06:30how long does it take to solve I don't
06:32know on my laptop uh a set of linear
06:35equations with you know let's say a
06:37thousand by, Matrix right which I could
06:40make sound pretty impressive right I
06:42could say you know my problem has more
06:44than a million coefficients in it which
06:48is true thousand by thousand Matrix does
06:50and then then you could say you know the
06:53N Cub the the factorization is order n
06:56cubed you know that's like a billion
06:59flops right of course that's a joke now
07:02right so and the answer is like I can do
07:06milliseconds okay just so can you my
07:09phone can do it in something on that
07:10order of magnitude right so which is
07:13actually kind of stunning um I think
07:16then I asked the following question I
07:18asked how long does it take you if it
07:21takes 100 milliseconds to solve let's
07:23say one set of a thousand by thousand
07:26equations right a thousand equations
07:28thousand unknowns how long does it take
07:29to solve 10 for that matter you know
07:32even 100 how long does it take if it
07:34takes a tenth of a second to to do one
07:37the answer is surprising it's this it's
07:39a tenth of a second everybody got that
07:41so not that many people appreciate this
07:44right or or understand this and this is
07:45actually it's this is used throughout a
07:47lot of uh numerical stuff a lot of it uh
07:50well you use it all the time you just
07:52don't know it uh so all the solvers that
07:54you use when you call CVX Pi almost all
07:56of them will use things like this right
07:59right so the most typical methods to
08:01solve things like socps sdps these
08:04things or for that matter even LPS
08:06involve two solves right and that's done
08:09in the same basically the factorization
08:12dominates and and that's that's the
08:14that's the point there okay so all right
08:17actually I'm just curious how many
08:18people have seen this before how many
08:19people knew that you could solve 10
08:21equations with a thousand variables
08:25unknowns same coefficient Matrix
08:27different right hand sides uh you could
08:29do that you could solve 10 and the
08:31amount of time it takes you to solve one
08:33you knew come on other people yeah is no
08:37is no one else impressed or like
08:40shouldn't you be I mean I'm I maybe it's
08:43maybe too early but people should be
08:44going like that's awesome like really
08:47but I'm not getting that feeling but
08:48that's okay that's fine okay good but
08:52right um now uh sparse L factorization
08:57this is super interesting this is
08:59something everybody should know about
09:01right so um there's since I guess it's
09:04been developed since about the 1960s is
09:07uh basically stuff on sparse matrices
09:10right and so everything you know first
09:12of all you know how do you what are the
09:13data structures used to store them and
09:16then basically there's a whole parallel
09:17universe of doing all the stuff you
09:19would do with so-called dense matrices
09:21which are just stored as arrays right
09:24how do you do this for sparse matrices
09:25including factorizations back solves all
09:29these kinds of things everything you can
09:30do igen decompositions singular value
09:33decompositions there's a whole field
09:35where people have made this efficient
09:37for sparse matrices right and this is
09:39something like everybody needs to know
09:42about like everyone because if you do
09:45any kind of computation you need to know
09:47about this right because because there
09:48will be again there will be surprising
09:51statements we'll make about what you can
09:53solve and not solve um okay so if a is
09:57sparse um the typ iCal factorization is
10:00going to have four factors it's going to
10:03have a P1 is a a permutation in front P2
10:07is a permutation in back and then L and
10:10U are these uh lower triangular and
10:14upper triangular uh and those so the the
10:17idea here is mathematically you don't
10:20need P2 right you do need P1 it is it is
10:24false that any Matrix any non- singular
10:26Matrix can be factored as Lu
10:29okay you do need that initial
10:31permutation but you don't need that
10:33second one so the second one is there
10:35and by the way P1 is also there um so
10:39that when you carry out this Lu
10:42factorization L and U will also be
10:45sparse because if you do this if you're
10:47not careful about doing this your Matrix
10:50a will be sparse but L and U will be
10:52like dense okay when that happens it's
10:54referred to as you have fill in right
10:56you have lots of entries entries in l
10:59you that weren't there in a okay so and
11:02this is very important for everybody to
11:03know right so um and you know here' be
11:06an example um it it's extremely common
11:09to work all the time with let's say A
11:12you know a million bym million sparse
11:14Matrix it is simply not a problem right
11:18I can do stuff like that on my laptop
11:20right um you know the number of nonzeros
11:23is not going to be 10 to the 12 which
11:26would be a dense one you know it's going
11:27to be something reasonable like I don't
11:29know it's got you know 20 30 100
11:31non-zeros per row even that is not that
11:33big a deal because it's 100 million then
11:35it's 100 million coefficients if there's
11:37if there's 100 non zeros per row and you
11:39multiply that by about 10 so even in
11:41doubles it's like a gig which is just
11:44like very small it's it's a reasonably
11:46small amount of memory okay so um anyway
11:50so what happens is if you have your
11:52million bym million Matrix and you and
11:54you call a solver like this if you if
11:57you choose P1 and P2
11:59poorly then what'll happen is like L
12:02will be like a dense lower triangular
12:04Matrix right you can't even store it
12:06because it's a million by million right
12:08so it's 10 12 divided by two uh things I
12:10mean you can't let's just leave it that
12:13way okay making sense so so so the P1
12:17and P2 they're chosen euristic Le and
12:19you can you know people have been doing
12:20this since the 60s is actually very cool
12:22you can ask questions like oh find me
12:26the P1 and P2 that minimize the number
12:31lnu that's a perfectly nice problem to
12:33come up with and it is NP hard okay so
12:37it's but who cares it's silly there are
12:39unbelievably good uh urtic for doing for
12:42doing this right uh that I mean all of
12:45us benefit from so that's this is great
12:48um okay and so what happens with sparse
12:51matrices is the factor and solve uh
12:55times whatever they are they're less
12:58they are less than than 2/3 n cubed um
13:00and I'm going to I'm going to give you a
13:01a a very important special case right
13:04now and then we'll we'll encounter
13:07others right so uh here's a super
13:10important special case is a so-called
13:11banded Matrix um so a banded Matrix
13:15looks like this uh well of course
13:17there's just a band here like this you
13:19know maybe this is K entries wide right
13:23and basically this is nonzero and
13:26everything out here is that's zero and
13:27that's zero so that's a banded Matrix
13:30right so I I can store a banded Matrix
13:33on my laptop easily if it's a million by
13:36million and the bandwidth is let's say
13:3810 or 20 right so everybody so that's a
13:41banded Matrix no no that's fine so it
13:45turns out when you uh here well this is
13:47a case where you don't need P1 and P2
13:50right but if you simply do the Lu
13:53factorization here the L and the U are
13:56going to be basically the same they're
13:58also going to be banded so I'll just
14:00write down l l is going to is lower
14:03triangular so whoops this goes down to
14:05there right and then it simply looks
14:08like that that's K over2 and and this is
14:11what L looks like and you look similarly
14:14right so um by the way sometimes
14:18abandoned Matrix comes to you looking
14:19like that but sometimes it's been
14:21permuted by the app by whoever form the
14:23problem right they decide well I'm going
14:25to list all my States first then all my
14:27controls then all this in a in an
14:29optimal control problem and then they
14:31comes out with some crazy sparsity
14:33pattern and uh any method would be able
14:36any methods used to euristic choose P1
14:38and P2 will permute it to look banded
14:41okay so what happens in a banded
14:43factorization is first of all the
14:46factorization time is is n
14:50k^2 and then the back solve is
14:54NK okay so these are stunning L smaller
15:01n^2 okay so what this tells you is you
15:07can solve a million bym million
15:09equations if the bandwidth is you know
15:1110 like EAS like super easily and it's
15:14going to be so fast it's not even funny
15:16everybody following this so which is
15:18kind of shocking right because someone
15:19said what are you doing you say I'm
15:20solving a a set of a million equations
15:22for a million variables each equation
15:25involves 10 of the variables you're like
15:27whoa that sounds like it's hard right
15:30the answer is it's unbelievable this is
15:32a millisecond operation everyone follow
15:35it's just this is completely insane so
15:37but this is things you absolutely need
15:39to know this right because this is the
15:41this is the trick to many many things
15:43okay um so what's you know by the way
15:47what's interesting here is the factor
15:49solve ratio is not that high anymore now
15:51it's K it used to be n and so that says
15:54if I say how long does it take to solve
15:57a million by million equations banded
15:58with a bandwidth of 10 it's some number
16:00which is very which is shockingly small
16:02then you say how about 10 of them okay
16:04how about two of them and you say it's
16:06only a little tiny bit more everybody
16:08following this okay so by the way um
16:11what what you'll be doing in the next
16:14week two weeks is you will start to
16:18recognize Matrix structure like sparsity
16:21you're going to see some others later
16:22today like block arrow and weird other
16:25ones we'll we'll and you will start
16:26recognizing them they're super important
16:29because two things number one you'll you
16:32you'll real you'll start recognizing
16:34structures that you can exploit to solve
16:36extremely fast that's number one but
16:39number two on the other end you're going
16:40to start recognizing practical problems
16:43that have that that will result in
16:45matrices with that structure I'm going
16:47to mention one right now all of control
16:51all of signal processing just looks like
16:54that every last bit of it okay so that
16:58means if I ask you to you've solved
17:01problems you know little control
17:02problems and things like that but what
17:04this says is if someone says I have to
17:06compute I have to solve a control
17:08problem with you know 20,000 time steps
17:11and a state state dimension of 20 and 10
17:14actuators and someone say wow that
17:15sounds like it's going to be complicated
17:17you just laugh and say are you kidding
17:21like I can do that on my phone um and of
17:23course that's the key to you know
17:25actually putting these things into a lot
17:27of control systems right so lot a lot of
17:29them actually do this so okay that's the
17:31big picture so um this is this is just
17:35like super cool to know
17:37about okay this brings us to the uh a
17:42very famous factorization it's about 100
17:44years old um it's the cheski
17:47factorization so um if you have a
17:49positive definite Matrix you can write
17:51it as just LL transpose and if you
17:54insist that the entries of the diagonal
17:56entries of L are positive
17:58then it's Unique so there's no choice
18:01you just there's just one factorization
18:03period that that that's it there's one
18:05way to write a as LL transpose um and
18:08how do you solve uh you know how do you
18:10solve ax equals B where a is positive
18:13definite that's called a positive
18:15definite system of equations right so
18:16how do you do that you do a chesy
18:18factorization then you do a forward and
18:20a backward sub substitution with L andl
18:22transpose okay so so that's that's it
18:25and you know actually it's 1/3 n cubed
18:28as opposed to 2 N cubed for General
18:29matrices and you know that kind of makes
18:31sense I me first number one no one cares
18:34about a factor of two in a flop count
18:35estimate because these are so crude
18:38right what really matters is like the
18:40people who actually wrote the micro code
18:42and all that kind of stuff and how's it
18:44blocked and how many cores is it using
18:46all sorts of crazy stuff internal that
18:48we don't have to most of us don't all of
18:49us I'm guessing should not and do not
18:52have to worry about right but we should
18:54also be all be very grateful that there
18:56are people who do right so um yeah so
19:00there it's a factor of two which is not
19:02unsurprising it's symmetric so roughly
19:04speaking there's a redundancy of two you
19:06know or or you can even look at this and
19:08say hey that's an Lu
19:10factorization with u equals L and you
19:13know why you say because the Matrix is
19:14symmetric that's why right so so it's
19:16not anyway not not not too surprising um
19:20we will see very shortly what you know
19:23where positive definite systems of
19:25equations arise right actually they they
19:27arise by eles in a lot of applications
19:29but but specifically in convex
19:32optimization we'll see they're at the
19:35stuff okay and now we get the sparse
19:38chesy factorization and here I'll tell
19:40you something um super uh interesting
19:42about I think um so here it turns out
19:46this is basically the same as calling a
19:48chesy factorization on ppose AP right
19:52because take that equation just multiply
19:53on the left by ppose on the right by P
19:57so what it basically said says is
19:59permute your Matrix your positive
20:01definite Matrix and then do a chesy
20:03factorization where there is no
20:05creativity it's just there's a unique
20:08chesy factorization everybody got this
20:10when you do that you will find some
20:12stunning things out you will find out
20:14that if you do a CH if you if you choose
20:16a good permutation the L is going to be
20:20super sparse if you choose a bad
20:23permutation it's going to be dense so
20:25and I'll give you a sort of a famous
20:29um so it would be something like this if
20:32I gave you uh this Matrix here uh so I'm
20:37going to draw a uh here I just make it
20:40this way so it it's got first row and
20:42column and a diagonal okay so by the way
20:45that's that's called like an up what do
20:47they call it's an Arrow matrix right
20:48because if you blur your eyes it looks
20:50like an arrow or something I don't know
20:51anyway that that's an upper Arrow matrix
20:54everybody got it's super sparse right
20:56number of nonzeros per Row in column you
20:58know except for the first row and column
21:00it's like two right so okay so that
21:03that's an example of a that is an
21:05example of a structure you should never
21:08ever see again and not have an emotional
21:11response to ever right because this is
21:14super important okay so okay if I do a
21:17chesy factorization on that here's
21:20here's here's I'm going to show you the
21:21sparcity pattern of L it's this ready
21:24it's completely dense there completely
21:29okay so this of course ruins everything
21:32right for one thing you can't even store
21:34that much right because if you started
21:35with a million by million you know upper
21:37Arrow matrix called cheski on it it we
21:41it would at some point your OS would say
21:44no I no I am not going to allocate that
21:46amount of memory no so everybody got
21:49this okay so now turns out all you need
21:52to do with this is permute the first and
21:56last columns uh and first and glass rows
21:59and columns and then you end up with
22:03Matrix and that that's called a you know
22:07like a I don't know it's a whatever AR
22:09that that's called an arrow structure
22:11right we we'll get to others but anyway
22:12that that's one okay if I do a cheski
22:15factorization on this I will show you
22:17the sparsity pattern you
22:21this and that's all okay so what it
22:27means is you know if it means that the
22:30permutation that permuted this into this
22:33was super important at least from a
22:35sparcity point of view okay uh everybody
22:38got so what this means is if you had
22:40that set of equations to
22:42solve uh actually if you just called a
22:45sparse solver on your behalf you
22:47wouldn't even know it but the
22:49permutation would change it to this
22:51factor that and then your Factor time is
22:53going to be order n and your solve time
23:00everybody following this so that's it so
23:03this is fortunately for a lot of those
23:05things you don't have to worry about it
23:06that's all based inside all of you have
23:08benefited from this probably a million
23:11times just so far in this class because
23:13every time CVX Pi calls a solver this is
23:17this is what's Happ this is the kind of
23:18stuff that's happening on your behalf
23:20many many times every time you call call
23:22solve on something okay so doesn't
23:24matter I mean but just now you know um
23:28okay this make sense yeah how much time
23:32does it take for the Sol it to find that
23:35P that's that's an that's a perfect
23:38question I should have said it so the
23:40question is how long you know so yeah if
23:42I if I say oh man are you kidding I can
23:45do this for you not a problem I'm going
23:47to find a permutation like that and then
23:48I can solve this in tenth of a second
23:49and you go cool how long do it take you
23:51to find the permutation you go couple
23:53days you know it's just not yeah so the
23:56general rule of thumb is is people use
23:58urtic for finding P um that are they
24:03hope on the order of magnitude of the
24:07amount of time it takes to uh actually
24:10factorization so that's what they do now
24:13there are applications where people will
24:15run they'll they'll take more time to
24:18actually work out a better permutation
24:21because they're going to use this a
24:22gazillion times okay so this would be
24:25the case for embedded systems like for
24:28right so if you're doing control and
24:30you're going to put this into let's say
24:32a rocket or something like that and and
24:34you know it's going to run at a 100
24:36Hertz for like the entire Mission or
24:39whatever it is right that's a lot of
24:41time it's worth you can take it you
24:44could definitely take an evening to come
24:46up with a better permutation but this is
24:48not what most people doing numer most
24:50people doing numeric who work on the num
24:52on you know solution of sparse linear
24:54equations don't do that they just stick
24:57with your ristics that are fast so and
24:59by the way that that portion is called
25:01the symbolic it's sometimes they call
25:03that the symbolic solve or the
25:05permutation phase there's lots of names
25:07for it right and and you can tell if you
25:10actually uh if you do things like uh um
25:14if you actually look at it to see what
25:16it's actually doing you will find out
25:18that it's spending a whole bunch of time
25:20and very it's doing graph calculations
25:22there's no floating Point numbers
25:24involved okay then then then when
25:26floating Point number operations startup
25:28that means you're actually doing the
25:31factorization that's a good question
25:34okay so now I know I'm not able to I
25:37mean I realize the for many of you if
25:39this is the first time you've seen it
25:40that's fine you should just get the
25:41highle picture of it and you could read
25:43there a little bit more in the appendix
25:46in the book there's an appendix on this
25:48um and this is just something you should
25:49know about so I don't know scipi you
25:52could poke around in scipi or something
25:54once you know about this you can't
25:55unknow about it and it's kind of you
25:56will have like incred I Powers yes what
26:00of uh well they're yeah they do so some
26:04are kind of greedy um I'd have to tell
26:07you actually how you actually do the
26:09factorization uh to actually explain
26:11explain the Uris stics um but a lot of
26:13them are just constructed one by one
26:15what they do is they say uh p is going
26:18to permute something to the top row and
26:20they say I'm going to find I'm going to
26:21find which entry I should permute to the
26:23top then who gets permuted to the second
26:26top then third top and fourth top things
26:28like that so so it's kind but that's
26:29greedy uh but that the that's the kind
26:32of things people use and they have lots
26:33of names and it's worth knowing a little
26:34bit about them yeah so doesn't be hard
26:37for finding the best one but are there
26:38like approximation guarantees no there's
26:41no approximation guarantees whatsoever
26:43Noone right so yeah no one has yeah so
26:47yeah I it's it's one of those things
26:49where you know we can ask of course you
26:52know this is all these are all graph
26:53calculations has nothing to these are
26:55just about these are just questions
26:56about graphs right oh there is one
26:58exception uh there's one theory of it
27:01that's this if the graph represented by
27:03the sparcity pattern of a is
27:06cordal then it turns out there is a perm
27:10there there exists a it's guaranteed
27:12there exists a permutation which has
27:14zero fill in meaning the number of
27:16nonzeros in L will be identical to the
27:18number of nonzeros in a divided by
27:20two right that so that if you know what
27:23a cordal graph is that's that that's one
27:25of them um that's actually kind of
27:26useless frankly in practice because the
27:28real question is not whether there
27:29exists one the question is whether our
27:31urtic find it and the answer is all
27:34decent urtic will find that one so yeah
27:38um each of these factorizations have
27:40required that we know the Matrix is non
27:42singular that it's postive
27:44def no so it turns out most of the so in
27:48this one to find the permutation for
27:50chesy factorization you actually don't
27:53have to know the entries in a you simply
27:55have to know their locations that's
27:58actually a super important more advanced
28:00but you you brought it up so what this
28:03means is we can do this ahead of time
28:05like so for example if we're solving an
28:06optimal control problem or something
28:08like that uh we can do the permutation
28:11once and for all offline and then when
28:13this thing is actually running a jet
28:14engine or something like that then we
28:16don't have to do it if we're doing
28:18finance and we want to do an insanely
28:20fast back test right we would first
28:23Factor we' do a we'd first do the
28:26factorization and then after that every
28:28solve is going to be really fast so
28:31that's how that works yeah so you don't
28:32know you don't have to know the if if a
28:34is not positive definite and you call an
28:36L you call a cheski factorization on it
28:39it will fail at some point at some point
28:41it's going to attempt to take the square
28:43root of a negative number right and then
28:46it'll it'll uh it'll send it it'll uh
28:48it'll throw a uh an exception to you of
28:52some kind right but that's the only time
28:54you would ever know yeah I'm curious
28:57when when uh does the solver no stop
29:00like what kind of Criterion oh s all of
29:02these are finite algorithms right and an
29:05infinite Precision they just solve right
29:08and when I say it's order n cubed it
29:09means I you know I will I can tell you
29:12for this you're going to do you know 1
29:14billion you know blah blah blah blah
29:15blah multiplies know 1.3 billion ads
29:19when it's done if okay if you did every
29:22that everything in infinite Precision
29:23you would have the answer end of story
29:25there's no stopping Criterion there's
29:27nothing like that okay now of course
29:29you're doing this in finite
29:31Precision right so since you ask there
29:35are methods to do something it's called
29:39refinement and that what that does is
29:41you compute a solution uh that's
29:44approximate and then you can actually do
29:46you can iterate to get a better solution
29:49and people know about that but that's a
29:50little bit beyond what we're talking
29:52about here but it's good to know about
29:56um okay okay um so LDL transpose
30:03factorization um this is any non-
30:06singular symmetric Matrix right um by
30:09the way we'll see examples of where
30:10these come up like maybe next week in
30:12fact so um so here it's like L it's like
30:17an LDL transpose factorization uh but D
30:20is diagonal and it's either 1ex one or 2
30:22x two blocks so that's just a
30:23mathematical fact that you might need to
30:25have these 2 by two blocks uh but the
30:27point is those are all easil you know
30:29easily inverted matrices permutations
30:31yes lower triangular yes um a block
30:34diagonal matrix with one by one and 2 by
30:36two blocks I think we know how to invert
30:38that that one we actually literally know
30:40how to invert um but we can certainly
30:42solve equations like that right so okay
30:46um so this is this is the LDL transpose
30:49factorization um and now we're going to
30:53talk about something this is something
30:54probably many of you have seen in other
30:56contexts it comes up in pretty much all
31:00Fields so um it's something called the
31:02sure uh complement is is the the the
31:05name of it and you I think you've
31:07already seen it in you know early on in
31:09the class or something like that but I
31:11can't think of a legitimate field where
31:13it doesn't come up right so in
31:15statistics it's conditioning a gaussian
31:17on some entries right in mechanics it's
31:20actually what happens when you impose
31:22the boundary conditions on something and
31:24ask where everything else is in
31:25electrical engineering you take a net
31:27and I I put a termination on some of the
31:29ports on the network and I ask what
31:31what's happening with the rest of them
31:32right so this I all Fields have this um
31:36okay and the idea is actually
31:38embarrassingly simple so it goes like
31:40this um we're going to write uh ax
31:45equals B we're going to block a into
31:48some blocks like a11 and I'm going to
31:49have a11 and a22 B uh square right um
31:54and I'll I'll also conformally block
31:57block X and B so they're going to be
31:59called X1 stacked on x2 and B1 stacked
32:01on B2 right okay now okay now but
32:06nothing has CH nothing has happened I
32:08just decided to name it this way so I I
32:10just divided it in some weird way okay
32:13uh what happens now is you do actually I
32:16mean you do kind of what you learned in
32:19maybe High School except those were
32:20numbers and not matrices um and so what
32:23you do is you just we just work out we
32:25can actually just solve this in a weird
32:27way so the first equation block equation
32:29is a11 X1 plus a12 X2 = B1 right and
32:36we're going to we're going to assume
32:37here we're going to make an assumption
32:38that a11 is invertible okay then so what
32:43that says is well I can I could do this
32:49tox2 right that's a that's that right
32:53and then I'll just put an a11 inverse
32:57okay so there we go like that okay um
33:03yeah by the way already when you see
33:05this this is hinting that this is going
33:06to be interesting when a11 is easy to
33:09invert so that that's the that's going
33:10to be the interesting part okay but
33:12we'll get there okay I plug this into
33:15the second equation and the second
33:17equation is like a21 X1 plus a22 X2 = B2
33:26and I plug this this expression in here
33:29and I'm going to get something like you
33:31know a21 a11 inverse * B1 - a12 X2 plus
33:39a22 X2 = B2 right so makes sense okay
33:46and then I just I rearrange that to look
33:50like this equation here right I just put
33:53and and that's another equation in in uh
33:56X2 um this Matrix here is famous and
33:58it's called the sure complement of of
34:01the original Matrix I can't remember if
34:03it's with respect to the one one or two
34:05two block but it's one of those two and
34:07I don't remember which it is it doesn't
34:08matter it's a sure compliment okay um
34:11and you've probably seen it before or
34:13something in some yeah anyway that's it
34:17okay okay um okay so uh you could use
34:22this these calculations to solve the
34:24original thing and that generic
34:27algorithm looks like this it says first
34:29you should form a11 inverse a12 I mean
34:31because that's this thing right and a11
34:34inverse B1 right oh uh by the way um so
34:38let's talk about how we would do that so
34:41the way you would do that is you'd
34:44a11 okay to compute this you would do a
34:49solve right because you factored it and
34:51now you do a solve to calculate this you
34:54do a solve on each column of A1 two
34:57everybody got that right so this could
35:00be this this that could end up being
35:02quite that's that's quite efficient
35:04right so so the way you know this if I
35:08if I were to double click on that and
35:09expand it it would look like this Factor
35:11a11 then solve with B1 and solve with
35:16each column of a12 right that gives me
35:21matrices okay uh next step is I form s
35:24that's the sure complement which is this
35:27um and I form this bti thing over here
35:31which is the right hand side of this
35:33equation right so this equation up here
35:34is sx2 equals B TI and then then what I
35:38do is I'm going to I I find x what here
35:41is I solve I do a factor on S then a
35:45back solve and then I can reassemble
35:47this gives me X2 directly and I can get
35:49X1 from like this equation here and by
35:52the way when I solve this equation I
35:54already factored a11 so this is a back
35:56this a solve okay right this is the okay
36:01so that's that's the idea now okay now
36:03it's a drum roll to find out is this
36:06faster is it better and it's actually
36:13so if in step one you have f is going to
36:17be the factor cost for a11 S is the
36:20solve cost so in the first case you you
36:24factor a11 once and then you solve for
36:27all the columns of a12 and so that's N2
36:31uh s so that that's the cost of step one
36:34step two and um here that's actually
36:37just forming uh forming the the the the
36:41sure complement Matrix and what we're
36:42going to do is we're going to we're
36:43going to imagine that by the that the sh
36:45we're not going to exploit any structure
36:47in the sh complement although one could
36:49but we're not going to we're just going
36:50to treat it as dense um and then step
36:52three is going to be N2 cubed because
36:55it's just that's just just to to factor
36:57and solve it okay and so the total is
37:00this number up here it's f plus n2s plus
37:02blah blah okay all right so let's just
37:06throw in a general a11 and see what
37:08happens um when you do I mean hardly
37:10surprising but actually down to the Flop
37:13it's identical and you've done Absolut
37:15congratulations you've done a bunch of
37:17algebra and Analysis and it did
37:19absolutely nothing okay so
37:22everybody got oh by the way it does
37:25nothing in terms of flop count uh in
37:27reality it does do things right so in
37:29fact people use these block this sort of
37:31block elimination meth they actually use
37:33these for dense matrices and the reason
37:36is you would choose N1 and N2 so that
37:38one of these solve methods could be done
37:42cache right and therefore done quickly
37:45so so so that actually happens but again
37:47all of us are blissfully unaware of
37:49those details but so okay um now it gets
37:53interesting it's interesting precisely
37:55when f is is small compared to N1 cubed
37:59so I'm going to give you here's a great
38:02example is diagonal so what the factor
38:06cost for a diagonal is zero I mean
38:08whatever or n if you want to whatever it
38:10doesn't matter that's there l l is going
38:13to be the Matrix itself and youu will be
38:15the identity there it's zero okay
38:18um but so here it's actually really
38:22interesting you get 2 N2 ^2 N1 right
38:26Plus 2/3 n what's amazing about that is
38:28that's linear in N1 so now I'm going to
38:30show you some equation uh I'll show you
38:34patterns uh and here it is one actually
38:37I was just showing you one I had one
38:39right there uh here's one remember this
38:46one that's this downward Arrow pattern
38:49um someone suggest a blocking of it like
38:53what would be a11 do you see a leading
38:55Matrix here that is easy to
38:58invert diagonal yeah yeah so here's
39:02here's a11 it looks like this
39:04right right it just looks like that
39:07right so this is a11 diagonal you know
39:12actually I was going to say matrices
39:13don't come easier to invert than
39:15diagonal that's false here's one that's
39:17easier is a multiple of the
39:19identity that that that's that's pretty
39:22straightforward actually I think in
39:23terms of flops it's the same anyway so
39:25it doesn't even matter but anyway here
39:26it is um okay so here's one where
39:30instead of this so N1 is I don't know
39:33let's suppose the original thing was n
39:34it's n minus one and N2 is one because
39:37this is this this thing is one by one
39:39right down here um and so in this case
39:43it turns out you're going to get a
39:45stunning uh reduction right because
39:47you're going to be able to you can
39:49invert this fast and in that case you
39:51would get exactly this number over here
39:53you would get you know 2 N2 2 N1 right
39:57so okay that's that's one example but
40:00I'll show you I'll show you some others
40:01right um what it means is if you saw
40:05yeah what if you saw a a a sparsity
40:08pattern that looked like
40:10this right this is uh banded and then uh
40:16over here these are these are like dense
40:18this is called dense rows and columns
40:20everybody got that so actually it's just
40:22a block Arrow matrix but what but with a
40:26thicker weight on the things because
40:28they're wide right this is that's banded
40:30and and then this is this is a whole
40:32chunk of stuff over here right everybody
40:34got that U so what this says is that
40:38this block elimination method here will
40:41that uh well it's actually insane it's
40:44actually going to be linear in the
40:46dimension if this if this width if this
40:49doesn't grow this way and if the
40:52bandwidth doesn't change this gets
40:54bigger and bigger and we can solve it
40:56right so U anyway it's pretty cool right
41:00so uh and that means if you see that you
41:04should you should uh smile we'll put it
41:06that way like or you're not going to see
41:08that probably because someone would
41:11already have to know but the way it's
41:13really going to work is someone's going
41:14to be talking about a problem it's going
41:17to be some weird complicated problem who
41:18knows what Fe it doesn't matter they're
41:20going to be talking about some
41:20complicated thing while they're talking
41:23you will be forming your mind what the
41:25sparsity pattern of the associated
41:27matrices look like okay when that image
41:30comes to mind you will smile and you
41:33will say I can solve that problem in
41:37half a second and they said I can't even
41:40load it into memory and you'd say just
41:44chill and I'll show you I'll show you
41:46what we're going to do right everybody
41:48following this so so I mean these are
41:50incredibly useful things right so and
41:52you know we can think about what this is
41:55this is basically a set of equ equations
41:57where banded means that equation you
42:01know 50 only references let's say
42:0760 okay and that would be here uh then
42:11you say but there's some special
42:12variables here which affect
42:14everybody okay so that that that's
42:17that's what this is so you're going to
42:18want to be making those uh yeah so yeah
42:22to be fully developed I would say here's
42:24what you want you want to be able to get
42:27a feel for when applications involve
42:30matrices that look like
42:32that then you're going to be happy
42:34because you're going to know something
42:35other people in the room don't because
42:37you're going to know that you can solve
42:38that in linear time right so that's why
42:42people would say like well I have to
42:44find this whole trajectory it's like
42:45it's it's you know it's 100,000 time
42:47points you know and you're like so but I
42:50mean do you have a phone or whatever
42:52it's just not going to be that
42:53complicated right again if you know what
42:54you're doing okay so okay
43:00now this leads us to uh and this this
43:04probably people have seen it's got lots
43:06of names it depends on where you're from
43:08I think it's called like Sherman
43:11Morrison Woodbury sometimes it's called
43:13the Woodbury thing sometimes it's called
43:14The Matrix inversion
43:16Lemma and I have a friend who can
43:19actually tell where someone learned
43:21their linear algebra because they use
43:24different terms and they they would they
43:26even use different terms I believe at
43:27Oxford and Cambridge just just so you
43:29know so this is like someone can just by
43:31asking someone what they call this and a
43:33couple of other questions you can figure
43:35out that they got their PHD at Oxford or
43:37Cambridge or something or Harvard or
43:38whatever anyway or MIT or something okay
43:41so here it is you start with um a A plus
43:45BC uh time xals B right so that's your
43:50that's in fact some people some people
43:52call this a a uh a low rank perturbation
43:56of a I mean that's not fair because no
43:58one said that B and C are small right
44:01but if B had small columns if if B had
44:04just a handful of columns and C had a
44:05handful of rows this would indeed be a
44:08low you know what people would call low
44:09rank update to a okay so we want to
44:13solve a matrix that looks like that so
44:15in fact an extreme case would be things
44:17like uh here's when you hear all the
44:19time diagonal plus low rank there a is
44:24diagonal and B and C are small right so
44:27B has got you know a small number of
44:29columns C has got a small number of rows
44:32that's super common that's also
44:34something you will I would hope never
44:36ever forget diagonal plus low rank and
44:39it's it's all related um here's how it's
44:43related um yeah actually let's just talk
44:45about this in the case where a was
44:47diagonal and B and C are are are you
44:49know mod have modest number of columns
44:51right so in fact let's make in fact I'll
44:56so basically all of quantitative Finance
44:59uses a risk model where the covariance
45:01Matrix is diagonal plus low rank and
45:04it's called a factor model right so
45:08transposed and in uh If This Were Like A
45:12A gaussian or something if if this is
45:14the coari of a gaussian if you
45:15understand what I'm saying that's fine
45:16if you don't that's also cool but this
45:18is just completely Universal right so
45:20you'd have a what it would mean is it's
45:22even got a basian model and it basically
45:24says that uh that the returns of all
45:27these assets has is driven by let's say
45:31B and C they typically have B is about a
45:33width of 100 that would be very typical
45:36uh C would have a height of well C is be
45:38transposed right so you'd say look I
45:41don't know what these
45:438,000 returns are doing but they're
45:46mostly driven by these 100 factors right
45:51which which are which are actually so
45:53that's that's the underlying uh that's
45:55what's actually driving their returns
45:57and then the a is diagonal and that's
45:59they have a beautiful name for it's
46:00called idiosyncratic so that's the
46:02idiosyncratic risk it's the amount by
46:04which you know Google or you know
46:08whatever Coca-Cola or whatever it is go
46:10Chevron goes up and down all by is the
46:12part in fact the way they would say it
46:13this is dialect is you'd say that is the
46:16part of the variance not explained by
46:18the factors everyone got that so that's
46:20it so and the factors are actually kind
46:22of interesting sorry this is just a
46:23weird aside but since we're here I'll
46:26um the factors are interesting some are
46:28handcrafted so they're literally
46:31matrices with zeros and ones and weird
46:33stuff like that and it like are you you
46:35what uh yeah what you know what region
46:38are you in right are you uh are you a
46:41Pharmaceuticals company you know do you
46:43do are you do defense do you do that
46:45kind of thing others are actually
46:47crafted by you know people using
46:48numerical algorithms so okay that's was
46:50just a small story about this okay um oh
46:55if I have a if I have something that
46:56looks like that a is diagonal and B and
46:58C are small I can store that like no
47:00problem right that's that's even for
47:02very big like if I do I I want to you
47:05want to run a fund that is looking at a
47:07universe I'm using dialect of let's say
47:10100,000 uh assets right across the whole
47:13world that I know people who do that but
47:15so so you do that um so the covariance
47:18Matrix is 100,000 by 100,000 Matrix for
47:20something like that you might someone
47:22say well Yi what's your risk model and
47:24they would say oh I have a 200 Factor uh
47:28model and that means their covariance
47:31Matrix is expressed as a diagonal matrix
47:34right which is whatever it is 100,000 uh
47:38100,000 and then they have 200 factors
47:42uh and all of that is completely
47:43storable and stuff like that and you'll
47:45see in fact not only can you store you
47:46can actually solve problems super fast
47:48and things like that okay right that was
47:50all a weird aside but let's so let's
47:52let's keep going so now what you do here
47:54is something that is actually a little
47:56bit um unintuitive it's called you un
48:01eliminate we did this before in Duality
48:04and it turned out well there what you do
48:06to un eliminate means to introduce new
48:08variables right right to eliminate
48:10variables means to you know solve for
48:13them get rid of them and then substitute
48:14back their values into the rest that's
48:17elimination right that sounds good
48:20especially to an amateur or somebody
48:21first seeing this material because you
48:24know someone what are you doing you so
48:25I'm limiting some very why because I
48:27like to solve smaller sets of equations
48:29everybody I mean everyone like that's it
48:31makes perfect sense right it's
48:33completely and totally wrong that's not
48:36how you should be doing linear algebra
48:38but it does sound good and people can
48:39fall for that right okay un elimination
48:42is the opposite you say thank you for
48:44your set of equations if you don't mind
48:46I'm going to introduce new variables and
48:47equations and make it a bigger one and
48:49someone's like what why would you do
48:52that everyone see what I'm saying you'd
48:54say thank you for your you know thousand
48:56by thousand set of equations if you
48:57don't mind I'm going to expand it to a
48:5910,000 by 10,000 set of equ if it's okay
49:01with you right and they would say like
49:04why anyway and the answer is you'll
49:07you'll see why in a minute right so we
49:09write this this way we just introduce a
49:10new variable which is Y which is CX
49:13right so and I write that second
49:15equation as uh this bottom row which is
49:18cxus yal z right so so so the second the
49:22second system of equations is bigger and
49:26uh completely equivalent to the first
49:27one right okay now let's talk about this
49:32um when you see a matrix like that it
49:35you just encountered all this so far
49:38just a few minutes ago so I don't expect
49:40you to have this response now but
49:42hopefully within a week or two you
49:45should have the following response you
49:47will see a block there a minus I and
49:50your IE will go to it because it's
49:53really easily inverted and you're going
49:55to see a block and you're going to say
49:57well I you're going to have a an urge to
49:59do block elimination on this right
50:02because that and so what you would do is
50:04you'd say oh you know I know what I'm
50:05going to I'm going to permute the I the
50:07minus I up to the top because and then
50:09I'm going to use block elimination from
50:10the previous stuff right everybody got
50:13that and then you eliminate it and you
50:15will end up then forming the sh
50:18complement equations and it will be a
50:20plus bcx equals B so you'll be right
50:23started Okay so here's what you do
50:26instead you resist the urge here to
50:31eliminate the minus I even though it is
50:33a very attractive Target for elimination
50:35because it's easily inverted right you
50:38resist that urge and instead you
50:41a and you will end up with a sure
50:43complement equation that looks like
50:46this okay um and this allows you to get
50:50y here you solve for y then you then
50:53from that you plug it back in and get X
50:55and you end up with a formula that's
50:57called The Matrix inversion Lemma that's
50:58the neutral version It's it says that a
51:01plus BC inverse is you know this
51:03horrible formula over here yeah so
51:06that's it um this gets interesting if
51:09like a is diagonal or something like
51:12that then you'll anyway a lot of things
51:15would would would come out of that so
51:17how many people have seen like Matrix
51:19inversion Lemma or something before or
51:21or some Sherman Morris and Woodbury it's
51:25okay uh any other names for it I uh I
51:29used to know at least one or two others
51:31but anyway anybody else know another
51:34name for that it's a thing in linear
51:37algebra and how you call it depends on
51:39where you came from so
51:42okay okay so I think actually I'll uh oh
51:45yeah well here's an example like a
51:47diagonal and B and C are dense um this
51:50is you just you basically form uh you're
51:55going to form uh D is a plus BC this is
51:59if you do it the dumb way right you just
52:01write it out as a dense Matrix and
52:03solve um in this case then you get
52:06something that's PN squar sorry n cubed
52:09and if you do it the right way you get
52:10something that's linear in N right and
52:13these make a big difference right
52:16so yeah by by the way a lot of this
52:19stuff subsumes things that were taught
52:21in the 60s 7s and 80s and sometimes in
52:25by people who learned it from people who
52:27learned it in the 60s and 70s okay so so
52:30for example if you teach a a course on
52:32control you you might I've done this
52:35long I haven't done it for a long time I
52:36won't do it again but you go through a
52:38long derivation of what like the calman
52:40filter is or how to solve an optimal
52:42control problem right and it's very
52:44clever and you have little parts and
52:45this Bellman equation and blah blah blah
52:47and you get a and yeah you end up with
52:49stuff that actually works it's great um
52:52a very funny thing happens if you simply
52:55the problem represented by an optimal
52:57control problem or a calman filter and
53:00just throw it at a sparse solver guess
53:02what it'll be just as it'll it'll do it
53:06it'll do it so basically all of those
53:07lectures were useless is what I'm saying
53:09as a practical matter which is kind of
53:11embarrassing but whatever so I don't
53:13teach that anymore so I I don't teach it
53:14that way anymore I just say you get a
53:17giant set of linear equations and you
53:21solve it using sparse uh methods done
53:24that that that is the correct way to do
53:26it it gives you a lot of stuff right so
53:29this in a lot of fields like yeah I
53:32think probably I'm almost certain in
53:33statistics there'd be many many Papers
53:35written on if you had a factor model so
53:38that's a covariance matrix is diagonal
53:40plus low rank and it would be all sorts
53:43of things about very intensely clever
53:44ways I could condition suppose I give
53:46you 80 80% of the entries of a vector
53:49and I said please compute the
53:50conditional means of the of the
53:51remaining ones Everybody follow this
53:53that's like a conditional mean
53:55right so I'm sure there's whole Papers
53:58written on the on this and clever
54:00methods for doing this and blah anyway
54:02the answer is you form the covariance
54:05Matrix literally write down the
54:07equations you want to solve and then you
54:10do nothing but pass it to a sparse
54:12solver and it'll be just as fan it'll be
54:15just as fancy probably more General than
54:18all those papers so okay it's
54:20embarrassing but true so so okay um all
54:23right um so I we won't I'm not going to
54:27go into this there there's a whole lot
54:28more I haven't told you about linear
54:30algebra but this is this this is these
54:31are the main points so okay so let me um
54:34yeah we'll yeah and we'll right now this
54:37is just sort of hanging here now you
54:38just know about linear algebra but I
54:39promise in the next week we're going to
54:42connect it to problems and then that to
54:46optimization problems and that will have
54:48a connection to applications and then
54:50you will start being able to listen to a
54:53problem a practical problem and say I
54:56can solve that like I can solve that
54:58super fast right so everybody uh getting
55:01this right so I there there's people
55:04where this is all they do right if you
55:05do scientific computation this is kind
55:07of a lot of the stuff you do so okay so
55:10now now we're going to start um
55:13minimization um and we'll like I say
55:16we're building slowly up at the at the
55:18lowest level well at the lowest level
55:20are you know uh you know computers
55:23compilers micro code things like that
55:25which we're not going to look at uh next
55:27level up the bottom floor is what we
55:29just covered was linear
55:30algebra now the next floor is how do you
55:33solve like you know smooth unconstrained
55:36problems so we'll we'll we'll look at
55:39um and we'll build slowly up uh then
55:43we'll get equality constraints then
55:44we'll pull in inequality constraints and
55:46then we're there right so okay so here's
55:49what it looks like uh we're going to
55:50minimize a function f here that is we're
55:54going to assume it's it's smooth and and
55:56twice continuously differentiable right
55:59um and we we'll we'll assume that you
56:01know it it it's a this is the optimal
56:04value is achieved right so there's an
56:05there's at least an optimal point right
56:07now by the way a lot of problems like
56:09that that we want to solve in this class
56:10problems you have already solved don't
56:13satisfy this right like if you did an L1
56:16fitting problem you would minimize the
56:18L1 Norm of axus B and that doesn't count
56:21that's that's not here because that's
56:22differentiable okay um okay so this is
56:26what we're going to do um in general
56:29except for uh a very important set of
56:31problems we will see shortly uh there
56:33are no analytical solutions to this
56:36right um so what instead we're going to
56:40have iterative methods and an iterative
56:41method is going to generate like you
56:43know an X it's going to query something
56:46about F and then generate a new X and so
56:49on a new X and a new X and what's going
56:51to happen is those X's are going to
56:52converge to you know hopefully a
56:55solution of this problem right um and
57:00what we'll do is you know the typical
57:02thing you're going to get is something
57:04like this sequence is going to have the
57:06function values converge to the optimal
57:08value uh and another way to say it is
57:10that the gradient at at at is you're
57:13going to have a sequence of points where
57:14the gradient gets smaller and smaller
57:16okay and then when you deem it to be
57:18small enough um you asked earlier about
57:20stopping criteria here you have stopping
57:22criteria for the direct methods you
57:24don't have stopping criteria well it
57:27just stops it's like after n cubed over
57:29three steps you've got the chesy factor
57:32there is no stopping Criterion here
57:36okay so well we're g to we'll make some
57:39assumptions I I don't want to get too
57:41technical about it because it gets it
57:42quickly gets kind of boring and stuff
57:44like that um and by the way there's
57:46entire classes that do that cover
57:48nothing but this material in hideous
57:50full mathematical detail uh so I mean
57:52they're interesting but but
57:55I don't know I I think then what some
57:58people grow grow out of it or so anyway
58:00that I'll I'll stop here since okay
58:06um so yeah we're going to assume that
58:08the suble sets are are are closed and
58:11one one condition for that is pretty
58:13common is that the the function f is
58:15closed like if you do convex analysis
58:18which is the mathematics of convex
58:19optimization and convexity this is a
58:22central concept it just means that the
58:25right um and one example would be uh
58:29would would be if you had a um here's
58:32some examples of some closed functions
58:33right log sum X this is because the the
58:36domain of this is anything right so so
58:38this this would count here um here's
58:40another one this is weird because you
58:42might say but there a constraint the
58:44constraint is that X lies in the in the
58:46interior of this polyhedrin because
58:48otherwise I mean there's so there's an
58:50embedded constraint here and you might
58:51say no I don't like that but this one is
58:53perfect perfectly good all the Su level
58:55sets are closed and that the reason is
58:57that the function value goes to Infinity
58:59as you approach the boundary so we'll
59:01we'll get to that okay is that okay so I
59:04this these are technical conditions I
59:06think not super duper
59:10um people are sometimes assume uh
59:14something like a minimum
59:16curvature right so what that says is for
59:19example the minimum igen value of the
59:21hessen is bigger than some number M
59:23which is positive right like that gives
59:25you a minimum curvature to the function
59:28right that that'll allow you to say all
59:30sorts of things um yeah now you could
59:35ask questions like um how would you know
59:38what the minimum curvature is like it's
59:40a good question right um and the answer
59:42is in most cases you would have
59:43absolutely no idea there are cases where
59:45you do um oh here's one if I do mult
59:48let's say I do multinomial logistic
59:50regression and I add quadratic
59:53regularization to it
59:55then I just I don't even count the
59:57curvature that comes from the
59:59multinomial logistic regression loss and
01:00:02just the regular the quadratic
01:00:03regularization alone gives me a Minimum
01:00:05curvature Everybody follow so I mean but
01:00:08these are kind of rare right but that
01:00:09that is an example where you actually
01:00:11would know a candidate value for M right
01:00:14which is literally the uh it it it's
01:00:17going to be the regularization the ridge
01:00:19regression you know parameter um okay so
01:00:23if this is the case then this says that
01:00:25F minus M time the norm of you know
01:00:34convex and if you apply that if you then
01:00:36apply the basic convexity formula to
01:00:39that you'll find out really cool you get
01:00:41a strengthened Jens you know I guess
01:00:43it's not Jensen whatever uh you know
01:00:45this first part this up to here that's
01:00:48true this is true for any convex
01:00:51function this part right that's the
01:00:53basic inequality what this says is oh
01:00:55yeah well it applies with a little bit
01:00:57more right it it's actually that's the
01:00:59minimum curvature constraint right so it
01:01:02strengthens that basic inequality we saw
01:01:04six weeks ago or whenever it was seven
01:01:06weeks ago six right um okay and this
01:01:11means all sorts of cool things because
01:01:12for example um I could minimize both of
01:01:16these equations over y on the left I get
01:01:21P star because it's just the minimum of
01:01:22f of Y on the right right this is a
01:01:25quadratic function in y I set the
01:01:27gradient equal to zero I minimize it and
01:01:30you get a number which is actually it's
01:01:32going to be uh anyway you end up with
01:01:34something that looks like this you end
01:01:35up with this inequality right and so
01:01:38this is kind of cool because this
01:01:40says that if you wanted a stopping
01:01:43Criterion you would run the algorithm
01:01:45you'd evaluate the gradient and when the
01:01:46gradient gets small enough you quit and
01:01:48someone says why are you quitting um and
01:01:51you'd say well the gradient is less than
01:01:52something and they go yeah but how far
01:01:54am I from optimal and then you would say
01:01:56oh good question I can actually bound it
01:01:58I can say that here's how far you are
01:01:59from the optimal value no more than 1
01:02:01over 2m times this of course then in
01:02:04general someone would say how do you how
01:02:05do you know M and the answer is you
01:02:06don't and blah blah blah so but okay
01:02:09everybody but it it it this is the
01:02:12justification you would have for
01:02:14stopping when the gradient is
01:02:18um so we'll say a little bit about
01:02:21descent methods these are um yeah these
01:02:24are you know super widely used um and
01:02:28they they all have a a there's a generic
01:02:30form and it looks like this um in each
01:02:33step of a of a of a descent method
01:02:36you're going to find something called
01:02:38Delta X and that's called the step or
01:02:40the search Direction here um and so you
01:02:45know basically what happens is at
01:02:46iteration K you have XK you make some
01:02:49queries to F about like what's the
01:02:52gradient or something like that and then
01:02:54you form the search Direction so you
01:02:56stop and you ask for directions and it
01:02:58says well you're here you want to
01:03:00minimize you know you should kind of go
01:03:01that Direct that's Delta X that's this
01:03:03direction then what you do is you're
01:03:05going to go in that direction but an
01:03:07with an amount that you're going to then
01:03:09choose and that amount is scaled by this
01:03:11Factor T which is positive and this is
01:03:15uh that's that's called sometimes the
01:03:17step size or the step length even though
01:03:19it need not be it's not really the
01:03:21length unless the norm of the gradient
01:03:23is equal to one right but still people
01:03:25call it the step length or the step size
01:03:26or something like that so okay um that's
01:03:30what you do um and it turns out if if
01:03:35the if if the direction you move in has
01:03:38a negative inner product with the
01:03:40gradient then it turns out uh and and
01:03:44and if you choose the step size
01:03:45correctly the new point will actually
01:03:48have a function value less than the
01:03:49original Point that's good that's called
01:03:50a descent method because you're you're
01:03:53kind of descending right right and you
01:03:54and you end up with a function value
01:03:56that's lower okay so here's like a
01:03:58generic version of it uh to make this a
01:04:02real algorithm we have to fill in things
01:04:05we have to figure out how do you find
01:04:07Direction and then we have to figure out
01:04:10how do you choose the step length right
01:04:12and that's the whole thing right and the
01:04:14stopping Criterion could be some Norm of
01:04:15the gradient usually it's not usually
01:04:18it's something that is computed as a
01:04:20byproduct of these two steps in here so
01:04:22so this is going to be like hundreds of
01:04:24different methods with different names
01:04:26are going to have exactly this form
01:04:29right and they differ in what's the
01:04:31search Direction and they differ in what
01:04:33is the line search okay so that's that's
01:04:36the uh that's the
01:04:39idea so for line searches um there's a
01:04:44bunch I mean here's one one is an exact
01:04:46line search that says once you've
01:04:48decided that you're here and you're
01:04:50going in that direction you actually
01:04:53look at the function restricted to that
01:04:55Ray and then you look for the point
01:04:57where the where along that Ray you have
01:04:59the smallest values that's called an
01:05:01exact line search and you could do this
01:05:03in some cases right uh where you can
01:05:06actually work this out analytically or
01:05:07you can compute it cheaply or something
01:05:09like that right so that's that's one one
01:05:12place where you you might do this right
01:05:14um another one which sounds much crudder
01:05:18is called a backtracking line search and
01:05:20this has many many names again depending
01:05:21on where you came from it's called an
01:05:23arm line search it's called it's got all
01:05:26sorts of names for it in armo condition
01:05:28there's at least one of those I forget
01:05:29what it is in Russian but they have a
01:05:31different name for it so it's all this
01:05:34the lots of different names for this um
01:05:36so this has been rediscovered like I
01:05:37don't know a couple hundred times here's
01:05:39the way it works um it's actually it's
01:05:42called backtracking and it goes like
01:05:44this you start with T equals 1 so you
01:05:47you try X Plus Delta
01:05:50X then what you're going to do is you're
01:05:52going to check whether F has gone down
01:05:55and this is critical enough if it by the
01:05:57way if it goes if if it went up that's a
01:06:00really that that's basically not that's
01:06:02really bad right in fact you can even be
01:06:04out of domain so that that would be like
01:06:06a very you know that that would be it
01:06:07went up a lot because it went up to plus
01:06:09infinity right so okay so here's what
01:06:12this looks like here's here's T here um
01:06:16here here is the function restricted to
01:06:20T if you this would be the exact line
01:06:24uh number that that's the exact step
01:06:26length you would take you'd take this to
01:06:27minimize this right um so the way this
01:06:30works is you look at this basic
01:06:33inequality here that basic inequality is
01:06:35this line right here right this dash
01:06:37line that's what and what it says is uh
01:06:41if I put Alpha equals 1 in that equation
01:06:43I can reverse that inequality because
01:06:46basically when I when I continue this
01:06:48linear thing I guarant this is a lower
01:06:50bound on this right so this this the
01:06:53actual ual function value could never be
01:06:55below this if if the function's convex
01:06:58right so here's what you do we put an
01:07:00alpha in there uh which is a number
01:07:03that's less than one uh yeah actually in
01:07:05fact it's less than a half for weird
01:07:07reasons I mean doesn't even it turns out
01:07:09it's not going to matter what it is too
01:07:10much um and then we degrade the slope by
01:07:13multiplying by Alpha so here Alpha is
01:07:15maybe a third I just made it up and that
01:07:19line so if you take a really small step
01:07:22length you're definitely below the this
01:07:24line because if you take a really small
01:07:25step length here this is actually a
01:07:27really good approximation of how much
01:07:29the function is going to go down right
01:07:31um and then what you then your the rule
01:07:33is you have to be below this dashed line
01:07:38this this degraded predicted line and so
01:07:40way that works is you know you might
01:07:42start here and query the function and
01:07:44find this value and you're above this in
01:07:47fact the most dramatic one is you'd
01:07:48start way out here and you're o you're
01:07:49over here and actually when you do X
01:07:52Plus Delta X in that case you're
01:07:54actually have a higher value than F ofx
01:07:57everybody see that so so that would be a
01:07:59particularly poor choice of Step length
01:08:01right so what you do then is you
01:08:05multiply this by some Factor beta beta
01:08:09is typically like a half you know people
01:08:11use that they use 0.1 so you simply say
01:08:14Okay that was too much so you you go
01:08:16back by half and you land here you try
01:08:18this one you get this value it's less
01:08:20than this thing and you accept it so you
01:08:22take the first one you accept that's
01:08:24backtracking um it looks now you know
01:08:28that your intuition actually the story
01:08:30of a lot of this um uh unconstrained
01:08:32minimization is I guess there'll be
01:08:35several times when something you would
01:08:37think is is right is just wrong
01:08:39completely wrong so now I'll tell you
01:08:41the first one you would look at this and
01:08:43you'd say it's it's obviously clearly
01:08:46it's better to do an exact line search
01:08:49that's totally obvious right because
01:08:51you're actually minimizing the function
01:08:52along this line right this makes sense
01:08:55and this crude thing that stops when
01:08:57you've gotten like some acceptable by
01:08:59the way people use numbers like Alpha
01:09:0201 right meaning if you get even just
01:09:05the slightest decrease like good fine
01:09:07thanks we'll take it so everybody see
01:09:09what I'm saying so looks very
01:09:11unsophisticated right um anyway all
01:09:15right now I'll tell the here the punch
01:09:16line the punch line is it doesn't make
01:09:19any difference uh that in fact exact
01:09:21line search is not only not better in
01:09:23practice it's often
01:09:25worse and the simple dumb thing actually
01:09:28works just fine thank you this is not
01:09:31obvious right so um
01:09:35okay now we get to a gradient descent
01:09:39method this is the simplest one remember
01:09:41we had to choose a Direction I mean I
01:09:45would hope it would be the first it
01:09:47would it should be the first thing you
01:09:49would think of someone says make me an
01:09:51interative method I'm at f I meant X and
01:09:54I evaluate F at that point and I
01:09:56evaluate a gradient now gradient is
01:09:58pointing in the direction of fastest
01:10:01increase of f if I move in that
01:10:03direction F goes up as fast as possible
01:10:05if I move in the negative gradient
01:10:07descent Direction F goes down as quickly
01:10:10possible and you not only would you come
01:10:13up with this um but you'd also ask
01:10:17yourself why would you do anything else
01:10:21Ian you realize what think about if
01:10:22anyone if you didn't do the gradient
01:10:25method what happens is you arrive here
01:10:27and someone says where's the negative
01:10:28gradient you go that direction you say
01:10:31what does that mean you go if you go in
01:10:32that direction you're going to go down
01:10:33as quickly as possible and you go cool
01:10:36which way do you recommend I go and you
01:10:38go um that way you're like like what
01:10:41like why would you recommend that anyway
01:10:44you will shortly see why but is this
01:10:46kind of non-intuitive right uh so but
01:10:48we'll we'll get to it anyway so gradient
01:10:50method is the most intuitive one it
01:10:51looks it looks like this uh use any it
01:10:54doesn't matter what line search you use
01:10:57and here it is and you can you know
01:10:59prove convergence of this and all that
01:11:00kind of stuff that's fine um it turns
01:11:02out this is often unbelievably slow so
01:11:07uh so in fact people you we we'll see
01:11:09what other search directions can
01:11:11actually dramatically improve this but
01:11:13this is slow very weird um and we'll
01:11:15look at a picture to understand it so I
01:11:18just take a quadratic looks like this
01:11:20right and here are some level curves of
01:11:21the quadratic down here and Gamma is a
01:11:25number uh oh yeah let's see so gamma
01:11:28allows me to change the eccentricity of
01:11:32the level sets right so if gamma equals
01:11:351 this is 12 X12 plus x22 oh yeah by the
01:11:39way what's the gradient in that case if
01:11:41the so the level sets are just
01:11:44here what what what does the gradient
01:11:52gradient just in that case this this is
01:11:55actually more I'll draw this even it's
01:12:00here here's your function uh here the
01:12:03level sets here's zero right these are
01:12:06Level sets you're here the gradient
01:12:08points directly out like that and the
01:12:11negative gradient points directly
01:12:12towards zero which is the minimizer so
01:12:16if gamma was one if I ask you to
01:12:18minimize X12 plus x22 using the gradient
01:12:21method with an exact line search tell me
01:12:23about how well that algorithm is going
01:12:28work like like yeah very well like in
01:12:31one step right it's going to say which
01:12:33way should I go you say steepest descent
01:12:35which just happens in this case to aim
01:12:39solution right did he do an exact line
01:12:41search and it's game over everybody got
01:12:43that okay so but now that makes sense
01:12:47and in fact someone who was proposing
01:12:48like you know a gradient method would
01:12:50say yeah look look how well that works
01:12:52that's great like why would you not do
01:12:54that why would you why would you go in a
01:12:57crazy Direction like that right um but
01:13:01we can actually control the eccentricity
01:13:03by making gamma either really big or
01:13:06small um and in this case you can
01:13:09actually work out analytically what the
01:13:12what the iterates are right and you get
01:13:14something that looks like this is
01:13:15actually a very good picture to have in
01:13:16your mind of course it's totally silly
01:13:18because we don't care about
01:13:20two-dimensional problems we care about
01:13:22you know problems with a minimum mention
01:13:23of 200 2,000 20,000 2 million but
01:13:26nevertheless this is this gives you the
01:13:28picture so here this is when it's just
01:13:31slightly eccentric and here these are
01:13:33the level curves and so the negative
01:13:35gradient is just simply the gradient is
01:13:38orth these are orthogonal to the level
01:13:40curves and it says go in that
01:13:42direction this is this is the minimum
01:13:45because we're doing exact line search
01:13:47and then it says which way should I go
01:13:48and you go like that
01:13:50everybody so this has got uh people
01:13:52refer to this this is zigzagging I can
01:13:55make we can make it arbitrarily bad
01:13:57right we can take gamma equals a th000
01:14:01and you'll find out that this work that
01:14:03this thing's basically going like this
01:14:04and I think and actually I think someone
01:14:06I think there's even a name in in
01:14:07Western optimization it's called the
01:14:09morat effect right so but it's just the
01:14:12other thing I like is it's just called
01:14:14zigzagging everybody got this so by the
01:14:17way you know when you see this picture
01:14:19you can kind of figure out some things
01:14:22immediately that would be pretty good
01:14:24right so if you're in e or some any
01:14:26other field you would say you're kind of
01:14:29going in the right direction on
01:14:32average right I mean the problem is
01:14:34you're just doing this weird thermal
01:14:35thing up and down like this whereas you
01:14:37want to really kind of be going that way
01:14:39and so you'd say you know what you
01:14:41should run you should run those search
01:14:42directions through a low pass filter or
01:14:44something I mean again I'm speaking
01:14:46dialect um or if you do physics you'd
01:14:48say yeah you know add some momentum to
01:14:50it or something like that or viscosity
01:14:52add a vis osity term right so that you
01:14:55would kind of then you do something that
01:14:57would go like right like that and head
01:14:59towards the solution right so and those
01:15:01are all methods we're not going to look
01:15:02at those but that's that's kind of the
01:15:04idea okay so even this baby example
01:15:07tells us that oh but it actually we get
01:15:10some really good intuition here the uh
01:15:14the gradient method works really well
01:15:16when you're when the subl sets of your
01:15:19function have low condition number or
01:15:25because if they're round the the
01:15:28negative gradient points directly to the
01:15:30solution and you're going to go there
01:15:31quickly if you if it has high condition
01:15:34number or we you know if it's or if the
01:15:36suev sets are weirdly squished out like
01:15:38this that kind of thing is going to
01:15:41everybody got that so that's that's kind
01:15:44of the picture here I mean kind of
01:15:47that's actually super important
01:15:48observation because we're going to use
01:15:50that that's actually the segue to to
01:15:52basically Thursday's
01:15:53topic right so okay all right so here's
01:15:58just some examples um this is a you know
01:16:01Silly function that's non-
01:16:04quadratic um and this is happens if you
01:16:06use like a back Laing L backtracking
01:16:09line search and here's an exact line
01:16:10search and you'd look at these two and
01:16:11you'd go like well I told you exact line
01:16:13search was way better but this just
01:16:15shows you kind of the idea um but it
01:16:17shows you that in real non you know real
01:16:20examples you're actually getting this
01:16:21moratto or zigzagging effect right and
01:16:24then we can look at that over here and
01:16:26see what that looks like so um this
01:16:30looks like this right um and let me
01:16:34explain what this is these are the the
01:16:36iterations right this is a log scale
01:16:39telling you actually your
01:16:41suboptimality right um and what you can
01:16:44see is all sorts what you see all sorts
01:16:46of weird things here right like the
01:16:48first thing is that actually between
01:16:49zero and like 100
01:16:51iterations for whatever reason the
01:16:53backtracking line search was doing
01:16:55better like I don't know why but it was
01:16:57and this is kind of common right so this
01:16:59is what it this is what it looks like
01:17:01the other thing you see is that this
01:17:02kind of gets this gets kind of linear
01:17:04and that's referred to as linear
01:17:06convergence but that's a weird use of
01:17:08linear because it's linear on a log or a
01:17:13semi log plot right so it means because
01:17:17this is a log thing what linear
01:17:19convergence here means is that each step
01:17:22you do each iteration reduces the error
01:17:25by some factor it's pretty small frankly
01:17:28but we could work out what it is right
01:17:31if you you know if you if you divided it
01:17:34by well let's just have say that hit
01:17:36here so let's take to go from one to 10
01:17:39Theus 4 that took you 170 iterations
01:17:42right then that tells you you take uh
01:17:45the you take 10 the 4 and you take that
01:17:47to the 1 175th power and that tells you
01:17:51how much your error gets small each step
01:17:54right so that that tells you roughly
01:17:56what it is right okay so that's that's
01:17:59called linear linear convergence um okay
01:18:03I think maybe what we'll do is uh I
01:18:05won't even start the next the next uh
01:18:07section but this is a good place to uh
01:18:10to to stop and we'll uh we'll and we'll