00:09 good afternoon everyone thank you all
00:11 for joining today my name is Alexander
00:13 amini and I'll be one of your course
00:15 organizers this year along with Ava and
00:19 together we're super excited to
00:21 introduce you all to introduction to
00:23 deep learning now MIT intro to deep
00:26 learning is a really really fun exciting
00:28 and fast-paced program here at MIT and
00:32 let me start by just first of all giving
00:33 you a little bit of background into what
00:36 we do and what you're going to learn
00:37 about this year so this week of intro to
00:41 deep learning we're going to cover a ton
00:42 of material in just one week you'll
00:45 learn the foundations of this really
00:46 really fascinating and exciting field of
00:49 deep learning and artificial
00:51 and more importantly you're going to get
00:53 hands-on experience actually reinforcing
00:56 what you learn in the lectures as part
00:59 of Hands-On software labs
01:01 now over the past decade Ai and deep
01:03 learning have really had a huge
01:05 Resurgence and had incredible successes
01:08 and a lot of problems that even just a
01:10 decade ago we thought were not really
01:12 even solvable in the near future now
01:14 we're solving with deep learning with
01:16 Incredible use now this past year in
01:19 particular of 2022 has been an
01:22 incredible year for a deep learning
01:23 progress and I'd like to say that
01:25 actually this past year in particular
01:27 has been the year of generative deep
01:29 learning using deep learning to generate
01:31 brand new types of data that I've never
01:34 been seen before and never existed in
01:37 reality in fact I want to start this
01:39 class by actually showing you how we
01:40 started this class several years ago
01:42 which was by playing this video that
01:45 I'll play in a second now this video
01:47 actually was an introductory video for
01:49 the class it was uh and kind of
01:52 exemplifies this idea that I'm talking
01:54 about so let me just stop there and play
01:56 this video first of all
02:01 hi everybody and welcome
02:06 new official introductory course on deep
02:10 learning part here at mic
02:13 each one his revolutionizing so many
02:17 things and Robotics to medicine and
02:21 everything in between
02:23 you'll learn the fundamentals of this
02:26 field and how you can build some of
02:29 these incredible algorithms
02:32 in fact this entire speech then video
02:36 are not real and were created using deep
02:41 learning and artificial intelligence
02:45 and in this class you'll learn how
02:48 it has been an honor to speak with you
02:51 today and I hope you enjoyed it was
02:56 so in case you couldn't tell this video
02:59 and its entire audio was actually not
03:02 real it was synthetically generated by a
03:04 deep learning algorithm and when we
03:06 introduced this class A few years ago
03:09 this video was created several years ago
03:11 right but even several years ago when we
03:14 introduced this and put it on YouTube it
03:16 went somewhat viral right people really
03:18 loved this video they were intrigued by
03:20 how real the video and audio felt and
03:23 looked uh entirely generated by an
03:26 algorithm by a computer and people were
03:28 shocked with the power and the realism
03:30 of these types of approaches and this
03:32 was a few years ago now fast forward to
03:35 today and the state of deep learning
03:40 have seen deep learning accelerating at
03:43 a rate faster than we've ever seen
03:45 before in fact we can use deep learning
03:47 now to generate not just images of faces
03:51 but generate full synthetic environments
03:53 where we can train autonomous vehicles
03:55 entirely in simulation and deploy them
03:58 on full-scale vehicles in the real world
04:00 seamlessly the videos here you see are
04:02 actually from a data driven simulator
04:04 from neural networks generated called
04:06 Vista that we actually built here at MIT
04:09 and have open sourced to the public so
04:11 all of you can actually train and build
04:13 the future of autonomy and self-driving
04:15 cars and of course it goes far beyond
04:17 this as well deep learning can be used
04:19 to generate content directly from how we
04:23 speak and the language that we convey to
04:25 it from prompts that we say deep
04:27 learning can reason about the prompts in
04:29 natural language and English for example
04:31 and then guide and control what is
04:34 generated according to what we specify
04:38 we've seen examples of where we can
04:40 generate for example things that again
04:43 have never existed in reality we can ask
04:45 a neural network to generate a photo of
04:47 a astronaut riding a horse and it
04:50 actually can imagine hallucinate what
04:53 this might look like even though of
04:54 course this photo not only this photo
04:57 has never occurred before but I don't
04:58 think any photo of an astronaut riding a
05:00 horse has ever occurred before so
05:02 there's not really even training data
05:03 that you could go off in this case and
05:05 my personal favorite is actually how we
05:07 can not only build software that can
05:09 generate images and videos but build
05:11 software that can generate software as
05:14 well we can also have algorithms that
05:16 can take language prompts for example a
05:19 prompt like this write code and
05:20 tensorflow to generate or to train a
05:23 neural network and not only will it
05:25 write the code and create that neural
05:27 network but it will have the ability to
05:30 reason about the code that it's
05:32 generated and walk you through step by
05:33 step explaining the process and
05:35 procedure all the way from the ground up
05:37 to you so that you can actually learn
05:38 how to do this process as well
05:42 I think some of these examples really
05:44 just highlight how far deep learning and
05:47 these methods have come in the past six
05:49 years since we started this course and
05:50 you saw that example just a few years
05:52 ago from that introductory video but now
05:55 we're seeing such incredible advances
05:56 and the most amazing part of this course
05:58 in my opinion is actually that within
06:01 this one week we're going to take you
06:03 through from the ground up starting from
06:04 today all of the foundational building
06:07 blocks that will allow you to understand
06:09 and make all of this amazing Advance as
06:13 so with that hopefully now you're all
06:15 super excited about what this class will
06:18 teach and I want to basically now just
06:20 start by taking a step back and
06:22 introducing some of these terminologies
06:24 that I've kind of been throwing around
06:26 so far the Deep learning artificial
06:28 intelligence what do these things
06:30 so first of all I want to maybe just
06:35 speak a little bit about intelligence
06:37 and what intelligence means at its core
06:39 so to me intelligence is simply the
06:42 ability to process information such that
06:45 we can use it to inform some future
06:46 decision or action that we take
06:49 now the field of artificial intelligence
06:52 is simply the ability for us to build
06:54 algorithms artificial algorithms that
06:56 can do exactly this process information
06:58 to inform some future decision
07:01 now machine learning is simply a subset
07:03 of AI which focuses specifically on how
07:06 we can build a machine to or teach a
07:10 machine how to do this from some
07:12 experiences or data for example now deep
07:15 learning goes One Step Beyond this and
07:17 is a subset of machine learning which
07:19 focuses explicitly on what are called
07:20 neural networks and how we can build
07:22 neural networks that can extract
07:24 features in the data these are basically
07:25 what you can think of as patterns that
07:27 occur within the data so that it can
07:29 learn to complete these tasks as well
07:32 now that's exactly what this class is
07:35 really all about at its core we're going
07:37 to try and teach you and give you the
07:38 foundational understanding and how we
07:40 can build and teach computers to learn
07:43 tasks many different type of tasks
07:45 directly from raw data and that's really
07:48 what this class spoils down to at it's
07:50 it's most simple form and we'll provide
07:52 a very solid foundation for you both on
07:55 the technical side through the lectures
07:57 which will happen in two parts
07:59 throughout the class the first lecture
08:01 and the second lecture each one about
08:02 one hour long followed by a software lab
08:05 which will immediately follow the
08:06 lectures which will try to reinforce a
08:08 lot of what we cover in the in the in
08:10 the technical part of the class and you
08:13 know give you hands-on experience
08:14 implementing those ideas
08:16 so this program is split between these
08:19 two pieces the technical lectures and
08:20 the software Labs we have several new
08:22 updates this year in specific especially
08:24 in many of the later lectures the first
08:27 lecture will cover the foundations of
08:29 deep learning which is going to be right
08:32 and finally we'll conclude the course
08:34 with some very exciting guest lectures
08:36 from both Academia and Industry who are
08:38 really leading and driving forward the
08:41 state of AI and deep learning and of
08:43 course we have many awesome prizes that
08:46 go with all of the software labs and the
08:49 project competition at the end of the
08:51 course so maybe quickly to go through
08:53 these each day like I said we'll have
08:54 dedicated software Labs that couple with
08:58 starting today with lab one you'll
09:00 actually build a neural network keeping
09:02 with this theme of generative AI you'll
09:04 build a neural network that can learn
09:05 listen to a lot of music and actually
09:08 learn how to generate brand new songs in
09:10 that genre of music
09:12 at the end at the next level of the
09:14 class on Friday we'll host a project
09:16 pitch competition where either you
09:18 individually or as part of a group can
09:21 participate and present an idea a novel
09:24 deep learning idea to all of us it'll be
09:27 roughly three minutes in length and we
09:31 will focus not as much because this is a
09:33 one week program we're not going to
09:35 focus so much on the results of your
09:37 pitch but rather The Innovation and the
09:39 idea and the novelty of what you're
09:42 the prices here are quite significant
09:44 already where first price is going to
09:46 get an Nvidia GPU which is really a key
09:48 piece of Hardware that is instrumental
09:50 if you want to actually build a deep
09:52 learning project and train these neural
09:54 networks which can be very large and
09:55 require a lot of compute these prices
09:57 will give you the compute to do so and
09:59 finally this year we'll be awarding a
10:01 grand prize for labs two and three
10:03 combined which will occur on Tuesday and
10:06 Wednesday focused on what I believe is
10:08 actually solving some of the most
10:09 exciting problems in this field of deep
10:11 learning and how specifically how we can
10:13 build models that can be robust not only
10:17 accurate but robust and trustworthy and
10:19 safe when they're deployed as well and
10:21 you'll actually get experience
10:23 developing those types of solutions that
10:25 can actually Advance the state of the
10:28 now all of these Labs that I mentioned
10:30 and competitions here are going to be
10:33 due on Thursday night at 11 PM right
10:36 before the last day of class and we'll
10:38 be helping you all along the way this
10:40 this Prize or this competition in
10:42 particular has very significant prizes
10:45 so I encourage all of you to really
10:46 enter this prize and try to try to get a
10:50 chance to win the prize
10:52 and of course like I said we're going to
10:54 be helping you all along the way who are
10:55 many available resources throughout this
10:57 class to help you achieve this please
11:00 post to Piazza if you have any questions
11:02 and of course this program has an
11:04 incredible team that you can reach out
11:06 to at any point in case you have any
11:08 issues or questions on the materials
11:10 myself and Ava will be your two main
11:13 lectures for the first part of the class
11:14 we'll also be hearing like I said in the
11:17 later part of the class from some guest
11:19 lectures who will share some really
11:21 cutting edge state-of-the-art
11:22 developments in deep learning and of
11:24 course I want to give a huge shout out
11:26 and thanks to all of our sponsors who
11:28 without their support this program
11:29 wouldn't have been possible
11:30 at first yet again another year so thank
11:34 okay so now with that let's really dive
11:37 into the really fun stuff of today's
11:39 lecture which is you know the the
11:41 technical part and I think I want to
11:43 start this part by asking all of you and
11:46 having yourselves ask yourself you know
11:48 having you ask yourselves this question
11:50 of you know why are all of you here
11:53 first of all why do you care about this
11:54 topic in the first place now
11:57 I think to answer this question we have
11:59 to take a step back and think about you
12:01 know the history of machine learning and
12:03 what machine learning is and what deep
12:05 learning brings to the table on top of
12:07 machine learning now traditional machine
12:09 learning algorithms typically Define
12:11 what are called these set of features in
12:14 the data you can think of these as
12:15 certain patterns in the data and then
12:17 usually these features are hand
12:19 engineered so probably a human will come
12:20 into the data set and with a lot of
12:22 domain knowledge and experience can try
12:25 to uncover what these features might be
12:27 now the key idea of deep learning and
12:29 this is really Central to this class is
12:31 that instead of having a human Define
12:33 these features what if we could have a
12:35 machine look at all of this data and
12:38 actually try to extract and uncover what
12:40 are the core patterns in the data so
12:42 that it can use those when it sees new
12:44 data to make some decisions so for
12:46 example if we wanted to detect faces in
12:48 an image a deep neural network algorithm
12:51 might actually learn that in order to
12:53 detect a face it first has to detect
12:55 things like edges in the image lines and
12:57 edges and when you combine those lines
12:59 and edges you can actually create
13:00 compositions of features like corners
13:03 and curves which when you create those
13:05 when you combine those you can create
13:07 more high level features for example
13:08 eyes and noses and ears and then those
13:11 are the features that allow you to
13:13 ultimately detect what you care about
13:15 detecting which is the face but all of
13:16 these come from what are called kind of
13:18 a hierarchical learning of features and
13:20 you can actually see some examples of
13:22 these these are real features learned by
13:24 a neural network and how they're
13:25 combined defines this progression of
13:28 but in fact what I just described this
13:31 underlying and fundamental building
13:33 block of neural networks and deep
13:34 learning have actually existed for
13:37 decades now why are we studying all of
13:40 this now and today in this class with
13:42 all of this great enthusiasm to learn
13:44 this right well for one there have been
13:46 several key advances that have occurred
13:48 in the past decade number one is that
13:51 data is so much more pervasive than it
13:54 has ever been before in our lifetimes
13:56 these models are hungry for more data
13:59 and we're living in the age of Big Data
14:02 more data is available to these models
14:04 than ever before and they Thrive off of
14:07 that secondly these algorithms are
14:10 massively parallelizable they require a
14:12 lot of compute and we're also at a
14:14 unique time in history where we have the
14:17 ability to train these extremely
14:19 large-scale algorithms and techniques
14:21 that have existed for a very long time
14:23 but we can now train them due to the
14:24 hardware advances that have been made
14:26 and finally due to open source toolbox
14:28 access and software platforms like
14:30 tensorflow for example which all of you
14:32 will get a lot of experience on in this
14:34 class training and building the code for
14:37 these neural networks has never been
14:39 easier so that from the software point
14:40 of view as well there have been
14:42 incredible advances to open source you
14:44 know the the underlying fundamentals of
14:46 what you're going to learn
14:48 so let me start now with just building
14:50 up from the ground up the fundamental
14:52 building block of every single neural
14:55 network that you're going to learn in
14:56 this class and that's going to be just a
14:58 single neuron right and in neural
15:01 network language a single neuron is
15:02 called a perceptron
15:05 so what is the perceptron a perceptron
15:07 is like I said a single neuron and it's
15:11 actually I'm going to say it's very very
15:13 simple idea so I want to make sure that
15:15 everyone in the audience understands
15:16 exactly what a perceptron is and how it
15:19 so let's start by first defining a
15:21 perceptron as taking it as input a set
15:24 of inputs right so on the left hand side
15:26 you can see this perceptron takes M
15:29 different inputs 1 to M right these are
15:32 the blue circles we're denoting these
15:36 each of these numbers each of these
15:39 inputs is then multiplied by a
15:41 corresponding weight which we can call W
15:43 right so X1 will be multiplied by W1
15:46 and we'll add the result of all of these
15:49 multiplications together
15:51 now we take that single number after the
15:53 addition and we pass it through this
15:55 non-linear what we call a non-linear
15:57 activation function and that produces
15:59 our final output of the perceptron which
16:05 this is actually not entirely accurate
16:07 of the picture of a perceptron there's
16:09 one step that I forgot to mention here
16:11 so in addition to multiplying all of
16:14 these inputs with their corresponding
16:15 weights we're also now going to add
16:17 what's called a bias term here denoted
16:19 as this w0 which is just a scalar weight
16:22 and you can think of it coming with a
16:24 input of just one so that's going to
16:26 allow the network to basically shift its
16:29 nonlinear activation function uh you
16:32 know non-linearly right as it sees its
16:36 now on the right hand side you can see
16:39 mathematically formulated right as a
16:42 single equation we can now rewrite this
16:44 linear this this equation with linear
16:46 algebra terms of vectors and Dot
16:49 products right so for example we can
16:50 Define our entire inputs X1 to XM as a
16:57 right that large Vector X can be
17:00 by or taking a DOT excuse me Matrix
17:02 multiplied with our weights W this again
17:06 another Vector of our weights W1 to WN
17:10 taking their dot product not only
17:13 multiplies them but it also adds the
17:15 resulting terms together adding a bias
17:17 like we said before and applying this
17:22 now you might be wondering what is this
17:24 non-linear function I've mentioned it a
17:26 few times already well I said it is a
17:28 function right that's passed that we
17:31 pass the outputs of the neural network
17:33 through before we return it you know to
17:35 the next neuron in the in the pipeline
17:38 right so one common example of a
17:40 nonlinear function that's very popular
17:41 in deep neural networks is called the
17:43 sigmoid function you can think of this
17:45 as kind of a continuous version of a
17:46 threshold function right it goes from
17:48 zero to one and it's having it can take
17:51 us input any real number on the real
17:54 and you can see an example of it
17:56 Illustrated on the bottom right hand now
17:58 in fact there are many types of
18:00 nonlinear activation functions that are
18:02 popular in deep neural networks and here
18:03 are some common ones and throughout this
18:05 presentation you'll actually see some
18:07 examples of these code snippets on the
18:09 bottom of the slides where we'll try and
18:11 actually tie in some of what you're
18:13 learning in the lectures to actual
18:15 software and how you can Implement these
18:16 pieces which will help you a lot for
18:18 your software Labs explicitly so the
18:20 sigmoid activation on the left is very
18:22 popular since it's a function that
18:24 outputs you know between zero and one so
18:26 especially when you want to deal with
18:27 probability distributions for example
18:29 this is very important because
18:30 probabilities live between 0 and 1.
18:33 in modern deep neural networks though
18:35 the relu function which you can see on
18:37 the far right hand is a very popular
18:38 activation function because it's
18:40 piecewise linear it's extremely
18:41 efficient to compute especially when
18:43 Computing its derivatives right its
18:45 derivatives are constants except for one
18:47 non-linear idiot zero
18:51 now I hope actually all of you are
18:53 probably asking this question to
18:54 yourself of why do we even need this
18:56 nonlinear activation function it seems
18:57 like it kind of just complicates this
18:59 whole picture when we didn't really need
19:00 it in the first place and I want to just
19:03 spend a moment on answering this because
19:05 the point of a nonlinear activation
19:07 function is of course number one is to
19:09 introduce non-linearities to our data
19:12 right if we think about our data almost
19:15 all data that we care about all real
19:17 world data is highly non-linear now this
19:19 is important because if we want to be
19:21 able to deal with those types of data
19:23 sets we need models that are also
19:24 nonlinear so they can capture those same
19:26 types of patterns so imagine I told you
19:28 to separate for example I gave you this
19:29 data set red points from greenpoints and
19:31 I ask you to try and separate those two
19:33 types of data points now you might think
19:36 that this is easy but what if I could
19:37 only if I told you that you could only
19:39 use a single line to do so well now it
19:42 becomes a very complicated problem in
19:43 fact you can't really Solve IT
19:44 effectively with a single line
19:47 and in fact if you introduce nonlinear
19:50 activation functions to your Solution
19:52 that's exactly what allows you to you
19:54 know deal with these types of problems
19:56 nonlinear activation functions allow you
19:58 to deal with non-linear types of data
20:01 now and that's what exactly makes neural
20:03 networks so powerful at their core
20:06 so let's understand this maybe with a
20:08 very simple example walking through this
20:10 diagram of a perceptron one more time
20:12 imagine I give you this trained neural
20:14 network with weights now not W1 W2 I'm
20:17 going to actually give you numbers at
20:19 these locations right so the trained
20:21 w0 will be 1 and W will be a vector of 3
20:27 so this neural network has two inputs
20:29 like we said before it has input X1 it
20:31 has input X2 if we want to get the
20:33 output of it this is also the main thing
20:36 I want all of you to take away from this
20:37 lecture today is that to get the output
20:39 of a perceptron there are three steps we
20:41 need to take right from this stage we
20:43 first compute the multiplication of our
20:45 inputs with our weights
20:48 sorry yeah multiply them together add
20:52 their result and compute a non-linearity
20:54 it's these three steps that Define the
20:57 forward propagation of information
20:58 through a perceptron
21:00 so let's take a look at how that exactly
21:03 works right so if we plug in these
21:05 numbers to the to those equations we can
21:08 see that everything inside of our
21:10 non-linearity here the nonlinearity is G
21:12 right that function G which could be a
21:15 sigmoid we saw a previous slide
21:17 that component inside of our
21:20 nonlinearity is in fact just a
21:22 two-dimensional line it has two inputs
21:24 and if we consider the space of all of
21:27 the possible inputs that this neural
21:29 network could see we can actually plot
21:31 this on a decision boundary right we can
21:33 plot this two-dimensional line as as a a
21:37 decision boundary as a plane separating
21:40 these two components of our space
21:42 in fact not only is it a single plane
21:45 there's a directionality component
21:46 depending on which side of the plane
21:48 that we live on if we see an input for
21:50 example here negative one two we
21:53 actually know that it lives on one side
21:55 of the plane and it will have a certain
21:57 type of output in this case that output
21:59 is going to be positive right because in
22:02 this case when we plug those components
22:04 into our equation we'll get a positive
22:06 number that passes through the
22:09 nonlinear component and that gets
22:11 propagated through as well of course if
22:13 you're on the other side of the space
22:14 you're going to have the opposite result
22:17 right and that thresholding function is
22:19 going to essentially live at this
22:21 decision boundary so depending on which
22:22 side of the space you live on that
22:24 thresholding function that sigmoid
22:26 function is going to then control how
22:29 you move to one side or the other
22:32 now in this particular example this is
22:35 very convenient right because we can
22:36 actually visualize and I can draw this
22:38 exact full space for you on this slide
22:41 it's only a two-dimensional space so
22:42 it's very easy for us to visualize but
22:44 of course for almost all problems that
22:47 we care about our data points are not
22:49 going to be two-dimensional right if you
22:51 think about an image the dimensionality
22:53 of an image is going to be the number of
22:54 pixels that you have in the image right
22:56 so these are going to be thousands of
22:58 Dimensions millions of Dimensions or
23:00 even more and then drawing these types
23:03 of plots like you see here is simply not
23:05 feasible right so we can't always do
23:07 this but hopefully this gives you some
23:08 intuition to understand kind of as we
23:11 build up into more complex models
23:14 so now that we have an idea of the
23:16 perceptron let's see how we can actually
23:18 take this single neuron and start to
23:20 build it up into something more
23:21 complicated a full neural network and
23:23 build a model from that
23:24 so let's revisit again this previous
23:27 diagram of the perceptron if again just
23:30 to reiterate one more time this core
23:32 piece of information that I want all of
23:34 you to take away from this class is how
23:36 a perceptron works and how it propagates
23:38 information to its decision there are
23:41 three steps first is the dot product
23:43 second is the bias and third is the
23:45 non-linearity and you keep repeating
23:47 this process for every single perceptron
23:49 in your neural network
23:51 let's simplify the diagram a little bit
23:53 I'll get rid of the weights and you can
23:56 assume that every line here now
23:57 basically has an Associated weight
23:59 scaler that's associated with it every
24:01 line also has it corresponds to the
24:03 input that's coming in it has a weight
24:05 that's coming in also at the
24:07 on the line itself and I've also removed
24:10 the bias just for a sake of Simplicity
24:12 but it's still there
24:14 so now the result is that Z which let's
24:18 call that the result of our DOT product
24:20 plus the bias is going and that's what
24:22 we pass into our non-linear function
24:24 that piece is going to be applied to
24:27 that activation function now the final
24:29 output here is simply going to be G
24:33 which is our activation function of Z
24:36 right Z is going to be basically what
24:38 you can think of the state of this
24:39 neuron it's the result of that dot
24:43 now if we want to Define and build up a
24:46 multi-layered output neural network if
24:49 we want two outputs to this function for
24:51 example it's a very simple procedure we
24:52 just have now two neurons two
24:54 perceptrons each perceptron will control
24:56 the output for its Associated piece
25:00 so now we have two outputs each one is a
25:02 normal perceptron it takes all of the
25:03 inputs so they both take the same inputs
25:06 but amazingly now with this mathematical
25:08 understanding we can start to build our
25:11 first neural network entirely from
25:13 scratch so what does that look like so
25:15 we can start by firstly initializing
25:17 these two components the first component
25:19 that we saw was the weight Matrix excuse
25:22 me the weight Vector it's a vector of
25:24 Weights in this case
25:26 and the second component is the the bias
25:29 Vector that we're going to multiply with
25:31 the dot product of all of our inputs by
25:33 our weights right so
25:36 the only remaining step now after we've
25:38 defined these parameters of our layer is
25:40 to now Define you know how does forward
25:43 propagation of information works and
25:45 that's exactly those three main
25:46 components that I've been stressing to
25:49 so we can create this call function to
25:51 do exactly that to Define this forward
25:53 propagation of information
25:55 and the story here is exactly the same
25:57 as we've been seeing it right Matrix
25:58 multiply our inputs with our weights
26:04 and then apply a non-linearity and
26:06 return the result and that literally
26:09 this code will run this will Define a
26:11 full net a full neural network layer
26:13 that you can then take like this
26:16 and of course actually luckily for all
26:18 of you all of that code which wasn't
26:20 much code that's been abstracted away by
26:22 these libraries like tensorflow you can
26:24 simply call functions like this which
26:26 will actually you know replicate exactly
26:30 so you don't need to necessarily copy
26:32 all of that code down you just you can
26:36 and with that understanding you know we
26:39 just saw how you could build a single
26:40 layer but of course now you can actually
26:43 start to think about how you can stack
26:45 these layers as well so since we now
26:48 have this transformation essentially
26:50 from our inputs to a hidden output you
26:53 can think of this as basically how we
26:56 Define some way of transforming those
27:00 inputs right into some new dimensional
27:03 space right perhaps closer to the value
27:06 that we want to predict and that
27:07 transformation is going to be eventually
27:09 learned to know how to transform those
27:11 inputs into our desired outputs and
27:13 we'll get to that later but for now the
27:16 piece that I want to really focus on is
27:17 if we have these more complex neural
27:19 networks I want to really distill down
27:21 that this is nothing more complex than
27:23 what we've already seen if we focus on
27:24 just one neuron in this diagram
27:28 take is here for example Z2 right Z2 is
27:31 this neuron that's highlighted in the
27:34 it's just the same perceptron that we've
27:36 been seeing so far in this class it was
27:39 a its output is obtained by taking a DOT
27:41 product adding a bias and then applying
27:43 that non-linearity between all of its
27:46 if we look at a different node for
27:47 example Z3 which is the one right below
27:49 it it's the exact same story again it
27:51 sees all of the same inputs but it has a
27:53 different set of weight Matrix that it's
27:55 going to apply to those inputs so we'll
27:57 have a different output but the
27:58 mathematical equations are exactly the
28:01 so from now on I'm just going to kind of
28:03 simplify all of these lines and diagrams
28:05 just to show these icons in the middle
28:07 just to demonstrate that this means
28:09 everything is going to fully connect it
28:10 to everything and defined by those
28:12 mathematical equations that we've been
28:14 covering but there's no extra complexity
28:16 in these models from what you've already
28:19 now if you want to Stack these types of
28:21 Solutions on top of each other these
28:24 layers on top of each other you can not
28:26 only Define one layer very easily but
28:28 you can actually create what are called
28:29 sequential models these sequential
28:31 models you can Define one layer after
28:33 another and they define basically the
28:36 forward propagation of information not
28:37 just from the neuron level but now from
28:40 the layer level every layer will be
28:42 fully connected to the next layer and
28:44 the inputs of the secondary layer will
28:46 be all of the outputs of the prior layer
28:50 now of course if you want to create a
28:51 very deep neural network all the Deep
28:53 neural network is is we just keep
28:55 stacking these layers on top of each
28:56 other there's nothing else to this story
28:58 that's really as simple as it is once so
29:02 these layers are basically all they are
29:03 is just layers where the final output is
29:06 right by going deeper and deeper into
29:09 this progression of different layers
29:10 right and you just keep stacking them
29:12 until you get to the last layer which is
29:14 your output layer it's your final
29:15 prediction that you want to Output
29:18 right we can create a deep neural
29:19 network to do all of this by stacking
29:21 these layers and creating these more
29:23 hierarchical models like we saw very
29:25 early in the beginning of today's
29:27 lecture one where the final output is
29:29 really computed by you know just going
29:31 deeper and deeper into this system
29:34 okay so that's awesome so we've now seen
29:37 how we can go from a single neuron to a
29:40 layer to all the way to a deep neural
29:42 network right building off of these
29:44 foundational principles
29:46 let's take a look at how exactly we can
29:48 use these uh you know principles that
29:52 we've just discussed to solve a very
29:54 real problem that I think all of you are
29:56 probably very concerned about uh this
29:59 morning when you when you woke up so
30:01 that problem is how we can build a
30:04 neural network to answer this question
30:05 which is will I how will I pass this
30:07 class and if I will or will I not
30:10 so to answer this question let's see if
30:12 we can train a neural network to solve
30:16 so to do this let's start with a very
30:18 simple neural network right we'll train
30:20 this model with two inputs just two
30:22 inputs one input is going to be the
30:23 number of lectures that you attend over
30:25 the course of this one week and the
30:27 second input is going to be how many
30:30 hours that you spend on your final
30:31 project or your competition
30:34 okay so what we're going to do is
30:36 firstly go out and collect a lot of data
30:38 from all of the past years that we've
30:39 taught this course and we can plot all
30:41 of this data because it's only two input
30:43 space we can plot this data on a
30:44 two-dimensional feature space right we
30:47 can actually look at all of the students
30:49 before you that have passed the class
30:51 and failed the class and see where they
30:53 lived in this space for the amount of
30:55 hours that they've spent the number of
30:56 lectures that they've attended and so on
30:58 greenpoints are the people who have
31:00 passed red or those who have failed now
31:03 and here's you right you're right here
31:05 four or five is your coordinate space
31:07 you fall right there and you've attended
31:10 four lectures you've spent five hours on
31:11 your final project we want to build a
31:13 neural network to answer the question of
31:15 will you pass the class although you
31:18 so let's do it we have two inputs one is
31:20 four one is five these are two numbers
31:22 we can feed them through a neural
31:24 network that we've just seen how we can
31:27 and we feed that into a single layered
31:30 three hidden units in this example but
31:32 we could make it larger if we wanted to
31:33 be more expressive and more powerful
31:36 and we see here that the probability of
31:37 you passing this class is 0.1
31:40 it's pretty visible so why would this be
31:42 the case right what did we do wrong
31:44 because I don't think it's correct right
31:46 when we looked at the space it looked
31:49 like actually you were a good candidate
31:50 to pass the class but why is the neural
31:52 network saying that there's only a 10
31:54 likelihood that you should pass does
31:56 anyone have any ideas
32:05 this neural network is just uh like it
32:08 was just born right it has no
32:09 information about the the world or this
32:12 class it doesn't know what four and five
32:14 mean or what the notion of passing or
32:16 failing means right so
32:19 exactly right this neural network has
32:21 not been trained you can think of it
32:22 kind of as a baby it hasn't learned
32:25 anything yet so our job firstly is to
32:28 train it and part of that understanding
32:29 is we first need to tell the neural
32:31 network when it makes mistakes right so
32:33 mathematically we should now think about
32:35 how we can answer this question which is
32:37 does did my neural network make a
32:39 mistake and if it made a mistake how can
32:42 I tell it how big of a mistake it was so
32:43 that the next time it sees this data
32:46 point can it do better minimize that
32:49 so in neural network language those
32:52 mistakes are called losses right and
32:54 specifically you want to Define what's
32:56 called a loss function which is going to
32:58 take as input your prediction
33:01 and the true prediction right and how
33:03 far away your prediction is from the
33:05 true prediction tells you how big of a
33:07 loss there is right so for example
33:11 let's say we want to build a neural
33:14 network to do classification of
33:16 or sorry actually even before that I
33:19 want to maybe give you some terminology
33:20 so there are multiple different ways of
33:23 saying the same thing in neural networks
33:25 and deep learning so what I just
33:27 described as a loss function is also
33:29 commonly referred to as an objective
33:30 function empirical risk a cost function
33:33 these are all exactly the same thing
33:35 they're all a way for us to train the
33:37 neural network to teach the neural
33:38 network when it makes mistakes
33:40 and what we really ultimately want to do
33:43 is over the course of an entire data set
33:45 not just one data point of mistakes we
33:48 won't say over the entire data set we
33:50 want to minimize all of the mistakes on
33:53 average that this neural network makes
33:56 so if we look at the problem like I said
33:58 of binary classification will I pass
34:00 this class or will I not there's a yes
34:02 or no answer that means binary
34:05 now we can use what's called a loss
34:08 function of the softmax Cross entropy
34:10 loss and for those of you who aren't
34:12 familiar this notion of cross entropy is
34:14 actually developed here at MIT by Sean
34:21 Claude Shannon who is a Visionary he did
34:25 his Masters here over 50 years ago he
34:28 introduced this notion of cross-entropy
34:30 and that was you know pivotal in in the
34:33 ability for us to train these types of
34:35 neural networks even now into the future
34:38 so let's start by instead of predicting
34:42 a binary cross-entropy output what if we
34:44 wanted to predict a final grade of your
34:47 class score for example that's no longer
34:49 a binary output yes or no it's actually
34:52 a continuous variable right it's the
34:53 grade let's say out of 100 points what
34:56 is the value of your score in the class
34:58 project right for this type of loss we
35:01 can use what's called a mean squared
35:02 error loss you can think of this
35:03 literally as just subtracting your
35:05 predicted grade from the true grade and
35:08 minimizing that distance apart
35:11 foreign so I think now we're ready to
35:14 really put all of this information
35:15 together and Tackle this problem of
35:18 training a neural network right to not
35:23 how erroneous it is how large its loss
35:26 is but more importantly minimize that
35:28 loss as a function of seeing all of this
35:30 training data that it observes
35:33 so we know that we want to find this
35:35 neural network like we mentioned before
35:37 that minimizes this empirical risk or
35:40 this empirical loss averaged across our
35:43 entire data set now this means that we
35:45 want to find mathematically these W's
35:48 right that minimize J of w JFW is our
35:52 loss function average over our entire
35:54 data set and W is our weight so we want
35:56 to find the set of Weights that on
35:59 average is going to give us the minimum
36:01 the smallest loss as possible
36:05 now remember that W here is just a list
36:08 basically it's just a group of all of
36:10 the weights in our neural network you
36:11 may have hundreds of weights and a very
36:14 very small neural network or in today's
36:16 neural networks you may have billions or
36:18 trillions of weights and you want to
36:19 find what is the value of every single
36:21 one of these weights that's going to
36:23 result in the smallest loss as possible
36:26 now how can you do this remember that
36:28 our loss function J of w is just a
36:31 function of our weights right so for any
36:34 instantiation of our weights we can
36:36 compute a scalar value of you know how
36:39 how erroneous would our neural network
36:41 be for this instantiation of our weights
36:44 so let's try and visualize for example
36:47 in a very simple example of a
36:49 two-dimensional space where we have only
36:51 two weights extremely simple neural
36:54 network here very small two weight
36:56 neural network and we want to find what
36:58 are the optimal weights that would train
37:00 this neural network we can plot
37:04 how erroneous the neural network is for
37:07 every single instantiation of these two
37:10 weights right this is a huge space it's
37:12 an infinite space but still we can try
37:14 to we can have a function that evaluates
37:17 at every point in this space
37:19 now what we ultimately want to do is
37:21 again we want to find which set of W's
37:25 will give us the smallest
37:28 loss possible that means basically the
37:30 lowest point on this landscape that you
37:32 can see here where is the W's that bring
37:35 us to that lowest point
37:38 the way that we do this is actually just
37:41 by firstly starting at a random place we
37:43 have no idea where to start so pick a
37:45 random place to start in this space and
37:47 let's start there at this location let's
37:50 evaluate our neural network we can
37:51 compute the loss at this specific
37:53 location and on top of that we can
37:57 how the loss is changing we can compute
37:59 the gradient of the loss because our
38:01 loss function is a continuous function
38:04 right so we can actually compute
38:05 derivatives of our function across the
38:08 space of our weights and the gradient
38:10 tells us the direction of the highest
38:14 point right so from where we stand the
38:16 gradient tells us where we should go to
38:20 now of course we don't want to increase
38:21 our loss we want to decrease our loss so
38:23 we negate our gradient and we take a
38:25 step in the opposite direction of the
38:27 gradient that brings us one step closer
38:29 to the bottom of the landscape and we
38:32 just keep repeating this process right
38:34 over and over again we evaluate the
38:36 neural network at this new location
38:38 compute its gradient and step in that
38:40 new direction we keep traversing this
38:43 landscape until we converge to the
38:46 we can really summarize this algorithm
38:49 which is known formally as gradient
38:51 descent right so gradient descent simply
38:53 can be written like this we initialize
38:56 all of our weights right this can be two
38:58 weights like you saw in the previous
38:59 example it can be billions of Weights
39:01 like in real neural networks we compute
39:05 this gradient of the partial derivative
39:08 with of our loss with respect to the
39:10 weights and then we can update our
39:11 weights in the opposite direction of
39:15 so essentially we just take this small
39:18 amount small step you can think of it
39:20 which here is denoted as Ada and we
39:24 refer to this small step right this is
39:27 commonly referred to as what's known as
39:29 the learning rate it's like how much we
39:30 want to trust that gradient and step in
39:33 the direction of that gradient we'll
39:34 talk more about this later
39:36 but just to give you some sense of code
39:39 this this algorithm is very well
39:41 translatable to real code as well for
39:43 every line on the pseudocode you can see
39:44 on the left you can see corresponding
39:46 real code on the right that is runnable
39:48 and directly implementable by all of you
39:51 but now let's take a look specifically
39:53 at this term here this is the gradient
39:55 we touched very briefly on this in the
39:57 visual example this explains like I said
40:00 how the loss is changing as a function
40:02 of the weights right so as the weights
40:04 move around will my loss increase or
40:06 decrease and that will tell the neural
40:08 network if it needs to move the weights
40:10 in a certain direction or not but I
40:13 never actually told you how to compute
40:14 this right and I think that's an
40:16 extremely important part because if you
40:17 don't know that then you can't uh well
40:20 you can't train your neural network
40:21 right this is a critical part of
40:23 training neural networks and that
40:25 process of computing this line This
40:27 gradient line is known as back
40:29 propagation so let's do a very quick
40:32 intro to back propagation and how it
40:36 so again let's start with the simplest
40:38 neural network in existence this neural
40:40 network has one input one output and
40:42 only one neuron right this is as simple
40:44 as it gets we want to compute the
40:46 gradient of our loss with respect to our
40:49 weight in this case let's compute it
40:51 with respect to W2 the second weight
40:54 so this derivative is going to tell us
40:56 how much a small change in this weight
40:59 will affect our loss if if a small
41:02 change if we change our weight a little
41:03 bit in One Direction we'll increase our
41:05 loss or decrease our loss
41:07 so to compute that we can write out this
41:09 derivative we can start with applying
41:11 the chain rule backwards from the loss
41:14 function through the output specifically
41:17 what we can do is we can actually just
41:18 decompose this derivative into two
41:21 components the first component is the
41:23 derivative of our loss with respect to
41:25 our output multiplied by the derivative
41:27 of our output with respect to W2 right
41:30 this is just a standard um
41:33 uh instantiation of the chain rule with
41:36 this original derivative that we had on
41:39 let's suppose we wanted to compute the
41:41 gradients of the weight before that
41:42 which in this case are not W1 but W
41:45 excuse me not W2 but W1
41:48 well all we do is replace W2 with W1 and
41:51 that chain Rule still holds right that
41:53 same equation holds but now you can see
41:56 on the red component that last component
41:58 of the chain rule we have to once again
42:00 recursively apply one more chain rule
42:02 because that's again another derivative
42:05 that we can't directly evaluate we can
42:06 expand that once more with another
42:08 instantiation of the chain Rule and now
42:10 all of these components we can directly
42:13 propagate these gradients through the
42:15 hidden units right in our neural network
42:17 all the way back to the weight that
42:19 we're interested in in this example
42:21 right so we first computed the
42:22 derivative with respect to W2 then we
42:25 can back propagate that and use that
42:26 information also with W1 that's why we
42:28 really call it back propagation because
42:30 this process occurs from the output all
42:32 the way back to the input
42:34 now we repeat this process essentially
42:36 many many times over the course of
42:39 training by propagating these gradients
42:41 over and over again through the network
42:43 all the way from the output to the
42:45 inputs to determine for every single
42:46 weight answering this question which is
42:49 how much does a small change in these
42:51 weights affect our loss function if it
42:54 increases it or decreases and how we can
42:55 use that to improve the loss ultimately
42:58 because that's our final goal in this
43:01 foreign so that's the back propagation
43:04 algorithm that's that's the core of
43:06 training neural networks in theory it's
43:09 very simple it's it's really just an
43:11 instantiation of the chain rule
43:14 but let's touch on some insights that
43:16 make training neural networks actually
43:18 extremely complicated in practice even
43:20 though the algorithm of back propagation
43:22 is simple and you know many decades old
43:26 in practice though optimization of
43:29 neural networks looks something like
43:31 this it looks nothing like that picture
43:32 that I showed you before there are ways
43:34 that we can visualize very large deep
43:36 neural networks and you can think of the
43:39 landscape of these models looking like
43:40 something like this this is an
43:42 illustration from a paper that came out
43:43 several years ago where they tried to
43:45 actually visualize the landscape a very
43:47 very deep neural networks and that's
43:49 what this landscape actually looks like
43:51 that's what you're trying to deal with
43:52 and find the minimum in this space and
43:53 you can imagine the challenges that come
43:56 so to cover the challenges let's first
43:59 think of and recall that update equation
44:02 defined in gradient descent right so I
44:05 didn't talk too much about this
44:07 parameter Ada but now let's spend a bit
44:09 of time thinking about this this is
44:10 called The Learning rate like we saw
44:12 before it determines basically how big
44:15 of a step we need to take in the
44:17 direction of our gradient on every
44:18 single iteration of back propagation
44:21 in practice even setting the learning
44:24 rate can be very challenging you as you
44:25 as the designer of the neural network
44:27 have to set this value this learning
44:29 rate and how do you pick this value
44:31 right so that can actually be quite
44:32 difficult it has really uh large
44:34 consequences when building a neural
44:36 network so for example
44:38 if we set the learning rate too low then
44:42 we learn very slowly so let's assume we
44:44 start on the right hand side here at
44:46 that initial guess if our learning rate
44:48 is not large enough not only do we
44:50 converge slowly we actually don't even
44:52 converge to the global minimum right
44:54 because we kind of get stuck in a local
44:57 now what if we set our learning rate too
44:59 high right what can actually happen is
45:01 we overshoot and we can actually start
45:03 to diverge from the solution the
45:05 gradients can actually explode very bad
45:07 things happen and then the neural
45:09 network doesn't trade so that's also not
45:11 good in reality there's a very happy
45:14 medium between setting it too small
45:16 setting it too large where you set it
45:18 just large enough to kind of overshoot
45:20 some of these local Minima
45:22 put you into a reasonable part of the
45:24 search space where then you can actually
45:26 Converge on the solutions that you care
45:29 but actually how do you set these
45:31 learning rates in practice right how do
45:32 you pick what is the ideal learning rate
45:34 one option and this is actually a very
45:37 common option in practice is to simply
45:39 try out a bunch of learning rates and
45:41 see what works the best right so try out
45:43 let's say a whole grid of different
45:44 learning rates and you know train all of
45:47 these neural networks see which one
45:49 but I think we can do something a lot
45:51 smarter right so what are some more
45:53 intelligent ways that we could do this
45:55 instead of exhaustively trying out a
45:56 whole bunch of different learning rates
45:58 can we design a learning rate algorithm
46:00 that actually adapts to our neural
46:03 network and adapts to its landscape so
46:05 that it's a bit more intelligent than
46:09 so this really ultimately means that
46:12 the learning rate the speed at which the
46:15 algorithm is trusting the gradients that
46:17 it sees is going to depend on how large
46:20 the gradient is in that location and how
46:23 fast we're learning how many other
46:25 options uh and sorry and many other
46:28 options that we might have as part of
46:31 training in neural networks right so
46:32 it's not only how quickly we're learning
46:34 you may judge it on many different
46:35 factors in the learning landscape
46:39 in fact we've all been these different
46:42 algorithms that I'm talking about these
46:44 adaptive learning rate algorithms have
46:46 been very widely studied in practice
46:48 there is a very thriving community in
46:51 the Deep learning research community
46:52 that focuses on developing and designing
46:55 new algorithms for learning rate
46:58 adaptation and faster optimization of
47:00 large neural networks like these
47:02 and during your Labs you'll actually get
47:04 the opportunity to not only try out a
47:07 lot of these different adaptive
47:09 algorithms which you can see here but
47:11 also try to uncover what are kind of the
47:12 patterns and benefits of One Versus the
47:15 other and that's going to be something
47:16 that I think you'll you'll find very
47:18 insightful as part of your labs
47:21 so another key component of your Labs
47:23 that you'll see is how you can actually
47:24 put all of this information that we've
47:26 covered today into a single picture that
47:29 looks roughly something like this which
47:30 defines your model at the first at the
47:33 top here that's where you define your
47:34 model we talked about this in the
47:36 beginning part of the lecture
47:38 for every piece in your model you're now
47:40 going to need to Define this Optimizer
47:42 which we've just talked about this
47:44 Optimizer is defined together with a
47:46 learning rate right how quickly you want
47:48 to optimize your lost landscape and over
47:50 many Loops you're going to pass over all
47:52 of the examples in your data set
47:54 and observe essentially how to improve
47:57 your network that's the gradient and
47:59 then actually improve the network in
48:00 those directions and keep doing that
48:01 over and over and over again until
48:03 eventually your neural network converges
48:05 to some sort of solution
48:09 so I want to very quickly briefly in the
48:12 remaining time that we have continue to
48:14 talk about tips for training these
48:16 neural networks in practice and focus on
48:19 this very powerful idea of batching your
48:22 data into well what are called mini
48:24 batches of smaller pieces of data
48:28 to do this let's revisit that gradient
48:30 descent algorithm right so here this
48:32 gradient that we talked about before is
48:35 actually extraordinarily computationally
48:37 expensive to compute because it's
48:39 computed as a summation across all of
48:42 the pieces in your data set
48:44 right and in most real life or real
48:47 world problems you know it's simply not
48:49 feasible to compute a gradient over your
48:51 entire data set data sets are just too
48:53 large these days so in you know there
48:57 are some Alternatives right what are the
48:58 Alternatives instead of computing the
49:00 derivative or the gradients across your
49:03 entire data set what if you instead
49:06 computed the gradient over just a single
49:09 example in your data set just one
49:10 example well of course this this
49:13 estimate of your gradient is going to be
49:15 exactly that it's an estimate it's going
49:17 to be very noisy it may roughly reflect
49:19 the trends of your entire data set but
49:21 because it's a very it's only one
49:23 example in fact of your entire data set
49:25 it may be very noisy right
49:29 well the advantage of this though is
49:31 that it's much faster to compute
49:33 obviously the gradient over a single
49:36 example because it's one example so
49:38 computationally this has huge advantages
49:40 but the downside is that it's extremely
49:42 stochastic right that's the reason why
49:44 this algorithm is not called gradient
49:46 descent it's called stochastic gradient
49:49 now what's the middle ground right
49:51 instead of computing it with respect to
49:52 one example in your data set what if we
49:54 computed what's called a mini batch of
49:56 examples a small batch of examples that
49:59 we can compute the gradients over and
50:01 when we take these gradients they're
50:04 still computationally efficient to
50:06 compute because it's a mini batch it's
50:08 not too large maybe we're talking on the
50:10 order of tens or hundreds of examples in
50:15 more importantly because we've expanded
50:18 from a single example to maybe 100
50:19 examples the stochasticity is
50:21 significantly reduced and the accuracy
50:23 of our gradient is much improved
50:26 so normally we're thinking of batch
50:28 sizes many batch sizes roughly on the
50:30 order of 100 data points tens or
50:33 hundreds of data points this is much
50:35 faster obviously to compute than
50:36 gradient descent and much more accurate
50:38 to compute compared to stochastic
50:40 gradient descent which is that single
50:42 single point example
50:44 so this increase in gradient accuracy
50:48 allows us to essentially converge to our
50:50 solution much quicker than it could have
50:53 been possible in practice due to
50:54 gradient descent limitations it also
50:56 means that we can increase our learning
50:58 rate because we can trust each of those
51:00 gradients much more efficiently right
51:03 we're now averaging over a batch it's
51:05 going to be much more accurate than the
51:07 stochastic version so we can increase
51:08 that learning rate and actually learn
51:12 this allows us to also massively
51:14 parallelize this entire algorithm in
51:16 computation right we can split up
51:18 batches onto separate workers and
51:21 Achieve even more significant speed UPS
51:23 of this entire problem using gpus the
51:26 last topic that I very very briefly want
51:29 to cover in today's lecture is this
51:32 topic of overfitting right when we're
51:34 optimizing a neural network with
51:37 stochastic gradient descent we have this
51:40 challenge of what's called overfitting
51:42 overfitting I looks like this roughly
51:45 right so on the left hand side
51:47 we want to build a neural network or
51:49 let's say in general we want to build a
51:51 machine learning model that can
51:52 accurately describe some patterns in our
51:55 data but remember we're ultimately we
51:58 don't want to describe the patterns in
51:59 our training data ideally we want to
52:01 define the patterns in our test data of
52:04 course we don't observe test data we
52:06 only observe training data so we have
52:08 this challenge of extracting patterns
52:09 from training data and hoping that they
52:11 generalize to our test data so set in
52:14 one different way we want to build
52:16 models that can learn representations
52:18 from our training data that can still
52:20 generalize even when we show them brand
52:22 new unseen pieces of test data so assume
52:26 that you want to build a line that can
52:28 describe or find the patterns in these
52:31 points that you can see on the slide
52:32 right if you have a very simple neural
52:35 network which is just a single line
52:38 you can describe this data sub-optimally
52:42 right because the data here is
52:44 non-linear you're not going to
52:45 accurately capture all of the nuances
52:47 and subtleties in this data set that's
52:50 on the left hand side if you move to the
52:51 right hand side you can see a much more
52:53 complicated model but here you're
52:55 actually over expressive you're too
52:57 expressive and you're capturing kind of
52:59 the nuances the spurious nuances in your
53:03 training data that are actually not
53:05 representative of your test data
53:07 ideally you want to end up with the
53:09 model in the middle which is basically
53:11 the middle ground right it's not too
53:12 complex and it's not too simple it still
53:15 gives you what you want to perform well
53:17 and even when you give it brand new data
53:19 so to address this problem let's briefly
53:22 talk about what's called regularization
53:24 regularization is a technique that you
53:27 can introduce to your training pipeline
53:29 to discourage complex models from being
53:33 now as we've seen before this is really
53:35 critical because neural networks are
53:37 extremely large models they are
53:39 extremely prone to overfitting right so
53:42 regularization and having techniques for
53:44 regularization has extreme implications
53:46 towards the success of neural networks
53:49 and having them generalize Beyond
53:50 training data far into our testing
53:53 the most popular technique for
53:55 regularization in deep learning is
53:57 called Dropout and the idea of Dropout
53:59 is is actually very simple it's let's
54:02 revisit it by drawing this picture of
54:04 deep neural networks that we saw earlier
54:05 in today's lecture in Dropout during
54:07 training we essentially randomly select
54:10 some subset of the neurons in this
54:12 neural network and we try to prune them
54:16 out with some random probabilities so
54:18 for example we can select this subset of
54:20 neural of neurons we can randomly select
54:22 them with a probability of 50 percent
54:24 and with that probability we randomly
54:27 turn them off or on on different
54:29 iterations of our training
54:32 so this is essentially forcing the
54:35 neural network to learn you can think of
54:37 an ensemble of different models on every
54:39 iteration it's going to be exposed to
54:42 kind of a different model internally
54:44 than the one it had on the last
54:45 iteration so it has to learn how to
54:47 build internal Pathways to process the
54:50 same information and it can't rely on
54:52 information that it learned on previous
54:54 iterations right so it forces it to kind
54:56 of capture some deeper meaning within
54:58 the pathways of the neural network and
55:00 this can be extremely powerful because
55:02 number one it lowers the capacity of the
55:04 neural network significantly right
55:06 you're lowering it by roughly 50 percent
55:09 but also because it makes them easier to
55:12 train because the number of Weights that
55:14 have gradients in this case is also
55:16 reduced so it's actually much faster to
55:20 now like I mentioned on every iteration
55:22 we randomly drop out a different set of
55:25 neurons right and that helps the data
55:29 and the second regularization techniques
55:31 which is actually a very broad
55:32 regularization technique far beyond
55:34 neural networks is simply called early
55:37 stopping now we know the the definition
55:41 of overfitting is simply when our model
55:44 starts to represent basically the
55:46 training data more than the testing data
55:48 that's really what overfitting comes
55:50 down to at its core
55:51 if we set aside some of the training
55:53 data to use separately that we don't
55:55 train on it we can use it as kind of a
55:58 data set synthetic testing data set in
56:02 we can monitor how our network is
56:04 learning on this unseen portion of data
56:07 so for example we can over the course of
56:09 training we can basically plot the
56:11 performance of our Network on both the
56:13 training set as well as our held out
56:15 test set and as the network is trained
56:18 we're going to see that first of all
56:19 these both decrease but there's going to
56:21 be a point where the loss plateaus and
56:25 starts to increase the training loss
56:27 will actually start to increase this is
56:28 exactly the point where you start to
56:30 overfit right because now you're
56:31 starting to have sorry that was the test
56:34 loss the test loss actually starts to
56:36 increase because now you're starting to
56:37 overfit on your training data this
56:40 pattern basically continues for the rest
56:41 of training and this is the point that I
56:44 want you to focus on right this Middle
56:46 Point is where we need to stop training
56:48 because after this point assuming that
56:50 this test set is a valid representation
56:52 of the true test set this is the place
56:55 where the accuracy of the model will
56:57 only get worse right so this is where we
56:59 would want to early stop our model and
57:00 regularize the performance
57:02 and we can see that stopping anytime
57:04 before this point is also not good we're
57:07 going to produce an underfit model where
57:09 we could have had a better model on the
57:10 test data but it's this trade-off right
57:13 you can't stop too late and you can't
57:14 stop too early as well
57:17 so I'll conclude this lecture by just
57:20 summarizing these three key points that
57:21 we've covered in today's lecture so far
57:24 so we've first covered these fundamental
57:26 building blocks of all neural networks
57:28 which is the single neuron the
57:29 perceptron we've built these up into
57:31 larger neural layers and then from their
57:34 neural networks and deep neural networks
57:36 we've learned how we can train these
57:38 apply them to data sets back propagate
57:41 through them and we've seen some trips
57:43 tips and tricks for optimizing these
57:48 in the next lecture we'll hear from Ava
57:50 on deep sequence modeling using rnns and
57:53 specifically this very exciting new type
57:56 of model called the Transformer
57:58 architecture and attention mechanisms so
58:01 maybe let's resume the class in about
58:03 five minutes after we have a chance to
58:04 swap speakers and thank you so much for
58:07 all of your attention