00:09 good afternoon everyone thank you all

00:11 for joining today my name is Alexander

00:13 amini and I'll be one of your course

00:15 organizers this year along with Ava and

00:19 together we're super excited to

00:21 introduce you all to introduction to

00:23 deep learning now MIT intro to deep

00:26 learning is a really really fun exciting

00:28 and fast-paced program here at MIT and

00:32 let me start by just first of all giving

00:33 you a little bit of background into what

00:36 we do and what you're going to learn

00:37 about this year so this week of intro to

00:41 deep learning we're going to cover a ton

00:42 of material in just one week you'll

00:45 learn the foundations of this really

00:46 really fascinating and exciting field of

00:49 deep learning and artificial

00:51 and more importantly you're going to get

00:53 hands-on experience actually reinforcing

00:56 what you learn in the lectures as part

00:59 of Hands-On software labs

01:01 now over the past decade Ai and deep

01:03 learning have really had a huge

01:05 Resurgence and had incredible successes

01:08 and a lot of problems that even just a

01:10 decade ago we thought were not really

01:12 even solvable in the near future now

01:14 we're solving with deep learning with

01:16 Incredible use now this past year in

01:19 particular of 2022 has been an

01:22 incredible year for a deep learning

01:23 progress and I'd like to say that

01:25 actually this past year in particular

01:27 has been the year of generative deep

01:29 learning using deep learning to generate

01:31 brand new types of data that I've never

01:34 been seen before and never existed in

01:37 reality in fact I want to start this

01:39 class by actually showing you how we

01:40 started this class several years ago

01:42 which was by playing this video that

01:45 I'll play in a second now this video

01:47 actually was an introductory video for

01:49 the class it was uh and kind of

01:52 exemplifies this idea that I'm talking

01:54 about so let me just stop there and play

01:56 this video first of all

02:01 hi everybody and welcome

02:06 new official introductory course on deep

02:10 learning part here at mic

02:13 each one his revolutionizing so many

02:17 things and Robotics to medicine and

02:21 everything in between

02:23 you'll learn the fundamentals of this

02:26 field and how you can build some of

02:29 these incredible algorithms

02:32 in fact this entire speech then video

02:36 are not real and were created using deep

02:41 learning and artificial intelligence

02:45 and in this class you'll learn how

02:48 it has been an honor to speak with you

02:51 today and I hope you enjoyed it was

02:56 so in case you couldn't tell this video

02:59 and its entire audio was actually not

03:02 real it was synthetically generated by a

03:04 deep learning algorithm and when we

03:06 introduced this class A few years ago

03:09 this video was created several years ago

03:11 right but even several years ago when we

03:14 introduced this and put it on YouTube it

03:16 went somewhat viral right people really

03:18 loved this video they were intrigued by

03:20 how real the video and audio felt and

03:23 looked uh entirely generated by an

03:26 algorithm by a computer and people were

03:28 shocked with the power and the realism

03:30 of these types of approaches and this

03:32 was a few years ago now fast forward to

03:35 today and the state of deep learning

03:40 have seen deep learning accelerating at

03:43 a rate faster than we've ever seen

03:45 before in fact we can use deep learning

03:47 now to generate not just images of faces

03:51 but generate full synthetic environments

03:53 where we can train autonomous vehicles

03:55 entirely in simulation and deploy them

03:58 on full-scale vehicles in the real world

04:00 seamlessly the videos here you see are

04:02 actually from a data driven simulator

04:04 from neural networks generated called

04:06 Vista that we actually built here at MIT

04:09 and have open sourced to the public so

04:11 all of you can actually train and build

04:13 the future of autonomy and self-driving

04:15 cars and of course it goes far beyond

04:17 this as well deep learning can be used

04:19 to generate content directly from how we

04:23 speak and the language that we convey to

04:25 it from prompts that we say deep

04:27 learning can reason about the prompts in

04:29 natural language and English for example

04:31 and then guide and control what is

04:34 generated according to what we specify

04:38 we've seen examples of where we can

04:40 generate for example things that again

04:43 have never existed in reality we can ask

04:45 a neural network to generate a photo of

04:47 a astronaut riding a horse and it

04:50 actually can imagine hallucinate what

04:53 this might look like even though of

04:54 course this photo not only this photo

04:57 has never occurred before but I don't

04:58 think any photo of an astronaut riding a

05:00 horse has ever occurred before so

05:02 there's not really even training data

05:03 that you could go off in this case and

05:05 my personal favorite is actually how we

05:07 can not only build software that can

05:09 generate images and videos but build

05:11 software that can generate software as

05:14 well we can also have algorithms that

05:16 can take language prompts for example a

05:19 prompt like this write code and

05:20 tensorflow to generate or to train a

05:23 neural network and not only will it

05:25 write the code and create that neural

05:27 network but it will have the ability to

05:30 reason about the code that it's

05:32 generated and walk you through step by

05:33 step explaining the process and

05:35 procedure all the way from the ground up

05:37 to you so that you can actually learn

05:38 how to do this process as well

05:42 I think some of these examples really

05:44 just highlight how far deep learning and

05:47 these methods have come in the past six

05:49 years since we started this course and

05:50 you saw that example just a few years

05:52 ago from that introductory video but now

05:55 we're seeing such incredible advances

05:56 and the most amazing part of this course

05:58 in my opinion is actually that within

06:01 this one week we're going to take you

06:03 through from the ground up starting from

06:04 today all of the foundational building

06:07 blocks that will allow you to understand

06:09 and make all of this amazing Advance as

06:13 so with that hopefully now you're all

06:15 super excited about what this class will

06:18 teach and I want to basically now just

06:20 start by taking a step back and

06:22 introducing some of these terminologies

06:24 that I've kind of been throwing around

06:26 so far the Deep learning artificial

06:28 intelligence what do these things

06:30 so first of all I want to maybe just

06:35 speak a little bit about intelligence

06:37 and what intelligence means at its core

06:39 so to me intelligence is simply the

06:42 ability to process information such that

06:45 we can use it to inform some future

06:46 decision or action that we take

06:49 now the field of artificial intelligence

06:52 is simply the ability for us to build

06:54 algorithms artificial algorithms that

06:56 can do exactly this process information

06:58 to inform some future decision

07:01 now machine learning is simply a subset

07:03 of AI which focuses specifically on how

07:06 we can build a machine to or teach a

07:10 machine how to do this from some

07:12 experiences or data for example now deep

07:15 learning goes One Step Beyond this and

07:17 is a subset of machine learning which

07:19 focuses explicitly on what are called

07:20 neural networks and how we can build

07:22 neural networks that can extract

07:24 features in the data these are basically

07:25 what you can think of as patterns that

07:27 occur within the data so that it can

07:29 learn to complete these tasks as well

07:32 now that's exactly what this class is

07:35 really all about at its core we're going

07:37 to try and teach you and give you the

07:38 foundational understanding and how we

07:40 can build and teach computers to learn

07:43 tasks many different type of tasks

07:45 directly from raw data and that's really

07:48 what this class spoils down to at it's

07:50 it's most simple form and we'll provide

07:52 a very solid foundation for you both on

07:55 the technical side through the lectures

07:57 which will happen in two parts

07:59 throughout the class the first lecture

08:01 and the second lecture each one about

08:02 one hour long followed by a software lab

08:05 which will immediately follow the

08:06 lectures which will try to reinforce a

08:08 lot of what we cover in the in the in

08:10 the technical part of the class and you

08:13 know give you hands-on experience

08:14 implementing those ideas

08:16 so this program is split between these

08:19 two pieces the technical lectures and

08:20 the software Labs we have several new

08:22 updates this year in specific especially

08:24 in many of the later lectures the first

08:27 lecture will cover the foundations of

08:29 deep learning which is going to be right

08:32 and finally we'll conclude the course

08:34 with some very exciting guest lectures

08:36 from both Academia and Industry who are

08:38 really leading and driving forward the

08:41 state of AI and deep learning and of

08:43 course we have many awesome prizes that

08:46 go with all of the software labs and the

08:49 project competition at the end of the

08:51 course so maybe quickly to go through

08:53 these each day like I said we'll have

08:54 dedicated software Labs that couple with

08:58 starting today with lab one you'll

09:00 actually build a neural network keeping

09:02 with this theme of generative AI you'll

09:04 build a neural network that can learn

09:05 listen to a lot of music and actually

09:08 learn how to generate brand new songs in

09:10 that genre of music

09:12 at the end at the next level of the

09:14 class on Friday we'll host a project

09:16 pitch competition where either you

09:18 individually or as part of a group can

09:21 participate and present an idea a novel

09:24 deep learning idea to all of us it'll be

09:27 roughly three minutes in length and we

09:31 will focus not as much because this is a

09:33 one week program we're not going to

09:35 focus so much on the results of your

09:37 pitch but rather The Innovation and the

09:39 idea and the novelty of what you're

09:42 the prices here are quite significant

09:44 already where first price is going to

09:46 get an Nvidia GPU which is really a key

09:48 piece of Hardware that is instrumental

09:50 if you want to actually build a deep

09:52 learning project and train these neural

09:54 networks which can be very large and

09:55 require a lot of compute these prices

09:57 will give you the compute to do so and

09:59 finally this year we'll be awarding a

10:01 grand prize for labs two and three

10:03 combined which will occur on Tuesday and

10:06 Wednesday focused on what I believe is

10:08 actually solving some of the most

10:09 exciting problems in this field of deep

10:11 learning and how specifically how we can

10:13 build models that can be robust not only

10:17 accurate but robust and trustworthy and

10:19 safe when they're deployed as well and

10:21 you'll actually get experience

10:23 developing those types of solutions that

10:25 can actually Advance the state of the

10:28 now all of these Labs that I mentioned

10:30 and competitions here are going to be

10:33 due on Thursday night at 11 PM right

10:36 before the last day of class and we'll

10:38 be helping you all along the way this

10:40 this Prize or this competition in

10:42 particular has very significant prizes

10:45 so I encourage all of you to really

10:46 enter this prize and try to try to get a

10:50 chance to win the prize

10:52 and of course like I said we're going to

10:54 be helping you all along the way who are

10:55 many available resources throughout this

10:57 class to help you achieve this please

11:00 post to Piazza if you have any questions

11:02 and of course this program has an

11:04 incredible team that you can reach out

11:06 to at any point in case you have any

11:08 issues or questions on the materials

11:10 myself and Ava will be your two main

11:13 lectures for the first part of the class

11:14 we'll also be hearing like I said in the

11:17 later part of the class from some guest

11:19 lectures who will share some really

11:21 cutting edge state-of-the-art

11:22 developments in deep learning and of

11:24 course I want to give a huge shout out

11:26 and thanks to all of our sponsors who

11:28 without their support this program

11:29 wouldn't have been possible

11:30 at first yet again another year so thank

11:34 okay so now with that let's really dive

11:37 into the really fun stuff of today's

11:39 lecture which is you know the the

11:41 technical part and I think I want to

11:43 start this part by asking all of you and

11:46 having yourselves ask yourself you know

11:48 having you ask yourselves this question

11:50 of you know why are all of you here

11:53 first of all why do you care about this

11:54 topic in the first place now

11:57 I think to answer this question we have

11:59 to take a step back and think about you

12:01 know the history of machine learning and

12:03 what machine learning is and what deep

12:05 learning brings to the table on top of

12:07 machine learning now traditional machine

12:09 learning algorithms typically Define

12:11 what are called these set of features in

12:14 the data you can think of these as

12:15 certain patterns in the data and then

12:17 usually these features are hand

12:19 engineered so probably a human will come

12:20 into the data set and with a lot of

12:22 domain knowledge and experience can try

12:25 to uncover what these features might be

12:27 now the key idea of deep learning and

12:29 this is really Central to this class is

12:31 that instead of having a human Define

12:33 these features what if we could have a

12:35 machine look at all of this data and

12:38 actually try to extract and uncover what

12:40 are the core patterns in the data so

12:42 that it can use those when it sees new

12:44 data to make some decisions so for

12:46 example if we wanted to detect faces in

12:48 an image a deep neural network algorithm

12:51 might actually learn that in order to

12:53 detect a face it first has to detect

12:55 things like edges in the image lines and

12:57 edges and when you combine those lines

12:59 and edges you can actually create

13:00 compositions of features like corners

13:03 and curves which when you create those

13:05 when you combine those you can create

13:07 more high level features for example

13:08 eyes and noses and ears and then those

13:11 are the features that allow you to

13:13 ultimately detect what you care about

13:15 detecting which is the face but all of

13:16 these come from what are called kind of

13:18 a hierarchical learning of features and

13:20 you can actually see some examples of

13:22 these these are real features learned by

13:24 a neural network and how they're

13:25 combined defines this progression of

13:28 but in fact what I just described this

13:31 underlying and fundamental building

13:33 block of neural networks and deep

13:34 learning have actually existed for

13:37 decades now why are we studying all of

13:40 this now and today in this class with

13:42 all of this great enthusiasm to learn

13:44 this right well for one there have been

13:46 several key advances that have occurred

13:48 in the past decade number one is that

13:51 data is so much more pervasive than it

13:54 has ever been before in our lifetimes

13:56 these models are hungry for more data

13:59 and we're living in the age of Big Data

14:02 more data is available to these models

14:04 than ever before and they Thrive off of

14:07 that secondly these algorithms are

14:10 massively parallelizable they require a

14:12 lot of compute and we're also at a

14:14 unique time in history where we have the

14:17 ability to train these extremely

14:19 large-scale algorithms and techniques

14:21 that have existed for a very long time

14:23 but we can now train them due to the

14:24 hardware advances that have been made

14:26 and finally due to open source toolbox

14:28 access and software platforms like

14:30 tensorflow for example which all of you

14:32 will get a lot of experience on in this

14:34 class training and building the code for

14:37 these neural networks has never been

14:39 easier so that from the software point

14:40 of view as well there have been

14:42 incredible advances to open source you

14:44 know the the underlying fundamentals of

14:46 what you're going to learn

14:48 so let me start now with just building

14:50 up from the ground up the fundamental

14:52 building block of every single neural

14:55 network that you're going to learn in

14:56 this class and that's going to be just a

14:58 single neuron right and in neural

15:01 network language a single neuron is

15:02 called a perceptron

15:05 so what is the perceptron a perceptron

15:07 is like I said a single neuron and it's

15:11 actually I'm going to say it's very very

15:13 simple idea so I want to make sure that

15:15 everyone in the audience understands

15:16 exactly what a perceptron is and how it

15:19 so let's start by first defining a

15:21 perceptron as taking it as input a set

15:24 of inputs right so on the left hand side

15:26 you can see this perceptron takes M

15:29 different inputs 1 to M right these are

15:32 the blue circles we're denoting these

15:36 each of these numbers each of these

15:39 inputs is then multiplied by a

15:41 corresponding weight which we can call W

15:43 right so X1 will be multiplied by W1

15:46 and we'll add the result of all of these

15:49 multiplications together

15:51 now we take that single number after the

15:53 addition and we pass it through this

15:55 non-linear what we call a non-linear

15:57 activation function and that produces

15:59 our final output of the perceptron which

16:05 this is actually not entirely accurate

16:07 of the picture of a perceptron there's

16:09 one step that I forgot to mention here

16:11 so in addition to multiplying all of

16:14 these inputs with their corresponding

16:15 weights we're also now going to add

16:17 what's called a bias term here denoted

16:19 as this w0 which is just a scalar weight

16:22 and you can think of it coming with a

16:24 input of just one so that's going to

16:26 allow the network to basically shift its

16:29 nonlinear activation function uh you

16:32 know non-linearly right as it sees its

16:36 now on the right hand side you can see

16:39 mathematically formulated right as a

16:42 single equation we can now rewrite this

16:44 linear this this equation with linear

16:46 algebra terms of vectors and Dot

16:49 products right so for example we can

16:50 Define our entire inputs X1 to XM as a

16:57 right that large Vector X can be

17:00 by or taking a DOT excuse me Matrix

17:02 multiplied with our weights W this again

17:06 another Vector of our weights W1 to WN

17:10 taking their dot product not only

17:13 multiplies them but it also adds the

17:15 resulting terms together adding a bias

17:17 like we said before and applying this

17:22 now you might be wondering what is this

17:24 non-linear function I've mentioned it a

17:26 few times already well I said it is a

17:28 function right that's passed that we

17:31 pass the outputs of the neural network

17:33 through before we return it you know to

17:35 the next neuron in the in the pipeline

17:38 right so one common example of a

17:40 nonlinear function that's very popular

17:41 in deep neural networks is called the

17:43 sigmoid function you can think of this

17:45 as kind of a continuous version of a

17:46 threshold function right it goes from

17:48 zero to one and it's having it can take

17:51 us input any real number on the real

17:54 and you can see an example of it

17:56 Illustrated on the bottom right hand now

17:58 in fact there are many types of

18:00 nonlinear activation functions that are

18:02 popular in deep neural networks and here

18:03 are some common ones and throughout this

18:05 presentation you'll actually see some

18:07 examples of these code snippets on the

18:09 bottom of the slides where we'll try and

18:11 actually tie in some of what you're

18:13 learning in the lectures to actual

18:15 software and how you can Implement these

18:16 pieces which will help you a lot for

18:18 your software Labs explicitly so the

18:20 sigmoid activation on the left is very

18:22 popular since it's a function that

18:24 outputs you know between zero and one so

18:26 especially when you want to deal with

18:27 probability distributions for example

18:29 this is very important because

18:30 probabilities live between 0 and 1.

18:33 in modern deep neural networks though

18:35 the relu function which you can see on

18:37 the far right hand is a very popular

18:38 activation function because it's

18:40 piecewise linear it's extremely

18:41 efficient to compute especially when

18:43 Computing its derivatives right its

18:45 derivatives are constants except for one

18:47 non-linear idiot zero

18:51 now I hope actually all of you are

18:53 probably asking this question to

18:54 yourself of why do we even need this

18:56 nonlinear activation function it seems

18:57 like it kind of just complicates this

18:59 whole picture when we didn't really need

19:00 it in the first place and I want to just

19:03 spend a moment on answering this because

19:05 the point of a nonlinear activation

19:07 function is of course number one is to

19:09 introduce non-linearities to our data

19:12 right if we think about our data almost

19:15 all data that we care about all real

19:17 world data is highly non-linear now this

19:19 is important because if we want to be

19:21 able to deal with those types of data

19:23 sets we need models that are also

19:24 nonlinear so they can capture those same

19:26 types of patterns so imagine I told you

19:28 to separate for example I gave you this

19:29 data set red points from greenpoints and

19:31 I ask you to try and separate those two

19:33 types of data points now you might think

19:36 that this is easy but what if I could

19:37 only if I told you that you could only

19:39 use a single line to do so well now it

19:42 becomes a very complicated problem in

19:43 fact you can't really Solve IT

19:44 effectively with a single line

19:47 and in fact if you introduce nonlinear

19:50 activation functions to your Solution

19:52 that's exactly what allows you to you

19:54 know deal with these types of problems

19:56 nonlinear activation functions allow you

19:58 to deal with non-linear types of data

20:01 now and that's what exactly makes neural

20:03 networks so powerful at their core

20:06 so let's understand this maybe with a

20:08 very simple example walking through this

20:10 diagram of a perceptron one more time

20:12 imagine I give you this trained neural

20:14 network with weights now not W1 W2 I'm

20:17 going to actually give you numbers at

20:19 these locations right so the trained

20:21 w0 will be 1 and W will be a vector of 3

20:27 so this neural network has two inputs

20:29 like we said before it has input X1 it

20:31 has input X2 if we want to get the

20:33 output of it this is also the main thing

20:36 I want all of you to take away from this

20:37 lecture today is that to get the output

20:39 of a perceptron there are three steps we

20:41 need to take right from this stage we

20:43 first compute the multiplication of our

20:45 inputs with our weights

20:48 sorry yeah multiply them together add

20:52 their result and compute a non-linearity

20:54 it's these three steps that Define the

20:57 forward propagation of information

20:58 through a perceptron

21:00 so let's take a look at how that exactly

21:03 works right so if we plug in these

21:05 numbers to the to those equations we can

21:08 see that everything inside of our

21:10 non-linearity here the nonlinearity is G

21:12 right that function G which could be a

21:15 sigmoid we saw a previous slide

21:17 that component inside of our

21:20 nonlinearity is in fact just a

21:22 two-dimensional line it has two inputs

21:24 and if we consider the space of all of

21:27 the possible inputs that this neural

21:29 network could see we can actually plot

21:31 this on a decision boundary right we can

21:33 plot this two-dimensional line as as a a

21:37 decision boundary as a plane separating

21:40 these two components of our space

21:42 in fact not only is it a single plane

21:45 there's a directionality component

21:46 depending on which side of the plane

21:48 that we live on if we see an input for

21:50 example here negative one two we

21:53 actually know that it lives on one side

21:55 of the plane and it will have a certain

21:57 type of output in this case that output

21:59 is going to be positive right because in

22:02 this case when we plug those components

22:04 into our equation we'll get a positive

22:06 number that passes through the

22:09 nonlinear component and that gets

22:11 propagated through as well of course if

22:13 you're on the other side of the space

22:14 you're going to have the opposite result

22:17 right and that thresholding function is

22:19 going to essentially live at this

22:21 decision boundary so depending on which

22:22 side of the space you live on that

22:24 thresholding function that sigmoid

22:26 function is going to then control how

22:29 you move to one side or the other

22:32 now in this particular example this is

22:35 very convenient right because we can

22:36 actually visualize and I can draw this

22:38 exact full space for you on this slide

22:41 it's only a two-dimensional space so

22:42 it's very easy for us to visualize but

22:44 of course for almost all problems that

22:47 we care about our data points are not

22:49 going to be two-dimensional right if you

22:51 think about an image the dimensionality

22:53 of an image is going to be the number of

22:54 pixels that you have in the image right

22:56 so these are going to be thousands of

22:58 Dimensions millions of Dimensions or

23:00 even more and then drawing these types

23:03 of plots like you see here is simply not

23:05 feasible right so we can't always do

23:07 this but hopefully this gives you some

23:08 intuition to understand kind of as we

23:11 build up into more complex models

23:14 so now that we have an idea of the

23:16 perceptron let's see how we can actually

23:18 take this single neuron and start to

23:20 build it up into something more

23:21 complicated a full neural network and

23:23 build a model from that

23:24 so let's revisit again this previous

23:27 diagram of the perceptron if again just

23:30 to reiterate one more time this core

23:32 piece of information that I want all of

23:34 you to take away from this class is how

23:36 a perceptron works and how it propagates

23:38 information to its decision there are

23:41 three steps first is the dot product

23:43 second is the bias and third is the

23:45 non-linearity and you keep repeating

23:47 this process for every single perceptron

23:49 in your neural network

23:51 let's simplify the diagram a little bit

23:53 I'll get rid of the weights and you can

23:56 assume that every line here now

23:57 basically has an Associated weight

23:59 scaler that's associated with it every

24:01 line also has it corresponds to the

24:03 input that's coming in it has a weight

24:05 that's coming in also at the

24:07 on the line itself and I've also removed

24:10 the bias just for a sake of Simplicity

24:12 but it's still there

24:14 so now the result is that Z which let's

24:18 call that the result of our DOT product

24:20 plus the bias is going and that's what

24:22 we pass into our non-linear function

24:24 that piece is going to be applied to

24:27 that activation function now the final

24:29 output here is simply going to be G

24:33 which is our activation function of Z

24:36 right Z is going to be basically what

24:38 you can think of the state of this

24:39 neuron it's the result of that dot

24:43 now if we want to Define and build up a

24:46 multi-layered output neural network if

24:49 we want two outputs to this function for

24:51 example it's a very simple procedure we

24:52 just have now two neurons two

24:54 perceptrons each perceptron will control

24:56 the output for its Associated piece

25:00 so now we have two outputs each one is a

25:02 normal perceptron it takes all of the

25:03 inputs so they both take the same inputs

25:06 but amazingly now with this mathematical

25:08 understanding we can start to build our

25:11 first neural network entirely from

25:13 scratch so what does that look like so

25:15 we can start by firstly initializing

25:17 these two components the first component

25:19 that we saw was the weight Matrix excuse

25:22 me the weight Vector it's a vector of

25:24 Weights in this case

25:26 and the second component is the the bias

25:29 Vector that we're going to multiply with

25:31 the dot product of all of our inputs by

25:33 our weights right so

25:36 the only remaining step now after we've

25:38 defined these parameters of our layer is

25:40 to now Define you know how does forward

25:43 propagation of information works and

25:45 that's exactly those three main

25:46 components that I've been stressing to

25:49 so we can create this call function to

25:51 do exactly that to Define this forward

25:53 propagation of information

25:55 and the story here is exactly the same

25:57 as we've been seeing it right Matrix

25:58 multiply our inputs with our weights

26:04 and then apply a non-linearity and

26:06 return the result and that literally

26:09 this code will run this will Define a

26:11 full net a full neural network layer

26:13 that you can then take like this

26:16 and of course actually luckily for all

26:18 of you all of that code which wasn't

26:20 much code that's been abstracted away by

26:22 these libraries like tensorflow you can

26:24 simply call functions like this which

26:26 will actually you know replicate exactly

26:30 so you don't need to necessarily copy

26:32 all of that code down you just you can

26:36 and with that understanding you know we

26:39 just saw how you could build a single

26:40 layer but of course now you can actually

26:43 start to think about how you can stack

26:45 these layers as well so since we now

26:48 have this transformation essentially

26:50 from our inputs to a hidden output you

26:53 can think of this as basically how we

26:56 Define some way of transforming those

27:00 inputs right into some new dimensional

27:03 space right perhaps closer to the value

27:06 that we want to predict and that

27:07 transformation is going to be eventually

27:09 learned to know how to transform those

27:11 inputs into our desired outputs and

27:13 we'll get to that later but for now the

27:16 piece that I want to really focus on is

27:17 if we have these more complex neural

27:19 networks I want to really distill down

27:21 that this is nothing more complex than

27:23 what we've already seen if we focus on

27:24 just one neuron in this diagram

27:28 take is here for example Z2 right Z2 is

27:31 this neuron that's highlighted in the

27:34 it's just the same perceptron that we've

27:36 been seeing so far in this class it was

27:39 a its output is obtained by taking a DOT

27:41 product adding a bias and then applying

27:43 that non-linearity between all of its

27:46 if we look at a different node for

27:47 example Z3 which is the one right below

27:49 it it's the exact same story again it

27:51 sees all of the same inputs but it has a

27:53 different set of weight Matrix that it's

27:55 going to apply to those inputs so we'll

27:57 have a different output but the

27:58 mathematical equations are exactly the

28:01 so from now on I'm just going to kind of

28:03 simplify all of these lines and diagrams

28:05 just to show these icons in the middle

28:07 just to demonstrate that this means

28:09 everything is going to fully connect it

28:10 to everything and defined by those

28:12 mathematical equations that we've been

28:14 covering but there's no extra complexity

28:16 in these models from what you've already

28:19 now if you want to Stack these types of

28:21 Solutions on top of each other these

28:24 layers on top of each other you can not

28:26 only Define one layer very easily but

28:28 you can actually create what are called

28:29 sequential models these sequential

28:31 models you can Define one layer after

28:33 another and they define basically the

28:36 forward propagation of information not

28:37 just from the neuron level but now from

28:40 the layer level every layer will be

28:42 fully connected to the next layer and

28:44 the inputs of the secondary layer will

28:46 be all of the outputs of the prior layer

28:50 now of course if you want to create a

28:51 very deep neural network all the Deep

28:53 neural network is is we just keep

28:55 stacking these layers on top of each

28:56 other there's nothing else to this story

28:58 that's really as simple as it is once so

29:02 these layers are basically all they are

29:03 is just layers where the final output is

29:06 right by going deeper and deeper into

29:09 this progression of different layers

29:10 right and you just keep stacking them

29:12 until you get to the last layer which is

29:14 your output layer it's your final

29:15 prediction that you want to Output

29:18 right we can create a deep neural

29:19 network to do all of this by stacking

29:21 these layers and creating these more

29:23 hierarchical models like we saw very

29:25 early in the beginning of today's

29:27 lecture one where the final output is

29:29 really computed by you know just going

29:31 deeper and deeper into this system

29:34 okay so that's awesome so we've now seen

29:37 how we can go from a single neuron to a

29:40 layer to all the way to a deep neural

29:42 network right building off of these

29:44 foundational principles

29:46 let's take a look at how exactly we can

29:48 use these uh you know principles that

29:52 we've just discussed to solve a very

29:54 real problem that I think all of you are

29:56 probably very concerned about uh this

29:59 morning when you when you woke up so

30:01 that problem is how we can build a

30:04 neural network to answer this question

30:05 which is will I how will I pass this

30:07 class and if I will or will I not

30:10 so to answer this question let's see if

30:12 we can train a neural network to solve

30:16 so to do this let's start with a very

30:18 simple neural network right we'll train

30:20 this model with two inputs just two

30:22 inputs one input is going to be the

30:23 number of lectures that you attend over

30:25 the course of this one week and the

30:27 second input is going to be how many

30:30 hours that you spend on your final

30:31 project or your competition

30:34 okay so what we're going to do is

30:36 firstly go out and collect a lot of data

30:38 from all of the past years that we've

30:39 taught this course and we can plot all

30:41 of this data because it's only two input

30:43 space we can plot this data on a

30:44 two-dimensional feature space right we

30:47 can actually look at all of the students

30:49 before you that have passed the class

30:51 and failed the class and see where they

30:53 lived in this space for the amount of

30:55 hours that they've spent the number of

30:56 lectures that they've attended and so on

30:58 greenpoints are the people who have

31:00 passed red or those who have failed now

31:03 and here's you right you're right here

31:05 four or five is your coordinate space

31:07 you fall right there and you've attended

31:10 four lectures you've spent five hours on

31:11 your final project we want to build a

31:13 neural network to answer the question of

31:15 will you pass the class although you

31:18 so let's do it we have two inputs one is

31:20 four one is five these are two numbers

31:22 we can feed them through a neural

31:24 network that we've just seen how we can

31:27 and we feed that into a single layered

31:30 three hidden units in this example but

31:32 we could make it larger if we wanted to

31:33 be more expressive and more powerful

31:36 and we see here that the probability of

31:37 you passing this class is 0.1

31:40 it's pretty visible so why would this be

31:42 the case right what did we do wrong

31:44 because I don't think it's correct right

31:46 when we looked at the space it looked

31:49 like actually you were a good candidate

31:50 to pass the class but why is the neural

31:52 network saying that there's only a 10

31:54 likelihood that you should pass does

31:56 anyone have any ideas

32:05 this neural network is just uh like it

32:08 was just born right it has no

32:09 information about the the world or this

32:12 class it doesn't know what four and five

32:14 mean or what the notion of passing or

32:16 failing means right so

32:19 exactly right this neural network has

32:21 not been trained you can think of it

32:22 kind of as a baby it hasn't learned

32:25 anything yet so our job firstly is to

32:28 train it and part of that understanding

32:29 is we first need to tell the neural

32:31 network when it makes mistakes right so

32:33 mathematically we should now think about

32:35 how we can answer this question which is

32:37 does did my neural network make a

32:39 mistake and if it made a mistake how can

32:42 I tell it how big of a mistake it was so

32:43 that the next time it sees this data

32:46 point can it do better minimize that

32:49 so in neural network language those

32:52 mistakes are called losses right and

32:54 specifically you want to Define what's

32:56 called a loss function which is going to

32:58 take as input your prediction

33:01 and the true prediction right and how

33:03 far away your prediction is from the

33:05 true prediction tells you how big of a

33:07 loss there is right so for example

33:11 let's say we want to build a neural

33:14 network to do classification of

33:16 or sorry actually even before that I

33:19 want to maybe give you some terminology

33:20 so there are multiple different ways of

33:23 saying the same thing in neural networks

33:25 and deep learning so what I just

33:27 described as a loss function is also

33:29 commonly referred to as an objective

33:30 function empirical risk a cost function

33:33 these are all exactly the same thing

33:35 they're all a way for us to train the

33:37 neural network to teach the neural

33:38 network when it makes mistakes

33:40 and what we really ultimately want to do

33:43 is over the course of an entire data set

33:45 not just one data point of mistakes we

33:48 won't say over the entire data set we

33:50 want to minimize all of the mistakes on

33:53 average that this neural network makes

33:56 so if we look at the problem like I said

33:58 of binary classification will I pass

34:00 this class or will I not there's a yes

34:02 or no answer that means binary

34:05 now we can use what's called a loss

34:08 function of the softmax Cross entropy

34:10 loss and for those of you who aren't

34:12 familiar this notion of cross entropy is

34:14 actually developed here at MIT by Sean

34:21 Claude Shannon who is a Visionary he did

34:25 his Masters here over 50 years ago he

34:28 introduced this notion of cross-entropy

34:30 and that was you know pivotal in in the

34:33 ability for us to train these types of

34:35 neural networks even now into the future

34:38 so let's start by instead of predicting

34:42 a binary cross-entropy output what if we

34:44 wanted to predict a final grade of your

34:47 class score for example that's no longer

34:49 a binary output yes or no it's actually

34:52 a continuous variable right it's the

34:53 grade let's say out of 100 points what

34:56 is the value of your score in the class

34:58 project right for this type of loss we

35:01 can use what's called a mean squared

35:02 error loss you can think of this

35:03 literally as just subtracting your

35:05 predicted grade from the true grade and

35:08 minimizing that distance apart

35:11 foreign so I think now we're ready to

35:14 really put all of this information

35:15 together and Tackle this problem of

35:18 training a neural network right to not

35:23 how erroneous it is how large its loss

35:26 is but more importantly minimize that

35:28 loss as a function of seeing all of this

35:30 training data that it observes

35:33 so we know that we want to find this

35:35 neural network like we mentioned before

35:37 that minimizes this empirical risk or

35:40 this empirical loss averaged across our

35:43 entire data set now this means that we

35:45 want to find mathematically these W's

35:48 right that minimize J of w JFW is our

35:52 loss function average over our entire

35:54 data set and W is our weight so we want

35:56 to find the set of Weights that on

35:59 average is going to give us the minimum

36:01 the smallest loss as possible

36:05 now remember that W here is just a list

36:08 basically it's just a group of all of

36:10 the weights in our neural network you

36:11 may have hundreds of weights and a very

36:14 very small neural network or in today's

36:16 neural networks you may have billions or

36:18 trillions of weights and you want to

36:19 find what is the value of every single

36:21 one of these weights that's going to

36:23 result in the smallest loss as possible

36:26 now how can you do this remember that

36:28 our loss function J of w is just a

36:31 function of our weights right so for any

36:34 instantiation of our weights we can

36:36 compute a scalar value of you know how

36:39 how erroneous would our neural network

36:41 be for this instantiation of our weights

36:44 so let's try and visualize for example

36:47 in a very simple example of a

36:49 two-dimensional space where we have only

36:51 two weights extremely simple neural

36:54 network here very small two weight

36:56 neural network and we want to find what

36:58 are the optimal weights that would train

37:00 this neural network we can plot

37:04 how erroneous the neural network is for

37:07 every single instantiation of these two

37:10 weights right this is a huge space it's

37:12 an infinite space but still we can try

37:14 to we can have a function that evaluates

37:17 at every point in this space

37:19 now what we ultimately want to do is

37:21 again we want to find which set of W's

37:25 will give us the smallest

37:28 loss possible that means basically the

37:30 lowest point on this landscape that you

37:32 can see here where is the W's that bring

37:35 us to that lowest point

37:38 the way that we do this is actually just

37:41 by firstly starting at a random place we

37:43 have no idea where to start so pick a

37:45 random place to start in this space and

37:47 let's start there at this location let's

37:50 evaluate our neural network we can

37:51 compute the loss at this specific

37:53 location and on top of that we can

37:57 how the loss is changing we can compute

37:59 the gradient of the loss because our

38:01 loss function is a continuous function

38:04 right so we can actually compute

38:05 derivatives of our function across the

38:08 space of our weights and the gradient

38:10 tells us the direction of the highest

38:14 point right so from where we stand the

38:16 gradient tells us where we should go to

38:20 now of course we don't want to increase

38:21 our loss we want to decrease our loss so

38:23 we negate our gradient and we take a

38:25 step in the opposite direction of the

38:27 gradient that brings us one step closer

38:29 to the bottom of the landscape and we

38:32 just keep repeating this process right

38:34 over and over again we evaluate the

38:36 neural network at this new location

38:38 compute its gradient and step in that

38:40 new direction we keep traversing this

38:43 landscape until we converge to the

38:46 we can really summarize this algorithm

38:49 which is known formally as gradient

38:51 descent right so gradient descent simply

38:53 can be written like this we initialize

38:56 all of our weights right this can be two

38:58 weights like you saw in the previous

38:59 example it can be billions of Weights

39:01 like in real neural networks we compute

39:05 this gradient of the partial derivative

39:08 with of our loss with respect to the

39:10 weights and then we can update our

39:11 weights in the opposite direction of

39:15 so essentially we just take this small

39:18 amount small step you can think of it

39:20 which here is denoted as Ada and we

39:24 refer to this small step right this is

39:27 commonly referred to as what's known as

39:29 the learning rate it's like how much we

39:30 want to trust that gradient and step in

39:33 the direction of that gradient we'll

39:34 talk more about this later

39:36 but just to give you some sense of code

39:39 this this algorithm is very well

39:41 translatable to real code as well for

39:43 every line on the pseudocode you can see

39:44 on the left you can see corresponding

39:46 real code on the right that is runnable

39:48 and directly implementable by all of you

39:51 but now let's take a look specifically

39:53 at this term here this is the gradient

39:55 we touched very briefly on this in the

39:57 visual example this explains like I said

40:00 how the loss is changing as a function

40:02 of the weights right so as the weights

40:04 move around will my loss increase or

40:06 decrease and that will tell the neural

40:08 network if it needs to move the weights

40:10 in a certain direction or not but I

40:13 never actually told you how to compute

40:14 this right and I think that's an

40:16 extremely important part because if you

40:17 don't know that then you can't uh well

40:20 you can't train your neural network

40:21 right this is a critical part of

40:23 training neural networks and that

40:25 process of computing this line This

40:27 gradient line is known as back

40:29 propagation so let's do a very quick

40:32 intro to back propagation and how it

40:36 so again let's start with the simplest

40:38 neural network in existence this neural

40:40 network has one input one output and

40:42 only one neuron right this is as simple

40:44 as it gets we want to compute the

40:46 gradient of our loss with respect to our

40:49 weight in this case let's compute it

40:51 with respect to W2 the second weight

40:54 so this derivative is going to tell us

40:56 how much a small change in this weight

40:59 will affect our loss if if a small

41:02 change if we change our weight a little

41:03 bit in One Direction we'll increase our

41:05 loss or decrease our loss

41:07 so to compute that we can write out this

41:09 derivative we can start with applying

41:11 the chain rule backwards from the loss

41:14 function through the output specifically

41:17 what we can do is we can actually just

41:18 decompose this derivative into two

41:21 components the first component is the

41:23 derivative of our loss with respect to

41:25 our output multiplied by the derivative

41:27 of our output with respect to W2 right

41:30 this is just a standard um

41:33 uh instantiation of the chain rule with

41:36 this original derivative that we had on

41:39 let's suppose we wanted to compute the

41:41 gradients of the weight before that

41:42 which in this case are not W1 but W

41:45 excuse me not W2 but W1

41:48 well all we do is replace W2 with W1 and

41:51 that chain Rule still holds right that

41:53 same equation holds but now you can see

41:56 on the red component that last component

41:58 of the chain rule we have to once again

42:00 recursively apply one more chain rule

42:02 because that's again another derivative

42:05 that we can't directly evaluate we can

42:06 expand that once more with another

42:08 instantiation of the chain Rule and now

42:10 all of these components we can directly

42:13 propagate these gradients through the

42:15 hidden units right in our neural network

42:17 all the way back to the weight that

42:19 we're interested in in this example

42:21 right so we first computed the

42:22 derivative with respect to W2 then we

42:25 can back propagate that and use that

42:26 information also with W1 that's why we

42:28 really call it back propagation because

42:30 this process occurs from the output all

42:32 the way back to the input

42:34 now we repeat this process essentially

42:36 many many times over the course of

42:39 training by propagating these gradients

42:41 over and over again through the network

42:43 all the way from the output to the

42:45 inputs to determine for every single

42:46 weight answering this question which is

42:49 how much does a small change in these

42:51 weights affect our loss function if it

42:54 increases it or decreases and how we can

42:55 use that to improve the loss ultimately

42:58 because that's our final goal in this

43:01 foreign so that's the back propagation

43:04 algorithm that's that's the core of

43:06 training neural networks in theory it's

43:09 very simple it's it's really just an

43:11 instantiation of the chain rule

43:14 but let's touch on some insights that

43:16 make training neural networks actually

43:18 extremely complicated in practice even

43:20 though the algorithm of back propagation

43:22 is simple and you know many decades old

43:26 in practice though optimization of

43:29 neural networks looks something like

43:31 this it looks nothing like that picture

43:32 that I showed you before there are ways

43:34 that we can visualize very large deep

43:36 neural networks and you can think of the

43:39 landscape of these models looking like

43:40 something like this this is an

43:42 illustration from a paper that came out

43:43 several years ago where they tried to

43:45 actually visualize the landscape a very

43:47 very deep neural networks and that's

43:49 what this landscape actually looks like

43:51 that's what you're trying to deal with

43:52 and find the minimum in this space and

43:53 you can imagine the challenges that come

43:56 so to cover the challenges let's first

43:59 think of and recall that update equation

44:02 defined in gradient descent right so I

44:05 didn't talk too much about this

44:07 parameter Ada but now let's spend a bit

44:09 of time thinking about this this is

44:10 called The Learning rate like we saw

44:12 before it determines basically how big

44:15 of a step we need to take in the

44:17 direction of our gradient on every

44:18 single iteration of back propagation

44:21 in practice even setting the learning

44:24 rate can be very challenging you as you

44:25 as the designer of the neural network

44:27 have to set this value this learning

44:29 rate and how do you pick this value

44:31 right so that can actually be quite

44:32 difficult it has really uh large

44:34 consequences when building a neural

44:36 network so for example

44:38 if we set the learning rate too low then

44:42 we learn very slowly so let's assume we

44:44 start on the right hand side here at

44:46 that initial guess if our learning rate

44:48 is not large enough not only do we

44:50 converge slowly we actually don't even

44:52 converge to the global minimum right

44:54 because we kind of get stuck in a local

44:57 now what if we set our learning rate too

44:59 high right what can actually happen is

45:01 we overshoot and we can actually start

45:03 to diverge from the solution the

45:05 gradients can actually explode very bad

45:07 things happen and then the neural

45:09 network doesn't trade so that's also not

45:11 good in reality there's a very happy

45:14 medium between setting it too small

45:16 setting it too large where you set it

45:18 just large enough to kind of overshoot

45:20 some of these local Minima

45:22 put you into a reasonable part of the

45:24 search space where then you can actually

45:26 Converge on the solutions that you care

45:29 but actually how do you set these

45:31 learning rates in practice right how do

45:32 you pick what is the ideal learning rate

45:34 one option and this is actually a very

45:37 common option in practice is to simply

45:39 try out a bunch of learning rates and

45:41 see what works the best right so try out

45:43 let's say a whole grid of different

45:44 learning rates and you know train all of

45:47 these neural networks see which one

45:49 but I think we can do something a lot

45:51 smarter right so what are some more

45:53 intelligent ways that we could do this

45:55 instead of exhaustively trying out a

45:56 whole bunch of different learning rates

45:58 can we design a learning rate algorithm

46:00 that actually adapts to our neural

46:03 network and adapts to its landscape so

46:05 that it's a bit more intelligent than

46:09 so this really ultimately means that

46:12 the learning rate the speed at which the

46:15 algorithm is trusting the gradients that

46:17 it sees is going to depend on how large

46:20 the gradient is in that location and how

46:23 fast we're learning how many other

46:25 options uh and sorry and many other

46:28 options that we might have as part of

46:31 training in neural networks right so

46:32 it's not only how quickly we're learning

46:34 you may judge it on many different

46:35 factors in the learning landscape

46:39 in fact we've all been these different

46:42 algorithms that I'm talking about these

46:44 adaptive learning rate algorithms have

46:46 been very widely studied in practice

46:48 there is a very thriving community in

46:51 the Deep learning research community

46:52 that focuses on developing and designing

46:55 new algorithms for learning rate

46:58 adaptation and faster optimization of

47:00 large neural networks like these

47:02 and during your Labs you'll actually get

47:04 the opportunity to not only try out a

47:07 lot of these different adaptive

47:09 algorithms which you can see here but

47:11 also try to uncover what are kind of the

47:12 patterns and benefits of One Versus the

47:15 other and that's going to be something

47:16 that I think you'll you'll find very

47:18 insightful as part of your labs

47:21 so another key component of your Labs

47:23 that you'll see is how you can actually

47:24 put all of this information that we've

47:26 covered today into a single picture that

47:29 looks roughly something like this which

47:30 defines your model at the first at the

47:33 top here that's where you define your

47:34 model we talked about this in the

47:36 beginning part of the lecture

47:38 for every piece in your model you're now

47:40 going to need to Define this Optimizer

47:42 which we've just talked about this

47:44 Optimizer is defined together with a

47:46 learning rate right how quickly you want

47:48 to optimize your lost landscape and over

47:50 many Loops you're going to pass over all

47:52 of the examples in your data set

47:54 and observe essentially how to improve

47:57 your network that's the gradient and

47:59 then actually improve the network in

48:00 those directions and keep doing that

48:01 over and over and over again until

48:03 eventually your neural network converges

48:05 to some sort of solution

48:09 so I want to very quickly briefly in the

48:12 remaining time that we have continue to

48:14 talk about tips for training these

48:16 neural networks in practice and focus on

48:19 this very powerful idea of batching your

48:22 data into well what are called mini

48:24 batches of smaller pieces of data

48:28 to do this let's revisit that gradient

48:30 descent algorithm right so here this

48:32 gradient that we talked about before is

48:35 actually extraordinarily computationally

48:37 expensive to compute because it's

48:39 computed as a summation across all of

48:42 the pieces in your data set

48:44 right and in most real life or real

48:47 world problems you know it's simply not

48:49 feasible to compute a gradient over your

48:51 entire data set data sets are just too

48:53 large these days so in you know there

48:57 are some Alternatives right what are the

48:58 Alternatives instead of computing the

49:00 derivative or the gradients across your

49:03 entire data set what if you instead

49:06 computed the gradient over just a single

49:09 example in your data set just one

49:10 example well of course this this

49:13 estimate of your gradient is going to be

49:15 exactly that it's an estimate it's going

49:17 to be very noisy it may roughly reflect

49:19 the trends of your entire data set but

49:21 because it's a very it's only one

49:23 example in fact of your entire data set

49:25 it may be very noisy right

49:29 well the advantage of this though is

49:31 that it's much faster to compute

49:33 obviously the gradient over a single

49:36 example because it's one example so

49:38 computationally this has huge advantages

49:40 but the downside is that it's extremely

49:42 stochastic right that's the reason why

49:44 this algorithm is not called gradient

49:46 descent it's called stochastic gradient

49:49 now what's the middle ground right

49:51 instead of computing it with respect to

49:52 one example in your data set what if we

49:54 computed what's called a mini batch of

49:56 examples a small batch of examples that

49:59 we can compute the gradients over and

50:01 when we take these gradients they're

50:04 still computationally efficient to

50:06 compute because it's a mini batch it's

50:08 not too large maybe we're talking on the

50:10 order of tens or hundreds of examples in

50:15 more importantly because we've expanded

50:18 from a single example to maybe 100

50:19 examples the stochasticity is

50:21 significantly reduced and the accuracy

50:23 of our gradient is much improved

50:26 so normally we're thinking of batch

50:28 sizes many batch sizes roughly on the

50:30 order of 100 data points tens or

50:33 hundreds of data points this is much

50:35 faster obviously to compute than

50:36 gradient descent and much more accurate

50:38 to compute compared to stochastic

50:40 gradient descent which is that single

50:42 single point example

50:44 so this increase in gradient accuracy

50:48 allows us to essentially converge to our

50:50 solution much quicker than it could have

50:53 been possible in practice due to

50:54 gradient descent limitations it also

50:56 means that we can increase our learning

50:58 rate because we can trust each of those

51:00 gradients much more efficiently right

51:03 we're now averaging over a batch it's

51:05 going to be much more accurate than the

51:07 stochastic version so we can increase

51:08 that learning rate and actually learn

51:12 this allows us to also massively

51:14 parallelize this entire algorithm in

51:16 computation right we can split up

51:18 batches onto separate workers and

51:21 Achieve even more significant speed UPS

51:23 of this entire problem using gpus the

51:26 last topic that I very very briefly want

51:29 to cover in today's lecture is this

51:32 topic of overfitting right when we're

51:34 optimizing a neural network with

51:37 stochastic gradient descent we have this

51:40 challenge of what's called overfitting

51:42 overfitting I looks like this roughly

51:45 right so on the left hand side

51:47 we want to build a neural network or

51:49 let's say in general we want to build a

51:51 machine learning model that can

51:52 accurately describe some patterns in our

51:55 data but remember we're ultimately we

51:58 don't want to describe the patterns in

51:59 our training data ideally we want to

52:01 define the patterns in our test data of

52:04 course we don't observe test data we

52:06 only observe training data so we have

52:08 this challenge of extracting patterns

52:09 from training data and hoping that they

52:11 generalize to our test data so set in

52:14 one different way we want to build

52:16 models that can learn representations

52:18 from our training data that can still

52:20 generalize even when we show them brand

52:22 new unseen pieces of test data so assume

52:26 that you want to build a line that can

52:28 describe or find the patterns in these

52:31 points that you can see on the slide

52:32 right if you have a very simple neural

52:35 network which is just a single line

52:38 you can describe this data sub-optimally

52:42 right because the data here is

52:44 non-linear you're not going to

52:45 accurately capture all of the nuances

52:47 and subtleties in this data set that's

52:50 on the left hand side if you move to the

52:51 right hand side you can see a much more

52:53 complicated model but here you're

52:55 actually over expressive you're too

52:57 expressive and you're capturing kind of

52:59 the nuances the spurious nuances in your

53:03 training data that are actually not

53:05 representative of your test data

53:07 ideally you want to end up with the

53:09 model in the middle which is basically

53:11 the middle ground right it's not too

53:12 complex and it's not too simple it still

53:15 gives you what you want to perform well

53:17 and even when you give it brand new data

53:19 so to address this problem let's briefly

53:22 talk about what's called regularization

53:24 regularization is a technique that you

53:27 can introduce to your training pipeline

53:29 to discourage complex models from being

53:33 now as we've seen before this is really

53:35 critical because neural networks are

53:37 extremely large models they are

53:39 extremely prone to overfitting right so

53:42 regularization and having techniques for

53:44 regularization has extreme implications

53:46 towards the success of neural networks

53:49 and having them generalize Beyond

53:50 training data far into our testing

53:53 the most popular technique for

53:55 regularization in deep learning is

53:57 called Dropout and the idea of Dropout

53:59 is is actually very simple it's let's

54:02 revisit it by drawing this picture of

54:04 deep neural networks that we saw earlier

54:05 in today's lecture in Dropout during

54:07 training we essentially randomly select

54:10 some subset of the neurons in this

54:12 neural network and we try to prune them

54:16 out with some random probabilities so

54:18 for example we can select this subset of

54:20 neural of neurons we can randomly select

54:22 them with a probability of 50 percent

54:24 and with that probability we randomly

54:27 turn them off or on on different

54:29 iterations of our training

54:32 so this is essentially forcing the

54:35 neural network to learn you can think of

54:37 an ensemble of different models on every

54:39 iteration it's going to be exposed to

54:42 kind of a different model internally

54:44 than the one it had on the last

54:45 iteration so it has to learn how to

54:47 build internal Pathways to process the

54:50 same information and it can't rely on

54:52 information that it learned on previous

54:54 iterations right so it forces it to kind

54:56 of capture some deeper meaning within

54:58 the pathways of the neural network and

55:00 this can be extremely powerful because

55:02 number one it lowers the capacity of the

55:04 neural network significantly right

55:06 you're lowering it by roughly 50 percent

55:09 but also because it makes them easier to

55:12 train because the number of Weights that

55:14 have gradients in this case is also

55:16 reduced so it's actually much faster to

55:20 now like I mentioned on every iteration

55:22 we randomly drop out a different set of

55:25 neurons right and that helps the data

55:29 and the second regularization techniques

55:31 which is actually a very broad

55:32 regularization technique far beyond

55:34 neural networks is simply called early

55:37 stopping now we know the the definition

55:41 of overfitting is simply when our model

55:44 starts to represent basically the

55:46 training data more than the testing data

55:48 that's really what overfitting comes

55:50 down to at its core

55:51 if we set aside some of the training

55:53 data to use separately that we don't

55:55 train on it we can use it as kind of a

55:58 data set synthetic testing data set in

56:02 we can monitor how our network is

56:04 learning on this unseen portion of data

56:07 so for example we can over the course of

56:09 training we can basically plot the

56:11 performance of our Network on both the

56:13 training set as well as our held out

56:15 test set and as the network is trained

56:18 we're going to see that first of all

56:19 these both decrease but there's going to

56:21 be a point where the loss plateaus and

56:25 starts to increase the training loss

56:27 will actually start to increase this is

56:28 exactly the point where you start to

56:30 overfit right because now you're

56:31 starting to have sorry that was the test

56:34 loss the test loss actually starts to

56:36 increase because now you're starting to

56:37 overfit on your training data this

56:40 pattern basically continues for the rest

56:41 of training and this is the point that I

56:44 want you to focus on right this Middle

56:46 Point is where we need to stop training

56:48 because after this point assuming that

56:50 this test set is a valid representation

56:52 of the true test set this is the place

56:55 where the accuracy of the model will

56:57 only get worse right so this is where we

56:59 would want to early stop our model and

57:00 regularize the performance

57:02 and we can see that stopping anytime

57:04 before this point is also not good we're

57:07 going to produce an underfit model where

57:09 we could have had a better model on the

57:10 test data but it's this trade-off right

57:13 you can't stop too late and you can't

57:14 stop too early as well

57:17 so I'll conclude this lecture by just

57:20 summarizing these three key points that

57:21 we've covered in today's lecture so far

57:24 so we've first covered these fundamental

57:26 building blocks of all neural networks

57:28 which is the single neuron the

57:29 perceptron we've built these up into

57:31 larger neural layers and then from their

57:34 neural networks and deep neural networks

57:36 we've learned how we can train these

57:38 apply them to data sets back propagate

57:41 through them and we've seen some trips

57:43 tips and tricks for optimizing these

57:48 in the next lecture we'll hear from Ava

57:50 on deep sequence modeling using rnns and

57:53 specifically this very exciting new type

57:56 of model called the Transformer

57:58 architecture and attention mechanisms so

58:01 maybe let's resume the class in about

58:03 five minutes after we have a chance to

58:04 swap speakers and thank you so much for

58:07 all of your attention