00:09 hello everyone and I hope you enjoyed

00:12 Alexander's first lecture I'm Ava and in

00:15 this second lecture lecture two we're

00:17 going to focus on this question of

00:19 sequence modeling how we can build

00:21 neural networks that can handle and

00:23 learn from sequential data

00:25 so in Alexander's first lecture he

00:27 introduced the essentials of neural

00:29 networks starting with perceptrons

00:31 building up to feed forward models and

00:33 how you can actually train these models

00:35 and start to think about deploying them

00:38 now we're going to turn our attention to

00:41 specific types of problems that involve

00:43 sequential processing of data and we'll

00:46 realize why these types of problems

00:48 require a different way of implementing

00:50 and building neural networks from what

00:55 and I think some of the components in

00:57 this lecture traditionally can be a bit

00:59 confusing or daunting at first but what

01:01 I really really want to do is to build

01:03 this understanding up from the

01:05 foundations walking through step by step

01:07 developing intuition all the way to

01:09 understanding the math and the

01:11 operations behind how these networks

01:15 okay so let's let's get started

01:22 to begin I first want to motivate what

01:24 exactly we mean when we talk about

01:26 sequential data or sequential modeling

01:28 so we're going to begin with a really

01:30 simple intuitive example

01:32 let's say we have this picture of a ball

01:34 and your task is to predict where this

01:37 ball is going to travel to next

01:40 now if you don't have any prior

01:42 information about the trajectory of the

01:44 ball it's motion it's history any guess

01:46 or prediction about its next position is

01:49 going to be exactly that a random guess

01:53 if however in addition to the current

01:55 location of the ball I gave you some

01:57 information about where it was moving in

01:59 the past now the problem becomes much

02:01 easier and I think hopefully we can all

02:04 agree that most likely or most likely

02:07 next prediction is that this ball is

02:09 going to move forward to the right in in

02:13 so this is a really you know reduced

02:15 down Bare Bones intuitive example but

02:18 the truth is that Beyond this sequential

02:20 data is really all around us right as

02:23 I'm speaking the words coming out of my

02:25 mouth form a sequence of sound waves

02:28 that Define audio which we can split up

02:30 to think about in this sequential manner

02:33 similarly text language can be split up

02:36 into a sequence of characters

02:38 or a sequence of words

02:41 and there are many many more examples in

02:43 which sequential processing sequential

02:45 data is present right from medical

02:48 signals like EKGs to financial markets

02:51 and projecting stock prices to

02:53 biological sequences encoded in DNA to

02:56 patterns in the climate to patterns of

02:58 motion and many more

03:00 and so already hopefully you're getting

03:03 a sense of what these types of questions

03:04 and problems may look like and where

03:06 they are relevant in the real world

03:10 when we consider applications of

03:12 sequential modeling in the real world we

03:14 can think about a number of different

03:15 kind of problem definitions that we can

03:17 have in our Arsenal and work with

03:20 in the first lecture Alexander

03:22 introduced the Notions of classification

03:24 and the notion of regression where he

03:27 talked about and we learned about feed

03:29 forward models that can operate one to

03:31 one in this fixed and static setting

03:33 right given a single input predict a

03:37 the binary classification example of

03:39 will you succeed or pass this class

03:42 here there's there's no notion of

03:44 sequence there's no notion of time

03:46 now if we introduce this idea of a

03:49 sequential component we can handle

03:51 inputs that may be defined temporally

03:54 and potentially also produce a

03:56 sequential or temporal output so for as

04:00 one example we can consider text

04:02 language and maybe we want to generate

04:04 one prediction given a sequence of text

04:07 classifying whether a message is a

04:10 positive sentiment or a negative

04:13 conversely we could have a single input

04:16 let's say an image and our goal may be

04:19 now to generate text or a sequential

04:22 description of this image right given

04:24 this image of a baseball player throwing

04:26 a ball can we build a neural network

04:28 that generates that as a language

04:32 finally we can also consider

04:34 applications and problems where we have

04:36 sequence in sequence L for example if we

04:39 want to translate between two languages

04:41 and indeed this type of thinking in this

04:44 type of Architecture is what powers the

04:47 task of machine translation in your

04:49 phones in Google Translate and and many

04:54 so hopefully right this has given you a

04:57 picture of what sequential data looks

04:58 like what these types of problem

05:00 definitions may look like and from this

05:03 we're going to start and build up our

05:04 understanding of what neural networks we

05:07 can build and train for these types of

05:11 so first we're going to begin with the

05:14 notion of recurrence and build up from

05:16 that to Define recurrent neural networks

05:19 and in the last portion of the lecture

05:21 we'll talk about the underlying

05:22 mechanisms underlying the Transformer

05:25 architectures that are very very very

05:27 powerful in terms of handling sequential

05:29 data but as I said at the beginning

05:31 right the theme of this lecture is

05:34 building up that understanding step by

05:36 step starting with the fundamentals and

05:39 so to do that we're going to go back

05:41 revisit the perceptron and move forward

05:45 right so as Alexander introduced where

05:48 we studied the perception perceptron in

05:51 the perceptron is defined by this single

05:54 neural operation where we have some set

05:57 of inputs let's say X1 through XM and

06:00 each of these numbers are multiplied by

06:03 a corresponding weight pass through a

06:05 non-linear activation function that then

06:08 generates a predicted output y hat

06:11 here we can have multiple inputs coming

06:13 in to generate our output but still

06:16 these inputs are not thought of as

06:18 points in a sequence or time steps in a

06:22 even if we scale this perceptron and

06:24 start to stack multiple perceptrons

06:27 together to Define these feed forward

06:28 neural networks we still don't have this

06:31 notion of temporal processing or

06:33 sequential information even though we

06:36 are able to translate and convert

06:38 multiple inputs apply these weight

06:40 operations apply this non-linearity to

06:43 then Define multiple predicted outputs

06:47 so taking a look at this diagram right

06:49 on the left in blue you have inputs on

06:52 the right in purple you have these

06:53 outputs and the green defines the neural

06:56 the single neural network layer that's

06:58 transforming these inputs to the outputs

07:01 Next Step I'm going to just simplify

07:02 this diagram I'm going to collapse down

07:05 those stack perceptrons together and

07:08 depict this with this green block

07:11 still it's the same operation going on

07:13 right we have an input Vector being

07:16 being transformed to predict this output

07:20 now what I've introduced here which you

07:23 may notice is this new variable T right

07:26 which I'm using to denote a single time

07:29 we are considering an input at a single

07:31 time step and using our neural network

07:34 to generate a single output

07:36 corresponding to that

07:38 how could we start to extend and build

07:40 off this to now think about multiple

07:43 time steps and how we could potentially

07:45 process a sequence of information

07:48 well what if we took this diagram all

07:51 I've done is just rotated it 90 degrees

07:54 where we still have this input vector

07:56 and being fed in producing an output

07:59 vector and what if we can make a copy of

08:02 this network right and just do this

08:05 operation multiple times to try to

08:07 handle inputs that are fed in

08:09 corresponding to different times right

08:11 we have an individual time step starting

08:16 and we can do the same thing the same

08:18 operation for the next time step again

08:21 treating that as an isolated instance

08:24 and keep doing this repeatedly

08:27 and what you'll notice hopefully is all

08:29 these models are simply copies of each

08:31 other just with different inputs at each

08:33 of these different time steps

08:36 and we can make this concrete right in

08:38 terms of what this functional

08:39 transformation is doing

08:41 the predicted output at a particular

08:43 time step y hat of T is a function of

08:48 the input at that time step X of T and

08:51 that function is what is learned and

08:53 defined by our neural network weights

08:56 okay so I've told you that our goal here

08:59 is Right trying to understand sequential

09:01 data do sequential modeling but what

09:04 could be the issue with what this

09:06 diagram is showing and what I've shown

09:15 exactly that's exactly right so the

09:17 student's answer was that X1 or it could

09:20 be related to X naught and you have this

09:22 temporal dependence but these isolated

09:24 replicas don't capture that at all and

09:29 answers the question perfectly right

09:32 here a predicted output at a later time

09:37 precisely on inputs at previous time

09:39 steps if this is truly a sequential

09:41 problem with this temporal dependence

09:45 so how could we start to reason about

09:47 this how could we Define a relation that

09:50 links the Network's computations at a

09:52 particular time step to Prior history

09:55 and memory from previous time steps

09:58 well what if we did exactly that right

10:01 what if we simply linked the computation

10:04 and the information understood by the

10:07 network to these other replicas via what

10:11 we call a recurrence relation

10:13 what this means is that something about

10:15 what the network is Computing at a

10:17 particular time is passed on to those

10:20 later time steps and we Define that

10:22 according to this variable H which we

10:25 call this internal state or you can

10:27 think of it as a memory term

10:29 that's maintained by the neurons and the

10:31 network and it's this state that's being

10:34 passed time set to time step as we read

10:37 in and and process this sequential

10:42 what this means is that the Network's

10:45 output its predictions its computations

10:47 is not only a function of the input data

10:51 but also we have this other variable H

10:54 which captures this notion of State

10:56 captions captures this notion of memory

10:59 that's being computed by the network and

11:02 passed on over time

11:04 specifically right to walk through this

11:06 our predicted output y hat of T depends

11:10 not only on the input at a time but also

11:12 this past memory this past state

11:15 and it is this linkage of temporal

11:19 dependence and recurrence that defines

11:21 this idea of a recurrent neural unit

11:24 what I've shown is this this connection

11:27 that's being unrolled over time but we

11:30 can also depict this relationship

11:32 according to a loop

11:34 this computation to this internal State

11:37 variable h of T is being iteratively

11:39 updated over time and that's fed back

11:42 into the neuron the neurons computation

11:45 in this recurrence relation

11:49 this is how we Define these recurrent

11:51 cells that comprise recurrent neural

11:56 and the key here is that we have this

11:59 this idea of this recurrence relation

12:00 that captures the cyclic temporal

12:06 and indeed it's this idea that is really

12:08 the intuitive Foundation behind

12:10 recurrent neural networks or rnns and so

12:13 let's continue to build up our

12:15 understanding from here and move forward

12:17 into how we can actually Define the RNN

12:20 operations mathematically and in code

12:23 so all we're going to do is formalize

12:25 this relationship a little bit more

12:27 the key idea here is that the RNN is

12:30 maintaining the state and it's updating

12:32 the state at each of these time steps as

12:35 the sequence is is processed

12:38 we Define this by applying this

12:40 recurrence relation

12:41 and what the recurrence relation

12:43 captures is how we're actually updating

12:45 that internal State h of t

12:48 specifically that state update is

12:51 exactly like any other neural network

12:52 operator operation that we've introduced

12:55 so far where again we're learning a

12:59 defined by a set of Weights w

13:01 we're using that function to update the

13:06 and the additional component the newness

13:09 here is that that function depends both

13:11 on the input and the prior time step h

13:16 and what you'll know is that this

13:19 function f sub W is defined by a set of

13:22 weights and it's the same set of Weights

13:24 the same set of parameters that are used

13:27 time step to time step as the recurrent

13:30 neural network processes this temporal

13:32 information the sequential data

13:35 okay so the key idea here hopefully is

13:38 coming coming through is that this RNN

13:41 stay update operation takes this state

13:44 and updates it each time a sequence is

13:48 we can also translate this to how we can

13:52 think about implementing rnns in Python

13:56 code or rather pseudocode hopefully

13:59 getting a better understanding and

14:00 intuition behind how these networks work

14:03 so what we do is we just start by

14:07 for now this is abstracted away

14:10 and we start we initialize its hidden

14:12 State and we have some sentence right

14:15 let's say this is our input of Interest

14:16 where we're interested in predicting

14:19 maybe the next word that's occurring in

14:22 what we can do is Loop through these

14:25 individual words in the sentence that

14:27 Define our temporal input and at each

14:30 step as We're looping through each word

14:32 in that sentence is fed into the RNN

14:36 model along with the previous hidden

14:40 and this is what generates a prediction

14:42 for the next word and updates the RNN

14:47 finally our prediction for the final

14:49 word in the sentence the word that we're

14:51 missing is simply the rnn's output after

14:54 all the prior words have been fed in

14:59 so this is really breaking down how the

15:01 RNN Works how it's processing the

15:03 sequential information

15:05 and what you've noticed is that the RNN

15:08 computation includes both this update to

15:10 the hidden State as well as generating

15:12 some predicted output at the end that is

15:15 our ultimate goal that we're interested

15:17 and so to walk through this how we're

15:20 actually generating the output

15:23 what the RNN computes is given some

15:27 it then performs this update to the

15:31 and this update to the head and state is

15:34 just a standard neural network operation

15:36 just like we saw in the first lecture

15:38 where it consists of taking a weight

15:41 Matrix multiplying that by the previous

15:44 hidden State taking another weight

15:46 Matrix multiplying that by the input at

15:49 a time step and applying a non-linearity

15:52 and in this case right because we have

15:54 these two input streams the input data X

15:57 of T and the previous state H we have

16:01 these two separate weight matrices that

16:03 the network is learning over the course

16:06 that comes together we apply the

16:09 non-linearity and then we can generate

16:12 an output at a given time step by just

16:14 modifying the hidden state

16:17 using a separate weight Matrix to update

16:20 this value and then generate a predicted

16:24 and that's what there is to it right

16:26 that's how the RNN in its single

16:29 operation updates both the hidden State

16:32 and also generates a predicted output

16:36 okay so now this gives you the internal

16:39 working of how the RNN computation

16:41 occurs at a particular time step let's

16:45 next think about how this looks like

16:46 over time and Define the computational

16:50 graph of the RNN as being unrolled or

16:53 expanded acrost across time

16:56 so so far the dominant way I've been

16:58 showing the rnns is according to this

17:01 loop-like diagram on the Left Right

17:03 feeding back in on itself

17:05 another way we can visualize and think

17:07 about rnns is as kind of unrolling this

17:11 recurrence over time over the individual

17:14 time steps in our sequence

17:16 what this means is that we can take the

17:19 network at our first time step

17:21 and continue to iteratively unroll it

17:24 across the time steps

17:26 going on forward all the way until we

17:29 process all the time steps in our input

17:32 now we can formalize this diagram a

17:35 little bit more by defining the weight

17:37 matrices that connect the inputs to the

17:41 hidden State update

17:43 and the weight matrices that are used to

17:46 update the internal State across time

17:49 and finally the weight matrices that

17:51 Define the the update to generate a

17:56 now recall that in all these cases right

18:00 for all these three weight matrices add

18:02 all these time steps we are simply

18:04 reusing the same weight matrices right

18:07 so it's one set of parameters one set of

18:09 weight matrices that just process this

18:12 information sequentially

18:14 now you may be thinking okay so how do

18:17 we actually start to be thinking about

18:19 how to train the RNN how to define the

18:22 loss given that we have this temporal

18:24 processing in this temporal dependence

18:27 well a prediction at an individual time

18:30 step will simply amount to a computed

18:33 loss at that particular time step

18:36 so now we can compare those predictions

18:38 time step by time step to the true label

18:41 and generate a loss value for those

18:43 timestamps and finally we can get our

18:46 total loss by taking all these

18:48 individual loss terms together and

18:51 summing them defining the total loss for

18:54 a particular input to the RNN

18:58 if we can walk through an example of how

19:00 we implement this RNN in tensorflow

19:02 starting from scratch

19:04 the RNN can be defined as a layer

19:07 operation and a layer class that

19:09 Alexander introduced in the first

19:11 lecture and so we can Define it

19:13 according to an initialization of weight

19:16 matrices initialization of a hidden

19:19 state which commonly amounts to

19:21 initializing these two to zero

19:25 next we can Define how we can actually

19:27 pass forward through the RNN Network to

19:31 process a given input X

19:33 and what you'll notice is in this

19:35 forward operation the computations are

19:38 exactly like we just walked through we

19:40 first update the hidden state

19:42 according to that equation we introduced

19:46 and then generate a predicted output

19:48 that is a transformed version of that

19:52 and finally at each time step we return

19:54 it both the output and the updated

19:57 hidden State as this is what is

19:59 necessary to be stored to continue this

20:02 RNN operation over time

20:04 what is very convenient is that although

20:08 you could Define your RNN Network and

20:09 your RNN layer completely from scratch

20:11 is that tensorflow abstracts this

20:14 operation away for you so you can simply

20:16 Define a simple RNN according to uh this

20:20 this call that you're seeing here

20:23 um which yeah makes all the the

20:25 computations very efficient and and very

20:28 and you'll actually get practice

20:30 implementing and working with with rnns

20:33 in today's software lab

20:36 okay so that gives us the understanding

20:39 of rnns and going back to what I what I

20:43 described as kind of the problem setups

20:45 or the problem definitions at the

20:46 beginning of this lecture

20:48 I just want to remind you of the types

20:50 of sequence modeling problems on which

20:52 we can apply rnns right we can think

20:56 about taking a sequence of inputs

20:57 producing one predicted output at the

21:00 end of the sequence

21:02 we can think about taking a static

21:04 single input and trying to generate text

21:06 according to according to that single

21:11 and finally we can think about taking a

21:13 sequence of inputs producing a

21:15 prediction at every time step in that

21:18 and then doing this sequence to sequence

21:21 type of prediction and translation

21:28 yeah so so this will be the the

21:32 the software lab today which will focus

21:35 on this problem of of many to many

21:38 processing and many to many sequential

21:40 modeling taking a sequence going to a

21:44 what is common and what is universal

21:46 across all these types of problems and

21:49 tasks that we may want to consider with

21:51 rnns is what I like to think about what

21:54 type of design criteria we need to build

21:57 a robust and reliable Network for

22:00 processing these sequential modeling

22:03 what I mean by that is what are the

22:05 characteristics what are the the design

22:08 requirements that the RNN needs to

22:10 fulfill in order to be able to handle

22:13 sequential data effectively

22:16 the first is that sequences can be of

22:18 different lengths right they may be

22:21 short they may be long we want our RNN

22:23 model or our neural network model in

22:25 general to be able to handle sequences

22:27 of variable lengths

22:29 secondly and really importantly is as we

22:33 were discussing earlier that the whole

22:35 point of thinking about things through

22:36 the lens of sequence is to try to track

22:39 and learn dependencies in the data that

22:41 are related over time

22:43 so our model really needs to be able to

22:45 handle those different dependencies

22:47 which may occur at times that are very

22:50 very distant from each other

22:53 next right sequence is all about order

22:56 right there's some notion of how current

22:59 inputs depend on prior inputs and the

23:02 specific order of the observations we

23:04 see makes a big effect on what

23:08 prediction we may want to generate at

23:11 and finally in order to be able to

23:14 process this information

23:16 effectively our Network needs to be able

23:19 to do what we call parameter sharing

23:21 meaning that given one set of Weights

23:24 that set of weights should be able to

23:26 apply to different time steps in the

23:27 sequence and still result in a

23:30 meaningful prediction

23:31 and so today we're going to focus on how

23:34 recurrent neural networks meet these

23:36 design criteria and how these design

23:38 criteria motivate the need for even more

23:41 powerful architectures that can

23:43 outperform rnns in sequence modeling

23:46 so to understand these criteria very

23:49 concretely we're going to consider a

23:52 sequence modeling problem where given

23:54 some series of words our task is just to

23:57 predict the next word in that sentence

24:00 so let's say we have this sentence this

24:03 morning I took my cat for a walk

24:05 and our task is to predict the last word

24:09 in the sentence given the prior words

24:10 this morning I took my cap for a blank

24:15 our goal is to take our RNN Define it

24:20 put it to test on this task

24:22 what is our first step to doing this

24:25 well the very very first step before we

24:28 even think about defining the RNN is how

24:31 we can actually represent this

24:33 information to the network in a way that

24:35 it can process and understand

24:39 if we have a model that is processing

24:42 this data processing this text-based

24:45 and wanting to generate text as the

24:48 our problem can arise in that the neural

24:51 network itself is not equipped to handle

24:54 language explicitly right

24:56 remember that neural networks are simply

24:58 functional operators they're just

25:00 mathematical operations and so we can't

25:03 expect it right it doesn't have an

25:04 understanding from the start of what a

25:07 word is or what language means which

25:09 means that we need a way to represent

25:11 language numerically so that it can be

25:14 passed in to the network to process

25:19 so what we do is that we need to define

25:21 a way to translate this text this this

25:24 language information into a numerical

25:27 encoding a vector an array of numbers

25:30 that can then be fed in to our neural

25:33 network and generating a a vector of

25:36 numbers as its output

25:39 so now right this raises the question of

25:41 how do we actually Define this

25:43 transformation how can we transform

25:45 language into this numerical encoding

25:48 the key solution and the key way that a

25:50 lot of these networks work is this

25:53 notion and concept of embedding

25:55 what that means is it it's some

25:58 transformation that takes

26:00 indices or something that can be

26:03 represented as an index into a numerical

26:07 Vector of a given size

26:10 so if we think about how this idea of

26:12 embedding works for language data let's

26:14 consider a vocabulary of words that we

26:17 can possibly have in our language

26:19 and our goal is to be able to map these

26:22 individual words in our vocabulary to a

26:26 numerical Vector of fixed size

26:29 one way we could do this is by defining

26:31 all the possible words that could occur

26:35 and then indexing them assigning a index

26:38 label to each of these distinct words

26:41 a corresponds to index one cat responds

26:44 to index two so on and so forth and this

26:47 indexing Maps these individual words to

26:50 numbers unique indices

26:53 what these indices can then Define is

26:55 what we call a embedding vector

26:58 which is a fixed length encoding where

27:00 we've simply indicated a one value

27:04 at the index for that word when we

27:08 and this is called a one-hot embedding

27:10 where we have this fixed length Vector

27:13 of the size of our vocabulary and each

27:16 instance of the vocabulary corresponds

27:18 to a one-hot one at the corresponding

27:24 this is a very sparse way to do this and

27:28 it's simply based on

27:30 purely purely count the count index

27:32 there's no notion of semantic

27:35 information meaning that's captured in

27:38 this vector-based encoding

27:40 alternatively what is very commonly done

27:42 is to actually use a neural network to

27:45 learn in encoding to learn in embedding

27:47 and the goal here is that we can learn a

27:50 neural network that then captures some

27:52 inherent meaning or inherent semantics

27:54 in our input data and Maps related words

27:57 or related inputs closer together in

28:00 this embedding space meaning that

28:02 they'll have numerical vectors that are

28:04 more similar to each other

28:07 this concept is really really

28:09 foundational to how these

28:11 sequence modeling networks work and how

28:15 neural networks work in general

28:17 okay so with that in hand we can go back

28:20 to our design criteria

28:21 thinking about the capabilities that we

28:23 desire first we need to be able to

28:26 handle variable length sequences

28:29 if we again want to predict the next

28:31 word in the sequence we can have short

28:32 sequences we can have long sequences we

28:34 can have even longer sentences and our

28:37 key task is that we want to be able to

28:39 track dependencies across all these

28:43 and what we need what we mean by

28:45 dependencies is that there could be

28:46 information very very early on in a

28:50 but uh that may not be relevant or come

28:53 up late until very much later in the

28:55 sequence and we need to be able to track

28:58 these dependencies and maintain this

29:00 information in our Network

29:03 dependencies relate to order and

29:05 sequences are defined by their order and

29:08 we know that same words in a completely

29:11 different order have completely

29:12 different meanings right so our model

29:15 needs to be able to handle these

29:17 differences in order and the differences

29:19 in length that could result in different

29:25 okay so hopefully that example going

29:28 through the example in text

29:30 motivates how we can think about

29:32 transforming input data into a numerical

29:35 encoding that can be passed into the RNN

29:38 and also what are the key criteria that

29:41 we want to meet in handling these these

29:46 so so far we've painted the picture of

29:49 rnn's how they work intuition their

29:51 mathematical operations and what are the

29:54 key criteria that they need to meet the

29:57 final piece to this is how we actually

29:59 train and learn the weights in the RNN

30:02 and that's done through back propagation

30:04 algorithm with a bit of a Twist to just

30:07 handle sequential information

30:10 if we go back and think about how we

30:13 train feed forward neural network models

30:16 the steps break down in thinking through

30:19 starting with an input where we first

30:21 take this input and make a forward pass

30:24 through the network going from input to

30:28 the key to back propagation that

30:29 Alexander introduced was this idea of

30:32 taking the prediction and back

30:33 propagating gradients back through the

30:37 and using this operation to then Define

30:40 and update the loss with respect to each

30:43 of the parameters in the network in

30:45 order to gradually adjust the parameters

30:48 the weights of the network in order to

30:50 minimize the overall loss

30:53 now with rnns as we walked through

30:55 earlier we have this temporal unrolling

30:58 which means that we have these

30:59 individual losses across the individual

31:01 steps in our sequence that sum together

31:04 to comprise the overall loss

31:07 what this means is that when we do back

31:11 we have to now instead of back

31:14 propagating errors through a single

31:17 back propagate the loss through each of

31:19 these individual time steps

31:21 and after we back propagate loss through

31:24 each of the individual time steps we

31:26 then do that across all time steps all

31:30 the way from our current time time T

31:32 back to the beginning of the sequence

31:35 and this is the why this is why this

31:38 algorithm is called back propagation

31:40 Through Time right because as you can

31:42 see the data and the the predictions and

31:45 the resulting errors are fed back in

31:47 time all the way from where we are

31:49 currently to the very beginning of the

31:51 input data sequence

31:55 so the back propagations through time is

31:58 actually a very tricky algorithm to

32:02 and the reason for this is if we take a

32:04 close look looking at how gradients flow

32:07 across the RNN what this algorithm

32:10 involves is that many many repeated

32:12 computations and multiplications of

32:15 these weight matrices repeatedly against

32:18 in order to compute the gradient with

32:20 respect to the very first time step we

32:24 have to make many of these

32:25 multiplicative repeats of the weight

32:29 why might this be problematic well if

32:33 this weight Matrix W is very very big

32:37 what this can result in is what they

32:39 call what we call the exploding gradient

32:41 problem where our gradients that we're

32:43 trying to use to optimize our Network do

32:46 exactly that they blow up they explode

32:49 and they get really big and makes it

32:51 infeasible and not possible to train the

32:56 what we do to mitigate this is a pretty

32:59 simple solution called gradient clipping

33:01 which effectively scales back these very

33:03 big gradients to try to constrain them

33:05 more in a more restricted way

33:10 conversely we can have the instance

33:13 where the weight matrices are very very

33:15 small and if these weight matrices are

33:19 we end up with a very very small value

33:21 at the end as a result of these repeated

33:23 weight Matrix computations and these

33:28 multiplications and this is a very real

33:31 problem in rnns in particular where we

33:34 can lead into this funnel called a

33:36 Vanishing gradient where now your

33:38 gradient has just dropped down close to

33:40 zero and again you can't train the

33:43 now there are particular tools that we

33:46 can use to implement that we can

33:48 Implement to try to mitigate the Spanish

33:50 ingredient problem and we'll touch on

33:52 each of these three solutions briefly

33:55 first being how we can Define the

33:58 activation function in our Network and

34:01 how we can change the network

34:02 architecture itself to try to better

34:04 handle this Vanishing gradient problem

34:10 I want to take just one step back to

34:12 give you a little more intuition about

34:14 why Vanishing gradients can be a real

34:17 issue for recurrent neural networks

34:20 Point I've kept trying to reiterate is

34:22 this notion of dependency in the

34:24 sequential data and what it means to

34:26 track those dependencies

34:28 well if the dependencies are very

34:30 constrained in a small space not

34:32 separated out that much by time

34:34 this repeated gradient computation and

34:37 the repeated weight matrix

34:38 multiplication is not so much of a

34:41 if we have a very short sequence where

34:44 the words are very closely related to

34:46 each other and it's pretty obvious what

34:49 our next output is going to be

34:52 the RNN can use the immediately passed

34:55 information to make a prediction

34:57 and so there are not going to be that

34:59 many uh that much of a requirement to

35:02 learn effective weights if the related

35:05 information is close to to each other

35:09 conversely now if we have a sentence

35:12 where we have a more long-term

35:16 what this means is that we need

35:17 information from way further back in the

35:19 sequence to make our prediction at the

35:22 end and that gap between what's relevant

35:25 and where we are at currently becomes

35:27 exceedingly large and therefore the

35:29 vanishing gradient problem is

35:31 increasingly exacerbated meaning that we

35:37 the RNN becomes unable to connect the

35:39 dots and establish this long-term

35:41 dependency all because of this Vanishing

35:45 so the ways that we can imply the ways

35:48 and modifications that we can make to

35:49 our Network to try to alleviate this

35:54 the first is that we can simply change

35:57 the activation functions in each of our

35:59 neural network layers to be such that

36:02 they can effectively try to mitigate and

36:05 Safeguard from gradients in instances

36:08 from shrinking the gradients in

36:10 instances where the data is greater than

36:13 zero and this is in particular true for

36:16 the relu activation function and the

36:19 reason is that in all instances where X

36:22 is greater than zero with the relu

36:24 function the derivative is one and so

36:28 that is not less than one and therefore

36:31 it helps in mitigating The Vanishing

36:37 another trick is how we initialize the

36:40 parameters in the network itself to

36:42 prevent them from shrinking to zero too

36:45 and there are there are mathematical

36:48 ways that we can do this namely by

36:50 initializing our weights to Identity

36:53 matrices and this effectively helps in

36:56 practice to prevent the weight updates

36:59 to shrink too rapidly to zero

37:02 however the most robust solution to the

37:04 vanishing gradient problem is by

37:07 introducing a slightly more complicated

37:09 uh version of the recurrent neural unit

37:13 to be able to more effectively track and

37:16 handle long-term dependencies in the

37:19 and this is this idea of gating and what

37:22 the idea is is by controlling

37:25 selectively the flow of information into

37:28 the neural unit to be able to filter out

37:31 what's not important while maintaining

37:35 and the key and the most popular type of

37:38 recurrent unit that achieves this gated

37:40 computation is called the lstm or long

37:44 short term memory Network

37:46 today we're not going to go into detail

37:49 on lstn's their mathematical details

37:52 their operations and so on but I just

37:55 want to convey the key idea and

37:57 intuitive idea about why these lstms are

38:00 effective at tracking long-term

38:04 the core is that the lstm is able to

38:08 control the flow of information through

38:10 these gates to be able to more

38:12 effectively filter out the unimportant

38:15 things and store the important things

38:19 what you can do is Implement Implement

38:22 lstms in tensorflow just as you would in

38:26 but the core concept that I want you to

38:28 take away when thinking about the lstm

38:30 is this idea of controlled information

38:35 very briefly the way that lstm operates

38:38 is by maintaining a cell State just like

38:41 a standard RNN and that cell state is

38:44 independent from what is directly

38:47 the way the cell state is updated is

38:50 according to these Gates that control

38:52 the flow of information

38:54 for getting and eliminating what is

38:57 storing the information that is relevant

39:01 updating the cell state in turn and then

39:04 filtering this this updated cell state

39:07 to produce the predicted output just

39:09 like the standard RNN

39:12 we can train the lstm using the back

39:15 propagation Through Time algorithm but

39:17 the mathematics of how the lstm is

39:19 defined allows for a completely

39:21 uninterrupted flow of the gradients

39:24 which completely eliminates the well

39:27 largely eliminates the The Vanishing

39:29 gradient problem that I introduced

39:33 again we're not if you're if you're

39:36 interested in learning more about the

39:37 mathematics and the details of lstms

39:40 please come and discuss with us after

39:42 the lectures but again just emphasizing

39:45 the core concept and the intuition

39:46 behind how the lstm operates

39:51 okay so so far where we've out where

39:54 we've been at we've covered a lot of

39:57 we've gone through the fundamental

39:58 workings of rnns the architecture the

40:01 training the type of problems that

40:03 they've been applied to and I'd like to

40:05 close this part by considering some

40:08 concrete examples of how you're going to

40:10 use rnns in your software lab

40:14 and that is going to be in the task of

40:16 Music generation where you're going to

40:19 work to build an RNN that can predict

40:22 the next musical note in a sequence and

40:25 use it to generate brand new musical

40:27 sequences that have never been realized

40:31 so to give you an example of just the

40:33 quality and and type of output that you

40:36 can try to aim towards a few years ago

40:39 there was a work that

40:41 trained in RNN on a corpus of classical

40:44 music data and famously there's this

40:47 composer Schubert who uh wrote a famous

40:51 unfinished Symphony that consisted of

40:53 two movements but he was unable to

40:55 finish his uh his Symphony before he

40:59 died so he died and then he left the

41:01 third movement unfinished so a few years

41:04 ago a group trained a RNN based model to

41:08 actually try to generate the third

41:10 movement to Schubert's famous unfinished

41:13 Symphony given the prior to movements so

41:16 I'm going to play the result quite right

41:37 okay I I paused it I interrupted it

41:39 quite abruptly there but if there are

41:42 any classical music aficionados out

41:44 there hopefully you get a appreciation

41:46 for kind of the quality that was

41:48 generated uh in in terms of the music

41:50 quality and this was already from a few

41:53 years ago and as we'll see in the next

41:55 lectures the and continuing with this

41:57 theme of generative AI the power of

41:59 these algorithms has advanced

42:01 tremendously since we first played this

42:05 um particularly in you know a whole

42:07 range of domains which I'm excited to

42:10 talk about but not for now okay so

42:13 you'll tackle this problem head on in

42:15 today's lab RNN music generation

42:21 we can think about the the simple

42:23 example of input sequence to a single

42:26 output with sentiment classification

42:29 where we can think about for example

42:30 text like tweets and assigning positive

42:34 or negative labels to these these text

42:36 examples based on the content that that

42:40 is learned by the network

42:43 okay so this kind of concludes the

42:46 portion on rnns and I think it's quite

42:50 remarkable that using all the

42:51 foundational Concepts and operations

42:53 that we've talked about so far we've

42:56 been able to try to build up networks

42:59 that handle this complex problem of

43:00 sequential modeling

43:03 like any technology right and RNN is not

43:06 without limitations so what are some of

43:09 those limitations and what are some

43:10 potential issues that can arise with

43:13 using rnns or even lstms

43:17 the first is this idea of encoding and

43:20 and dependency in terms of the the

43:24 temporal separation of data that we're

43:28 while rnns require is that the

43:30 sequential information is fed in and

43:32 processed time step by time step

43:35 what that imposes is what we call an

43:37 encoding bottleneck right where we have

43:40 we're trying to encode a lot of content

43:42 for example a very large body of text

43:44 many different words into a single

43:47 output that may be just at the very last

43:50 time step how do we ensure that all that

43:52 information leading up to that time step

43:54 was properly maintained and encoded and

43:57 learned by the network in practice this

43:59 is very very challenging and a lot of

44:01 information can be lost

44:03 another limitation is that by doing this

44:06 time step by time step processing rnns

44:09 can be quite slow there is not really an

44:11 easy way to parallelize that computation

44:15 and finally together these components of

44:18 the encoding bottleneck the requirement

44:20 to process this data step by step

44:23 imposes the biggest problem which is

44:25 when we talk about long memory

44:28 the capacity of the RNN and the lstm is

44:31 really not that long we can't really

44:34 handle data of tens of thousands or

44:37 hundreds of thousands or even Beyond

44:39 sequential information that effectively

44:41 to learn the complete amount of

44:43 information and patterns that are

44:45 present within such a rich data source

44:48 and so because of this very recently

44:51 there's been a lot of attention in how

44:53 we can move Beyond this notion of

44:56 step-by-step recurrent processing to

44:58 build even more powerful architectures

45:00 for processing sequential data

45:03 to understand how we do how we can start

45:06 to do this let's take a big step back

45:08 right think about the high level goal of

45:10 sequence modeling that I introduced at

45:13 given some input a sequence of data we

45:17 want to build a feature encoding and use

45:20 our neural network to learn that and

45:22 then transform that feature encoding

45:24 into a predicted output

45:27 what we saw is that rnns use this notion

45:30 of recurrence to maintain order

45:32 information processing information time

45:37 but as I just mentioned we had these key

45:39 three bottlenecks to rnns

45:42 what we really want to achieve is to go

45:44 beyond these bottlenecks and Achieve

45:47 even higher capabilities in terms of the

45:49 power of these models rather than having

45:52 an encoding bottleneck ideally we want

45:54 to process information continuously as a

45:57 continuous stream of information

45:59 rather than being slow we want to be

46:01 able to parallelize computations to

46:04 speed up processing and finally of

46:06 course our main goal is to really try to

46:09 establish long memory that can build

46:11 nuanced and Rich understanding of

46:16 the limitation of rnns that's linked to

46:18 all these problems and issues in our

46:21 inability to achieve these capabilities

46:23 is that they require this time step by

46:26 time step processing

46:28 so what if we could move beyond that

46:30 what if we could eliminate this need for

46:32 recurrence entirely and not have to

46:34 process the data time set by time step

46:37 well a first and naive approach would be

46:40 to just squash all the data all the time

46:44 steps together to create a vector that's

46:48 effectively concatenated right the time

46:50 steps are eliminated there's just one

46:54 where we have now one vector input with

46:57 the data from all time points that's

46:59 then fed into the model

47:01 it calculates some feature vector and

47:03 then generates some output which

47:05 hopefully makes sense

47:07 and because we've squashed all these

47:09 time steps together we could simply

47:11 think about maybe building a feed

47:13 forward Network that could that could do

47:17 well with that we'd eliminate the need

47:20 for recurrence but we still have the

47:22 issues that it's not scalable because

47:25 the dense feed forward Network would

47:28 have to be immensely large defined by

47:30 many many different connections

47:32 and critically we've completely lost our

47:34 in order information by just squashing

47:37 everything together blindly there's no

47:39 temporal dependence and we're then stuck

47:42 in our ability to try to establish

47:48 so what if instead we could still think

47:51 about bringing these time steps together

47:53 but be a bit more clever about how we

47:56 try to extract information from this

47:59 the key idea is this idea of being able

48:03 to identify and attend to what is

48:06 important in a potentially sequential

48:09 stream of information

48:11 and this is the notion of attention or

48:14 which is an extremely extremely powerful

48:17 Concept in modern deep learning and AI I

48:20 cannot understate or I don't know

48:21 understand overstate I I cannot

48:23 emphasize enough how powerful this

48:27 attention is the foundational mechanism

48:29 of the Transformer architecture which

48:32 many of you may have heard about

48:34 and it's the the the notion of a

48:38 transformer can often be very daunting

48:40 because sometimes they're presented with

48:41 these really complex diagrams or

48:44 deployed in complex applications and you

48:47 may think okay how do I even start to

48:50 at its core though attention the key

48:52 operation is a very intuitive idea and

48:55 we're going to in the last portion of

48:57 this lecture break it down step by step

48:59 to see why it's so powerful and how we

49:02 can use that as part of a larger neural

49:04 network like a Transformer

49:07 specifically we're going to be talking

49:09 and focusing on this idea of

49:12 attending to the most important parts of

49:17 so let's consider an image I think it's

49:20 most intuitive to consider an image

49:22 first this is a picture of Iron Man and

49:25 if our goal is to try to extract

49:27 information from this image of what's

49:30 what we could do maybe is using our eyes

49:32 naively scan over this image pixel by

49:35 pixel right just going across the image

49:39 however our brains maybe maybe

49:42 internally they're doing some type of

49:43 computation like this but you and I we

49:45 can simply look at this image and be

49:48 able to attend to the important parts

49:50 we can see that it's Iron Man coming at

49:52 you right in the image and then we can

49:55 focus in a little further and say okay

49:56 what are the details about Iron Man that

50:00 what is key what you're doing is your

50:02 brain is identifying which parts are

50:04 attending to to attend to and then

50:07 extracting those features that deserve

50:10 the highest attention

50:13 the first part of this problem is really

50:15 the most interesting and challenging one

50:18 and it's very similar to the concept of

50:21 search effectively that's what search is

50:23 doing taking some larger body of

50:26 information and trying to extract and

50:28 identify the important parts

50:30 so let's go there next how does search

50:33 you're thinking you're in this class how

50:35 can I learn more about neural networks

50:37 well in this day and age one thing you

50:39 may do besides coming here and joining

50:41 us is going to the internet having all

50:44 the videos out there trying to find

50:45 something that matches doing a search

50:49 so you have a giant database like

50:51 YouTube you want to find a video

50:53 you enter in your query deep learning

50:57 and what comes out are some possible

51:02 for every video in the database there is

51:04 going to be some key information related

51:06 to the interview to that to that video

51:08 let's say the title

51:11 now to do the search

51:13 what the task is to find the overlaps

51:17 between your query and each of these

51:20 titles right the keys in the database

51:23 what we want to compute is a metric of

51:25 similarity and relevance between the

51:27 query and these keys

51:30 how similar are they to our desired

51:33 and we can do this step by step let's

51:36 say this first option of a video about

51:39 the elegant giant sea turtles not that

51:41 similar to our query about deep learning

51:46 introduction to deep learning the first

51:48 introductory lecture on this class yes

51:52 the third option a video about the late

51:54 and great Kobe Bryant not that relevant

51:57 the key operation here is that there is

52:00 this similarity computation bringing the

52:02 query and the key together

52:04 the final step is now that we've

52:07 identified what key is relevant

52:09 extracting the relevant information what

52:11 we want to pay attention to and that's

52:13 the video itself we call this the value

52:16 and because the searches is implemented

52:19 well right we've successfully identified

52:21 the relevant video on deep learning that

52:23 you are going to want to pay attention

52:26 and it's this this idea this intuition

52:28 of giving a query trying to find

52:31 similarity trying to extract the related

52:33 values that form the basis of

52:37 and how it works in neural networks like

52:40 so to go concretely into this right

52:43 let's go back now to our text our

52:49 our goal is to identify and attend to

52:52 features in this input that are relevant

52:54 to the semantic meaning of the sentence

52:58 now first step we have sequence we have

53:02 order we've eliminated recurrence right

53:04 we're feeding in all the time steps all

53:07 at once we still need a way to encode

53:09 and capture this information about order

53:12 and this positional dependence

53:14 how this is done is this idea of

53:17 possession positional encoding which

53:19 captures some inherent order information

53:22 present in the sequence I'm just going

53:25 to touch on this very briefly but the

53:27 idea is related to this idea of

53:29 embeddings which I introduced earlier

53:32 what is done is a neural network layer

53:34 is used to encode positional information

53:37 that captures the relative relationships

53:40 in terms of order within within this

53:45 that's the high level concept right

53:47 we're still being able to process these

53:50 time steps all at once there is no

53:52 notion of time step rather the data is

53:54 singular but still we learned this

53:56 encoding that captures the positional

54:01 now our next step is to take this

54:03 encoding and figure out what to attend

54:05 to exactly like that search operation

54:07 that I introduced with the YouTube

54:09 example extracting a query extracting a

54:12 key extracting a value and relating them

54:15 so we use neural network layers to do

54:19 given this positional encoding what

54:22 attention does is applies a neural

54:26 transforming that first generating the

54:30 we do this again using a separate neural

54:33 network layer and this is a different

54:35 set of Weights a different set of

54:36 parameters that then transform that

54:38 positional embedding in a different way

54:40 generating a second output the key

54:45 and finally this repeat this operation

54:47 is repeated with a third layer a third

54:50 set of Weights generating the value

54:53 now with these three in hand the key the

54:56 the query the key and the value we can

54:58 compare them to each other to try to

55:00 figure out where in that self-input the

55:04 network should attend to what is

55:07 and that's the key idea behind this

55:09 similarity metric or what you can think

55:11 of as an attention score what we're

55:14 doing is we're Computing a similarity

55:16 score between a query and the key

55:19 and remember that these query and Qui

55:22 key values are just arrays of numbers we

55:25 can Define them as arrays of numbers

55:27 which you can think of as vectors in

55:31 the query Vector the query values are

55:34 some Vector the key

55:36 the key values are some other vector and

55:39 mathematically the way that we can

55:41 compare these two vectors to understand

55:42 how similar they are is by taking the

55:45 dot product and scaling it captures how

55:49 similar these vectors are how whether or

55:52 not they're pointing in the same

55:55 this is the similarity metric and if you

55:58 are familiar with a little bit of linear

56:00 algebra this is also known as the cosine

56:04 operation functions exactly the same way

56:07 for matrices if we apply this dot

56:10 product operation to our query in key

56:12 matrices key matrices we get this

56:15 similarity metric out

56:19 this is very very key in defining our

56:22 next step Computing the attention

56:24 waiting in terms of what the network

56:26 should actually attend to within this

56:30 this operation gives us a score which

56:36 the components of the input data are

56:39 related to each other

56:41 so given a sentence right when we

56:43 compute this similarity score metric we

56:47 can then begin to think of Weights that

56:50 Define the relationship between the

56:52 sequential the components of the

56:54 sequential data to each other

56:56 so for example in the this example with

56:59 a text sentence he tossed the tennis

57:04 the goal with the score is that words in

57:07 the sequence that are related to each

57:08 other should have high attention weights

57:10 ball related to toss related to tennis

57:15 and this metric itself is our attention

57:18 waiting what we have done is passed that

57:21 similarity score through a soft Max

57:24 function which all it does is it

57:27 constrains those values to be between 0

57:29 and 1. and so you can think of these as

57:31 relative scores of relative attention

57:35 finally now that we have this metric

57:38 that can captures this notion of

57:41 similarity and these internal

57:43 self-relationships we can finally use

57:46 this metric to extract features that are

57:49 deserving of high attention

57:52 and that's the exact final step in this

57:55 self-attention mechanism in that we take

57:58 that attention waiting Matrix multiply

58:01 it by the value and get a transformed

58:07 of the initial data as our output which

58:10 in turn reflects the features that

58:12 correspond to high attention

58:15 all right let's take a breath let's

58:18 recap what we have just covered so far

58:21 the goal with this idea of

58:23 self-attention the backbone of

58:24 Transformers is to eliminate recurrence

58:27 attend to the most important features in

58:31 in an architecture how this is actually

58:33 deployed is first we take our input data

58:37 we compute these positional encodings

58:40 the neural network layers are applied

58:43 three-fold to transform the positional

58:46 encoding into each of the key query and

58:53 compute the self-attention weight score

58:56 according to the up the dot product

58:58 operation that we went through prior and

59:01 then self-attend to these features to

59:05 information to extract features that

59:08 deserve High attention

59:11 what is so powerful about this approach

59:14 in taking this attention wait

59:16 putting it together with the value to

59:19 extract High attention features is that

59:21 this operation the scheme that I'm

59:24 showing on the right defines a single

59:26 self-attention head

59:28 and multiple of these self-attention

59:30 heads can be linked together to form

59:32 larger Network architectures where you

59:35 can think about these different heads

59:36 trying to extract different information

59:38 different relevant parts of the input to

59:41 now put together a very very rich

59:43 encoding and representation of the data

59:46 that we're working with

59:48 intuitively back to our Ironman example

59:51 what this idea of multiple

59:52 self-attention heads can amount to is

59:55 that different Salient features and

59:58 Salient information in the data is

01:00:00 extracted first maybe you consider Iron

01:00:03 Man attention had one and you may have

01:00:06 additional attention heads that are

01:00:08 picking out other relevant parts of the

01:00:10 data which maybe we did not realize

01:00:12 before for example the building or the

01:00:15 spaceship in the background that's

01:00:19 this is a key building block of many

01:00:22 many many many powerful architectures

01:00:25 that are out there today today I again

01:00:27 cannot emphasize how enough how powerful

01:00:30 this mechanism is

01:00:32 and indeed this this backbone idea of

01:00:35 self-attention that you just built up

01:00:37 understanding of is the key operation of

01:00:40 some of the most powerful neural

01:00:42 networks and deep learning models out

01:00:44 there today ranging from the very

01:00:46 powerful language models like gpt3 which

01:00:50 are capable of synthesizing natural

01:00:53 language in a very human-like fashion

01:00:55 digesting large bodies of text

01:00:57 information to understand relationships

01:01:01 to models that are being deployed for

01:01:04 extremely impactful applications in

01:01:07 biology and Medicine such as Alpha full

01:01:10 2 which uses this notion of

01:01:12 self-attention to look at data of

01:01:15 protein sequences and be able to predict

01:01:17 the three-dimensional structure of a

01:01:19 protein just given sequence information

01:01:23 and all the way even now to computer

01:01:25 vision which will be the topic of our

01:01:27 next lecture tomorrow where the same

01:01:30 idea of attention that was initially

01:01:32 developed in sequential data

01:01:34 applications has now transformed the

01:01:36 field of computer vision and again using

01:01:39 this key concept of attending to the

01:01:42 important features in an input to build

01:01:44 these very rich representations of

01:01:46 complex High dimensional data

01:01:49 okay so that concludes lectures for

01:01:53 today I know we have covered a lot of

01:01:55 territory in a pretty short amount of

01:01:57 time but that is what this boot camp

01:01:59 program is all about so hopefully today

01:02:02 you've gotten a sense of the foundations

01:02:04 of neural networks in the lecture with

01:02:06 Alexander we talked about rnns how

01:02:09 they're well suited for sequential data

01:02:11 how we can train them using back

01:02:14 how we can deploy them for different

01:02:15 applications and finally how we can move

01:02:18 Beyond recurrence to build this idea of

01:02:20 self-attention for building increasingly

01:02:23 powerful models for deep learning in

01:02:25 sequence modeling

01:02:27 all right hopefully you enjoyed we have

01:02:31 um about 45 minutes left for the for the

01:02:34 lab portion and open Office hours in

01:02:36 which we welcome you to ask us questions

01:02:38 uh of us and the Tas and to start work

01:02:41 on the labs the information for the labs

01:02:44 is is up there thank you so much for