00:09 hello everyone and I hope you enjoyed
00:12 Alexander's first lecture I'm Ava and in
00:15 this second lecture lecture two we're
00:17 going to focus on this question of
00:19 sequence modeling how we can build
00:21 neural networks that can handle and
00:23 learn from sequential data
00:25 so in Alexander's first lecture he
00:27 introduced the essentials of neural
00:29 networks starting with perceptrons
00:31 building up to feed forward models and
00:33 how you can actually train these models
00:35 and start to think about deploying them
00:38 now we're going to turn our attention to
00:41 specific types of problems that involve
00:43 sequential processing of data and we'll
00:46 realize why these types of problems
00:48 require a different way of implementing
00:50 and building neural networks from what
00:55 and I think some of the components in
00:57 this lecture traditionally can be a bit
00:59 confusing or daunting at first but what
01:01 I really really want to do is to build
01:03 this understanding up from the
01:05 foundations walking through step by step
01:07 developing intuition all the way to
01:09 understanding the math and the
01:11 operations behind how these networks
01:15 okay so let's let's get started
01:22 to begin I first want to motivate what
01:24 exactly we mean when we talk about
01:26 sequential data or sequential modeling
01:28 so we're going to begin with a really
01:30 simple intuitive example
01:32 let's say we have this picture of a ball
01:34 and your task is to predict where this
01:37 ball is going to travel to next
01:40 now if you don't have any prior
01:42 information about the trajectory of the
01:44 ball it's motion it's history any guess
01:46 or prediction about its next position is
01:49 going to be exactly that a random guess
01:53 if however in addition to the current
01:55 location of the ball I gave you some
01:57 information about where it was moving in
01:59 the past now the problem becomes much
02:01 easier and I think hopefully we can all
02:04 agree that most likely or most likely
02:07 next prediction is that this ball is
02:09 going to move forward to the right in in
02:13 so this is a really you know reduced
02:15 down Bare Bones intuitive example but
02:18 the truth is that Beyond this sequential
02:20 data is really all around us right as
02:23 I'm speaking the words coming out of my
02:25 mouth form a sequence of sound waves
02:28 that Define audio which we can split up
02:30 to think about in this sequential manner
02:33 similarly text language can be split up
02:36 into a sequence of characters
02:38 or a sequence of words
02:41 and there are many many more examples in
02:43 which sequential processing sequential
02:45 data is present right from medical
02:48 signals like EKGs to financial markets
02:51 and projecting stock prices to
02:53 biological sequences encoded in DNA to
02:56 patterns in the climate to patterns of
02:58 motion and many more
03:00 and so already hopefully you're getting
03:03 a sense of what these types of questions
03:04 and problems may look like and where
03:06 they are relevant in the real world
03:10 when we consider applications of
03:12 sequential modeling in the real world we
03:14 can think about a number of different
03:15 kind of problem definitions that we can
03:17 have in our Arsenal and work with
03:20 in the first lecture Alexander
03:22 introduced the Notions of classification
03:24 and the notion of regression where he
03:27 talked about and we learned about feed
03:29 forward models that can operate one to
03:31 one in this fixed and static setting
03:33 right given a single input predict a
03:37 the binary classification example of
03:39 will you succeed or pass this class
03:42 here there's there's no notion of
03:44 sequence there's no notion of time
03:46 now if we introduce this idea of a
03:49 sequential component we can handle
03:51 inputs that may be defined temporally
03:54 and potentially also produce a
03:56 sequential or temporal output so for as
04:00 one example we can consider text
04:02 language and maybe we want to generate
04:04 one prediction given a sequence of text
04:07 classifying whether a message is a
04:10 positive sentiment or a negative
04:13 conversely we could have a single input
04:16 let's say an image and our goal may be
04:19 now to generate text or a sequential
04:22 description of this image right given
04:24 this image of a baseball player throwing
04:26 a ball can we build a neural network
04:28 that generates that as a language
04:32 finally we can also consider
04:34 applications and problems where we have
04:36 sequence in sequence L for example if we
04:39 want to translate between two languages
04:41 and indeed this type of thinking in this
04:44 type of Architecture is what powers the
04:47 task of machine translation in your
04:49 phones in Google Translate and and many
04:54 so hopefully right this has given you a
04:57 picture of what sequential data looks
04:58 like what these types of problem
05:00 definitions may look like and from this
05:03 we're going to start and build up our
05:04 understanding of what neural networks we
05:07 can build and train for these types of
05:11 so first we're going to begin with the
05:14 notion of recurrence and build up from
05:16 that to Define recurrent neural networks
05:19 and in the last portion of the lecture
05:21 we'll talk about the underlying
05:22 mechanisms underlying the Transformer
05:25 architectures that are very very very
05:27 powerful in terms of handling sequential
05:29 data but as I said at the beginning
05:31 right the theme of this lecture is
05:34 building up that understanding step by
05:36 step starting with the fundamentals and
05:39 so to do that we're going to go back
05:41 revisit the perceptron and move forward
05:45 right so as Alexander introduced where
05:48 we studied the perception perceptron in
05:51 the perceptron is defined by this single
05:54 neural operation where we have some set
05:57 of inputs let's say X1 through XM and
06:00 each of these numbers are multiplied by
06:03 a corresponding weight pass through a
06:05 non-linear activation function that then
06:08 generates a predicted output y hat
06:11 here we can have multiple inputs coming
06:13 in to generate our output but still
06:16 these inputs are not thought of as
06:18 points in a sequence or time steps in a
06:22 even if we scale this perceptron and
06:24 start to stack multiple perceptrons
06:27 together to Define these feed forward
06:28 neural networks we still don't have this
06:31 notion of temporal processing or
06:33 sequential information even though we
06:36 are able to translate and convert
06:38 multiple inputs apply these weight
06:40 operations apply this non-linearity to
06:43 then Define multiple predicted outputs
06:47 so taking a look at this diagram right
06:49 on the left in blue you have inputs on
06:52 the right in purple you have these
06:53 outputs and the green defines the neural
06:56 the single neural network layer that's
06:58 transforming these inputs to the outputs
07:01 Next Step I'm going to just simplify
07:02 this diagram I'm going to collapse down
07:05 those stack perceptrons together and
07:08 depict this with this green block
07:11 still it's the same operation going on
07:13 right we have an input Vector being
07:16 being transformed to predict this output
07:20 now what I've introduced here which you
07:23 may notice is this new variable T right
07:26 which I'm using to denote a single time
07:29 we are considering an input at a single
07:31 time step and using our neural network
07:34 to generate a single output
07:36 corresponding to that
07:38 how could we start to extend and build
07:40 off this to now think about multiple
07:43 time steps and how we could potentially
07:45 process a sequence of information
07:48 well what if we took this diagram all
07:51 I've done is just rotated it 90 degrees
07:54 where we still have this input vector
07:56 and being fed in producing an output
07:59 vector and what if we can make a copy of
08:02 this network right and just do this
08:05 operation multiple times to try to
08:07 handle inputs that are fed in
08:09 corresponding to different times right
08:11 we have an individual time step starting
08:16 and we can do the same thing the same
08:18 operation for the next time step again
08:21 treating that as an isolated instance
08:24 and keep doing this repeatedly
08:27 and what you'll notice hopefully is all
08:29 these models are simply copies of each
08:31 other just with different inputs at each
08:33 of these different time steps
08:36 and we can make this concrete right in
08:38 terms of what this functional
08:39 transformation is doing
08:41 the predicted output at a particular
08:43 time step y hat of T is a function of
08:48 the input at that time step X of T and
08:51 that function is what is learned and
08:53 defined by our neural network weights
08:56 okay so I've told you that our goal here
08:59 is Right trying to understand sequential
09:01 data do sequential modeling but what
09:04 could be the issue with what this
09:06 diagram is showing and what I've shown
09:15 exactly that's exactly right so the
09:17 student's answer was that X1 or it could
09:20 be related to X naught and you have this
09:22 temporal dependence but these isolated
09:24 replicas don't capture that at all and
09:29 answers the question perfectly right
09:32 here a predicted output at a later time
09:37 precisely on inputs at previous time
09:39 steps if this is truly a sequential
09:41 problem with this temporal dependence
09:45 so how could we start to reason about
09:47 this how could we Define a relation that
09:50 links the Network's computations at a
09:52 particular time step to Prior history
09:55 and memory from previous time steps
09:58 well what if we did exactly that right
10:01 what if we simply linked the computation
10:04 and the information understood by the
10:07 network to these other replicas via what
10:11 we call a recurrence relation
10:13 what this means is that something about
10:15 what the network is Computing at a
10:17 particular time is passed on to those
10:20 later time steps and we Define that
10:22 according to this variable H which we
10:25 call this internal state or you can
10:27 think of it as a memory term
10:29 that's maintained by the neurons and the
10:31 network and it's this state that's being
10:34 passed time set to time step as we read
10:37 in and and process this sequential
10:42 what this means is that the Network's
10:45 output its predictions its computations
10:47 is not only a function of the input data
10:51 but also we have this other variable H
10:54 which captures this notion of State
10:56 captions captures this notion of memory
10:59 that's being computed by the network and
11:02 passed on over time
11:04 specifically right to walk through this
11:06 our predicted output y hat of T depends
11:10 not only on the input at a time but also
11:12 this past memory this past state
11:15 and it is this linkage of temporal
11:19 dependence and recurrence that defines
11:21 this idea of a recurrent neural unit
11:24 what I've shown is this this connection
11:27 that's being unrolled over time but we
11:30 can also depict this relationship
11:32 according to a loop
11:34 this computation to this internal State
11:37 variable h of T is being iteratively
11:39 updated over time and that's fed back
11:42 into the neuron the neurons computation
11:45 in this recurrence relation
11:49 this is how we Define these recurrent
11:51 cells that comprise recurrent neural
11:56 and the key here is that we have this
11:59 this idea of this recurrence relation
12:00 that captures the cyclic temporal
12:06 and indeed it's this idea that is really
12:08 the intuitive Foundation behind
12:10 recurrent neural networks or rnns and so
12:13 let's continue to build up our
12:15 understanding from here and move forward
12:17 into how we can actually Define the RNN
12:20 operations mathematically and in code
12:23 so all we're going to do is formalize
12:25 this relationship a little bit more
12:27 the key idea here is that the RNN is
12:30 maintaining the state and it's updating
12:32 the state at each of these time steps as
12:35 the sequence is is processed
12:38 we Define this by applying this
12:40 recurrence relation
12:41 and what the recurrence relation
12:43 captures is how we're actually updating
12:45 that internal State h of t
12:48 specifically that state update is
12:51 exactly like any other neural network
12:52 operator operation that we've introduced
12:55 so far where again we're learning a
12:59 defined by a set of Weights w
13:01 we're using that function to update the
13:06 and the additional component the newness
13:09 here is that that function depends both
13:11 on the input and the prior time step h
13:16 and what you'll know is that this
13:19 function f sub W is defined by a set of
13:22 weights and it's the same set of Weights
13:24 the same set of parameters that are used
13:27 time step to time step as the recurrent
13:30 neural network processes this temporal
13:32 information the sequential data
13:35 okay so the key idea here hopefully is
13:38 coming coming through is that this RNN
13:41 stay update operation takes this state
13:44 and updates it each time a sequence is
13:48 we can also translate this to how we can
13:52 think about implementing rnns in Python
13:56 code or rather pseudocode hopefully
13:59 getting a better understanding and
14:00 intuition behind how these networks work
14:03 so what we do is we just start by
14:07 for now this is abstracted away
14:10 and we start we initialize its hidden
14:12 State and we have some sentence right
14:15 let's say this is our input of Interest
14:16 where we're interested in predicting
14:19 maybe the next word that's occurring in
14:22 what we can do is Loop through these
14:25 individual words in the sentence that
14:27 Define our temporal input and at each
14:30 step as We're looping through each word
14:32 in that sentence is fed into the RNN
14:36 model along with the previous hidden
14:40 and this is what generates a prediction
14:42 for the next word and updates the RNN
14:47 finally our prediction for the final
14:49 word in the sentence the word that we're
14:51 missing is simply the rnn's output after
14:54 all the prior words have been fed in
14:59 so this is really breaking down how the
15:01 RNN Works how it's processing the
15:03 sequential information
15:05 and what you've noticed is that the RNN
15:08 computation includes both this update to
15:10 the hidden State as well as generating
15:12 some predicted output at the end that is
15:15 our ultimate goal that we're interested
15:17 and so to walk through this how we're
15:20 actually generating the output
15:23 what the RNN computes is given some
15:27 it then performs this update to the
15:31 and this update to the head and state is
15:34 just a standard neural network operation
15:36 just like we saw in the first lecture
15:38 where it consists of taking a weight
15:41 Matrix multiplying that by the previous
15:44 hidden State taking another weight
15:46 Matrix multiplying that by the input at
15:49 a time step and applying a non-linearity
15:52 and in this case right because we have
15:54 these two input streams the input data X
15:57 of T and the previous state H we have
16:01 these two separate weight matrices that
16:03 the network is learning over the course
16:06 that comes together we apply the
16:09 non-linearity and then we can generate
16:12 an output at a given time step by just
16:14 modifying the hidden state
16:17 using a separate weight Matrix to update
16:20 this value and then generate a predicted
16:24 and that's what there is to it right
16:26 that's how the RNN in its single
16:29 operation updates both the hidden State
16:32 and also generates a predicted output
16:36 okay so now this gives you the internal
16:39 working of how the RNN computation
16:41 occurs at a particular time step let's
16:45 next think about how this looks like
16:46 over time and Define the computational
16:50 graph of the RNN as being unrolled or
16:53 expanded acrost across time
16:56 so so far the dominant way I've been
16:58 showing the rnns is according to this
17:01 loop-like diagram on the Left Right
17:03 feeding back in on itself
17:05 another way we can visualize and think
17:07 about rnns is as kind of unrolling this
17:11 recurrence over time over the individual
17:14 time steps in our sequence
17:16 what this means is that we can take the
17:19 network at our first time step
17:21 and continue to iteratively unroll it
17:24 across the time steps
17:26 going on forward all the way until we
17:29 process all the time steps in our input
17:32 now we can formalize this diagram a
17:35 little bit more by defining the weight
17:37 matrices that connect the inputs to the
17:41 hidden State update
17:43 and the weight matrices that are used to
17:46 update the internal State across time
17:49 and finally the weight matrices that
17:51 Define the the update to generate a
17:56 now recall that in all these cases right
18:00 for all these three weight matrices add
18:02 all these time steps we are simply
18:04 reusing the same weight matrices right
18:07 so it's one set of parameters one set of
18:09 weight matrices that just process this
18:12 information sequentially
18:14 now you may be thinking okay so how do
18:17 we actually start to be thinking about
18:19 how to train the RNN how to define the
18:22 loss given that we have this temporal
18:24 processing in this temporal dependence
18:27 well a prediction at an individual time
18:30 step will simply amount to a computed
18:33 loss at that particular time step
18:36 so now we can compare those predictions
18:38 time step by time step to the true label
18:41 and generate a loss value for those
18:43 timestamps and finally we can get our
18:46 total loss by taking all these
18:48 individual loss terms together and
18:51 summing them defining the total loss for
18:54 a particular input to the RNN
18:58 if we can walk through an example of how
19:00 we implement this RNN in tensorflow
19:02 starting from scratch
19:04 the RNN can be defined as a layer
19:07 operation and a layer class that
19:09 Alexander introduced in the first
19:11 lecture and so we can Define it
19:13 according to an initialization of weight
19:16 matrices initialization of a hidden
19:19 state which commonly amounts to
19:21 initializing these two to zero
19:25 next we can Define how we can actually
19:27 pass forward through the RNN Network to
19:31 process a given input X
19:33 and what you'll notice is in this
19:35 forward operation the computations are
19:38 exactly like we just walked through we
19:40 first update the hidden state
19:42 according to that equation we introduced
19:46 and then generate a predicted output
19:48 that is a transformed version of that
19:52 and finally at each time step we return
19:54 it both the output and the updated
19:57 hidden State as this is what is
19:59 necessary to be stored to continue this
20:02 RNN operation over time
20:04 what is very convenient is that although
20:08 you could Define your RNN Network and
20:09 your RNN layer completely from scratch
20:11 is that tensorflow abstracts this
20:14 operation away for you so you can simply
20:16 Define a simple RNN according to uh this
20:20 this call that you're seeing here
20:23 um which yeah makes all the the
20:25 computations very efficient and and very
20:28 and you'll actually get practice
20:30 implementing and working with with rnns
20:33 in today's software lab
20:36 okay so that gives us the understanding
20:39 of rnns and going back to what I what I
20:43 described as kind of the problem setups
20:45 or the problem definitions at the
20:46 beginning of this lecture
20:48 I just want to remind you of the types
20:50 of sequence modeling problems on which
20:52 we can apply rnns right we can think
20:56 about taking a sequence of inputs
20:57 producing one predicted output at the
21:00 end of the sequence
21:02 we can think about taking a static
21:04 single input and trying to generate text
21:06 according to according to that single
21:11 and finally we can think about taking a
21:13 sequence of inputs producing a
21:15 prediction at every time step in that
21:18 and then doing this sequence to sequence
21:21 type of prediction and translation
21:28 yeah so so this will be the the
21:32 the software lab today which will focus
21:35 on this problem of of many to many
21:38 processing and many to many sequential
21:40 modeling taking a sequence going to a
21:44 what is common and what is universal
21:46 across all these types of problems and
21:49 tasks that we may want to consider with
21:51 rnns is what I like to think about what
21:54 type of design criteria we need to build
21:57 a robust and reliable Network for
22:00 processing these sequential modeling
22:03 what I mean by that is what are the
22:05 characteristics what are the the design
22:08 requirements that the RNN needs to
22:10 fulfill in order to be able to handle
22:13 sequential data effectively
22:16 the first is that sequences can be of
22:18 different lengths right they may be
22:21 short they may be long we want our RNN
22:23 model or our neural network model in
22:25 general to be able to handle sequences
22:27 of variable lengths
22:29 secondly and really importantly is as we
22:33 were discussing earlier that the whole
22:35 point of thinking about things through
22:36 the lens of sequence is to try to track
22:39 and learn dependencies in the data that
22:41 are related over time
22:43 so our model really needs to be able to
22:45 handle those different dependencies
22:47 which may occur at times that are very
22:50 very distant from each other
22:53 next right sequence is all about order
22:56 right there's some notion of how current
22:59 inputs depend on prior inputs and the
23:02 specific order of the observations we
23:04 see makes a big effect on what
23:08 prediction we may want to generate at
23:11 and finally in order to be able to
23:14 process this information
23:16 effectively our Network needs to be able
23:19 to do what we call parameter sharing
23:21 meaning that given one set of Weights
23:24 that set of weights should be able to
23:26 apply to different time steps in the
23:27 sequence and still result in a
23:30 meaningful prediction
23:31 and so today we're going to focus on how
23:34 recurrent neural networks meet these
23:36 design criteria and how these design
23:38 criteria motivate the need for even more
23:41 powerful architectures that can
23:43 outperform rnns in sequence modeling
23:46 so to understand these criteria very
23:49 concretely we're going to consider a
23:52 sequence modeling problem where given
23:54 some series of words our task is just to
23:57 predict the next word in that sentence
24:00 so let's say we have this sentence this
24:03 morning I took my cat for a walk
24:05 and our task is to predict the last word
24:09 in the sentence given the prior words
24:10 this morning I took my cap for a blank
24:15 our goal is to take our RNN Define it
24:20 put it to test on this task
24:22 what is our first step to doing this
24:25 well the very very first step before we
24:28 even think about defining the RNN is how
24:31 we can actually represent this
24:33 information to the network in a way that
24:35 it can process and understand
24:39 if we have a model that is processing
24:42 this data processing this text-based
24:45 and wanting to generate text as the
24:48 our problem can arise in that the neural
24:51 network itself is not equipped to handle
24:54 language explicitly right
24:56 remember that neural networks are simply
24:58 functional operators they're just
25:00 mathematical operations and so we can't
25:03 expect it right it doesn't have an
25:04 understanding from the start of what a
25:07 word is or what language means which
25:09 means that we need a way to represent
25:11 language numerically so that it can be
25:14 passed in to the network to process
25:19 so what we do is that we need to define
25:21 a way to translate this text this this
25:24 language information into a numerical
25:27 encoding a vector an array of numbers
25:30 that can then be fed in to our neural
25:33 network and generating a a vector of
25:36 numbers as its output
25:39 so now right this raises the question of
25:41 how do we actually Define this
25:43 transformation how can we transform
25:45 language into this numerical encoding
25:48 the key solution and the key way that a
25:50 lot of these networks work is this
25:53 notion and concept of embedding
25:55 what that means is it it's some
25:58 transformation that takes
26:00 indices or something that can be
26:03 represented as an index into a numerical
26:07 Vector of a given size
26:10 so if we think about how this idea of
26:12 embedding works for language data let's
26:14 consider a vocabulary of words that we
26:17 can possibly have in our language
26:19 and our goal is to be able to map these
26:22 individual words in our vocabulary to a
26:26 numerical Vector of fixed size
26:29 one way we could do this is by defining
26:31 all the possible words that could occur
26:35 and then indexing them assigning a index
26:38 label to each of these distinct words
26:41 a corresponds to index one cat responds
26:44 to index two so on and so forth and this
26:47 indexing Maps these individual words to
26:50 numbers unique indices
26:53 what these indices can then Define is
26:55 what we call a embedding vector
26:58 which is a fixed length encoding where
27:00 we've simply indicated a one value
27:04 at the index for that word when we
27:08 and this is called a one-hot embedding
27:10 where we have this fixed length Vector
27:13 of the size of our vocabulary and each
27:16 instance of the vocabulary corresponds
27:18 to a one-hot one at the corresponding
27:24 this is a very sparse way to do this and
27:28 it's simply based on
27:30 purely purely count the count index
27:32 there's no notion of semantic
27:35 information meaning that's captured in
27:38 this vector-based encoding
27:40 alternatively what is very commonly done
27:42 is to actually use a neural network to
27:45 learn in encoding to learn in embedding
27:47 and the goal here is that we can learn a
27:50 neural network that then captures some
27:52 inherent meaning or inherent semantics
27:54 in our input data and Maps related words
27:57 or related inputs closer together in
28:00 this embedding space meaning that
28:02 they'll have numerical vectors that are
28:04 more similar to each other
28:07 this concept is really really
28:09 foundational to how these
28:11 sequence modeling networks work and how
28:15 neural networks work in general
28:17 okay so with that in hand we can go back
28:20 to our design criteria
28:21 thinking about the capabilities that we
28:23 desire first we need to be able to
28:26 handle variable length sequences
28:29 if we again want to predict the next
28:31 word in the sequence we can have short
28:32 sequences we can have long sequences we
28:34 can have even longer sentences and our
28:37 key task is that we want to be able to
28:39 track dependencies across all these
28:43 and what we need what we mean by
28:45 dependencies is that there could be
28:46 information very very early on in a
28:50 but uh that may not be relevant or come
28:53 up late until very much later in the
28:55 sequence and we need to be able to track
28:58 these dependencies and maintain this
29:00 information in our Network
29:03 dependencies relate to order and
29:05 sequences are defined by their order and
29:08 we know that same words in a completely
29:11 different order have completely
29:12 different meanings right so our model
29:15 needs to be able to handle these
29:17 differences in order and the differences
29:19 in length that could result in different
29:25 okay so hopefully that example going
29:28 through the example in text
29:30 motivates how we can think about
29:32 transforming input data into a numerical
29:35 encoding that can be passed into the RNN
29:38 and also what are the key criteria that
29:41 we want to meet in handling these these
29:46 so so far we've painted the picture of
29:49 rnn's how they work intuition their
29:51 mathematical operations and what are the
29:54 key criteria that they need to meet the
29:57 final piece to this is how we actually
29:59 train and learn the weights in the RNN
30:02 and that's done through back propagation
30:04 algorithm with a bit of a Twist to just
30:07 handle sequential information
30:10 if we go back and think about how we
30:13 train feed forward neural network models
30:16 the steps break down in thinking through
30:19 starting with an input where we first
30:21 take this input and make a forward pass
30:24 through the network going from input to
30:28 the key to back propagation that
30:29 Alexander introduced was this idea of
30:32 taking the prediction and back
30:33 propagating gradients back through the
30:37 and using this operation to then Define
30:40 and update the loss with respect to each
30:43 of the parameters in the network in
30:45 order to gradually adjust the parameters
30:48 the weights of the network in order to
30:50 minimize the overall loss
30:53 now with rnns as we walked through
30:55 earlier we have this temporal unrolling
30:58 which means that we have these
30:59 individual losses across the individual
31:01 steps in our sequence that sum together
31:04 to comprise the overall loss
31:07 what this means is that when we do back
31:11 we have to now instead of back
31:14 propagating errors through a single
31:17 back propagate the loss through each of
31:19 these individual time steps
31:21 and after we back propagate loss through
31:24 each of the individual time steps we
31:26 then do that across all time steps all
31:30 the way from our current time time T
31:32 back to the beginning of the sequence
31:35 and this is the why this is why this
31:38 algorithm is called back propagation
31:40 Through Time right because as you can
31:42 see the data and the the predictions and
31:45 the resulting errors are fed back in
31:47 time all the way from where we are
31:49 currently to the very beginning of the
31:51 input data sequence
31:55 so the back propagations through time is
31:58 actually a very tricky algorithm to
32:02 and the reason for this is if we take a
32:04 close look looking at how gradients flow
32:07 across the RNN what this algorithm
32:10 involves is that many many repeated
32:12 computations and multiplications of
32:15 these weight matrices repeatedly against
32:18 in order to compute the gradient with
32:20 respect to the very first time step we
32:24 have to make many of these
32:25 multiplicative repeats of the weight
32:29 why might this be problematic well if
32:33 this weight Matrix W is very very big
32:37 what this can result in is what they
32:39 call what we call the exploding gradient
32:41 problem where our gradients that we're
32:43 trying to use to optimize our Network do
32:46 exactly that they blow up they explode
32:49 and they get really big and makes it
32:51 infeasible and not possible to train the
32:56 what we do to mitigate this is a pretty
32:59 simple solution called gradient clipping
33:01 which effectively scales back these very
33:03 big gradients to try to constrain them
33:05 more in a more restricted way
33:10 conversely we can have the instance
33:13 where the weight matrices are very very
33:15 small and if these weight matrices are
33:19 we end up with a very very small value
33:21 at the end as a result of these repeated
33:23 weight Matrix computations and these
33:28 multiplications and this is a very real
33:31 problem in rnns in particular where we
33:34 can lead into this funnel called a
33:36 Vanishing gradient where now your
33:38 gradient has just dropped down close to
33:40 zero and again you can't train the
33:43 now there are particular tools that we
33:46 can use to implement that we can
33:48 Implement to try to mitigate the Spanish
33:50 ingredient problem and we'll touch on
33:52 each of these three solutions briefly
33:55 first being how we can Define the
33:58 activation function in our Network and
34:01 how we can change the network
34:02 architecture itself to try to better
34:04 handle this Vanishing gradient problem
34:10 I want to take just one step back to
34:12 give you a little more intuition about
34:14 why Vanishing gradients can be a real
34:17 issue for recurrent neural networks
34:20 Point I've kept trying to reiterate is
34:22 this notion of dependency in the
34:24 sequential data and what it means to
34:26 track those dependencies
34:28 well if the dependencies are very
34:30 constrained in a small space not
34:32 separated out that much by time
34:34 this repeated gradient computation and
34:37 the repeated weight matrix
34:38 multiplication is not so much of a
34:41 if we have a very short sequence where
34:44 the words are very closely related to
34:46 each other and it's pretty obvious what
34:49 our next output is going to be
34:52 the RNN can use the immediately passed
34:55 information to make a prediction
34:57 and so there are not going to be that
34:59 many uh that much of a requirement to
35:02 learn effective weights if the related
35:05 information is close to to each other
35:09 conversely now if we have a sentence
35:12 where we have a more long-term
35:16 what this means is that we need
35:17 information from way further back in the
35:19 sequence to make our prediction at the
35:22 end and that gap between what's relevant
35:25 and where we are at currently becomes
35:27 exceedingly large and therefore the
35:29 vanishing gradient problem is
35:31 increasingly exacerbated meaning that we
35:37 the RNN becomes unable to connect the
35:39 dots and establish this long-term
35:41 dependency all because of this Vanishing
35:45 so the ways that we can imply the ways
35:48 and modifications that we can make to
35:49 our Network to try to alleviate this
35:54 the first is that we can simply change
35:57 the activation functions in each of our
35:59 neural network layers to be such that
36:02 they can effectively try to mitigate and
36:05 Safeguard from gradients in instances
36:08 from shrinking the gradients in
36:10 instances where the data is greater than
36:13 zero and this is in particular true for
36:16 the relu activation function and the
36:19 reason is that in all instances where X
36:22 is greater than zero with the relu
36:24 function the derivative is one and so
36:28 that is not less than one and therefore
36:31 it helps in mitigating The Vanishing
36:37 another trick is how we initialize the
36:40 parameters in the network itself to
36:42 prevent them from shrinking to zero too
36:45 and there are there are mathematical
36:48 ways that we can do this namely by
36:50 initializing our weights to Identity
36:53 matrices and this effectively helps in
36:56 practice to prevent the weight updates
36:59 to shrink too rapidly to zero
37:02 however the most robust solution to the
37:04 vanishing gradient problem is by
37:07 introducing a slightly more complicated
37:09 uh version of the recurrent neural unit
37:13 to be able to more effectively track and
37:16 handle long-term dependencies in the
37:19 and this is this idea of gating and what
37:22 the idea is is by controlling
37:25 selectively the flow of information into
37:28 the neural unit to be able to filter out
37:31 what's not important while maintaining
37:35 and the key and the most popular type of
37:38 recurrent unit that achieves this gated
37:40 computation is called the lstm or long
37:44 short term memory Network
37:46 today we're not going to go into detail
37:49 on lstn's their mathematical details
37:52 their operations and so on but I just
37:55 want to convey the key idea and
37:57 intuitive idea about why these lstms are
38:00 effective at tracking long-term
38:04 the core is that the lstm is able to
38:08 control the flow of information through
38:10 these gates to be able to more
38:12 effectively filter out the unimportant
38:15 things and store the important things
38:19 what you can do is Implement Implement
38:22 lstms in tensorflow just as you would in
38:26 but the core concept that I want you to
38:28 take away when thinking about the lstm
38:30 is this idea of controlled information
38:35 very briefly the way that lstm operates
38:38 is by maintaining a cell State just like
38:41 a standard RNN and that cell state is
38:44 independent from what is directly
38:47 the way the cell state is updated is
38:50 according to these Gates that control
38:52 the flow of information
38:54 for getting and eliminating what is
38:57 storing the information that is relevant
39:01 updating the cell state in turn and then
39:04 filtering this this updated cell state
39:07 to produce the predicted output just
39:09 like the standard RNN
39:12 we can train the lstm using the back
39:15 propagation Through Time algorithm but
39:17 the mathematics of how the lstm is
39:19 defined allows for a completely
39:21 uninterrupted flow of the gradients
39:24 which completely eliminates the well
39:27 largely eliminates the The Vanishing
39:29 gradient problem that I introduced
39:33 again we're not if you're if you're
39:36 interested in learning more about the
39:37 mathematics and the details of lstms
39:40 please come and discuss with us after
39:42 the lectures but again just emphasizing
39:45 the core concept and the intuition
39:46 behind how the lstm operates
39:51 okay so so far where we've out where
39:54 we've been at we've covered a lot of
39:57 we've gone through the fundamental
39:58 workings of rnns the architecture the
40:01 training the type of problems that
40:03 they've been applied to and I'd like to
40:05 close this part by considering some
40:08 concrete examples of how you're going to
40:10 use rnns in your software lab
40:14 and that is going to be in the task of
40:16 Music generation where you're going to
40:19 work to build an RNN that can predict
40:22 the next musical note in a sequence and
40:25 use it to generate brand new musical
40:27 sequences that have never been realized
40:31 so to give you an example of just the
40:33 quality and and type of output that you
40:36 can try to aim towards a few years ago
40:39 there was a work that
40:41 trained in RNN on a corpus of classical
40:44 music data and famously there's this
40:47 composer Schubert who uh wrote a famous
40:51 unfinished Symphony that consisted of
40:53 two movements but he was unable to
40:55 finish his uh his Symphony before he
40:59 died so he died and then he left the
41:01 third movement unfinished so a few years
41:04 ago a group trained a RNN based model to
41:08 actually try to generate the third
41:10 movement to Schubert's famous unfinished
41:13 Symphony given the prior to movements so
41:16 I'm going to play the result quite right
41:37 okay I I paused it I interrupted it
41:39 quite abruptly there but if there are
41:42 any classical music aficionados out
41:44 there hopefully you get a appreciation
41:46 for kind of the quality that was
41:48 generated uh in in terms of the music
41:50 quality and this was already from a few
41:53 years ago and as we'll see in the next
41:55 lectures the and continuing with this
41:57 theme of generative AI the power of
41:59 these algorithms has advanced
42:01 tremendously since we first played this
42:05 um particularly in you know a whole
42:07 range of domains which I'm excited to
42:10 talk about but not for now okay so
42:13 you'll tackle this problem head on in
42:15 today's lab RNN music generation
42:21 we can think about the the simple
42:23 example of input sequence to a single
42:26 output with sentiment classification
42:29 where we can think about for example
42:30 text like tweets and assigning positive
42:34 or negative labels to these these text
42:36 examples based on the content that that
42:40 is learned by the network
42:43 okay so this kind of concludes the
42:46 portion on rnns and I think it's quite
42:50 remarkable that using all the
42:51 foundational Concepts and operations
42:53 that we've talked about so far we've
42:56 been able to try to build up networks
42:59 that handle this complex problem of
43:00 sequential modeling
43:03 like any technology right and RNN is not
43:06 without limitations so what are some of
43:09 those limitations and what are some
43:10 potential issues that can arise with
43:13 using rnns or even lstms
43:17 the first is this idea of encoding and
43:20 and dependency in terms of the the
43:24 temporal separation of data that we're
43:28 while rnns require is that the
43:30 sequential information is fed in and
43:32 processed time step by time step
43:35 what that imposes is what we call an
43:37 encoding bottleneck right where we have
43:40 we're trying to encode a lot of content
43:42 for example a very large body of text
43:44 many different words into a single
43:47 output that may be just at the very last
43:50 time step how do we ensure that all that
43:52 information leading up to that time step
43:54 was properly maintained and encoded and
43:57 learned by the network in practice this
43:59 is very very challenging and a lot of
44:01 information can be lost
44:03 another limitation is that by doing this
44:06 time step by time step processing rnns
44:09 can be quite slow there is not really an
44:11 easy way to parallelize that computation
44:15 and finally together these components of
44:18 the encoding bottleneck the requirement
44:20 to process this data step by step
44:23 imposes the biggest problem which is
44:25 when we talk about long memory
44:28 the capacity of the RNN and the lstm is
44:31 really not that long we can't really
44:34 handle data of tens of thousands or
44:37 hundreds of thousands or even Beyond
44:39 sequential information that effectively
44:41 to learn the complete amount of
44:43 information and patterns that are
44:45 present within such a rich data source
44:48 and so because of this very recently
44:51 there's been a lot of attention in how
44:53 we can move Beyond this notion of
44:56 step-by-step recurrent processing to
44:58 build even more powerful architectures
45:00 for processing sequential data
45:03 to understand how we do how we can start
45:06 to do this let's take a big step back
45:08 right think about the high level goal of
45:10 sequence modeling that I introduced at
45:13 given some input a sequence of data we
45:17 want to build a feature encoding and use
45:20 our neural network to learn that and
45:22 then transform that feature encoding
45:24 into a predicted output
45:27 what we saw is that rnns use this notion
45:30 of recurrence to maintain order
45:32 information processing information time
45:37 but as I just mentioned we had these key
45:39 three bottlenecks to rnns
45:42 what we really want to achieve is to go
45:44 beyond these bottlenecks and Achieve
45:47 even higher capabilities in terms of the
45:49 power of these models rather than having
45:52 an encoding bottleneck ideally we want
45:54 to process information continuously as a
45:57 continuous stream of information
45:59 rather than being slow we want to be
46:01 able to parallelize computations to
46:04 speed up processing and finally of
46:06 course our main goal is to really try to
46:09 establish long memory that can build
46:11 nuanced and Rich understanding of
46:16 the limitation of rnns that's linked to
46:18 all these problems and issues in our
46:21 inability to achieve these capabilities
46:23 is that they require this time step by
46:26 time step processing
46:28 so what if we could move beyond that
46:30 what if we could eliminate this need for
46:32 recurrence entirely and not have to
46:34 process the data time set by time step
46:37 well a first and naive approach would be
46:40 to just squash all the data all the time
46:44 steps together to create a vector that's
46:48 effectively concatenated right the time
46:50 steps are eliminated there's just one
46:54 where we have now one vector input with
46:57 the data from all time points that's
46:59 then fed into the model
47:01 it calculates some feature vector and
47:03 then generates some output which
47:05 hopefully makes sense
47:07 and because we've squashed all these
47:09 time steps together we could simply
47:11 think about maybe building a feed
47:13 forward Network that could that could do
47:17 well with that we'd eliminate the need
47:20 for recurrence but we still have the
47:22 issues that it's not scalable because
47:25 the dense feed forward Network would
47:28 have to be immensely large defined by
47:30 many many different connections
47:32 and critically we've completely lost our
47:34 in order information by just squashing
47:37 everything together blindly there's no
47:39 temporal dependence and we're then stuck
47:42 in our ability to try to establish
47:48 so what if instead we could still think
47:51 about bringing these time steps together
47:53 but be a bit more clever about how we
47:56 try to extract information from this
47:59 the key idea is this idea of being able
48:03 to identify and attend to what is
48:06 important in a potentially sequential
48:09 stream of information
48:11 and this is the notion of attention or
48:14 which is an extremely extremely powerful
48:17 Concept in modern deep learning and AI I
48:20 cannot understate or I don't know
48:21 understand overstate I I cannot
48:23 emphasize enough how powerful this
48:27 attention is the foundational mechanism
48:29 of the Transformer architecture which
48:32 many of you may have heard about
48:34 and it's the the the notion of a
48:38 transformer can often be very daunting
48:40 because sometimes they're presented with
48:41 these really complex diagrams or
48:44 deployed in complex applications and you
48:47 may think okay how do I even start to
48:50 at its core though attention the key
48:52 operation is a very intuitive idea and
48:55 we're going to in the last portion of
48:57 this lecture break it down step by step
48:59 to see why it's so powerful and how we
49:02 can use that as part of a larger neural
49:04 network like a Transformer
49:07 specifically we're going to be talking
49:09 and focusing on this idea of
49:12 attending to the most important parts of
49:17 so let's consider an image I think it's
49:20 most intuitive to consider an image
49:22 first this is a picture of Iron Man and
49:25 if our goal is to try to extract
49:27 information from this image of what's
49:30 what we could do maybe is using our eyes
49:32 naively scan over this image pixel by
49:35 pixel right just going across the image
49:39 however our brains maybe maybe
49:42 internally they're doing some type of
49:43 computation like this but you and I we
49:45 can simply look at this image and be
49:48 able to attend to the important parts
49:50 we can see that it's Iron Man coming at
49:52 you right in the image and then we can
49:55 focus in a little further and say okay
49:56 what are the details about Iron Man that
50:00 what is key what you're doing is your
50:02 brain is identifying which parts are
50:04 attending to to attend to and then
50:07 extracting those features that deserve
50:10 the highest attention
50:13 the first part of this problem is really
50:15 the most interesting and challenging one
50:18 and it's very similar to the concept of
50:21 search effectively that's what search is
50:23 doing taking some larger body of
50:26 information and trying to extract and
50:28 identify the important parts
50:30 so let's go there next how does search
50:33 you're thinking you're in this class how
50:35 can I learn more about neural networks
50:37 well in this day and age one thing you
50:39 may do besides coming here and joining
50:41 us is going to the internet having all
50:44 the videos out there trying to find
50:45 something that matches doing a search
50:49 so you have a giant database like
50:51 YouTube you want to find a video
50:53 you enter in your query deep learning
50:57 and what comes out are some possible
51:02 for every video in the database there is
51:04 going to be some key information related
51:06 to the interview to that to that video
51:08 let's say the title
51:11 now to do the search
51:13 what the task is to find the overlaps
51:17 between your query and each of these
51:20 titles right the keys in the database
51:23 what we want to compute is a metric of
51:25 similarity and relevance between the
51:27 query and these keys
51:30 how similar are they to our desired
51:33 and we can do this step by step let's
51:36 say this first option of a video about
51:39 the elegant giant sea turtles not that
51:41 similar to our query about deep learning
51:46 introduction to deep learning the first
51:48 introductory lecture on this class yes
51:52 the third option a video about the late
51:54 and great Kobe Bryant not that relevant
51:57 the key operation here is that there is
52:00 this similarity computation bringing the
52:02 query and the key together
52:04 the final step is now that we've
52:07 identified what key is relevant
52:09 extracting the relevant information what
52:11 we want to pay attention to and that's
52:13 the video itself we call this the value
52:16 and because the searches is implemented
52:19 well right we've successfully identified
52:21 the relevant video on deep learning that
52:23 you are going to want to pay attention
52:26 and it's this this idea this intuition
52:28 of giving a query trying to find
52:31 similarity trying to extract the related
52:33 values that form the basis of
52:37 and how it works in neural networks like
52:40 so to go concretely into this right
52:43 let's go back now to our text our
52:49 our goal is to identify and attend to
52:52 features in this input that are relevant
52:54 to the semantic meaning of the sentence
52:58 now first step we have sequence we have
53:02 order we've eliminated recurrence right
53:04 we're feeding in all the time steps all
53:07 at once we still need a way to encode
53:09 and capture this information about order
53:12 and this positional dependence
53:14 how this is done is this idea of
53:17 possession positional encoding which
53:19 captures some inherent order information
53:22 present in the sequence I'm just going
53:25 to touch on this very briefly but the
53:27 idea is related to this idea of
53:29 embeddings which I introduced earlier
53:32 what is done is a neural network layer
53:34 is used to encode positional information
53:37 that captures the relative relationships
53:40 in terms of order within within this
53:45 that's the high level concept right
53:47 we're still being able to process these
53:50 time steps all at once there is no
53:52 notion of time step rather the data is
53:54 singular but still we learned this
53:56 encoding that captures the positional
54:01 now our next step is to take this
54:03 encoding and figure out what to attend
54:05 to exactly like that search operation
54:07 that I introduced with the YouTube
54:09 example extracting a query extracting a
54:12 key extracting a value and relating them
54:15 so we use neural network layers to do
54:19 given this positional encoding what
54:22 attention does is applies a neural
54:26 transforming that first generating the
54:30 we do this again using a separate neural
54:33 network layer and this is a different
54:35 set of Weights a different set of
54:36 parameters that then transform that
54:38 positional embedding in a different way
54:40 generating a second output the key
54:45 and finally this repeat this operation
54:47 is repeated with a third layer a third
54:50 set of Weights generating the value
54:53 now with these three in hand the key the
54:56 the query the key and the value we can
54:58 compare them to each other to try to
55:00 figure out where in that self-input the
55:04 network should attend to what is
55:07 and that's the key idea behind this
55:09 similarity metric or what you can think
55:11 of as an attention score what we're
55:14 doing is we're Computing a similarity
55:16 score between a query and the key
55:19 and remember that these query and Qui
55:22 key values are just arrays of numbers we
55:25 can Define them as arrays of numbers
55:27 which you can think of as vectors in
55:31 the query Vector the query values are
55:34 some Vector the key
55:36 the key values are some other vector and
55:39 mathematically the way that we can
55:41 compare these two vectors to understand
55:42 how similar they are is by taking the
55:45 dot product and scaling it captures how
55:49 similar these vectors are how whether or
55:52 not they're pointing in the same
55:55 this is the similarity metric and if you
55:58 are familiar with a little bit of linear
56:00 algebra this is also known as the cosine
56:04 operation functions exactly the same way
56:07 for matrices if we apply this dot
56:10 product operation to our query in key
56:12 matrices key matrices we get this
56:15 similarity metric out
56:19 this is very very key in defining our
56:22 next step Computing the attention
56:24 waiting in terms of what the network
56:26 should actually attend to within this
56:30 this operation gives us a score which
56:36 the components of the input data are
56:39 related to each other
56:41 so given a sentence right when we
56:43 compute this similarity score metric we
56:47 can then begin to think of Weights that
56:50 Define the relationship between the
56:52 sequential the components of the
56:54 sequential data to each other
56:56 so for example in the this example with
56:59 a text sentence he tossed the tennis
57:04 the goal with the score is that words in
57:07 the sequence that are related to each
57:08 other should have high attention weights
57:10 ball related to toss related to tennis
57:15 and this metric itself is our attention
57:18 waiting what we have done is passed that
57:21 similarity score through a soft Max
57:24 function which all it does is it
57:27 constrains those values to be between 0
57:29 and 1. and so you can think of these as
57:31 relative scores of relative attention
57:35 finally now that we have this metric
57:38 that can captures this notion of
57:41 similarity and these internal
57:43 self-relationships we can finally use
57:46 this metric to extract features that are
57:49 deserving of high attention
57:52 and that's the exact final step in this
57:55 self-attention mechanism in that we take
57:58 that attention waiting Matrix multiply
58:01 it by the value and get a transformed
58:07 of the initial data as our output which
58:10 in turn reflects the features that
58:12 correspond to high attention
58:15 all right let's take a breath let's
58:18 recap what we have just covered so far
58:21 the goal with this idea of
58:23 self-attention the backbone of
58:24 Transformers is to eliminate recurrence
58:27 attend to the most important features in
58:31 in an architecture how this is actually
58:33 deployed is first we take our input data
58:37 we compute these positional encodings
58:40 the neural network layers are applied
58:43 three-fold to transform the positional
58:46 encoding into each of the key query and
58:53 compute the self-attention weight score
58:56 according to the up the dot product
58:58 operation that we went through prior and
59:01 then self-attend to these features to
59:05 information to extract features that
59:08 deserve High attention
59:11 what is so powerful about this approach
59:14 in taking this attention wait
59:16 putting it together with the value to
59:19 extract High attention features is that
59:21 this operation the scheme that I'm
59:24 showing on the right defines a single
59:26 self-attention head
59:28 and multiple of these self-attention
59:30 heads can be linked together to form
59:32 larger Network architectures where you
59:35 can think about these different heads
59:36 trying to extract different information
59:38 different relevant parts of the input to
59:41 now put together a very very rich
59:43 encoding and representation of the data
59:46 that we're working with
59:48 intuitively back to our Ironman example
59:51 what this idea of multiple
59:52 self-attention heads can amount to is
59:55 that different Salient features and
59:58 Salient information in the data is
01:00:00 extracted first maybe you consider Iron
01:00:03 Man attention had one and you may have
01:00:06 additional attention heads that are
01:00:08 picking out other relevant parts of the
01:00:10 data which maybe we did not realize
01:00:12 before for example the building or the
01:00:15 spaceship in the background that's
01:00:19 this is a key building block of many
01:00:22 many many many powerful architectures
01:00:25 that are out there today today I again
01:00:27 cannot emphasize how enough how powerful
01:00:30 this mechanism is
01:00:32 and indeed this this backbone idea of
01:00:35 self-attention that you just built up
01:00:37 understanding of is the key operation of
01:00:40 some of the most powerful neural
01:00:42 networks and deep learning models out
01:00:44 there today ranging from the very
01:00:46 powerful language models like gpt3 which
01:00:50 are capable of synthesizing natural
01:00:53 language in a very human-like fashion
01:00:55 digesting large bodies of text
01:00:57 information to understand relationships
01:01:01 to models that are being deployed for
01:01:04 extremely impactful applications in
01:01:07 biology and Medicine such as Alpha full
01:01:10 2 which uses this notion of
01:01:12 self-attention to look at data of
01:01:15 protein sequences and be able to predict
01:01:17 the three-dimensional structure of a
01:01:19 protein just given sequence information
01:01:23 and all the way even now to computer
01:01:25 vision which will be the topic of our
01:01:27 next lecture tomorrow where the same
01:01:30 idea of attention that was initially
01:01:32 developed in sequential data
01:01:34 applications has now transformed the
01:01:36 field of computer vision and again using
01:01:39 this key concept of attending to the
01:01:42 important features in an input to build
01:01:44 these very rich representations of
01:01:46 complex High dimensional data
01:01:49 okay so that concludes lectures for
01:01:53 today I know we have covered a lot of
01:01:55 territory in a pretty short amount of
01:01:57 time but that is what this boot camp
01:01:59 program is all about so hopefully today
01:02:02 you've gotten a sense of the foundations
01:02:04 of neural networks in the lecture with
01:02:06 Alexander we talked about rnns how
01:02:09 they're well suited for sequential data
01:02:11 how we can train them using back
01:02:14 how we can deploy them for different
01:02:15 applications and finally how we can move
01:02:18 Beyond recurrence to build this idea of
01:02:20 self-attention for building increasingly
01:02:23 powerful models for deep learning in
01:02:25 sequence modeling
01:02:27 all right hopefully you enjoyed we have
01:02:31 um about 45 minutes left for the for the
01:02:34 lab portion and open Office hours in
01:02:36 which we welcome you to ask us questions
01:02:38 uh of us and the Tas and to start work
01:02:41 on the labs the information for the labs
01:02:44 is is up there thank you so much for