MIT 6.S191: Recurrent Neural Networks, Transformers, and Attention

Alexander Amini2023-03-17

deep learning#mit#artificial intelligence#neural networks#machine learning#recurrent neural networks#rnn#what is a rnn#6s191#6.s191#mit deep learning#ava soleimany#soleimany#alexander amini#amini#lecture 2#tensorflow#computer vision#sequences#deep mind#openai#basics#introduction#deeplearning#ai#tensorflow tutorial#what is deep learning#deep learning basics#deep learning python#sequence modeling#sequential#long short term memory#lstm

543K views|1 years ago

💫 Short Summary

The video provides a lecture on sequence modeling and recurrent neural networks (RNNs). It explains the concept of sequential data and its applications, the design criteria for building RNNs, and the challenges such as vanishing gradients. The video also discusses the use of RNNs in music generation and presents an example of an RNN generating the third movement of Schubert's unfinished Symphony. Sequence modeling using recurrent neural networks (RNNs) can be limited by encoding bottlenecks, slow processing, and difficulties with long-term memory. Self-attention, a key concept in modern deep learning and AI, has become foundational in the transformer architecture, allowing for the extraction of important features in sequential data without the need for recurrence.

✨ Highlights

📊 Transcript

✦

Lecture 2 focuses on sequence modeling and how neural networks can handle and learn from sequential data.

00:00

Sequential data is present in various forms, such as the trajectory of a moving ball, audio waves, text, and language.

Applications of sequential modeling include language translation, sentiment analysis, and generating text or descriptions from images.

✦

Recurrence is introduced as a way to process a sequence of information in neural networks.

05:05

The perceptron and feedforward neural networks lack the ability to process sequential information.

Recurrence in neural networks involves linking the computations at different time steps and passing on the internal state or memory.

This idea forms the foundation of recurrent neural networks (RNNs).

✦

RNNs maintain and update the internal state while processing sequential information.

12:20

The internal state is updated using a recurrence relation, similar to other neural network operations.

RNNs process each time step in a sequence by updating the hidden state and generating a predicted output.

The computational graph of an RNN can be unrolled or expanded across time to visualize the processing of a sequence.

✦

RNNs require a numerical encoding of language for processing text-based data.

19:00

One approach is to use one-hot encoding, where each word is represented by a vector with a one at the index corresponding to the word.

Another approach is to use embeddings, which are learned numerical representations of words that capture semantic meaning.

RNNs need to be able to handle variable length sequences and track dependencies across time and order.

✦

The vanishing gradient problem in RNNs can be addressed by using the relu activation function, initializing weights to prevent rapid shrinking, and using LSTM (Long Short-Term Memory) networks.

32:29

Relu activation function helps mitigate vanishing gradients by maintaining a derivative of one for inputs greater than zero.

Initializing weights to identity matrices prevents rapid shrinking of updates.

LSTM networks control the flow of information through gates to filter out irrelevant information and store important information, thereby addressing the vanishing gradient problem.

✦

RNNs and LSTMs are used for music generation, with LSTMs being particularly effective at handling long-term dependencies in the music.

39:55

RNNs can predict the next musical note in a sequence and generate new musical sequences.

LSTMs control the flow of information through gates to filter out unimportant and store important information for better music generation.

LSTMs help alleviate the vanishing gradient problem by allowing for uninterrupted flow of gradients.

✦

Recurrent Neural Networks (RNNs) are used for sequential modeling, such as sentiment classification for text like tweets.

00:42

RNNs maintain order and process information time step by time step.

Limitations of RNNs include encoding bottleneck, slow processing, and difficulty in handling long memory.

✦

The concept of self-attention is introduced as a way to go beyond the limitations of RNNs.

00:46

Self-attention aims to process information continuously as a stream, parallelize computations for faster processing, and establish long memory.

Self-attention allows the model to identify and attend to important parts of a sequential stream of information.

✦

Self-attention is explained using the analogy of how humans extract important information from images and how search operations work.

00:52

In self-attention, the model computes similarity and relevance between the query and the key to determine where to focus.

Neural network layers are used to transform positional encoding into query, key, and value matrices.

The similarity score is computed using the dot product or cosine similarity.

The attention weights are used to extract features that deserve high attention from the input data.

✦

Self-attention is a key component of the Transformer architecture and is used in powerful neural networks for various applications.

00:58

Self-attention eliminates the need for recurrence and allows the model to attend to important features in the input data.

Multiple self-attention heads can be used to extract different relevant parts of the input, leading to rich representations of the data.

Self-attention is used in language models like GPT-3, as well as in applications in biology, medicine, and computer vision.

00:01 foreign

00:09 hello everyone and I hope you enjoyed

00:12 Alexander's first lecture I'm Ava and in

00:15 this second lecture lecture two we're

00:17 going to focus on this question of

00:19 sequence modeling how we can build

00:21 neural networks that can handle and

00:23 learn from sequential data

00:25 so in Alexander's first lecture he

00:27 introduced the essentials of neural

00:29 networks starting with perceptrons

00:31 building up to feed forward models and

00:33 how you can actually train these models

00:35 and start to think about deploying them

00:37 forward

00:38 now we're going to turn our attention to

00:41 specific types of problems that involve

00:43 sequential processing of data and we'll

00:46 realize why these types of problems

00:48 require a different way of implementing

00:50 and building neural networks from what

00:53 we've seen so far

00:55 and I think some of the components in

00:57 this lecture traditionally can be a bit

00:59 confusing or daunting at first but what

01:01 I really really want to do is to build

01:03 this understanding up from the

01:05 foundations walking through step by step

01:07 developing intuition all the way to

01:09 understanding the math and the

01:11 operations behind how these networks

01:14 operate

01:15 okay so let's let's get started

01:20 to to begin I

01:22 to begin I first want to motivate what

01:24 exactly we mean when we talk about

01:26 sequential data or sequential modeling

01:28 so we're going to begin with a really

01:30 simple intuitive example

01:32 let's say we have this picture of a ball

01:34 and your task is to predict where this

01:37 ball is going to travel to next

01:40 now if you don't have any prior

01:42 information about the trajectory of the

01:44 ball it's motion it's history any guess

01:46 or prediction about its next position is

01:49 going to be exactly that a random guess

01:53 if however in addition to the current

01:55 location of the ball I gave you some

01:57 information about where it was moving in

01:59 the past now the problem becomes much

02:01 easier and I think hopefully we can all

02:04 agree that most likely or most likely

02:07 next prediction is that this ball is

02:09 going to move forward to the right in in

02:11 the next frame

02:13 so this is a really you know reduced

02:15 down Bare Bones intuitive example but

02:18 the truth is that Beyond this sequential

02:20 data is really all around us right as

02:23 I'm speaking the words coming out of my

02:25 mouth form a sequence of sound waves

02:28 that Define audio which we can split up

02:30 to think about in this sequential manner

02:33 similarly text language can be split up

02:36 into a sequence of characters

02:38 or a sequence of words

02:41 and there are many many more examples in

02:43 which sequential processing sequential

02:45 data is present right from medical

02:48 signals like EKGs to financial markets

02:51 and projecting stock prices to

02:53 biological sequences encoded in DNA to

02:56 patterns in the climate to patterns of

02:58 motion and many more

03:00 and so already hopefully you're getting

03:03 a sense of what these types of questions

03:04 and problems may look like and where

03:06 they are relevant in the real world

03:10 when we consider applications of

03:12 sequential modeling in the real world we

03:14 can think about a number of different

03:15 kind of problem definitions that we can

03:17 have in our Arsenal and work with

03:20 in the first lecture Alexander

03:22 introduced the Notions of classification

03:24 and the notion of regression where he

03:27 talked about and we learned about feed

03:29 forward models that can operate one to

03:31 one in this fixed and static setting

03:33 right given a single input predict a

03:36 single output

03:37 the binary classification example of

03:39 will you succeed or pass this class

03:42 here there's there's no notion of

03:44 sequence there's no notion of time

03:46 now if we introduce this idea of a

03:49 sequential component we can handle

03:51 inputs that may be defined temporally

03:54 and potentially also produce a

03:56 sequential or temporal output so for as

04:00 one example we can consider text

04:02 language and maybe we want to generate

04:04 one prediction given a sequence of text

04:07 classifying whether a message is a

04:10 positive sentiment or a negative

04:12 sentiment

04:13 conversely we could have a single input

04:16 let's say an image and our goal may be

04:19 now to generate text or a sequential

04:22 description of this image right given

04:24 this image of a baseball player throwing

04:26 a ball can we build a neural network

04:28 that generates that as a language

04:30 caption

04:32 finally we can also consider

04:34 applications and problems where we have

04:36 sequence in sequence L for example if we

04:39 want to translate between two languages

04:41 and indeed this type of thinking in this

04:44 type of Architecture is what powers the

04:47 task of machine translation in your

04:49 phones in Google Translate and and many

04:51 other examples

04:54 so hopefully right this has given you a

04:57 picture of what sequential data looks

04:58 like what these types of problem

05:00 definitions may look like and from this

05:03 we're going to start and build up our

05:04 understanding of what neural networks we

05:07 can build and train for these types of

05:09 problems

05:11 so first we're going to begin with the

05:14 notion of recurrence and build up from

05:16 that to Define recurrent neural networks

05:19 and in the last portion of the lecture

05:21 we'll talk about the underlying

05:22 mechanisms underlying the Transformer

05:25 architectures that are very very very

05:27 powerful in terms of handling sequential

05:29 data but as I said at the beginning

05:31 right the theme of this lecture is

05:34 building up that understanding step by

05:36 step starting with the fundamentals and

05:38 the intuition

05:39 so to do that we're going to go back

05:41 revisit the perceptron and move forward

05:43 from there

05:45 right so as Alexander introduced where

05:48 we studied the perception perceptron in

05:50 lecture one

05:51 the perceptron is defined by this single

05:54 neural operation where we have some set

05:57 of inputs let's say X1 through XM and

06:00 each of these numbers are multiplied by

06:03 a corresponding weight pass through a

06:05 non-linear activation function that then

06:08 generates a predicted output y hat

06:11 here we can have multiple inputs coming

06:13 in to generate our output but still

06:16 these inputs are not thought of as

06:18 points in a sequence or time steps in a

06:21 sequence

06:22 even if we scale this perceptron and

06:24 start to stack multiple perceptrons

06:27 together to Define these feed forward

06:28 neural networks we still don't have this

06:31 notion of temporal processing or

06:33 sequential information even though we

06:36 are able to translate and convert

06:38 multiple inputs apply these weight

06:40 operations apply this non-linearity to

06:43 then Define multiple predicted outputs

06:47 so taking a look at this diagram right

06:49 on the left in blue you have inputs on

06:52 the right in purple you have these

06:53 outputs and the green defines the neural

06:56 the single neural network layer that's

06:58 transforming these inputs to the outputs

07:01 Next Step I'm going to just simplify

07:02 this diagram I'm going to collapse down

07:05 those stack perceptrons together and

07:08 depict this with this green block

07:11 still it's the same operation going on

07:13 right we have an input Vector being

07:16 being transformed to predict this output

07:18 vector

07:20 now what I've introduced here which you

07:23 may notice is this new variable T right

07:26 which I'm using to denote a single time

07:28 step

07:29 we are considering an input at a single

07:31 time step and using our neural network

07:34 to generate a single output

07:36 corresponding to that

07:38 how could we start to extend and build

07:40 off this to now think about multiple

07:43 time steps and how we could potentially

07:45 process a sequence of information

07:48 well what if we took this diagram all

07:51 I've done is just rotated it 90 degrees

07:54 where we still have this input vector

07:56 and being fed in producing an output

07:59 vector and what if we can make a copy of

08:02 this network right and just do this

08:05 operation multiple times to try to

08:07 handle inputs that are fed in

08:09 corresponding to different times right

08:11 we have an individual time step starting

08:14 with t0

08:16 and we can do the same thing the same

08:18 operation for the next time step again

08:21 treating that as an isolated instance

08:24 and keep doing this repeatedly

08:27 and what you'll notice hopefully is all

08:29 these models are simply copies of each

08:31 other just with different inputs at each

08:33 of these different time steps

08:36 and we can make this concrete right in

08:38 terms of what this functional

08:39 transformation is doing

08:41 the predicted output at a particular

08:43 time step y hat of T is a function of

08:48 the input at that time step X of T and

08:51 that function is what is learned and

08:53 defined by our neural network weights

08:56 okay so I've told you that our goal here

08:59 is Right trying to understand sequential

09:01 data do sequential modeling but what

09:04 could be the issue with what this

09:06 diagram is showing and what I've shown

09:07 you here

09:09 well yeah go ahead

09:12[Music]

09:15 exactly that's exactly right so the

09:17 student's answer was that X1 or it could

09:20 be related to X naught and you have this

09:22 temporal dependence but these isolated

09:24 replicas don't capture that at all and

09:27 that's exactly

09:29 answers the question perfectly right

09:32 here a predicted output at a later time

09:35 step could depend

09:37 precisely on inputs at previous time

09:39 steps if this is truly a sequential

09:41 problem with this temporal dependence

09:45 so how could we start to reason about

09:47 this how could we Define a relation that

09:50 links the Network's computations at a

09:52 particular time step to Prior history

09:55 and memory from previous time steps

09:58 well what if we did exactly that right

10:01 what if we simply linked the computation

10:04 and the information understood by the

10:07 network to these other replicas via what

10:11 we call a recurrence relation

10:13 what this means is that something about

10:15 what the network is Computing at a

10:17 particular time is passed on to those

10:20 later time steps and we Define that

10:22 according to this variable H which we

10:25 call this internal state or you can

10:27 think of it as a memory term

10:29 that's maintained by the neurons and the

10:31 network and it's this state that's being

10:34 passed time set to time step as we read

10:37 in and and process this sequential

10:40 information

10:42 what this means is that the Network's

10:45 output its predictions its computations

10:47 is not only a function of the input data

10:50 X

10:51 but also we have this other variable H

10:54 which captures this notion of State

10:56 captions captures this notion of memory

10:59 that's being computed by the network and

11:02 passed on over time

11:04 specifically right to walk through this

11:06 our predicted output y hat of T depends

11:10 not only on the input at a time but also

11:12 this past memory this past state

11:15 and it is this linkage of temporal

11:19 dependence and recurrence that defines

11:21 this idea of a recurrent neural unit

11:24 what I've shown is this this connection

11:27 that's being unrolled over time but we

11:30 can also depict this relationship

11:32 according to a loop

11:34 this computation to this internal State

11:37 variable h of T is being iteratively

11:39 updated over time and that's fed back

11:42 into the neuron the neurons computation

11:45 in this recurrence relation

11:49 this is how we Define these recurrent

11:51 cells that comprise recurrent neural

11:54 networks or

11:56 and the key here is that we have this

11:59 this idea of this recurrence relation

12:00 that captures the cyclic temporal

12:03 dependency

12:06 and indeed it's this idea that is really

12:08 the intuitive Foundation behind

12:10 recurrent neural networks or rnns and so

12:13 let's continue to build up our

12:15 understanding from here and move forward

12:17 into how we can actually Define the RNN

12:20 operations mathematically and in code

12:23 so all we're going to do is formalize

12:25 this relationship a little bit more

12:27 the key idea here is that the RNN is

12:30 maintaining the state and it's updating

12:32 the state at each of these time steps as

12:35 the sequence is is processed

12:38 we Define this by applying this

12:40 recurrence relation

12:41 and what the recurrence relation

12:43 captures is how we're actually updating

12:45 that internal State h of t

12:48 specifically that state update is

12:51 exactly like any other neural network

12:52 operator operation that we've introduced

12:55 so far where again we're learning a

12:58 function

12:59 defined by a set of Weights w

13:01 we're using that function to update the

13:04 cell State h of t

13:06 and the additional component the newness

13:09 here is that that function depends both

13:11 on the input and the prior time step h

13:14 of T minus one

13:16 and what you'll know is that this

13:19 function f sub W is defined by a set of

13:22 weights and it's the same set of Weights

13:24 the same set of parameters that are used

13:27 time step to time step as the recurrent

13:30 neural network processes this temporal

13:32 information the sequential data

13:35 okay so the key idea here hopefully is

13:38 coming coming through is that this RNN

13:41 stay update operation takes this state

13:44 and updates it each time a sequence is

13:47 processed

13:48 we can also translate this to how we can

13:52 think about implementing rnns in Python

13:56 code or rather pseudocode hopefully

13:59 getting a better understanding and

14:00 intuition behind how these networks work

14:03 so what we do is we just start by

14:06 defining an RNN

14:07 for now this is abstracted away

14:10 and we start we initialize its hidden

14:12 State and we have some sentence right

14:15 let's say this is our input of Interest

14:16 where we're interested in predicting

14:19 maybe the next word that's occurring in

14:21 this sentence

14:22 what we can do is Loop through these

14:25 individual words in the sentence that

14:27 Define our temporal input and at each

14:30 step as We're looping through each word

14:32 in that sentence is fed into the RNN

14:36 model along with the previous hidden

14:39 state

14:40 and this is what generates a prediction

14:42 for the next word and updates the RNN

14:45 state in turn

14:47 finally our prediction for the final

14:49 word in the sentence the word that we're

14:51 missing is simply the rnn's output after

14:54 all the prior words have been fed in

14:56 through the model

14:59 so this is really breaking down how the

15:01 RNN Works how it's processing the

15:03 sequential information

15:05 and what you've noticed is that the RNN

15:08 computation includes both this update to

15:10 the hidden State as well as generating

15:12 some predicted output at the end that is

15:15 our ultimate goal that we're interested

15:16 in

15:17 and so to walk through this how we're

15:20 actually generating the output

15:21 prediction itself

15:23 what the RNN computes is given some

15:26 input vector

15:27 it then performs this update to the

15:29 hidden state

15:31 and this update to the head and state is

15:34 just a standard neural network operation

15:36 just like we saw in the first lecture

15:38 where it consists of taking a weight

15:41 Matrix multiplying that by the previous

15:44 hidden State taking another weight

15:46 Matrix multiplying that by the input at

15:49 a time step and applying a non-linearity

15:52 and in this case right because we have

15:54 these two input streams the input data X

15:57 of T and the previous state H we have

16:01 these two separate weight matrices that

16:03 the network is learning over the course

16:05 of its training

16:06 that comes together we apply the

16:09 non-linearity and then we can generate

16:12 an output at a given time step by just

16:14 modifying the hidden state

16:17 using a separate weight Matrix to update

16:20 this value and then generate a predicted

16:22 output

16:24 and that's what there is to it right

16:26 that's how the RNN in its single

16:29 operation updates both the hidden State

16:32 and also generates a predicted output

16:36 okay so now this gives you the internal

16:39 working of how the RNN computation

16:41 occurs at a particular time step let's

16:45 next think about how this looks like

16:46 over time and Define the computational

16:50 graph of the RNN as being unrolled or

16:53 expanded acrost across time

16:56 so so far the dominant way I've been

16:58 showing the rnns is according to this

17:01 loop-like diagram on the Left Right

17:03 feeding back in on itself

17:05 another way we can visualize and think

17:07 about rnns is as kind of unrolling this

17:11 recurrence over time over the individual

17:14 time steps in our sequence

17:16 what this means is that we can take the

17:19 network at our first time step

17:21 and continue to iteratively unroll it

17:24 across the time steps

17:26 going on forward all the way until we

17:29 process all the time steps in our input

17:32 now we can formalize this diagram a

17:35 little bit more by defining the weight

17:37 matrices that connect the inputs to the

17:41 hidden State update

17:43 and the weight matrices that are used to

17:46 update the internal State across time

17:49 and finally the weight matrices that

17:51 Define the the update to generate a

17:54 predicted output

17:56 now recall that in all these cases right

18:00 for all these three weight matrices add

18:02 all these time steps we are simply

18:04 reusing the same weight matrices right

18:07 so it's one set of parameters one set of

18:09 weight matrices that just process this

18:12 information sequentially

18:14 now you may be thinking okay so how do

18:17 we actually start to be thinking about

18:19 how to train the RNN how to define the

18:22 loss given that we have this temporal

18:24 processing in this temporal dependence

18:27 well a prediction at an individual time

18:30 step will simply amount to a computed

18:33 loss at that particular time step

18:36 so now we can compare those predictions

18:38 time step by time step to the true label

18:41 and generate a loss value for those

18:43 timestamps and finally we can get our

18:46 total loss by taking all these

18:48 individual loss terms together and

18:51 summing them defining the total loss for

18:54 a particular input to the RNN

18:58 if we can walk through an example of how

19:00 we implement this RNN in tensorflow

19:02 starting from scratch

19:04 the RNN can be defined as a layer

19:07 operation and a layer class that

19:09 Alexander introduced in the first

19:11 lecture and so we can Define it

19:13 according to an initialization of weight

19:16 matrices initialization of a hidden

19:19 state which commonly amounts to

19:21 initializing these two to zero

19:25 next we can Define how we can actually

19:27 pass forward through the RNN Network to

19:31 process a given input X

19:33 and what you'll notice is in this

19:35 forward operation the computations are

19:38 exactly like we just walked through we

19:40 first update the hidden state

19:42 according to that equation we introduced

19:45 earlier

19:46 and then generate a predicted output

19:48 that is a transformed version of that

19:50 hidden state

19:52 and finally at each time step we return

19:54 it both the output and the updated

19:57 hidden State as this is what is

19:59 necessary to be stored to continue this

20:02 RNN operation over time

20:04 what is very convenient is that although

20:08 you could Define your RNN Network and

20:09 your RNN layer completely from scratch

20:11 is that tensorflow abstracts this

20:14 operation away for you so you can simply

20:16 Define a simple RNN according to uh this

20:20 this call that you're seeing here

20:23 um which yeah makes all the the

20:25 computations very efficient and and very

20:27 easy

20:28 and you'll actually get practice

20:30 implementing and working with with rnns

20:33 in today's software lab

20:36 okay so that gives us the understanding

20:39 of rnns and going back to what I what I

20:43 described as kind of the problem setups

20:45 or the problem definitions at the

20:46 beginning of this lecture

20:48 I just want to remind you of the types

20:50 of sequence modeling problems on which

20:52 we can apply rnns right we can think

20:56 about taking a sequence of inputs

20:57 producing one predicted output at the

21:00 end of the sequence

21:02 we can think about taking a static

21:04 single input and trying to generate text

21:06 according to according to that single

21:09 input

21:11 and finally we can think about taking a

21:13 sequence of inputs producing a

21:15 prediction at every time step in that

21:17 sequence

21:18 and then doing this sequence to sequence

21:21 type of prediction and translation

21:25 okay

21:26 so

21:28 yeah so so this will be the the

21:30 foundation for

21:31 um

21:32 the software lab today which will focus

21:35 on this problem of of many to many

21:38 processing and many to many sequential

21:40 modeling taking a sequence going to a

21:42 sequence

21:44 what is common and what is universal

21:46 across all these types of problems and

21:49 tasks that we may want to consider with

21:51 rnns is what I like to think about what

21:54 type of design criteria we need to build

21:57 a robust and reliable Network for

22:00 processing these sequential modeling

22:02 problems

22:03 what I mean by that is what are the

22:05 characteristics what are the the design

22:08 requirements that the RNN needs to

22:10 fulfill in order to be able to handle

22:13 sequential data effectively

22:16 the first is that sequences can be of

22:18 different lengths right they may be

22:21 short they may be long we want our RNN

22:23 model or our neural network model in

22:25 general to be able to handle sequences

22:27 of variable lengths

22:29 secondly and really importantly is as we

22:33 were discussing earlier that the whole

22:35 point of thinking about things through

22:36 the lens of sequence is to try to track

22:39 and learn dependencies in the data that

22:41 are related over time

22:43 so our model really needs to be able to

22:45 handle those different dependencies

22:47 which may occur at times that are very

22:50 very distant from each other

22:53 next right sequence is all about order

22:56 right there's some notion of how current

22:59 inputs depend on prior inputs and the

23:02 specific order of the observations we

23:04 see makes a big effect on what

23:08 prediction we may want to generate at

23:10 the end

23:11 and finally in order to be able to

23:14 process this information

23:16 effectively our Network needs to be able

23:19 to do what we call parameter sharing

23:21 meaning that given one set of Weights

23:24 that set of weights should be able to

23:26 apply to different time steps in the

23:27 sequence and still result in a

23:30 meaningful prediction

23:31 and so today we're going to focus on how

23:34 recurrent neural networks meet these

23:36 design criteria and how these design

23:38 criteria motivate the need for even more

23:41 powerful architectures that can

23:43 outperform rnns in sequence modeling

23:46 so to understand these criteria very

23:49 concretely we're going to consider a

23:52 sequence modeling problem where given

23:54 some series of words our task is just to

23:57 predict the next word in that sentence

24:00 so let's say we have this sentence this

24:03 morning I took my cat for a walk

24:05 and our task is to predict the last word

24:09 in the sentence given the prior words

24:10 this morning I took my cap for a blank

24:15 our goal is to take our RNN Define it

24:18 and

24:20 put it to test on this task

24:22 what is our first step to doing this

24:25 well the very very first step before we

24:28 even think about defining the RNN is how

24:31 we can actually represent this

24:33 information to the network in a way that

24:35 it can process and understand

24:39 if we have a model that is processing

24:42 this data processing this text-based

24:44 data

24:45 and wanting to generate text as the

24:47 output

24:48 our problem can arise in that the neural

24:51 network itself is not equipped to handle

24:54 language explicitly right

24:56 remember that neural networks are simply

24:58 functional operators they're just

25:00 mathematical operations and so we can't

25:03 expect it right it doesn't have an

25:04 understanding from the start of what a

25:07 word is or what language means which

25:09 means that we need a way to represent

25:11 language numerically so that it can be

25:14 passed in to the network to process

25:19 so what we do is that we need to define

25:21 a way to translate this text this this

25:24 language information into a numerical

25:27 encoding a vector an array of numbers

25:30 that can then be fed in to our neural

25:33 network and generating a a vector of

25:36 numbers as its output

25:39 so now right this raises the question of

25:41 how do we actually Define this

25:43 transformation how can we transform

25:45 language into this numerical encoding

25:48 the key solution and the key way that a

25:50 lot of these networks work is this

25:53 notion and concept of embedding

25:55 what that means is it it's some

25:58 transformation that takes

26:00 indices or something that can be

26:03 represented as an index into a numerical

26:07 Vector of a given size

26:10 so if we think about how this idea of

26:12 embedding works for language data let's

26:14 consider a vocabulary of words that we

26:17 can possibly have in our language

26:19 and our goal is to be able to map these

26:22 individual words in our vocabulary to a

26:26 numerical Vector of fixed size

26:29 one way we could do this is by defining

26:31 all the possible words that could occur

26:33 in this vocabulary

26:35 and then indexing them assigning a index

26:38 label to each of these distinct words

26:41 a corresponds to index one cat responds

26:44 to index two so on and so forth and this

26:47 indexing Maps these individual words to

26:50 numbers unique indices

26:53 what these indices can then Define is

26:55 what we call a embedding vector

26:58 which is a fixed length encoding where

27:00 we've simply indicated a one value

27:04 at the index for that word when we

27:07 observe that word

27:08 and this is called a one-hot embedding

27:10 where we have this fixed length Vector

27:13 of the size of our vocabulary and each

27:16 instance of the vocabulary corresponds

27:18 to a one-hot one at the corresponding

27:22 index

27:24 this is a very sparse way to do this and

27:28 it's simply based on

27:30 purely purely count the count index

27:32 there's no notion of semantic

27:35 information meaning that's captured in

27:38 this vector-based encoding

27:40 alternatively what is very commonly done

27:42 is to actually use a neural network to

27:45 learn in encoding to learn in embedding

27:47 and the goal here is that we can learn a

27:50 neural network that then captures some

27:52 inherent meaning or inherent semantics

27:54 in our input data and Maps related words

27:57 or related inputs closer together in

28:00 this embedding space meaning that

28:02 they'll have numerical vectors that are

28:04 more similar to each other

28:07 this concept is really really

28:09 foundational to how these

28:11 sequence modeling networks work and how

28:15 neural networks work in general

28:17 okay so with that in hand we can go back

28:20 to our design criteria

28:21 thinking about the capabilities that we

28:23 desire first we need to be able to

28:26 handle variable length sequences

28:29 if we again want to predict the next

28:31 word in the sequence we can have short

28:32 sequences we can have long sequences we

28:34 can have even longer sentences and our

28:37 key task is that we want to be able to

28:39 track dependencies across all these

28:42 different lengths

28:43 and what we need what we mean by

28:45 dependencies is that there could be

28:46 information very very early on in a

28:49 sequence

28:50 but uh that may not be relevant or come

28:53 up late until very much later in the

28:55 sequence and we need to be able to track

28:58 these dependencies and maintain this

29:00 information in our Network

29:03 dependencies relate to order and

29:05 sequences are defined by their order and

29:08 we know that same words in a completely

29:11 different order have completely

29:12 different meanings right so our model

29:15 needs to be able to handle these

29:17 differences in order and the differences

29:19 in length that could result in different

29:23 predicted outputs

29:25 okay so hopefully that example going

29:28 through the example in text

29:30 motivates how we can think about

29:32 transforming input data into a numerical

29:35 encoding that can be passed into the RNN

29:38 and also what are the key criteria that

29:41 we want to meet in handling these these

29:44 types of problems

29:46 so so far we've painted the picture of

29:49 rnn's how they work intuition their

29:51 mathematical operations and what are the

29:54 key criteria that they need to meet the

29:57 final piece to this is how we actually

29:59 train and learn the weights in the RNN

30:02 and that's done through back propagation

30:04 algorithm with a bit of a Twist to just

30:07 handle sequential information

30:10 if we go back and think about how we

30:13 train feed forward neural network models

30:16 the steps break down in thinking through

30:19 starting with an input where we first

30:21 take this input and make a forward pass

30:24 through the network going from input to

30:26 Output

30:28 the key to back propagation that

30:29 Alexander introduced was this idea of

30:32 taking the prediction and back

30:33 propagating gradients back through the

30:36 network

30:37 and using this operation to then Define

30:40 and update the loss with respect to each

30:43 of the parameters in the network in

30:45 order to gradually adjust the parameters

30:48 the weights of the network in order to

30:50 minimize the overall loss

30:53 now with rnns as we walked through

30:55 earlier we have this temporal unrolling

30:58 which means that we have these

30:59 individual losses across the individual

31:01 steps in our sequence that sum together

31:04 to comprise the overall loss

31:07 what this means is that when we do back

31:09 propagation

31:11 we have to now instead of back

31:14 propagating errors through a single

31:16 Network

31:17 back propagate the loss through each of

31:19 these individual time steps

31:21 and after we back propagate loss through

31:24 each of the individual time steps we

31:26 then do that across all time steps all

31:30 the way from our current time time T

31:32 back to the beginning of the sequence

31:35 and this is the why this is why this

31:38 algorithm is called back propagation

31:40 Through Time right because as you can

31:42 see the data and the the predictions and

31:45 the resulting errors are fed back in

31:47 time all the way from where we are

31:49 currently to the very beginning of the

31:51 input data sequence

31:55 so the back propagations through time is

31:58 actually a very tricky algorithm to

32:00 implement

32:01 uh in practice

32:02 and the reason for this is if we take a

32:04 close look looking at how gradients flow

32:07 across the RNN what this algorithm

32:10 involves is that many many repeated

32:12 computations and multiplications of

32:15 these weight matrices repeatedly against

32:17 each other

32:18 in order to compute the gradient with

32:20 respect to the very first time step we

32:24 have to make many of these

32:25 multiplicative repeats of the weight

32:27 Matrix

32:29 why might this be problematic well if

32:33 this weight Matrix W is very very big

32:37 what this can result in is what they

32:39 call what we call the exploding gradient

32:41 problem where our gradients that we're

32:43 trying to use to optimize our Network do

32:46 exactly that they blow up they explode

32:49 and they get really big and makes it

32:51 infeasible and not possible to train the

32:54 network stably

32:56 what we do to mitigate this is a pretty

32:59 simple solution called gradient clipping

33:01 which effectively scales back these very

33:03 big gradients to try to constrain them

33:05 more in a more restricted way

33:10 conversely we can have the instance

33:13 where the weight matrices are very very

33:15 small and if these weight matrices are

33:17 very very small

33:19 we end up with a very very small value

33:21 at the end as a result of these repeated

33:23 weight Matrix computations and these

33:26 repeated

33:27 um

33:28 multiplications and this is a very real

33:31 problem in rnns in particular where we

33:34 can lead into this funnel called a

33:36 Vanishing gradient where now your

33:38 gradient has just dropped down close to

33:40 zero and again you can't train the

33:42 network stably

33:43 now there are particular tools that we

33:46 can use to implement that we can

33:48 Implement to try to mitigate the Spanish

33:50 ingredient problem and we'll touch on

33:52 each of these three solutions briefly

33:55 first being how we can Define the

33:58 activation function in our Network and

34:01 how we can change the network

34:02 architecture itself to try to better

34:04 handle this Vanishing gradient problem

34:08 before we do that

34:10 I want to take just one step back to

34:12 give you a little more intuition about

34:14 why Vanishing gradients can be a real

34:17 issue for recurrent neural networks

34:20 Point I've kept trying to reiterate is

34:22 this notion of dependency in the

34:24 sequential data and what it means to

34:26 track those dependencies

34:28 well if the dependencies are very

34:30 constrained in a small space not

34:32 separated out that much by time

34:34 this repeated gradient computation and

34:37 the repeated weight matrix

34:38 multiplication is not so much of a

34:40 problem

34:41 if we have a very short sequence where

34:44 the words are very closely related to

34:46 each other and it's pretty obvious what

34:49 our next output is going to be

34:52 the RNN can use the immediately passed

34:55 information to make a prediction

34:57 and so there are not going to be that

34:59 many uh that much of a requirement to

35:02 learn effective weights if the related

35:05 information is close to to each other

35:08 temporally

35:09 conversely now if we have a sentence

35:12 where we have a more long-term

35:14 dependency

35:16 what this means is that we need

35:17 information from way further back in the

35:19 sequence to make our prediction at the

35:22 end and that gap between what's relevant

35:25 and where we are at currently becomes

35:27 exceedingly large and therefore the

35:29 vanishing gradient problem is

35:31 increasingly exacerbated meaning that we

35:35 really need to um

35:37 the RNN becomes unable to connect the

35:39 dots and establish this long-term

35:41 dependency all because of this Vanishing

35:43 gradient issue

35:45 so the ways that we can imply the ways

35:48 and modifications that we can make to

35:49 our Network to try to alleviate this

35:52 problem threefold

35:54 the first is that we can simply change

35:57 the activation functions in each of our

35:59 neural network layers to be such that

36:02 they can effectively try to mitigate and

36:05 Safeguard from gradients in instances

36:07 where

36:08 from shrinking the gradients in

36:10 instances where the data is greater than

36:13 zero and this is in particular true for

36:16 the relu activation function and the

36:19 reason is that in all instances where X

36:22 is greater than zero with the relu

36:24 function the derivative is one and so

36:28 that is not less than one and therefore

36:31 it helps in mitigating The Vanishing

36:34 gradient problem

36:37 another trick is how we initialize the

36:40 parameters in the network itself to

36:42 prevent them from shrinking to zero too

36:44 rapidly

36:45 and there are there are mathematical

36:48 ways that we can do this namely by

36:50 initializing our weights to Identity

36:53 matrices and this effectively helps in

36:56 practice to prevent the weight updates

36:59 to shrink too rapidly to zero

37:02 however the most robust solution to the

37:04 vanishing gradient problem is by

37:07 introducing a slightly more complicated

37:09 uh version of the recurrent neural unit

37:13 to be able to more effectively track and

37:16 handle long-term dependencies in the

37:18 data

37:19 and this is this idea of gating and what

37:22 the idea is is by controlling

37:25 selectively the flow of information into

37:28 the neural unit to be able to filter out

37:31 what's not important while maintaining

37:34 what is important

37:35 and the key and the most popular type of

37:38 recurrent unit that achieves this gated

37:40 computation is called the lstm or long

37:44 short term memory Network

37:46 today we're not going to go into detail

37:49 on lstn's their mathematical details

37:52 their operations and so on but I just

37:55 want to convey the key idea and

37:57 intuitive idea about why these lstms are

38:00 effective at tracking long-term

38:02 dependencies

38:04 the core is that the lstm is able to

38:08 um

38:08 control the flow of information through

38:10 these gates to be able to more

38:12 effectively filter out the unimportant

38:15 things and store the important things

38:19 what you can do is Implement Implement

38:22 lstms in tensorflow just as you would in

38:25 RNN

38:26 but the core concept that I want you to

38:28 take away when thinking about the lstm

38:30 is this idea of controlled information

38:33 flow through Gates

38:35 very briefly the way that lstm operates

38:38 is by maintaining a cell State just like

38:41 a standard RNN and that cell state is

38:44 independent from what is directly

38:46 outputted

38:47 the way the cell state is updated is

38:50 according to these Gates that control

38:52 the flow of information

38:54 for getting and eliminating what is

38:56 irrelevant

38:57 storing the information that is relevant

39:01 updating the cell state in turn and then

39:04 filtering this this updated cell state

39:07 to produce the predicted output just

39:09 like the standard RNN

39:11 and again

39:12 we can train the lstm using the back

39:15 propagation Through Time algorithm but

39:17 the mathematics of how the lstm is

39:19 defined allows for a completely

39:21 uninterrupted flow of the gradients

39:24 which completely eliminates the well

39:27 largely eliminates the The Vanishing

39:29 gradient problem that I introduced

39:31 earlier

39:33 again we're not if you're if you're

39:36 interested in learning more about the

39:37 mathematics and the details of lstms

39:40 please come and discuss with us after

39:42 the lectures but again just emphasizing

39:45 the core concept and the intuition

39:46 behind how the lstm operates

39:51 okay so so far where we've out where

39:54 we've been at we've covered a lot of

39:56 ground

39:57 we've gone through the fundamental

39:58 workings of rnns the architecture the

40:01 training the type of problems that

40:03 they've been applied to and I'd like to

40:05 close this part by considering some

40:08 concrete examples of how you're going to

40:10 use rnns in your software lab

40:14 and that is going to be in the task of

40:16 Music generation where you're going to

40:19 work to build an RNN that can predict

40:22 the next musical note in a sequence and

40:25 use it to generate brand new musical

40:27 sequences that have never been realized

40:29 before

40:31 so to give you an example of just the

40:33 quality and and type of output that you

40:36 can try to aim towards a few years ago

40:39 there was a work that

40:41 trained in RNN on a corpus of classical

40:44 music data and famously there's this

40:47 composer Schubert who uh wrote a famous

40:51 unfinished Symphony that consisted of

40:53 two movements but he was unable to

40:55 finish his uh his Symphony before he

40:59 died so he died and then he left the

41:01 third movement unfinished so a few years

41:04 ago a group trained a RNN based model to

41:08 actually try to generate the third

41:10 movement to Schubert's famous unfinished

41:13 Symphony given the prior to movements so

41:16 I'm going to play the result quite right

41:18 now

41:24[Music]

41:37 okay I I paused it I interrupted it

41:39 quite abruptly there but if there are

41:42 any classical music aficionados out

41:44 there hopefully you get a appreciation

41:46 for kind of the quality that was

41:48 generated uh in in terms of the music

41:50 quality and this was already from a few

41:53 years ago and as we'll see in the next

41:55 lectures the and continuing with this

41:57 theme of generative AI the power of

41:59 these algorithms has advanced

42:01 tremendously since we first played this

42:03 example

42:05 um particularly in you know a whole

42:07 range of domains which I'm excited to

42:10 talk about but not for now okay so

42:13 you'll tackle this problem head on in

42:15 today's lab RNN music generation

42:19 foreign

42:21 we can think about the the simple

42:23 example of input sequence to a single

42:26 output with sentiment classification

42:29 where we can think about for example

42:30 text like tweets and assigning positive

42:34 or negative labels to these these text

42:36 examples based on the content that that

42:40 is learned by the network

42:43 okay so this kind of concludes the

42:46 portion on rnns and I think it's quite

42:50 remarkable that using all the

42:51 foundational Concepts and operations

42:53 that we've talked about so far we've

42:56 been able to try to build up networks

42:59 that handle this complex problem of

43:00 sequential modeling

43:02 but

43:03 like any technology right and RNN is not

43:06 without limitations so what are some of

43:09 those limitations and what are some

43:10 potential issues that can arise with

43:13 using rnns or even lstms

43:17 the first is this idea of encoding and

43:20 and dependency in terms of the the

43:24 temporal separation of data that we're

43:26 trying to process

43:28 while rnns require is that the

43:30 sequential information is fed in and

43:32 processed time step by time step

43:35 what that imposes is what we call an

43:37 encoding bottleneck right where we have

43:40 we're trying to encode a lot of content

43:42 for example a very large body of text

43:44 many different words into a single

43:47 output that may be just at the very last

43:50 time step how do we ensure that all that

43:52 information leading up to that time step

43:54 was properly maintained and encoded and

43:57 learned by the network in practice this

43:59 is very very challenging and a lot of

44:01 information can be lost

44:03 another limitation is that by doing this

44:06 time step by time step processing rnns

44:09 can be quite slow there is not really an

44:11 easy way to parallelize that computation

44:15 and finally together these components of

44:18 the encoding bottleneck the requirement

44:20 to process this data step by step

44:23 imposes the biggest problem which is

44:25 when we talk about long memory

44:28 the capacity of the RNN and the lstm is

44:31 really not that long we can't really

44:34 handle data of tens of thousands or

44:37 hundreds of thousands or even Beyond

44:39 sequential information that effectively

44:41 to learn the complete amount of

44:43 information and patterns that are

44:45 present within such a rich data source

44:48 and so because of this very recently

44:51 there's been a lot of attention in how

44:53 we can move Beyond this notion of

44:56 step-by-step recurrent processing to

44:58 build even more powerful architectures

45:00 for processing sequential data

45:03 to understand how we do how we can start

45:06 to do this let's take a big step back

45:08 right think about the high level goal of

45:10 sequence modeling that I introduced at

45:12 the very beginning

45:13 given some input a sequence of data we

45:17 want to build a feature encoding and use

45:20 our neural network to learn that and

45:22 then transform that feature encoding

45:24 into a predicted output

45:27 what we saw is that rnns use this notion

45:30 of recurrence to maintain order

45:32 information processing information time

45:35 step by time step

45:37 but as I just mentioned we had these key

45:39 three bottlenecks to rnns

45:42 what we really want to achieve is to go

45:44 beyond these bottlenecks and Achieve

45:47 even higher capabilities in terms of the

45:49 power of these models rather than having

45:52 an encoding bottleneck ideally we want

45:54 to process information continuously as a

45:57 continuous stream of information

45:59 rather than being slow we want to be

46:01 able to parallelize computations to

46:04 speed up processing and finally of

46:06 course our main goal is to really try to

46:09 establish long memory that can build

46:11 nuanced and Rich understanding of

46:13 sequential data

46:16 the limitation of rnns that's linked to

46:18 all these problems and issues in our

46:21 inability to achieve these capabilities

46:23 is that they require this time step by

46:26 time step processing

46:28 so what if we could move beyond that

46:30 what if we could eliminate this need for

46:32 recurrence entirely and not have to

46:34 process the data time set by time step

46:37 well a first and naive approach would be

46:40 to just squash all the data all the time

46:44 steps together to create a vector that's

46:48 effectively concatenated right the time

46:50 steps are eliminated there's just one

46:52 one stream

46:54 where we have now one vector input with

46:57 the data from all time points that's

46:59 then fed into the model

47:01 it calculates some feature vector and

47:03 then generates some output which

47:05 hopefully makes sense

47:07 and because we've squashed all these

47:09 time steps together we could simply

47:11 think about maybe building a feed

47:13 forward Network that could that could do

47:16 this computation

47:17 well with that we'd eliminate the need

47:20 for recurrence but we still have the

47:22 issues that it's not scalable because

47:25 the dense feed forward Network would

47:28 have to be immensely large defined by

47:30 many many different connections

47:32 and critically we've completely lost our

47:34 in order information by just squashing

47:37 everything together blindly there's no

47:39 temporal dependence and we're then stuck

47:42 in our ability to try to establish

47:44 long-term memory

47:48 so what if instead we could still think

47:51 about bringing these time steps together

47:53 but be a bit more clever about how we

47:56 try to extract information from this

47:58 input data

47:59 the key idea is this idea of being able

48:03 to identify and attend to what is

48:06 important in a potentially sequential

48:09 stream of information

48:11 and this is the notion of attention or

48:13 self-attention

48:14 which is an extremely extremely powerful

48:17 Concept in modern deep learning and AI I

48:20 cannot understate or I don't know

48:21 understand overstate I I cannot

48:23 emphasize enough how powerful this

48:25 concept is

48:27 attention is the foundational mechanism

48:29 of the Transformer architecture which

48:32 many of you may have heard about

48:34 and it's the the the notion of a

48:38 transformer can often be very daunting

48:40 because sometimes they're presented with

48:41 these really complex diagrams or

48:44 deployed in complex applications and you

48:47 may think okay how do I even start to

48:48 make sense of this

48:50 at its core though attention the key

48:52 operation is a very intuitive idea and

48:55 we're going to in the last portion of

48:57 this lecture break it down step by step

48:59 to see why it's so powerful and how we

49:02 can use that as part of a larger neural

49:04 network like a Transformer

49:07 specifically we're going to be talking

49:09 and focusing on this idea of

49:11 self-attention

49:12 attending to the most important parts of

49:15 an input example

49:17 so let's consider an image I think it's

49:20 most intuitive to consider an image

49:22 first this is a picture of Iron Man and

49:25 if our goal is to try to extract

49:27 information from this image of what's

49:29 important

49:30 what we could do maybe is using our eyes

49:32 naively scan over this image pixel by

49:35 pixel right just going across the image

49:39 however our brains maybe maybe

49:42 internally they're doing some type of

49:43 computation like this but you and I we

49:45 can simply look at this image and be

49:48 able to attend to the important parts

49:50 we can see that it's Iron Man coming at

49:52 you right in the image and then we can

49:55 focus in a little further and say okay

49:56 what are the details about Iron Man that

49:58 may be important

50:00 what is key what you're doing is your

50:02 brain is identifying which parts are

50:04 attending to to attend to and then

50:07 extracting those features that deserve

50:10 the highest attention

50:13 the first part of this problem is really

50:15 the most interesting and challenging one

50:18 and it's very similar to the concept of

50:21 search effectively that's what search is

50:23 doing taking some larger body of

50:26 information and trying to extract and

50:28 identify the important parts

50:30 so let's go there next how does search

50:32 work

50:33 you're thinking you're in this class how

50:35 can I learn more about neural networks

50:37 well in this day and age one thing you

50:39 may do besides coming here and joining

50:41 us is going to the internet having all

50:44 the videos out there trying to find

50:45 something that matches doing a search

50:48 operation

50:49 so you have a giant database like

50:51 YouTube you want to find a video

50:53 you enter in your query deep learning

50:57 and what comes out are some possible

51:00 outputs right

51:02 for every video in the database there is

51:04 going to be some key information related

51:06 to the interview to that to that video

51:08 let's say the title

51:11 now to do the search

51:13 what the task is to find the overlaps

51:17 between your query and each of these

51:20 titles right the keys in the database

51:23 what we want to compute is a metric of

51:25 similarity and relevance between the

51:27 query and these keys

51:30 how similar are they to our desired

51:33 query

51:33 and we can do this step by step let's

51:36 say this first option of a video about

51:39 the elegant giant sea turtles not that

51:41 similar to our query about deep learning

51:44 our second option

51:46 introduction to deep learning the first

51:48 introductory lecture on this class yes

51:50 highly relevant

51:52 the third option a video about the late

51:54 and great Kobe Bryant not that relevant

51:57 the key operation here is that there is

52:00 this similarity computation bringing the

52:02 query and the key together

52:04 the final step is now that we've

52:07 identified what key is relevant

52:09 extracting the relevant information what

52:11 we want to pay attention to and that's

52:13 the video itself we call this the value

52:16 and because the searches is implemented

52:19 well right we've successfully identified

52:21 the relevant video on deep learning that

52:23 you are going to want to pay attention

52:24 to

52:26 and it's this this idea this intuition

52:28 of giving a query trying to find

52:31 similarity trying to extract the related

52:33 values that form the basis of

52:35 self-attention

52:37 and how it works in neural networks like

52:39 Transformers

52:40 so to go concretely into this right

52:43 let's go back now to our text our

52:45 language example

52:47 with the sentence

52:49 our goal is to identify and attend to

52:52 features in this input that are relevant

52:54 to the semantic meaning of the sentence

52:58 now first step we have sequence we have

53:02 order we've eliminated recurrence right

53:04 we're feeding in all the time steps all

53:07 at once we still need a way to encode

53:09 and capture this information about order

53:12 and this positional dependence

53:14 how this is done is this idea of

53:17 possession positional encoding which

53:19 captures some inherent order information

53:22 present in the sequence I'm just going

53:25 to touch on this very briefly but the

53:27 idea is related to this idea of

53:29 embeddings which I introduced earlier

53:32 what is done is a neural network layer

53:34 is used to encode positional information

53:37 that captures the relative relationships

53:40 in terms of order within within this

53:43 text

53:45 that's the high level concept right

53:47 we're still being able to process these

53:50 time steps all at once there is no

53:52 notion of time step rather the data is

53:54 singular but still we learned this

53:56 encoding that captures the positional

53:58 order information

54:01 now our next step is to take this

54:03 encoding and figure out what to attend

54:05 to exactly like that search operation

54:07 that I introduced with the YouTube

54:09 example extracting a query extracting a

54:12 key extracting a value and relating them

54:14 to each other

54:15 so we use neural network layers to do

54:18 exactly this

54:19 given this positional encoding what

54:22 attention does is applies a neural

54:24 network layer

54:26 transforming that first generating the

54:28 query

54:30 we do this again using a separate neural

54:33 network layer and this is a different

54:35 set of Weights a different set of

54:36 parameters that then transform that

54:38 positional embedding in a different way

54:40 generating a second output the key

54:45 and finally this repeat this operation

54:47 is repeated with a third layer a third

54:50 set of Weights generating the value

54:53 now with these three in hand the key the

54:56 the query the key and the value we can

54:58 compare them to each other to try to

55:00 figure out where in that self-input the

55:04 network should attend to what is

55:05 important

55:07 and that's the key idea behind this

55:09 similarity metric or what you can think

55:11 of as an attention score what we're

55:14 doing is we're Computing a similarity

55:16 score between a query and the key

55:19 and remember that these query and Qui

55:22 key values are just arrays of numbers we

55:25 can Define them as arrays of numbers

55:27 which you can think of as vectors in

55:30 space

55:31 the query Vector the query values are

55:34 some Vector the key

55:36 the key values are some other vector and

55:39 mathematically the way that we can

55:41 compare these two vectors to understand

55:42 how similar they are is by taking the

55:45 dot product and scaling it captures how

55:49 similar these vectors are how whether or

55:52 not they're pointing in the same

55:53 direction right

55:55 this is the similarity metric and if you

55:58 are familiar with a little bit of linear

56:00 algebra this is also known as the cosine

56:03 similarity

56:04 operation functions exactly the same way

56:07 for matrices if we apply this dot

56:10 product operation to our query in key

56:12 matrices key matrices we get this

56:15 similarity metric out

56:18 now

56:19 this is very very key in defining our

56:22 next step Computing the attention

56:24 waiting in terms of what the network

56:26 should actually attend to within this

56:28 input

56:30 this operation gives us a score which

56:33 defines how

56:35 how

56:36 the components of the input data are

56:39 related to each other

56:41 so given a sentence right when we

56:43 compute this similarity score metric we

56:47 can then begin to think of Weights that

56:50 Define the relationship between the

56:52 sequential the components of the

56:54 sequential data to each other

56:56 so for example in the this example with

56:59 a text sentence he tossed the tennis

57:02 ball to serve

57:04 the goal with the score is that words in

57:07 the sequence that are related to each

57:08 other should have high attention weights

57:10 ball related to toss related to tennis

57:15 and this metric itself is our attention

57:18 waiting what we have done is passed that

57:21 similarity score through a soft Max

57:24 function which all it does is it

57:27 constrains those values to be between 0

57:29 and 1. and so you can think of these as

57:31 relative scores of relative attention

57:34 weights

57:35 finally now that we have this metric

57:38 that can captures this notion of

57:41 similarity and these internal

57:43 self-relationships we can finally use

57:46 this metric to extract features that are

57:49 deserving of high attention

57:52 and that's the exact final step in this

57:55 self-attention mechanism in that we take

57:58 that attention waiting Matrix multiply

58:01 it by the value and get a transformed

58:04 transformation of

58:07 of the initial data as our output which

58:10 in turn reflects the features that

58:12 correspond to high attention

58:15 all right let's take a breath let's

58:18 recap what we have just covered so far

58:21 the goal with this idea of

58:23 self-attention the backbone of

58:24 Transformers is to eliminate recurrence

58:27 attend to the most important features in

58:29 in the input data

58:31 in an architecture how this is actually

58:33 deployed is first we take our input data

58:37 we compute these positional encodings

58:40 the neural network layers are applied

58:43 three-fold to transform the positional

58:46 encoding into each of the key query and

58:50 value matrices

58:52 we can then

58:53 compute the self-attention weight score

58:56 according to the up the dot product

58:58 operation that we went through prior and

59:01 then self-attend to these features to

59:03 these uh

59:05 information to extract features that

59:08 deserve High attention

59:11 what is so powerful about this approach

59:14 in taking this attention wait

59:16 putting it together with the value to

59:19 extract High attention features is that

59:21 this operation the scheme that I'm

59:24 showing on the right defines a single

59:26 self-attention head

59:28 and multiple of these self-attention

59:30 heads can be linked together to form

59:32 larger Network architectures where you

59:35 can think about these different heads

59:36 trying to extract different information

59:38 different relevant parts of the input to

59:41 now put together a very very rich

59:43 encoding and representation of the data

59:46 that we're working with

59:48 intuitively back to our Ironman example

59:51 what this idea of multiple

59:52 self-attention heads can amount to is

59:55 that different Salient features and

59:58 Salient information in the data is

01:00:00 extracted first maybe you consider Iron

01:00:03 Man attention had one and you may have

01:00:06 additional attention heads that are

01:00:08 picking out other relevant parts of the

01:00:10 data which maybe we did not realize

01:00:12 before for example the building or the

01:00:15 spaceship in the background that's

01:00:16 chasing iron

01:00:17 man and so

01:00:19 this is a key building block of many

01:00:22 many many many powerful architectures

01:00:25 that are out there today today I again

01:00:27 cannot emphasize how enough how powerful

01:00:30 this mechanism is

01:00:32 and indeed this this backbone idea of

01:00:35 self-attention that you just built up

01:00:37 understanding of is the key operation of

01:00:40 some of the most powerful neural

01:00:42 networks and deep learning models out

01:00:44 there today ranging from the very

01:00:46 powerful language models like gpt3 which

01:00:50 are capable of synthesizing natural

01:00:53 language in a very human-like fashion

01:00:55 digesting large bodies of text

01:00:57 information to understand relationships

01:01:00 in text

01:01:01 to models that are being deployed for

01:01:04 extremely impactful applications in

01:01:07 biology and Medicine such as Alpha full

01:01:10 2 which uses this notion of

01:01:12 self-attention to look at data of

01:01:15 protein sequences and be able to predict

01:01:17 the three-dimensional structure of a

01:01:19 protein just given sequence information

01:01:21 alone

01:01:23 and all the way even now to computer

01:01:25 vision which will be the topic of our

01:01:27 next lecture tomorrow where the same

01:01:30 idea of attention that was initially

01:01:32 developed in sequential data

01:01:34 applications has now transformed the

01:01:36 field of computer vision and again using

01:01:39 this key concept of attending to the

01:01:42 important features in an input to build

01:01:44 these very rich representations of

01:01:46 complex High dimensional data

01:01:49 okay so that concludes lectures for

01:01:53 today I know we have covered a lot of

01:01:55 territory in a pretty short amount of

01:01:57 time but that is what this boot camp

01:01:59 program is all about so hopefully today

01:02:02 you've gotten a sense of the foundations

01:02:04 of neural networks in the lecture with

01:02:06 Alexander we talked about rnns how

01:02:09 they're well suited for sequential data

01:02:11 how we can train them using back

01:02:12 propagation

01:02:14 how we can deploy them for different

01:02:15 applications and finally how we can move

01:02:18 Beyond recurrence to build this idea of

01:02:20 self-attention for building increasingly

01:02:23 powerful models for deep learning in

01:02:25 sequence modeling

01:02:27 all right hopefully you enjoyed we have

01:02:31 um about 45 minutes left for the for the

01:02:34 lab portion and open Office hours in

01:02:36 which we welcome you to ask us questions

01:02:38 uh of us and the Tas and to start work

01:02:41 on the labs the information for the labs

01:02:44 is is up there thank you so much for

01:02:46 your attention

01:02:48 foreign

💫 FAQs about This YouTube Video

1. What is the focus of the lecture on sequence modeling in neural networks?

The lecture focuses on building neural networks that can handle and learn from sequential data, exploring the question of sequence modeling and the implementation of recurrent neural networks (RNNs) for this purpose.

2. How are recurrent neural networks (RNNs) described in the video?

The video provides an overview of the fundamental workings of RNNs, their architecture, training, and the type of problems they can be applied to. It also introduces the concept of LSTM (Long Short-Term Memory) networks as a solution for effectively tracking long-term dependencies in data.

3. What are the key design criteria for building a robust and reliable network for processing sequential modeling problems?

The key design criteria highlighted in the video include the ability to handle variable length sequences, track dependencies in the data, handle differences in order, and implement parameter sharing. These criteria are essential for effectively processing sequential data.

4. How is the vanishing gradient problem explained in the context of recurrent neural networks (RNNs)?

The vanishing gradient problem in RNNs is highlighted, emphasizing the challenges of maintaining the flow of gradients over long-term dependencies in sequential data. Solutions such as changing activation functions, weight initialization, and the introduction of LSTM networks are discussed as ways to mitigate the vanishing gradient problem.

5. What example of the application of RNNs is mentioned in the video?

The video mentions the application of RNNs in music generation, showcasing the ability of RNN-based models to generate new musical sequences. An example of using RNNs to complete Schubert's unfinished Symphony is played to demonstrate the quality of music generated by these models.

6. What are the limitations of recurrent neural networks (RNNs) and long short-term memory (LSTM) networks discussed in the video?

The limitations of RNNs and LSTMs discussed in the video include the encoding bottleneck, slow processing due to the sequential nature, and the challenge of handling long memory.

7. How is the concept of attention described in the video, and what role does it play in the development of more powerful neural network architectures?

The video explains the concept of attention as the ability to focus on important parts of the input data. It plays a crucial role in the development of powerful neural network architectures, particularly in the context of sequence modeling and handling complex data.

8. What is the key idea behind self-attention in the context of neural networks, as presented in the video?

The key idea behind self-attention in neural networks, as presented in the video, is the ability to attend to important features within the input data and extract rich representations, leading to the development of more powerful models for sequence modeling and deep learning.

9. How are recurrent neural networks (RNNs) and long short-term memory (LSTM) networks compared to the concept of self-attention discussed in the video?

The video compares RNNs and LSTMs to the concept of self-attention by highlighting the limitations of RNNs and LSTMs in handling sequential data and long memory, while showcasing the power of self-attention in attending to important features and building more powerful models for sequence modeling.

10. What are the practical applications of the concept of attention and self-attention in modern deep learning and artificial intelligence, as mentioned in the video?

The practical applications of attention and self-attention in modern deep learning and artificial intelligence, as mentioned in the video, include their use in language models, computer vision, and handling complex sequential data, leading to advancements in natural language processing, image recognition, and data understanding.

🎥 Related Videos

What vaccinating vampire bats can teach us about pandemics | Daniel Streicker

What vaccinating vampire bats can teach us about pandemics | Daniel Streicker

a16z Podcast | Things Come Together -- Truths about Tech in Africa

a16z Podcast | Things Come Together -- Truths about Tech in Africa

2024 TSCRS Applications of anterior segments diagnostic instruments in cataract surgery

2024 TSCRS Applications of anterior segments diagnostic instruments in cataract surgery

a16z Podcast | The Infrastructure of Total Health

a16z Podcast | The Infrastructure of Total Health

The Robot Lawyer Resistance with Joshua Browder of DoNotPay

The Robot Lawyer Resistance with Joshua Browder of DoNotPay

NES Controllers Explained

NES Controllers Explained

🔥 Recently Summarized Examples

The Hitler-Stalin Pact | Reflections Episode 9

The Hitler-Stalin Pact | Reflections Episode 9

Uncovering Corruption From Health "Experts" | Scott Carney

Uncovering Corruption From Health "Experts" | Scott Carney

The Forgotten Geometry: A New Path to Unification

The Forgotten Geometry: A New Path to Unification

Joe Rogan Experience #2194 - Luis Elizondo

Joe Rogan Experience #2194 - Luis Elizondo

From Tesla to DNA: The Science of Scalar Waves - Dr. Sandra Rose Michael - Think Tank E44

From Tesla to DNA: The Science of Scalar Waves - Dr. Sandra Rose Michael - Think Tank E44

Bitcoin Holders...Watch Out for Sept

Bitcoin Holders...Watch Out for Sept

View original video