# The math behind Attention: Keys, Queries, and Values matrices

Serrano.Academy2023-08-31

124K views|11 months ago

ðŸ’« Short Summary

In this video, Louis Serrano explains the math behind attention mechanisms in large language models, focusing on key concepts such as similarity between words, dot product, cosine similarity, key, query, and value matrices, and their role in the attention mechanism. He also discusses multi-headed attention and the training of key, query, and value matrices in a Transformer model.

âœ¨ Highlights

ðŸ“Š Transcript

âœ¦

The video provides a detailed explanation of the math behind attention mechanisms in large language models, with a focus on the concept of similarity between words.

00:02Attention mechanisms are crucial in making Transformers work effectively.

The video will cover the concept of similarity between words using dot product and cosine similarity.

Key query and value matrices play a role in the attention mechanism within the Transformer model.

âœ¦

The concept of similarity is explained using examples of dot product and cosine similarity.

02:17Dot product is a measure of similarity that is high when the words are similar or close in the embedding, and low when the words are far away.

Cosine similarity uses the angle between vectors to determine similarity, with a range of 1 for very similar words and 0 or negative for very different words.

Scaled dot product is another variation of dot product that is used in attention, where the result is divided by the square root of the length of the vectors to prevent large numbers in high-dimensional spaces.

âœ¦

The video explains how the key and query matrices are used to transform embeddings for the attention mechanism.

04:45Keys and queries matrices modify the embeddings to improve their suitability for the attention mechanism, allowing for better understanding of context and similarity between words.

The linear transformation performed by the keys and queries matrices helps in finding the best embeddings for the attention task.

The values matrix is also mentioned, which transforms the embeddings to ones that are optimal for the next word prediction in a sentence.

âœ¦

The video demonstrates how the attention step moves words around in the embeddings to better understand context and similarity.

08:22The attention step calculates similarities between words using the modified embeddings from the keys and queries matrices.

The values matrix then transforms the embeddings to ones that are better suited for the next word prediction in a sentence.

The process involves scaling the embeddings based on their effectiveness for the task.

âœ¦

The video introduces multi-headed attention, which uses multiple sets of key, query, and value matrices to find the best embeddings for the attention task.

18:39Multi-headed attention involves using multiple sets of key, query, and value matrices to find the best embeddings for the attention task.

The embeddings from the key and query matrices are concatenated and then transformed using a linear step to reduce the dimensionality.

The linear step scales the embeddings to emphasize the most effective ones for the attention task.

The goal is to end up with a high-quality embedding that is optimal for the attention mechanism.

00:00[Music]

00:02 hello my name is Louis Serrano and in

00:05 this video you're going to learn about

00:07 the math behind attention mechanisms in

00:09 large language models attention

00:11 mechanisms are really important in large

00:13 language models as a matter of fact

00:14 they're one of the key steps that make

00:16 Transformers work really well now in a

00:19 previous video I showed you how

00:21 attention Works in very high level with

00:23 words flying towards each other and

00:25 gravitating towards other words in order

00:26 for the model to understand context in

00:29 this video we're going to do an example

00:31 in much more detail with all the math

00:33 involved now as I mentioned before the

00:35 concept of a transformer and the

00:37 attention mechanism were introduced in

00:39 this groundbreaking paper called

00:40 attention is all you need now this is a

00:43 series of three videos in the first

00:44 video I showed you what attention

00:46 mechanisms are in high level in this

00:48 video I'm going to do it with math and

00:50 in the third video that is upcoming I

00:52 will put them all together and show you

00:55 how a Transformer model works so in this

00:58 video in particular you're going to

00:59 learn about some concept similarity

01:01 between words and pieces of text is one

01:03 of the concepts one way to do this is

01:05 with DOT product and another one with

01:07 cosine similarity so we learn both and

01:09 next you're going to learn what the key

01:11 query and value matrices are as linear

01:13 Transformations and how they have an

01:16 involvement in the attention mechanism

01:19[Music]

01:22 so let's do a quick review of the first

01:24 video first we have embeddings and

01:27 embeddings are a way to put words or

01:30 longer pieces of text in this case it's

01:32 the plane but in reality you put it in a

01:34 high dimensional space in such a way

01:37 that words that are similar get sent to

01:40 points that are close so for example

01:42 these are fruits there's a strawberry an

01:45 orange a banana and a cherry and they

01:48 are all in the top corner of the image

01:51 because they're similar words so they

01:53 get sent to similar points and then over

01:55 here we have a bunch of Brands we have

01:58 Microsoft we have Android then we also

02:00 have a laptop and a phone so it's the

02:03 technology corner and then the question

02:05 we had in the previous video is where

02:06 would you put the word apple and that's

02:07 complicated because it's both a

02:10 technology brand and also a fruit so we

02:12 wouldn't know in particular let's take a

02:15 look at this in the oranges on the top

02:17 right and the phone is in the bottom

02:19 left where would you put an apple well

02:21 then you need to look at context so if

02:23 you have a sentence like please buy an

02:24 apple and an orange then you know you're

02:26 talking about the fruit if you have a

02:28 sentence like apple unveiled a new phone

02:29 then you know you're talking about the

02:31 technology brand so therefore this word

02:34 needs to be given context and the way

02:36 it's given context is by the neighboring

02:38 words in particular the word orange is

02:39 the one that helps us here so what we do

02:41 is that we look at where orange is and

02:44 then move the Apple in that direction

02:46 and then we're gonna use those new

02:47 coordinates instead of the old ones for

02:49 the app so then now the apple is closer

02:52 to the fruits so it knows more about its

02:54 context given the other words the word

02:57 orange in the sentence now for the

02:59 second sentence the word that gives us

03:01 the clue that we're talking about a

03:02 technology brand is the word phone

03:04 therefore what we do is we move towards

03:08 the word phone and then we use those

03:10 coordinates in the embedding so that

03:12 apple in the second sentence knows that

03:15 is more of a technology word because

03:17 it's closer to reward phone now another

03:19 thing we saw in the previous video is

03:20 that not just the word orange is going

03:22 to pull the Apple but all of the other

03:24 words are going to pull the apple and

03:27 how does this happen well this happens

03:29 with gravity or actually something very

03:32 similar to gravity words that are closed

03:34 like apple and orange have a strong

03:35 gravitational pull so they move towards

03:38 each other on the other hand the other

03:40 words don't have a strong gravitational

03:41 pull because they're far away actually

03:43 they're not far away but they're

03:45 dissimilar so we can think of distance

03:47 as a metric but in a minute I will tell

03:49 you exactly what I'm talking about here

03:51 but we can think of the words as being

03:54 far away and as I said words that are

03:58 closed get pulled together and words are

04:00 far away get old but not very much and

04:03 so after one gravitational step then the

04:05 word apple and orange are much closer

04:07 and the other words in the sentence well

04:09 they may move closer but not that much

04:11 and what happens is that context pulls

04:13 so if I've been talking about fruits for

04:15 a while and I said banana strawberry

04:16 lemon blueberry and orange and then I

04:18 say the word Apple then you would

04:20 imagine that I'm talking about a fruit

04:22 and what happens here in space is that

04:24 we have a galaxy of fruit words

04:27 somewhere and they have a strong pool so

04:29 when the word Apple comes in it gets

04:31 pulled by this Galaxy and therefore now

04:34 the Apple knows that it's a fruit and

04:36 not a technology brand now remember that

04:38 I told you that words are far away but

04:41 that's not really true in reality what

04:43 we need to do is that the concept of

04:45 similarity

04:46[Music]

04:49 so what is similarity well as humans we

04:53 have an idea of words being similar to

04:56 each other's or dissimilar and that's

04:58 exactly what similarity measures so

05:01 before we saw for example that the words

05:02 cherry and orange are similar and there

05:04 different than the word phone

05:06 and we kind of had the impression that

05:09 there's a measure of distance like

05:10 cherry and orange are closed so they

05:12 have a small distance in between and

05:14 Cherry and phone are far so they have a

05:16 large distance but as I mentioned what

05:19 we want is the opposite image of

05:21 similarity which is high when the words

05:24 are similar and low when the words are

05:27 different so next I'm going to show you

05:29 three ways to measure similarity that

05:32 are actually very similar at the end the

05:34 first one is called dot product so

05:35 imagine that you have these three words

05:38 over here Cherry orange and phone and as

05:41 we saw before the axis in the embedding

05:44 actually mean something it could be

05:45 something tangible for humans or maybe

05:48 something that the computer just knows

05:49 and we don't but let's say that the axes

05:52 here measure check for the horizontal

05:54 axis and fruitness for the vertical axis

05:57 so the cherry on the in the orange have

05:59 high fruitness and low Tech that's why

06:02 they're located in the top left and the

06:04 phone has high tech and low furnace

06:05 that's why it's located in in the bottom

06:08 right now we need a measure that is high

06:11 for these two numbers cherry and orange

06:14 so the measure of similarity is going to

06:16 be the phone we look at their

06:17 coordinates 1 4 for cherry and zero

06:20 three four orange and remember that one

06:23 of them is the amount of tech and the

06:24 other one is the amount of fruitness now

06:26 if these words are similar we would

06:28 imagine that they have similar amounts

06:30 of tech and similar amounts of fruitness

06:32 in particular they have both have low

06:33 Tech so therefore if we multiply these

06:36 two numbers it should be a low number

06:38 that's one times zero but they both have

06:41 high fructose so if we multiply those

06:43 two numbers four times three we get a

06:44 high number and when we add them

06:46 together that's the product of the tech

06:47 and the product of the fruitness we get

06:50 a high number which is 12. now let's do

06:52 the same thing for cherry and phone so

06:55 the similarity should be a small number

06:56 let's see between 1 4 and 3 0 what is

07:00 the dot product well it's one times

07:02 three the products of the tech values

07:04 plus four times zero the problem of

07:07 fruitness values and that's one times

07:08 three plus four times zero which is a

07:10 small number which is three notice that

07:12 the reason is because if uh one of the

07:14 words has low Tech the other one has

07:16 high tech and one of them has low

07:18 fruitness down and has high fruitness so

07:20 we're not going to get very high by

07:21 multiplying them and adding and the

07:24 extreme case is orange foam this Orange

07:27 phone the coordinates are zero three and

07:29 three zero so when we multiply 0 times 3

07:31 we get zero plus three times zero equals

07:33 zero and so we get zero notice that

07:36 these two words are actually

07:37 perpendicular with respect to the origin

07:39 and when two words are perpendicular

07:41 they're always gonna have dot product

07:43 equals zero so dot product is our first

07:46 measure of similarity it's High when the

07:49 words are similar or closing the

07:50 embedding and low when the words are far

07:53 away and notice that it could be

07:54 negative as well the second measure of

07:56 similar is called cosine similarity and

07:58 it looks very different from that

07:59 product but they're actually very

08:01 similar what we do here is we use an

08:04 angle so let's calculate the cosine

08:05 similarity between orange and Cherry we

08:07 look at the angle that the two vectors

08:09 made when they are traced from the

08:12 origin this angle is 14. if you are not

08:15 calculated it's actually the arctangent

08:17 of one quarter and here it is so that

08:21 number is 0.97 that means that the

08:23 similarity between Cherry and orange the

08:26 cosine similarity is 0.97 now let's

08:29 calculate the one between Cherry and

08:32 phone that angle is 76 because it's the

08:35 arc 10 of 4 divided by one and that

08:38 number is

08:39 0.24 so the similarity between Cherry

08:43 and orange is 0.24 which is much lower

08:45 than 0.97 and finally guess what that

08:49 third one is going to be the similarity

08:51 to an orange and phone is going to be

08:53 the cosine of this angle which is 90 and

08:55 that is again zero so the similarity of

08:58 zero cosine similarities since it's a

09:00 cosine it is between 1 and -1 so one is

09:05 for words that are very similar and then

09:07 zero and negative numbers are for words

09:09 that are very different now I told you

09:11 that dot product similarity are very

09:13 similar but they don't look that similar

09:15 why are they so similar well because

09:18 they're the same thing if the vectors

09:20 have length one more specifically if I

09:23 were to draw a unit circle around the

09:26 origin and I take every point and I draw

09:28 a line from the center to the point and

09:31 put the word where the line meets the

09:33 circle that means I scale everything

09:37 down so that all the vectors have length

09:40 one then cosine similarity and the dot

09:44 product are the exact same thing so

09:47 basically all my vectors have Norm 1

09:50 then cosine similarity and Dot product

09:52 are the same thing so at the end of the

09:54 day what we are saying is that dot

09:57 product and cosine similarity are the

09:59 same up to a scalar and the scalar is

10:03 the product of the two lengths of the

10:05 vector so if I take the dot product and

10:07 divide it by the parts of the lengths of

10:09 the vectors then I get the cosine

10:12 similarity now there is a third one

10:15 called scale dot product which as you

10:17 can imagine it's another multiple of the

10:19 dot product and that's actually the one

10:22 that gets used in attention so let me

10:24 show you quickly what it is it is the

10:26 dot product just like before so here we

10:28 get 12 except that now we're divided by

10:30 the square root of the length of the

10:32 vector and the length of the vector is 2

10:34 because these vectors have two

10:35 coordinates and so we get 8.49 for the

10:38 first one for the second one we had a 3

10:40 divided by root 2 is 2.12 and for the

10:43 third one well we had a zero and that

10:45 divided by root 2 is also 0. now now the

10:47 question is why are we dividing by this

10:49 square root of 2 well that is because

10:53 when you have very long vectors for

10:55 example with a thousand coordinates you

10:57 get really really large dot products and

11:00 you don't want that you want to manage

11:02 these numbers you want these numbers to

11:03 be small so that way you divide by the

11:06 square root of the length of the vectors

11:09[Music]

11:13 now as I mentioned before the one we use

11:17 for attention is the scale dot product

11:20 but for this example just to have nice

11:22 numbers we're going to do it with cosine

11:24 similarity but at the end of the day

11:26 remember that everything gets scaled by

11:29 the same number so let's look at an

11:32 example we have the sentences an apple

11:34 and orange and an Apple phone and let's

11:36 calculate some similarity so first let's

11:39 look at some coordinates let's say that

11:41 orange is in position 0 3

11:43 phone is in position for zero and this

11:46 ambiguous apple is in position two two

11:48 now embeddings don't only have two

11:50 Dimensions they have many many so to

11:52 make it more realistic let's say we have

11:54 three dimensions so there's an other

11:56 dimension here but all the words we have

11:58 are at zero over that Dimension so

12:01 they're in that flat plane by the wall

12:04 however the sentences have more words

12:06 the words and an end so let's say that

12:08 and an N are over here at coordinate 0 0

12:11 2 and 0 0 3. now let's calculate all the

12:15 similarities between all these words

12:17 that's going to be this table here the

12:19 first easy step to notice is that the

12:21 cosine similarity between every word and

12:24 itself is one why well because every

12:28 angle between every word in itself is

12:30 zero and the cosine is zero is one so

12:32 all of them are one now let's calculate

12:35 the similarity between orange and phone

12:37 we already saw that this is zero because

12:39 the angle is 90. now let's do between

12:42 orange and apple angle is 45 same thing

12:45 between Apple and phone and the cosine

12:47 of 45 degrees is 0.71 finally let's look

12:51 at phone and and or actually any word

12:53 between orange apple and phone makes an

12:56 angle of 90 degrees with and an N so all

12:59 these numbers are actually zero and

13:01 finally between and and the angle is

13:03 zero so therefore the cosine is one

13:08 so this is our entire table of

13:10 similarities between the words and we're

13:11 going to use this table similarity to

13:13 move the words around that's the

13:15 attention step so let's look at this

13:19 table but only for the words in the

13:20 sentence an apple and an orange we have

13:23 the words orange capital and an n and

13:25 we're going to move them around so we're

13:27 going to take them and change their

13:29 coordinates slightly each of these words

13:30 would be sent to a combination of itself

13:34 and all the other words and the

13:35 combination is given by the rows of this

13:37 table so more specifically let's look at

13:39 Orange Orange would go to one times

13:41 orange which is this coordinate over

13:44 here plus 0.71 times Apple which is this

13:47 coordinate right here plus 0 times n

13:49 plus zero times n which means nothing

13:52 else now let's look at where Apple goes

13:55 Apple goes to 0.71 times orange plus one

13:59 times apple plus 0 times n plus zero

14:01 times n now let's look at n and n they

14:04 go to zero times orange plus zero times

14:06 Apple

14:07 plus 1 times and plus one times n and

14:11 the same thing happens with the word and

14:12 goes to zero times orange plus zero

14:14 times apple plus 1 times n plus one

14:17 times n

14:18 so basically what we did is we took each

14:22 of the words and sent it to a

14:25 combination of the other words so now

14:27 Orange has a little bit of Apple in it

14:29 and apple has a little bit of orange in

14:31 it etc etc so we're just moving words

14:34 around and later I'm going to show you

14:36 graphically what this means now let's

14:38 also do it for the other sentence an

14:40 Apple phone this one I'm going to do it

14:42 a little faster phone goes to one times

14:44 phone plus 0.71 times apple plus 0 times

14:48 n Apple goes to 0.71 Times phone plus

14:51 one times apple plus 0 times n and n

14:54 goes to 1 times n so itself however

14:58 there are some technicalities I need to

15:00 tell you first of all let's look at the

15:03 word orange it goes to one times orange

15:04 plus 0.71 times Apple but these are Big

15:09 Numbers imagine doing this many times I

15:11 end up being sent to 500 times orange

15:13 plus 400 times Apple I don't want to

15:16 have these big numbers I want to be

15:18 scaling and everything down so in

15:20 particular I want these coefficients to

15:22 always add to one so that no matter how

15:24 many Transformations I make I'm gonna

15:26 end up with some percentage of orange

15:28 some percentage of apple and maybe

15:30 percentages of other words but I don't

15:33 want these to blow up so in order for

15:35 the coefficients to add to one I would

15:37 divide by their sum 1 plus 0.71 so I get

15:40 0.58 orange plus 0.42 times Apple so

15:45 that process is called normalization

15:47 however there's a small problem can you

15:50 see it well what happens is this let's

15:54 say that I have orange goes to 1 times

15:57 orange minus one times motorcycle

15:58 because remember that cosine distance

16:00 can be a negative number

16:03 so if I want these coefficients to add

16:05 to 1 I would divide by their sum which

16:07 is 1 minus one and dividing by zero is a

16:10 terrible thing to do never ever ever

16:12 divide by zero so how do I solve this

16:14 problem well I would like these

16:16 coefficients to always be positive find

16:18 a way to take these coefficients and

16:21 turn them into something positive I'm

16:23 good however I still want to respect the

16:26 order one is a lot bigger than minus one

16:29 so I want the coefficient that 1 becomes

16:32 to be still a lot bigger than the

16:33 coefficient that minus one becomes so

16:35 what is the solution well a common

16:37 solution here is instead of taking a

16:39 coefficient X take the coefficient e to

16:42 the X so raise e to the all the numbers

16:45 you see here and what do we get well if

16:48 I take every number and turn it into e 2

16:51 to that number then I have e to the 1

16:54 times orange plus e to the 0.71 times

16:56 Apple divided by e to the 1 plus 0.71

16:59 the numbers change slightly now there's

17:01 0.57 and 0.43

17:04 but what happens in the bottom one well

17:07 one becomes e to the one negative 1

17:09 becomes e to the minus 1 and now I add

17:12 them and the bottom becomes e to the one

17:14 plus e to the minus 1 and that becomes

17:17 0.88 orange plus 0.12 motorcycle so we

17:20 effectively turned the numbers into

17:23 positive ones respecting their order so

17:25 we do this step for the coefficients to

17:28 always add to one

17:29 this step is called Soft Max and it's a

17:32 very very popular function in machine

17:34 learning so now when we go to the tables

17:38 and how those tables created these new

17:41 words then we can change the numbers to

17:44 the soft Mach numbers and get these

17:47 actual new numbers

17:50 so this is what's going to tell us how

17:52 the words are going to move around but

17:53 actually before I show you this

17:55 geometrically I have to admit that I've

17:57 been lying to you because softmax

18:00 doesn't turn a zero into a zero in fact

18:03 these four numbers get sent to e to the

18:06 one e to the 0.71 e to the 0 and e to

18:09 the zero and e to the zero is one so

18:11 this combination of words when you

18:14 normalize it it actually becomes 0.4

18:17 orange plus 0.3 apple plus 0.15 and plus

18:20 0.15 and

18:22 so those hand and end are not that hard

18:25 to get rid of but as you may imagine in

18:28 real life they will have such small

18:30 coefficients that they're pretty much

18:33 negligible but at the end of the day you

18:35 have to consider all the numbers when

18:37 you do softmax but let's go back to

18:39 pretending that and and are not

18:42 important in other words let's go back

18:44 to the original equations Apple goes to

18:47 0.43 Orange plus 0.57 of apple and apple

18:51 goes to 0.43 phone plus 0.57 of Apple so

18:57 if we forget about the words and the end

18:59 let's actually go back to the plane here

19:02 where the three words are nicely fit in

19:04 the plane so let's look at the equations

19:06 again Apple going to 43 of orange plus

19:09 57 of Apple really means that we're

19:13 taking 43 of the apple and turning it

19:16 into orange geometrically this means

19:18 that we're taking the line from Apple to

19:21 Orange and moving the word Apple 43

19:24 along the way that gives us the new

19:26 coordinates 1.14 and

19:29 2.43 for the second sentence we do the

19:33 same thing we take 43 away from the

19:36 apple and turn it into the word phone

19:39 that means we trace this line from Apple

19:41 to phone and locate the new Apple 43

19:45 along the way that means it's in the

19:48 coordinates 2.86 and

19:50 1.14 so what does this mean that means

19:55 that when we're gonna talk about the

19:57 first sentence we're not going to use

19:59 the coordinates 2 2 for Apple we're

20:01 going to use the coordinates 1.14 and

20:04 2.43 and we're in the second sentence

20:06 we're not going to to use the

20:07 coordinates 2 2 we're going to use the

20:09 coordinates 2.86 and 1.14 so now we have

20:13 better coordinates because these new

20:15 coordinates are closer to the

20:17 coordinates of either orange or phone

20:20 depending on which sentence the word

20:22 Apple appears and therefore we have a

20:25 better version of the word Apple now

20:27 this is not much but imagine doing this

20:29 many times in a Transformer the

20:30 attention step is applied many times so

20:32 if you apply it many times at the end

20:34 the words are gonna end up much much

20:36 closer to what is dictating the context

20:40 in that piece of text and in a nutshell

20:44 that is what attention is doing

20:46[Music]

20:51 so now we're ready to learn the keys

20:52 queries and values and matrices if you

20:55 look at the original diagrams for

20:56 scale.producted tension on the left and

20:58 multi-headed tension on the right they

21:01 contain these KQ and V those are the key

21:05 query and value matrices that we're

21:07 going to denote by Keys queries and

21:09 values actually let's first learn keys

21:12 and queries and we're going to learn

21:13 values later in this video so let me

21:16 show you how I like to see keys and

21:18 queries matrices now recall from

21:20 previously in this video that when you

21:22 want to do the attention step you take

21:24 the embedding and then you move these

21:26 ambiguous Apple towards the phone or

21:28 towards the orange depending on the

21:30 context of the sentence now in the

21:31 previous video we learned what a linear

21:33 transformation is is basically a matrix

21:35 that you multiply all the vectors by and

21:37 you can get something like this another

21:39 embedding or maybe something like this a

21:42 good way to imagine linear

21:43 Transformations is send that squared to

21:46 any parallelogram and then the plane

21:48 follows because the square tessellates

21:50 the plane so therefore you just continue

21:53 tessellating the plane impression

21:54 parallelograms and you get a

21:56 transformation from the plane to the

21:58 plane so these two examples over here

22:00 are linear transformations of the

22:02 original embedding now let me ask you a

22:05 question out of these three which one is

22:08 the best one for applying the attention

22:10 step and which one is the worst one and

22:12 which one is so so feel free to pause

22:15 the video and think about it and I'll

22:17 tell you the answer well the first one

22:19 is so-so because when you apply

22:21 attention it kind of separates the fruit

22:23 apple from the technology Apple but not

22:25 so much

22:26 the second one is awful because you

22:29 apply the attention step and it doesn't

22:30 really separate the two words so this

22:32 one's really bad it doesn't really add

22:34 much information and the third one is

22:36 great because it's really spaces out the

22:39 phone and the orange and therefore it

22:41 separates the technology apple and the

22:44 fruit apple very well so this one's the

22:46 best one and the point of the keys and

22:47 queries Matrix is going to help us find

22:50 really good embeddings where we can do

22:52 attention and get a lot of good

22:55 information now how do they do it well

22:58 through linear Transformations but let

23:00 me be more specific

23:01 remember that attention was done by

23:03 calculating the similarity now let's

23:05 look at how we did it let's say you have

23:07 the vector for orange and the vector for

23:10 phone in this example they have three

23:11 coordinates but they could have as many

23:12 as we want and we want to find the

23:15 similarity so the similarity is the dot

23:18 product it actually was the scale dot

23:19 product or the cosine distance but at

23:22 the end of the day it's the same up to a

23:24 scalar so we're just going to take them

23:26 all similarly and the dot product can be

23:28 seen as the product of the first one

23:30 times the transpose of the second one is

23:32 a matrix product and as I said before if

23:36 we don't care so much about scaling we

23:37 can think of it as the cosine distance

23:40 in that particular embedding now how do

23:44 we get a new embedding well this is

23:46 where the keys and queries matrices come

23:48 in when we look at the keys and queries

23:50 Matrix what they do is they modify the

23:52 embeddings so instead of taking the

23:54 vector for orange we take the vector for

23:56 range times the keys matrices and

23:58 instead of taking the vector for phone

24:00 we take the vector phone times the

24:03 queries Matrix and we get new embeddings

24:05 and when we want to calculate the

24:07 similarity then it's the same thing it's

24:09 a product of the first one times the

24:12 transpose of the second one which is the

24:14 same as the transpose of queries times

24:16 the transpose of phone and this over

24:19 here is a matrix that defines a linear

24:21 transformation and that's the linear

24:22 transformation that takes this embedding

24:25 into this one over here so the keys and

24:28 queries Matrix work together to create a

24:31 linear transformation that will improve

24:34 our embedding to be able to do a tension

24:36 pattern therefore what we're doing is

24:39 that we're modifying the similarity in

24:41 one embedding and taking the similarity

24:43 on a different metric and we're gonna

24:46 make sure that this is a better one

24:48 actually we're going to calculate lots

24:49 of them and find the best ones but

24:51 that's coming a little later but imagine

24:53 keys and queries matrices as a way to

24:56 transform our embedding into one that is

24:59 better suited for this attention problem

25:05 and now that you've learned what the

25:07 keys and queries matrices are let me

25:09 show you what the values Matrix is

25:12 recall that the keys and queries Matrix

25:14 actually turn the embedding into one

25:17 that is best for calculating

25:19 similarities

25:21 however here's the thing that embedding

25:23 on the left is not where you want to

25:25 move the words you only want it for

25:27 calculating similarities

25:29 let's say that there's an ideal

25:31 embedding for moving the words and it's

25:33 the one over here so what do we do well

25:36 using the similarities we found on the

25:39 left embedding we're gonna move the

25:42 words on the right embedding and why is

25:46 that the case well what happens is that

25:48 the embedding on the left is actually

25:50 optimized for finding similarities

25:52 whereas the embedding on the right is

25:54 optimized for finding the next word in a

25:57 sentence why is this because a

25:59 Transformer what it does and we're going

26:02 to learn this in the next video but a

26:04 Transformer finds the next word in a

26:06 sentence and it continues finding the

26:08 next word until it builds long pieces of

26:10 text

26:11 so if anyone you want to move the words

26:13 around is one that's optimized for

26:16 finding the next word and recall that

26:18 the embedding on the left is found by

26:19 the keys and queries matrices and the

26:21 embedding on the right is found by the

26:23 values matrices and what are the values

26:25 Matrix do well it's the one that takes

26:28 the embedding on the left and multiplies

26:31 every Vector to get the embedding on the

26:33 right so when you take the embedding on

26:36 the left multiplied by The Matrix V you

26:39 get another transformation because you

26:40 can concatenate these linear

26:42 Transformations and you get the linear

26:45 transformation that corresponds to the

26:46 embedding on the right now why is it

26:50 that the embedding on the left is the

26:52 best one for finding similarities well

26:54 this one is one that knows the features

26:57 of the words for example it would be

26:59 able to pick up the color of a fruit to

27:02 the size the amount of fruitness the

27:04 flavor the technology that is in in the

27:08 phone

27:09 etc etc is the one that captures

27:10 features on the words whereas the

27:13 embedding for finding the next word is

27:16 one that knows when two words could

27:18 appear in the same context so for

27:19 example if the sentences I want to buy a

27:22 blank the next word could be car could

27:24 be apple could be phone in the embedding

27:26 on the right all those words are close

27:28 by because the embedding on the right is

27:31 good for finding the next word in a

27:33 sentence so recall that the keys and

27:36 queries matrices can capture the high

27:39 level and the low level granular

27:41 features of the words embedding on the

27:44 right doesn't have that it's optimized

27:47 for the next word in a sentence and

27:49 that's how the key query and values

27:51 Matrix actually give you the best

27:53 embeddings to apply attention in now

27:57 just to show you a little bit of the

27:58 math that happens in the value Matrix

28:00 imagine that these are your similarities

28:02 that you found after the soft Max

28:04 function then that means Apple goes to

28:08 0.3 orange plus 0.4 apple plus 0.15 and

28:11 plus Point 15 and that's given the

28:13 second row of the table and when you

28:16 multiply this matrix by the value Matrix

28:17 then you get some other embedding like

28:20 that now everything here is something 4

28:21 it could be length completely different

28:23 because the value Matrix doesn't need to

28:25 be a square it can be a rectangle and

28:27 then the second row tells us that

28:29 instead Apple should go to v21 Times

28:31 orange plus V2 times apple plus V2 3

28:34 times n plus V2 4 times n so that's how

28:36 the value Matrix comes in and transform

28:39 the first embedding into another one

28:45 well that was a lot of stuff so let's do

28:47 a little summary on the left you have

28:49 the diagram for scale dot product

28:50 extension and the formula so let's break

28:52 it down step by step first you have this

28:55 step over here where you multiply K and

28:58 Q transpose what is that well that is

29:00 the dot product and you're dividing by

29:03 the square root of DK DK is the length

29:05 of each of the vectors remember that

29:07 this is called scale dot product so what

29:10 we're doing here is finding the

29:11 similarities between the words I'm going

29:14 to denote the similarities by angles

29:15 with cosine distance but you know that

29:17 I'm talking about scale dot product

29:19 instead so now that we found the

29:21 similarities we move to the step over

29:23 here with the soft Max which is the one

29:26 where we figure out where to move the

29:28 words in particular the technology Apple

29:30 moves towards the phone and the fruit

29:32 apple moves towards the orange but we're

29:35 not going to make the movements on this

29:36 embedding because this embedding is not

29:38 optimal for that this embedding is

29:39 optimal for finding similarities

29:41 so we're gonna use the values Matrix to

29:44 turn this embedding into a better one V

29:46 is acting as a linear transformation

29:48 that transforms the left embedding into

29:51 the embedding on the right and in the

29:53 embedding on the right is where we move

29:55 the words because this embedding over

29:57 here is optimized for the function of

30:01 the Transformer which is finding the

30:03 next word in a sentence so that is

30:06 self-attention now what's multi-head

30:07 attention well it's very similar except

30:10 you used many heads and by many heads I

30:13 mean many key and query and value

30:17 matrices here we're going to show three

30:19 but you could use eight you could use 12

30:21 you could use many more and the more you

30:24 use the better obviously the more use

30:25 the more computing power you need but

30:27 the more you use the more likely you'll

30:29 be finding some pretty good ones so this

30:31 3K and Q Matrix as we saw before they

30:34 form three embeddings where you can

30:38 apply attention now K and Q are the ones

30:42 that help us find the embedities where

30:43 you find the similarities between the

30:45 words

30:46 now we also have a bunch of value

30:48 matrices as many as key and query Matrix

30:51 is always the same number and just like

30:52 before these value matrices transform

30:55 these embeddings where we find

30:56 similarities into embeddings where we

31:00 can move the words around now here is

31:03 the magic step how do we know which ones

31:06 are good and which ones are bad well

31:09 right now we don't

31:11 we concatenate them first what is

31:14 concatenating means well if I have a

31:15 table of two columns and another table

31:17 of two columns and another table is two

31:18 columns when I concatenate them I get a

31:20 table of six columns geometrically that

31:22 means if I have let's say an embedding

31:24 of two dimensions and another one and

31:25 another one I concatenate them then I

31:27 get an embedding of Six Dimensions now I

31:29 can't draw in Six Dimensions but imagine

31:31 that this thing over here is a really

31:34 high dimensional embedding of Six

31:36 Dimensions so something with six axis in

31:39 real life if you have a lot of big

31:40 embeddings you end up with a very high

31:42 dimensional embedding which is not

31:45 optimal so that's why we have this

31:46 linear step over here the linear step

31:49 over here is a rectangular Matrix that

31:51 is going to transform this into a lower

31:54 dimensional embedding that we can manage

31:56 but there's more this Matrix over here

31:59 learns which embeddings are good and

32:02 which embeddings are bad so for example

32:04 the best embedding for finding the

32:06 similarities was the third one so this

32:08 one gets scaled up and the worst

32:11 embedding was the one in the middle so

32:12 this one gets scaled down so this Matrix

32:15 over here this linear step actually does

32:18 a lot and now if we know what matrices

32:21 are better than others and the linear

32:23 step actually scales those well and

32:26 scales the bad ones by a small amount

32:29 then we end up with a really good

32:32 embedding and so we end up doing

32:34 attention in a pretty optimal embedding

32:37 which is exactly what we want

32:39 now I've done a lot of magic here

32:41 because I haven't really told you how to

32:44 find this key query and value matrices I

32:46 mean they seem to do great jobs but

32:49 finding them is probably not easy well

32:51 that's something we're gonna see more in

32:53 the next video but the idea is that this

32:56 key query and value Matrix get trained

32:58 with the Transformer model here's a

33:00 Transformer model and you can see that

33:02 multi-headed tension appears several

33:04 times

33:05 I like to simplify this diagram into

33:08 this diagram over here where you have

33:10 several steps tokenization embedding

33:12 positional encoding then a feed forward

33:15 and an attention part that repeats

33:18 several times and each of the blocks has

33:20 an attention block so in other words

33:22 imagine training a humongous neural

33:24 network and a neural network has inside

33:27 a bunch of key aquarium value matrices

33:30 that get trained as the neural network

33:32 gets trained to guess the next word but

33:35 I'm getting into the next video this is

33:37 what we're gonna learn on the third

33:39 video of this series so again this was

33:42 the second one attention mechanisms math

33:43 the first one had a high level idea of

33:46 attention and the third one is going to

33:47 be in Transformer models so stay tuned

33:51 when that is out I will put a link in

33:54 the comments so That's all folks

33:56 congratulations for getting until the

33:58 end this this was a bit of a complicated

34:01 video but I hope that the pictorial

34:04 examples were helpful now time for some

34:07 acknowledgments I would have not even

34:09 able to make this video if not for my

34:12 friend and colleague who is a genius and

34:15 knows a lot about attention and

34:17 Transformers and he actually helped me

34:18 go over these examples and help me form

34:21 these images so thank you Joel and some

34:24 more acknowledgments Jay Alamar was also

34:26 tremendously helpful in me understanding

34:29 Transformers and attention we had long

34:31 conversations where he explained this to

34:33 me several times and my friend Omar as

34:36 well Omar Flores was very helpful too I

34:39 actually have a podcast where I ask him

34:41 questions about Transformers for about

34:43 an hour and I learned a lot from this

34:46 podcast it's in Spanish but if you do

34:48 speak Spanish check it out what link is

34:50 in the comments and it's also in my

34:52 Spanish YouTube channel serrano.academy

34:54 and if you like this material definitely

34:56 check out llm.university it's a course

34:59 I've been doing at cohere with my very

35:01 knowledgeable colleagues me or Amir and

35:03 Jay Alamar the same J as before

35:06 this is a very comprehensive course and

35:08 it's taught in a very simple language it

35:11 talks about all the stuff in this video

35:13 including embedding similarity

35:15 Transformers attention also it has a lot

35:18 of labs where you can do semantic search

35:20 you can do prompt engineering many other

35:24 topics and so it's very Hands-On and it

35:27 also teaches you how to deploy models

35:29 basically it's a zero to 100 course on

35:32 llms and I recommend you to check

35:35 llm.university and finally if you want

35:37 to follow me well please subscribe to

35:39 the channel and hit like or put a

35:42 comment I love reading the comments it's

35:44 serrano.academy

35:46 you've also tweeted me at

35:47 serrano.academy or check out my page

35:49 where I have blog posts and a lot of

35:51 other stuff the page is also surrounded

35:53 on Academy and I have a book called

35:55 rocking machine learning in which I

35:57 explain machine learning in this way in

36:00 a simple and pictorial way with Labs

36:03 that are on GitHub so check it out

36:05 there's a 40 discount code Serrano YT if

36:08 you want to take a look the link and

36:10 information is on the comments so thank

36:12 you very much for your attention and see

36:14 you in the next video

36:15[Music]

ðŸ’« FAQs about This YouTube Video

### 1. What are attention mechanisms in large language models?

Attention mechanisms are a key component of large language models, such as Transformers, and are essential for the model to understand the context of words and sentences. They allow the model to focus on relevant parts of the input.

### 2. How do attention mechanisms work in language models?

Attention mechanisms in language models work by allowing the model to assign different weights to different parts of the input, focusing on the most relevant information for the task at hand. This helps the model understand the context and relationships between words and sentences.

### 3. What are the key components of attention mechanisms in large language models?

The key components of attention mechanisms in large language models include the ability to focus on relevant parts of the input, assign different weights to input elements, and understand the context and relationships between words and sentences.

### 4. Why are attention mechanisms important in large language models like Transformers?

Attention mechanisms are important in large language models like Transformers because they enable the model to effectively understand and process the input by focusing on relevant information and capturing the relationships between words and sentences.

### 5. How do attention mechanisms contribute to the success of large language models?

Attention mechanisms contribute to the success of large language models by allowing them to effectively capture the complex relationships and dependencies within the input data, leading to improved performance in understanding and generating natural language.

ðŸŽ¥ Related Videos

### What vaccinating vampire bats can teach us about pandemics | Daniel Streicker

### a16z Podcast | Things Come Together -- Truths about Tech in Africa

### 2024 TSCRS Applications of anterior segments diagnostic instruments in cataract surgery

### a16z Podcast | The Infrastructure of Total Health

### The Robot Lawyer Resistance with Joshua Browder of DoNotPay

### NES Controllers Explained

ðŸ”¥ Recently Summarized Examples