00:00 hello guys welcome to my video about the
00:02 Transformer and this is actually the
00:04 person 2.0 of my series on the
00:07 Transformer I had a previous video in
00:10 which I talked about the Transformer but
00:12 the audio quality was not good and as
00:14 suggested by my viewers as the video was
00:16 really uh had a huge success the viewers
00:19 suggested me to to improve their audio
00:21 quality so this this is why I'm doing
00:24 uh you don't have to watch the previous
00:26 series because I would be doing
00:27 basically the same things but with some
00:29 improvements so I'm actually
00:30 compensating from some mistakes I made
00:33 or from some improvements that I could
00:35 after watching this video I suggest
00:37 watch my watching my other video about
00:40 or how to code a Transformer model from
00:43 scratch so how to code the model itself
00:45 how to train it online data and how to
00:48 inference it stick it with me because
00:50 it's gonna be a little long journey but
00:54 now before we talk about the Transformer
00:56 I want to first talk about recurrent
00:59 neural networks so the networks that
01:02 were used before they introduced the
01:03 transformer for most of the sequence to
01:06 sequence jobs tasks so let's review them
01:11 recurring neural networks existed a long
01:14 time before the Transformer and they
01:16 allowed to map one sequence of input to
01:19 another sequence of output in this case
01:22 our input is X and we want an input
01:25 sequence Y what we did before is that we
01:29 split the sequence into single items so
01:31 we gave the recurrent neural network the
01:34 first item as input so X1 along with an
01:37 initial State usually made up of only
01:40 zeros and the recurrent normal Network
01:42 produced an output let's call it y1
01:46 and this happened at the first time step
01:49 then we took the hidden State this is
01:53 called the hidden state of the network
01:54 of the previous time step along with the
01:57 next input token so X2 and the network
02:01 had to produce the SEC the second output
02:04 token Y2 and then we did it the same
02:08 procedure at the third time step in
02:10 which we took the hidden state of the
02:12 previous time step along with the input
02:14 State the input token at the time steps
02:18 3 and the network has to produce the
02:21 next output token which is Y3 if you
02:24 have enter n tokens you need n time
02:27 steps to map a end sequence input into
02:30 an end sequence output
02:33 this worked fine for a lot of tasks but
02:36 had some problems let's review them
02:40 the problems with recurring neural
02:42 networks first of all are that they are
02:45 slow for long sequences because think of
02:48 the process we did before we have kind
02:50 of like a for Loop in which we do the
02:53 same operation for every token in the
02:56 input so if you have the longer the
02:58 sequence the longer this computation and
03:02 this made the the network not easy to
03:05 train for long sequences the second
03:07 problem was the vanishing or the
03:09 exploding gradients now you may have
03:12 heard these terms or expression on the
03:14 Internet or from other videos but I will
03:16 try to give you a brief Insight on what
03:20 does what do they mean on a practical
03:22 level so as you know
03:24 Frameworks like Pi torch they convert
03:27 our networks into a computation graph so
03:31 basically suppose we have a computation
03:33 graph I this is not an error network I
03:36 will making I will be making a
03:38 computational graph that is very simple
03:40 has nothing to do with the neural
03:42 networks but will show you the problems
03:44 that we have so imagine we have two
03:46 inputs X and another input let's call it
03:51 our computational graph first let's say
03:54 multiplies these two numbers so we have
03:56 a first a function let's call it f of x
04:02 that is X multiplied by y
04:05 let me multiplied and the result let's
04:12 is map is given to another function
04:14 let's call this function G of Z is equal
04:18 to let's say Z squared
04:22 what our phytorch for example does it's
04:25 that pytorch want to calculate the
04:28 usually we have a loss function by torch
04:30 calculates the derivative of the loss
04:33 function with respects to its each
04:35 weight in this case we just calculate
04:37 the derivative of the G function so the
04:39 output function with respect to all of
04:41 its inputs so derivative of G
04:45 with respect to X let's say is equal to
04:49 the derivative of G with respect to f
04:55 and multiplied by the derivative of f
05:02 these two should kind of cancel out this
05:05 is called the chain Rule now as you can
05:09 the longer the chain of computation so
05:12 if we have many nodes one after another
05:14 the longer this multiplication chain so
05:17 here we have two because the distance
05:19 from this node and this is two but
05:22 imagine you have 100 or 1000
05:25 now imagine this number is 0.5 and this
05:30 number is 0.5 also the resulting numbers
05:34 when multiplied together is a number
05:36 that is smaller than the two initial
05:38 numbers it's gone up 0.25 because it's
05:41 one to one half multiplied by one half
05:46 so if we have two numbers that are
05:48 smaller than one and we multiply them
05:50 together they will produce an even
05:52 smaller number and if we have two
05:54 numbers that are bigger than one and we
05:55 multiply them together they will produce
05:58 a number that is bigger than both of
05:59 them so if we have a very long chain of
06:02 computation it eventually will either
06:05 become a very big number or a very small
06:08 and this is not desirable first of all
06:11 because our CPU of our GPU can only
06:14 represent numbers up to a certain
06:17 Precision let's say 32-bit or 64-bit and
06:20 if the number becomes too small the
06:23 contribution of this number to the
06:24 output will become very small so when
06:27 the pi torch or our automatic let's say
06:30 our framework will calculate how to
06:33 adjust the weights the weight will move
06:37 very very very slowly because the
06:39 contribution of this product is will be
06:42 a very small number
06:44 and this means that we have the gradient
06:48 is Vanishing or in the other case it can
06:51 explode become very big numbers
06:53 and this is a problem the next problem
06:55 is difficulty in accessing information
06:59 what does it mean it means that as you
07:02 remember from the previous slide we saw
07:03 that the first input token is given to
07:06 the recurrent neural network to with
07:08 along with the first state
07:10 now we need to think that the recurrent
07:12 neural network is a long graph of
07:13 computation it will produce a new hidden
07:16 State then we will use the the new
07:19 hidden State along with the next token
07:21 to produce the next output if we have a
07:25 um of input sequence the last token will
07:29 have a hidden state whose contribution
07:32 from the first token has nearly gone
07:34 because of this long chain of
07:36 multiplication so actually the last
07:38 token will not depend much on the first
07:42 token and this is also not good because
07:44 for example we know as humans that in a
07:48 text in a quite long text the context
07:50 that we saw let's say 200 words before
07:53 still relevant to the context of the
07:56 current words and this is something that
07:59 the RNN could not map
08:02 and this is why we have the Transformer
08:05 so the Transformer solves these problems
08:07 with the recurrent neural networks and
08:11 the structure of the Transformer we can
08:14 divide into two macro blocks the first
08:18 macro block is called encoder and it's
08:22 the second macro block is called a
08:24 decoder and it's the second part here
08:27 the third part here you see on the top
08:30 it's just a linear layer and we will see
08:32 why it's there and what it is function
08:35 so and the two layers so the encoder and
08:39 the decoder are connected by this
08:41 connection you can see here
08:43 in which some output of the encoder is
08:46 sent as input to the decoder and we will
08:48 also see how let's start first of all
08:51 with some notations that I will be using
08:55 during my explanation and you should be
08:58 familiar with this notation also to
09:01 review some maths so the first thing we
09:03 should be familiar with is matrix
09:05 multiplication so imagine we have a
09:09 which is a sequence of let's say words
09:12 so sequence by D model and we will see
09:16 why it's called sequence by the model so
09:17 imagine we have a matrix that is a 6 by
09:21 512 in which each row is a word
09:27 and this word is not made of characters
09:29 but by 512 numbers so each word is
09:34 512 numbers okay like this imagine you
09:37 have 512 of them along this row 512
09:41 along this other row etc etc one two
09:44 three four five so we need another one
09:46 here okay the first word we will call it
09:49 a the second B the C D E and F
09:54 if we multiply this matrix by another
09:57 Matrix let's say the transpose of this
10:00 Matrix so it's a matrix where the rows
10:15 this word will be here B C D E and F and
10:27 512 numbers along each column because
10:31 before we had them on the rows now they
10:34 will become on the column so here we
10:35 have the 512 number
10:39 this is a matrix that is
10:42 512 by 6 so let me add some brackets
10:46 here if we multiply them we will get a
10:49 new Matrix that is we cancel the inner
10:52 dimensions and we get the outer
10:55 Dimension so it will become six by six
10:58 so it will be 6 rows by 6 rows so let's
11:02 how do we calculate the values of this
11:04 output Matrix this is six by six
11:08 this is the dot product of the first row
11:12 with the First Column so this is a
11:14 multiplied by a this the second value is
11:18 the first row with the second column the
11:21 third value is the first row with the
11:25 until the last column so a multiplied by
11:29 F Etc what is the dot product is
11:33 basically you take the first number of
11:35 the first row so here we have 512
11:38 numbers here we have 512 numbers so you
11:42 take the first number of the first row
11:44 and the first number of the First Column
11:46 you multiply them together
11:48 second value of the first row second
11:51 value of the First Column you multiply
11:53 them together and then you add all these
11:56 numbers together so it will be let's say
12:00 uh this number multiplied by this plus
12:03 this number multiplied by this plus this
12:06 number multiplied by this plus this
12:08 number multiplied by this plus you sum
12:10 all this number together and this is the
12:13 a DOT product a so we should be familiar
12:16 with this notation because I will be
12:18 using it a lot in the next slides let's
12:21 start our journey with of the
12:23 Transformer uh by looking at the encoder
12:28 starts with the input embeddings so what
12:31 is an input embedding
12:33 first of all let's start with our
12:35 sentence we have a sentence of in this
12:39 case six words what we do is we tokenize
12:41 it we transform the sentence into tokens
12:44 what does it mean to tokenize we split
12:46 them into single words
12:48 it is not necessary to always split the
12:52 sentence using single words we can even
12:55 split the sentence in part in smaller
12:58 parts that are even smaller than a
13:00 single word so we could even split this
13:02 a sentence into let's say 20 tokens by
13:05 using the each by splitting each word
13:08 into multiple words this is usually done
13:14 Transformer models but we will not be
13:17 doing it otherwise it's really difficult
13:19 to visualize so let's suppose we have
13:21 this input sentence and we split into
13:24 tokens and each token is a single word
13:27 the next step we do is we map these
13:33 and these numbers represent the position
13:35 of these words in our vocabulary so
13:38 imagine we have a vocabulary of all the
13:41 possible words that appear in our
13:42 training set each word will occupy a
13:46 position in this vocabulary so for
13:48 example the word will occupy the
13:50 position 105 the word the cat will
13:52 occupy the position 6500
13:56 and as you can see this cat here has the
13:59 same number as this cat here because
14:01 they occupy the same position in the
14:03 we take these numbers which are called
14:06 input IDs and we map them into a vector
14:12 this Vector is a vector made of 512
14:16 and we always map the same word to
14:19 always the same embedding
14:22 however this number is not fixed it's a
14:26 parameter for our model so our model
14:29 will learn to change these numbers in
14:31 such a way that it represents the
14:34 meaning of the word so the input ID is
14:36 never change because our vocabulary is
14:38 fixed but the embedding will change
14:40 along with the training process of the
14:42 model so the embeddings numbers will
14:45 change according to the needs of the
14:47 loss function so the input embedding are
14:49 basically mapping our single word into
14:52 an embedding of size 512 and we call
14:55 this quantity 512 D model because it's
14:59 the same name that it's also used in the
15:01 paper attention is all you need
15:04 let's look at the next layer of the
15:07 encoder which is the positional encoding
15:10 so what is positional encoding
15:13 what we want is that each word should
15:16 carry some information about its
15:19 position in the sentence because now we
15:22 built a matrix of words that are
15:25 embeddings but they don't convey any
15:27 information about how where that
15:30 particular word is inside the sentence
15:32 and this is the job of the positional
15:34 encoding so what we do
15:37 we want the model to treat words that
15:40 appear close to each other as close and
15:42 words that are distant as distant so we
15:45 want the model to see this information
15:47 about the special information that we
15:49 see with our eyes so for example when we
15:51 see this sentence what is positional
15:53 encoding we know that the word what is
15:56 more far from the word
15:59 um is compared to encoding because we we
16:03 have this partial information given by
16:05 our eyes but the model cannot see this
16:07 so we need to give some information to
16:09 the model about how the words are
16:12 specially distributed inside of the
16:15 and we want the positional encoding to
16:18 represent a pattern that the model can
16:20 learn and we will see how
16:24 imagine we have our original sentence
16:27 your cat is a lovely cat what we do is
16:29 we first convert into embeddings using
16:33 the previous layer so the input
16:34 embeddings and these are embeddings of
16:36 size 512 then we create some special
16:40 vectors called the positional encoding
16:42 vectors that we add to these embeddings
16:45 so this Vector we see here in red
16:48 is a vector of size 512 which is not
16:52 learned it's computed once and not
16:54 learned along with the training process
16:57 it's fixed and this word this Vector
17:00 represents the position of the word
17:02 inside of the sentence
17:04 and this should give us a output that is
17:08 a vector of size again 512 because we
17:12 are summing this number with this number
17:15 this number with this number so the
17:18 First Dimension with the First Dimension
17:19 the second dimension with that so we
17:21 will get a new Vector of the same size
17:23 of the input vectors or how are these
17:27 position in both embedding calculated
17:30 imagine we have a smaller sentence let's
17:32 say your cat is and you may have seen
17:36 the following expressions from the paper
17:39 what we do is we create a vector of five
17:43 of size D model so 512 and for each
17:48 position in this Vector we calculate the
17:51 value using these two expressions
17:54 using these arguments so the first
17:56 argument indicates the position of the
17:59 word inside of the sentence so the word
18:01 your occupies the position zero and we
18:06 use them for the even Dimension so the
18:10 zero the two the four the 510 Etc we use
18:14 the first expression so the sine and for
18:17 the other positions of this Vector we
18:19 use the second expression
18:22 and we do this for all the words inside
18:24 of the sentence so this particular
18:27 embedding is calculated p e of 1 0
18:30 because it's the first word embedding
18:33 zero so this one represents the argument
18:36 pause and this 0 represents the argument
18:40 2 I and p e of 1 1 means that the first
18:46 word uh Dimension one so we will use the
18:51 giving the position one and the two I
18:53 will be equal to 2i plus 1 will be equal
18:58 and we do this for this third word Etc
19:03 if we have another sentence we will not
19:06 have different positional encodings
19:08 we will have the same vectors even for
19:12 different sentences because the
19:14 positional encoding are computed once
19:16 and reused for every sentence that our
19:20 during inference or training so we only
19:23 compute the positional encoding once
19:25 when we create the model we save them
19:28 and then we reuse them we don't need to
19:29 compute it every time we feed the feed a
19:32 sentence to the model
19:36 so why the authors chose the cosine and
19:39 the sine functions to represent
19:41 positional encodings because let's watch
19:43 the plot of these two functions uh the
19:46 you can see the plot is by position so
19:49 the position of the word inside of the
19:50 sentence and this depth is the dimension
19:53 along the vector so the two I that you
19:56 see saw before in the previous
19:59 and if we plot them we can see as humans
20:01 a pattern here and we hope that the
20:04 model can also see this path okay the
20:08 next layer of the encoder is the
20:09 multi-head attention
20:11 we will not go inside of the multi-head
20:16 attention first we will first visualize
20:18 the single head attention so the
20:20 self-attention with a single head and
20:24 so what is self-attention self attention
20:27 is a mechanism that existed before they
20:30 introduced the Transformer the Alters of
20:33 the Transformer just changed it into a
20:36 multi-head attention so how did the
20:38 self-attention work
20:40 the self-attention allows the model to
20:44 relate words to each other
20:46 okay so we had the input embeddings that
20:49 capture the meaning of the word then we
20:52 have the positional encoding that give
20:54 the information about the position of
20:57 the word inside of the sentence now we
21:00 want this self-attention to relate words
21:04 now imagine we have uh in an input
21:07 sequence of six word with the D model of
21:12 which can be represented as a matrix
21:15 that we will call Q K and V so our q k
21:19 and V is a same Matrix are the same
21:22 Matrix representing the input so the
21:25 input of six words with the dimension of
21:29 512 so each word is represented by a
21:32 vector of size 512. we basically apply
21:36 this formula we saw here from the paper
21:38 to calculate the attention the self
21:40 attention in this case why
21:41 self-attention because it's the each
21:44 word in the sentence related to other
21:47 words in the same sentence so it's
21:51 so we start with our Q Matrix which is
21:54 uh the input sentence so let's visualize
21:57 it for example so we have six rows and
21:59 on this uh on the columns we have 512
22:03 column now they are really difficult to
22:04 draw but let's say we have 512 columns
22:07 and here we have six okay now what we do
22:12 according to this formula we multiply it
22:15 by the same sentence but transposed so
22:18 the transpose of the K which is again
22:20 the same input sequence
22:22 we divide it by the square root of 512
22:26 and then we apply this soft Max
22:28 the output of this as we saw before in
22:32 in the initial Matrix and notations we
22:35 saw that when we multiply 6 by 512 with
22:40 another Matrix that is 512 by 6 we
22:43 obtain a new Matrix that is six by six
22:45 and each value in this Matrix represents
22:48 the dot product of the first row with
22:51 the First Column this represents the dot
22:54 product of the first row with the second
22:58 the values here are actually randomly
23:00 generated so don't concentrate on the
23:02 values what you should notice is that
23:04 the soft Max makes all these values in
23:07 such a way that they sum up to one so
23:10 this Row for example here some sums up
23:13 to one this other row also sums up to
23:16 one etc etc and this value we see here
23:20 it's the dot product of the first word
23:23 with the embedding of the word itself
23:26 this value here is the dot product of
23:29 the embedding of the word your with the
23:33 embedding of the word cat and this value
23:36 here is the dot product of the word the
23:39 embedding of the word your with the
23:41 embedding of the word is
23:43 the next thing we and this value
23:45 represents somehow a score that how
23:48 intense is the relationship between one
23:51 word and another let's go uh ahead with
23:54 the formula so for now we just
23:56 multiplied Q by K divided by the square
23:59 root of Decay applied to the soft Max
24:01 but we didn't multiply by V
24:04 so let's go forward we multiply this
24:07 matrix by V and we obtain a new Matrix
24:10 which is 6 by 512 so if we multiply a
24:13 matrix that is 6 by 6 with another that
24:16 is 6 by 512 we get a new Matrix that is
24:19 6 by 512 and one thing you should notice
24:22 is that with the dimension of this
24:24 Matrix is exactly the dimension of the
24:26 initial Matrix from which we started
24:29 this what does it mean that we obtain a
24:33 new Matrix that is six rows so let's say
24:40 in which each these are our words so we
24:44 have six words and each word has an
24:46 embedding of Dimension 512 so now this
24:49 embedding here represents not only the
24:53 meaning of the word which was given by
24:55 the input embedding not only the
24:57 position of the word which was added by
25:00 the positional encoding but now somehow
25:02 this special embedding so these values
25:04 represent a special embedding that also
25:07 captures the relationship of this
25:11 particular word with all the other words
25:13 and this particular embedding of this
25:16 word here also captures not only its
25:19 meaning not only its position inside of
25:22 the sentence but also the relationship
25:24 of this word with all the other words
25:27 I want to remind you that this is not
25:29 the multi-head attention we are just
25:31 watching the self-attention so one head
25:33 we will we will see later how this
25:36 becomes the multi-head attention
25:41 self-attention has some properties that
25:44 first of all it's permutation invariant
25:47 what does it mean to be permutation
25:49 invariant it means that if we have a
25:54 first we had a matrix of six words in
25:57 this case the let's say just four words
26:02 and suppose by applying the formula
26:04 before this produces this particular
26:06 Matrix in which the there is new special
26:11 for the word a a new special embedding
26:13 for the word b a new special bedding for
26:15 the word c and d so let's call it a
26:17 prime B Prime C Prime D Prime if we
26:19 change the position of these two rows
26:22 the values will not change the position
26:24 of the output will change accordingly so
26:27 the values of B Prime will not change it
26:29 will just change in the the position and
26:32 also the C will also change position but
26:34 the values in each Vector will not
26:36 change and this is a desirable
26:37 properties self-attention as of now
26:40 requires no parameters I mean I didn't
26:42 introduce any parameter that is learned
26:45 by the model I just took the initial
26:47 sentence of in this case six words
26:51 we multiplied it by itself we divide it
26:54 by a fixed quantity which is the square
26:56 root of 512 and then we apply the soft
26:59 Max which is not introducing any
27:01 parameters so for now the self-attention
27:04 rate didn't require any parameter except
27:06 for the embedding of the words
27:09 this will change later when we introduce
27:11 the multi-head attention
27:14 also we expect because the each value in
27:18 the self-attention in the soft Max
27:20 Matrix is a DOT product of the word
27:23 embedding with itself and the other
27:25 words we expect the values along the
27:27 diagonal to be the maximum because it's
27:30 the dot product dot product of each word
27:36 there is another property of this Matrix
27:39 that is before we apply the soft softmax
27:42 if we replace the value in this Matrix
27:46 suppose we don't want the word your and
27:49 Cat to interact with each other or we
27:51 don't want the word let's say is and the
27:53 lovely to interact with each other what
27:56 we can do is before we apply the softmax
27:58 we can replace this value with minus
28:00 infinity and also this value with minus
28:06 and when we apply the soft Max the soft
28:08 Max will replace minus infinity with 0.
28:12 because as you remember the soft Max is
28:14 e to the power of x if x is going to
28:17 minus infinity e will be e to the power
28:19 of minus infinity will become very very
28:21 close to zero so basically zero
28:25 this is a desirable property that we
28:27 will use in the decoder of the
28:29 Transformer now let's have a look at
28:31 what is a multi-head attention so what
28:34 we just saw was the self attention and
28:37 we want to convert it into a
28:39 multi-headed tension you may have seen
28:41 these expressions from the paper but
28:43 don't worry I will explain them one by
28:46 imagine we have our encoder so we are on
28:49 the encoder side of of the Transformer
28:52 and we have our input sentence which is
28:55 let's say 6 by 512 so Six Word by 512 is
29:00 the size of the embedding of each word
29:02 in this case I call it sequence by D
29:05 model so sequence is the sequence length
29:07 as you can see on the legend in the
29:10 bottom left of the slide and the D model
29:13 is the size of the embedding Vector
29:15 which is 512. what we do just like the
29:20 and we take this input and we make four
29:24 copies of it one will be sent uh wait
29:27 one will be sent along this connection
29:30 we can see here and three will be sent
29:33 to the multi-header attention with three
29:36 respective names so it's the same input
29:39 that becomes three matrices that are
29:41 equal to input one is called the query
29:44 one is called key and one is called
29:46 value so basically we are taking this
29:48 input and making three copies of it one
29:50 we call Q K and B they have of course
29:55 what does the multihead attention do
29:57 first of all it multiplies these three
29:59 matrices by three parameter matrices
30:02 called WQ w k and WV
30:06 these matrices have Dimension D model by
30:09 D model so if we multiply a matrix that
30:12 is sequence by the model with another
30:14 one that is D model by D model we get a
30:17 new Matrix as output that is sequenced
30:20 by D model so basically the same
30:22 Dimension as the starting Matrix
30:25 and we will call them Q Prime K Prime
30:30 our next step is to split these matrices
30:33 into smaller matrices let's see how
30:36 we can split this Matrix Q Prime by the
30:40 sequence Dimension or by the D model
30:44 in the multi-hat attention we always
30:46 split by the D model Dimension so every
30:50 head will see the full sentence but a
30:53 smaller part of the embedding of each
30:57 so if we have an embedding of let's say
30:59 512 it will become smaller embeddings of
31:03 512 divided by four and we call this
31:07 quantity d k so d k is D model divided
31:11 by H where H is the number of heads in
31:13 our case we have H equal to 4.
31:17 we can calculate the attention between
31:19 these smaller matrices so q1 K1 and V1
31:23 using the expression taken from the
31:28 and this will result into a small Matrix
31:31 called Head 1 head 2 head 3 and head
31:35 four the dimension of head 1 up to head
31:39 four is sequence by d v
31:42 what is DV is basically it's equal to DK
31:46 it's just called a DV because the last
31:49 multiplication is done by V and in the
31:51 paper they call it DV so I am also
31:53 sticking to the same names
31:55 our next step is to multi combine these
32:00 matrices these small heads
32:03 by concatenating them along the DV
32:06 Dimension just like the paper says so we
32:10 can cut all this head together and we
32:12 get a new Matrix that is sequence by H
32:18 where H multiplied by DV as we know DV
32:22 is equal to d k so H multiplied by DV is
32:25 equal to D model so we get back the
32:28 initial shape so it's sequence by D
32:34 the next step is to multiply the result
32:37 of this concatenation by w o
32:40 and W O is a matrix that is H multiplied
32:43 by DV so D model multiple with the other
32:47 dimension being T model and the result
32:49 of this is a new Matrix that is the
32:52 result of the multi-head attention which
32:54 is sequenced by D model
32:56 so the multi had attention instead of
33:00 calculating the attention between these
33:02 matrices here so Q Prime K Prime and V
33:06 Prime splits them along the D model
33:09 Dimension into smaller matrices and
33:12 calculates the attention between these
33:14 smaller matrices so each head is
33:17 watching the full sentence but as
33:19 different aspect of the embedding of
33:22 each word why we want this because we
33:25 want the each head to watch different
33:28 aspects of the same word for example in
33:31 the Chinese language but also in other
33:33 languages one word may be a noun in some
33:36 cases maybe a verb in some other cases
33:38 maybe a adverb in some other cases
33:40 depending on the context
33:43 so what we want is that one head maybe
33:46 learns to relate that word as a noun
33:49 another head maybe learns to relate that
33:52 word as a verb and another head learn to
33:55 release that verb as an objective or
33:58 so this is why we want a multi-head
34:02 now you may also have seen online that
34:05 the the attention can be visualized and
34:09 I will show you how when we calculate
34:12 the attention between the Q and the K
34:15 matrices so when we do this operation so
34:19 the soft Max of Q multiplied by the K
34:21 divided by the square root of d k
34:24 we get a new Matrix just like we saw
34:27 before which is sequenced by sequence
34:29 and this represents a score that
34:32 represents the intensity of the
34:34 relationship between the two words
34:36 we can visualize this
34:39 and this will produce a visualization uh
34:43 similar to this one which I took from
34:45 the paper in which we see how the all
34:47 the heads work so for example if we
34:49 concentrate on this work making this
34:52 word here we can see that making is
34:54 related to the word difficult so this
34:57 word here by different heads so the blue
35:00 head the red head and the green head
35:03 but the wire let's say the Violet head
35:07 is not relating this two word together
35:09 so making and difficult is not related
35:11 by the violet or the pink head
35:14 The Violet head or the pink head they
35:17 are relating the word making to other
35:19 words for example to this word 2009
35:24 why this is the case because maybe this
35:27 pink head could see the part of the
35:30 embedding that these other heads could
35:32 not see that made this interaction
35:35 possible between these two words
35:40 you may be also wondering why these
35:43 three mattresses are called query keys
35:47 okay the terms come from the database
35:50 terminology or from the python-like
35:52 dictionaries but I would also like to
35:54 give my interpretation of my own making
35:56 a very simple example I think it's quite
36:03 so imagine we have a python-like
36:05 dictionary or a database in which we
36:07 have keys and values
36:09 the keys are the category of movies and
36:13 the values are the movies belonging to
36:15 that category in my case I just put one
36:19 so we have Romantics category which
36:22 includes Titanic we have action movies
36:24 that include the Dark Knight Etc imagine
36:27 we also have a user that makes a query
36:29 and the query is love
36:32 because we are in the Transformer world
36:34 all these words actually are represented
36:37 by embeddings of size 512.
36:40 so what our Transformer will do he will
36:43 convert this word love into an embedding
36:45 of 512 all these queries and values are
36:48 already embeddings of 512 and it will
36:52 calculate the dot product between the
36:54 query and all the keys
36:56 just like the formula so as you remember
36:58 the formula is a soft Max of query
37:00 multiplied by the transpose of the keys
37:03 divided by the square root of the model
37:06 so we are doing the dot product of all
37:08 the queries with all the keys
37:10 in this case the word love with all the
37:15 and this will result in a score that
37:18 will amplify some values or not amplify
37:25 um in this case our embedding may be in
37:28 such a way that the word love and
37:30 romantic are inter are related to each
37:33 other the word love and comedy are also
37:36 related to each other but not so
37:38 intensively like the word love and
37:40 romantic so it's more how to say let's
37:44 less strong relationship but maybe the
37:46 word horror and love are not related at
37:49 all so maybe their soft Max score is
37:58 layer in the encoder is the ADD and norm
38:01 and to introduce the other Norm we need
38:04 the layer normalization so let's see
38:06 what is the layer normalization
38:09 layer normalization is a layer that okay
38:13 let's make a practical example imagine
38:15 we have a batch of n items in this case
38:19 n is equal to three
38:22 item one item two item three each of
38:25 these items will have some features it
38:28 could be an embedding so for example it
38:30 could be a feature of a vector of size
38:32 512 but it could be a very big Matrix of
38:35 thousands of features doesn't matter
38:38 what we do is we calculate the mean and
38:40 the variance of each of these items
38:42 independently from each other
38:44 and we replace each value with another
38:47 value that is given by this expression
38:49 so basically we are normalizing so that
38:51 the new values are all in the range 0 to
38:56 actually we also multiply this new value
38:59 with a parameter called gamma and then
39:03 we add another parameter called beta and
39:06 this gamma and beta are learnable
39:09 and the model should learn to multiply
39:13 and add these parameters so as to
39:16 amplify the value that it wants to be
39:19 Amplified and not amplify that value
39:21 that it doesn't want to be Amplified
39:24 uh so we don't just normalize we
39:27 actually introduce some parameters
39:30 and I found a really nice visualization
39:32 from papers with code.com
39:35 in which we see the difference between
39:36 batch norm and layer Norm so as we can
39:39 see in the layer normalization we are
39:42 calculating if n is the batch Dimension
39:45 we are calculating all the values
39:47 belonging to one item in the batch
39:50 while in the batch Norm we are
39:53 calculating the same feature for all the
39:56 batch so for all the items in the batch
39:59 so we are mixing let's say values from
40:02 different items of the batch while in
40:05 the layer normalization we are treating
40:07 each item in the batch independently
40:09 which will have its own mean and its own
40:14 let's look at the decoder now
40:18 um in the encoder we saw the input
40:21 embeddings in this call in this case
40:23 they are called output embeddings but
40:24 the underlying working is the same here
40:28 also we have the positional encoding and
40:30 they are also the same as the Imp as the
40:35 the next layer is the musket multi-head
40:38 attention and we will see it now we also
40:41 have the multi-head attention here with
40:45 here we should see that the
40:48 there is the encoder here that produces
40:51 the output and is sent to the decoder in
41:01 while the query so this connection here
41:04 is the query coming from the decoder
41:07 so in this multi-head attention it's not
41:11 a self-attention anymore it's a cross
41:13 attention because we are taking two
41:15 sentences one is sent from the encoder
41:18 side so let's write encoder in which we
41:21 provide the output of the encoder and we
41:23 use it as a query as keys and values
41:26 while the output of the masked
41:28 multi-head attention is used as the
41:31 query in this multi-head attention
41:34 and the musket multi-head attention is
41:36 the self-attention of the input sentence
41:40 of the decoder so we take the input
41:41 sentence of the decoder we transform
41:44 into embeddings we add the depositional
41:47 encoding we give it to this multi-head
41:49 attention in which the query key and
41:50 values are the same input sequence we do
41:54 the ADD and Norm then we send this as
41:58 the queries of the multi-head attention
42:00 while the keys and the values are coming
42:02 from the encoder then we do the add the
42:05 I will not be showing the feed forward
42:07 which is just a fully connected layer
42:10 we then send the output of the feed
42:12 forward to the ADD and norm and finally
42:15 to the linear layer which we will see
42:17 later so let's have a look at the Muscat
42:20 multi-head attention and how it differs
42:22 from a normal multi-head attention
42:25 what we want our goal is that we want to
42:29 make the model causal it means that the
42:32 output at a certain position can only
42:34 depend on the words on the previous
42:36 position so the model must not be able
42:39 to see future words how can we achieve
42:43 as you saw the the output of the soft
42:47 Max in the attention calculation formula
42:49 is this Matrix sequence by sequence if
42:52 we want to hide the interaction of some
42:55 words with other words we delete this
42:58 value and we replace it with minus
43:00 infinity before we apply the soft Max so
43:03 that the soft Max will replace this
43:06 value with 0. and we do this for all the
43:10 interaction that we don't want so we
43:12 don't want your to watch future words so
43:15 we don't want your to watch cat is a
43:18 lovely cat and we don't want the word
43:20 cat to watch future words but only all
43:22 the words that come before it or the
43:24 word itself so we don't want this this
43:27 this this also the same for the other
43:32 so we can see that we are replacing all
43:36 the word all this values here that are
43:39 above this diagonal here so this is the
43:42 principal diagonal of the Matrix and we
43:44 want all the values that are above this
43:47 diagonal to be replaced with minus
43:49 infinity so that so that the soft Max
43:53 will replace them with zero let's see in
43:55 which stage of the multi-head attention
43:57 this mechanism is introduced so when we
44:01 calculate the attention between these
44:05 smaller matrices so q1 K1 and V1
44:08 before we apply this soft Max we replace
44:12 this values so this one this one this
44:15 one this one this one Etc with minus
44:18 infinity then we apply this soft Max and
44:21 then the soft Max will take care of
44:23 transforming these values into zeros so
44:26 basically we don't want these words to
44:28 interact with each other
44:30 and if we don't want this interaction
44:33 the model will learn to not make them
44:35 interact because the model will not get
44:37 any information from this interaction so
44:39 it's like this word cannot interact now
44:41 let's look at how the inference and
44:43 training works for a Transformer model
44:46 as I saw said previously we are dealing
44:50 with it we will be dealing with the
44:52 translation tasks so because it's easy
44:54 to visualize and it's easy to understand
44:56 all the steps let's start with the
44:59 training of the model we will go from an
45:02 English sentence I love you very much
45:04 into an Italian sentence it's a very
45:07 simple sentence it's easy to describe
45:12 we start with a description of the
45:16 of the Transformer model and we start
45:19 with our English sentence which is sent
45:21 to the encoder so our English sentence
45:25 on which we prepared and append to
45:28 special tokens one is called start of
45:30 sentence and one is called end of
45:32 sentence these two tokens are taken from
45:36 the vocabulary so they are special
45:37 tokens in our vocabulary that tells the
45:40 model what is the start position of a
45:43 sentence and what is the end of a
45:45 sentence we will see later why we need
45:49 for now just think that we take our
45:50 sentence we prepend a special token and
45:53 we append a special token
45:55 then what we do as you can see from the
45:58 picture we take our inputs we transform
46:00 into input embeddings we add the
46:02 positional encoding and then we send it
46:06 so this is our encoder input sequence by
46:08 the model we send it to the encoder it
46:11 will produce an output which is encode a
46:14 sequence by D model and it's called the
46:16 encoder output so as I saw we saw
46:19 previously the output of the encoder is
46:21 another Matrix that has the same
46:23 Dimension as the input Matrix
46:26 in which the embedding we can see it as
46:30 a sequence of embeddings in which this
46:32 embedding is special because it captures
46:34 not only the meaning of the word which
46:37 was given by the input embedding we saw
46:39 here so by this not only the position
46:42 which was given by the positional
46:44 encoding but also the interaction of
46:47 every word with every other word in the
46:49 same sentence because this is the
46:51 encoder so we are talking about
46:53 self-attention so it's the interaction
46:56 of each word in the sentence with all
46:59 the other words in the same sentence
47:02 we want to convert this sentence into
47:04 Italian so we prepare the input of the
47:07 decoder which is a start of sentence
47:11 as you can see from the picture of the
47:14 the Transformer the outputs here you can
47:18 what does it mean to shift right
47:20 basically it means we prepared a special
47:21 token called SOS start of sentence
47:26 you should also notice that these two
47:30 sequences actually they in when we code
47:33 the Transformer so if you watch my other
47:36 video on how to code a Transformer you
47:39 make this sequence of fixed length so
47:42 that if we have a sentence that is te
47:43 amo multo or a very long sequence
47:45 actually when we feed them to the
47:48 Transformer they all becomes become of
47:50 the same length how to do this we add
47:54 padding words to reach the length the
47:57 desired length so if our model can
47:59 support let's say a sequence length of
48:01 1000 in this case we have a fourth
48:07 996 tokens of padding to make this
48:10 sentence long enough to reach the
48:12 sequence length of course I'm not doing
48:14 it here because it's not easy to
48:16 visualize otherwise
48:18 okay we prepared this input for the
48:20 decoder we add transform into embeddings
48:24 we add the positional encoding then we
48:27 send it first to the multi-head
48:29 attentions to the musket
48:30 multi-haditation so along with the
48:33 and then we take the output of the
48:37 encoder and we send it to the decoder as
48:40 keys and values while the queries are
48:44 coming from the musket so the queries
48:46 are coming from this layer and the keys
48:49 and the values are the output of the
48:52 this the output of all this block here
48:55 so all this big block here
48:58 will be a matrix that is sequence by the
49:01 model just like for the encoder
49:04 however we can see that this is still an
49:08 embedding because it's a D model it's a
49:11 vector of size 512 how can we relate
49:15 um embedding back into our dictionary
49:17 how can we understand what is this word
49:21 in our vocabulary that's why we need a
49:24 linear layer that will map sequence by D
49:28 model into another sequence by
49:30 vocabulary size so it will tell for
49:33 every embedding that it sees what is the
49:35 position of that word in our vocabulary
49:38 so that we can understand what is the
49:40 actual token that is output by the model
49:46 after that we apply the softmax and
49:49 then we have our label what we expect
49:52 the model to Output given this English
49:59 we expect the model to Output this te
50:02 amo multo end of sentence and this is
50:06 called the label or the target what we
50:09 do when we have the output of the model
50:11 and the corresponding label we calculate
50:13 the loss in this case is the cross
50:15 entropy loss and then we back propagate
50:18 the loss to all the weights
50:21 now let's understand why we have these
50:23 special tokens called SOS and EOS
50:26 basically you can see that here the
50:28 sequence length is 4 actually is 1000
50:31 because we have the padding but let's
50:33 say we don't have any padding so it's
50:35 four tokens start of sentence the ammo
50:37 multo and what we want is the T ammo
50:40 multo end of sentence so our model when
50:43 it will see the start of sentence token
50:46 it will output the first token as output
50:50 T when it will see T it will output ammo
50:55 when it will see armor it will output
50:58 molto and when it will see a multo it
51:02 will output end of sentence which will
51:04 indicate that okay the translation is
51:07 and we will see this mechanism in the
51:12 ah this all happens in one time step
51:16 just like I promised at the beginning of
51:18 the video I said that with recurrental
51:21 or neural networks we have end time
51:23 steps to map n input sequence into an
51:28 output sequence but this problem would
51:30 be solved with the Transformer yes it
51:33 has been solved because you can see here
51:35 we didn't do any for Loop we just did
51:38 all in one pass we give an input
51:41 sequence to the encoder an input
51:43 sequence to the decoder we produced some
51:46 outputs we calculated that cross entropy
51:49 loss with the label and that's it it all
51:52 happens in one time step and this is the
51:55 power of the Transformer because it made
51:57 it very easy and very fast to train very
52:00 long sequences and with the very very
52:03 nice performance that you can see in
52:05 charge GPD you can see GPT in bird Etc
52:10 let's have a look at how inference works
52:14 again we have our English sentence I
52:16 love you very much we want to map it
52:18 into an Italian sentence
52:22 we have our usual Transformer we prepare
52:25 the input for the encoder which is start
52:28 of sentence I love you very much end of
52:31 we convert into input embeddings then we
52:34 add the positional encoding we prepare
52:35 the input for the encoder and we send it
52:37 to the encoder the encoder will produce
52:40 an output which is sequenced by the
52:42 model and we saw it before that it's a
52:44 sequence of special embeddings that
52:46 capture the meaning the position but
52:48 also the interaction of all the words
52:52 what we do is for the decoder we give
52:56 him just the start of sentence and of
52:59 course we keep the we add enough
53:01 embedding padding tokens to reach our
53:03 sequence length we just give the model
53:06 the start of sentence token and we again
53:09 we for this single token we convert into
53:12 embeddings we add the positional
53:14 encoding and we send it to the decoder
53:16 as decoder input the decoder will take
53:20 um his input as a query and the key and
53:24 the values coming from the encoder
53:27 and it will produce an output which is
53:30 sequenced by D model again we want the
53:33 linear layer to project it back to our
53:35 vocabulary and this projection is called
53:39 what we do is we apply the soft Max
53:41 which will select given the logists will
53:45 give the position of the output word
53:48 will have the maximum score with the
53:51 soft Max this is how we know what words
53:54 to select from the vocabulary
53:57 and this hopefully should produce the
53:59 first output token which is T
54:02 if the model has been trained correctly
54:05 this however happens at time step one so
54:09 when we train the model Transformer
54:11 model it happens in one pass so we have
54:13 one input sequence one output sequence
54:15 we give it to the model we do it one
54:17 time step and the model will learn it
54:19 when we inference however we need to do
54:22 it token by token and we will also see
54:24 why this is the case
54:27 at time Step 2 we don't need to
54:30 recompute the encoder output again
54:34 because the over English sentence didn't
54:37 change so we hope the the encoder should
54:41 produce the same output for it and then
54:44 what we do is we take the output of the
54:47 previous sentence so
54:50 as T we append it to the input of the
54:55 decoder and then we feed it to the
54:57 decoder again with the output of the
55:01 encoder from the previous step
55:03 which will produce an output sequence
55:05 from the decoder side which we again
55:07 project back into our vocabulary and
55:11 we get the next token which is ammo
55:15 so as I saw before as I as I said before
55:18 we are not recalculating the output of
55:21 the encoder for every time step because
55:23 our English sentence didn't change at
55:26 all what is changing is the input of the
55:28 decoder because at every time step we
55:30 are appending the output of the previous
55:32 step to the input of the decoder we do
55:35 the same for the time step 3
55:38 and we do the same for the time step 4
55:41 and hopefully we will stop when we see
55:44 the end of sentence token because that
55:47 is that's how the model tells us to stop
55:51 and this is how the inference works why
55:54 we needed four time steps
55:57 when we inference a model
55:59 um like the in this case the translation
56:02 model there are many strategies for
56:04 inferencing what we used is called
56:06 greedy strategy so for every step we get
56:10 the word with the maximum soft max value
56:14 and however this strategy Works uh
56:18 usually not bad but there are better
56:22 and one of them is called beam search
56:24 in beam search instead of always
56:27 greedily so this is that's why it's
56:29 called greedy instead of greedily taking
56:31 the maximum soft value we take the top B
56:35 values and then for each of these
56:39 choices we inference what are the next
56:42 possible tokens for each of the top B
56:46 values at every step and we keep only
56:49 the one with the B most probable
56:52 sequences and we delete the others this
56:55 is called beam search and it generally
57:00 so thank you guys for watching uh I know
57:04 it was a long video but it was really
57:06 worth it to go through each aspect of
57:08 the Transformer I hope you enjoyed this
57:11 journey with me so please subscribe to
57:13 the channel and don't forget to watch my
57:16 other video on how to code a Transformer
57:18 model from scratch in which I describe
57:20 not only again the structure of the
57:23 Transformer model while coding it but I
57:26 also show you how to train it on a data
57:30 set of your choice how to inference it
57:33 and I also provided the code on GitHub
57:35 and the Ecolab notebook to train the
57:41 model directly on collab
57:43 please subscribe to the to the channel
57:46 and let me know what you didn't
57:48 understand so that I can give more
57:51 explanation and please tell me what are
57:53 the problems in this kind of videos or
57:55 in this particular video that I can
57:57 improve for the next videos thank you
58:00 very much and have a great rest of the