00:05 hey there and welcome to this video in
00:08 the following I'll be doing a paper
00:09 explanation about cross attention if you
00:12 haven't heard about cross attention
00:13 don't be discouraged because it's the
00:15 technique that makes so many deep
00:17 learning models perform so well and
00:19 enable so many impressive results you
00:22 definitely have heard about stable
00:23 diffusion image gen or Muse they all use
00:25 cross attention and for what cross
00:28 attention is a way to condition a model
00:30 with some extra information the
00:32 aforementioned models all use it to
00:34 condition their models on text and as of
00:36 now cross attention seems to be the best
00:38 method for injecting text conditioning
00:40 into a model and not only that this also
00:43 enables so many cool follow-up
00:44 techniques as you might have seen with
00:46 stable diffusion and my intention with
00:48 this video is to give you some intuitive
00:50 and visual explanation about cross
00:52 attention because I believe while many
00:54 people have heard about it a lot
00:56 especially through the hype of stable
00:58 diffusion many don't really understand
01:00 how it works and just treat it as a
01:02 black box and the same was true for me
01:04 before I started making this video so
01:06 let's not lose any time and get into the
01:09 explanation so before going into cross
01:11 attention itself I want to take some
01:13 time to talk about the standard
01:15 attention also called self-attention you
01:17 probably have all seen this formula over
01:20 and over we have some input and
01:22 construct three different matrices
01:24 called q k and V then we multiply Q by K
01:28 and divide by a constant and then take a
01:30 soft Max and after that we multiply by
01:33 voila attention but what does it
01:36 actually do for that I want to take an
01:38 example about how to use self-attention
01:40 on an image this is also done by stable
01:43 diffusion for example but note that this
01:45 doesn't have anything to do with
01:47 conditioning the model on some extra
01:48 information self-attention is a way to
01:51 globally route information between the
01:53 entire image instead of just locally as
01:55 with convolutions you will see what that
01:59 for Simplicity let's say our image has
02:01 only 2x2 pixels but let's also assume
02:04 the image is RGB and thus has three
02:08 to apply self-attention now we first
02:10 have to flatten our image like this
02:12 which gives us a shape of four by three
02:14 and remember it's just for pixels with
02:17 three channels for RGB now we want to
02:19 get our q k and V matrices which means
02:22 we need three different ways to project
02:25 our input pixels into the latent space
02:27 where we'll apply attention in order to
02:30 make things really simple and easy to
02:32 visualize I'm choosing to project the
02:34 three-dimensional input into two
02:36 Dimensions but in practice you would go
02:38 much higher in order to do that each
02:40 weight Matrix will have a shape of three
02:42 by two remember that to project some
02:45 Vector from a dimension of n into a
02:47 dimension of M you construct The Matrix
02:50 of shape n times m in our case we want
02:52 to go from three dimensions to two
02:55 Dimensions which is why we construct the
02:57 three by two Matrix easy right first
03:00 let's make up some arbitrary values for
03:02 our image pixels which we can calculate
03:04 with predicting the input to q k and V
03:08 leaves us with all matrices of the shape
03:13 now let's take a look at the first step
03:15 which is to calculate Q by K this
03:18 operation basically multiplies each
03:20 Vector in Q which stands for one pixel
03:23 by all other vectors of the pixel in K
03:25 since each of these is a DOT product the
03:28 outcome will be a single number
03:30 indicating the similarity between the
03:33 vectors and by doing this for all pixels
03:35 we get a similarity Matrix of each pixel
03:38 to every other pixel so multiplying Q by
03:41 K can be interpreted as a similarity
03:45 we can better see this if we visualize q
03:48 and K in a two-dimensional grid if the
03:51 model thinks that two pixels should
03:53 attend to each other or in other words
03:55 should share some information it can use
03:58 the weight matrices for Q and K to
04:01 project them close together as done with
04:03 these two vectors for example by
04:05 applying the dot product between the two
04:07 we get a high number for the similarity
04:09 whereas for example these two vectors
04:11 are further away from each other and
04:14 thus the similarity will be small and
04:16 they will not attend to each other as
04:18 much now in the formula you see that we
04:20 divide by this Factor but for
04:22 demonstration purposes I will omit it
04:24 since it only scales the similarity
04:26 Matrix this is mainly done for having a
04:29 more stable training next we apply the
04:31 softmax function in our case we applied
04:34 over the rows because we want to
04:36 normalize the values for how much one
04:38 individual pixel it tends to all others
04:41 you can interpret these values now as to
04:44 how much a pixel will attend to some
04:47 for example this pixel here will attend
04:50 23 percent to itself 33 to the second
04:53 pixel 27 to the third and another 17
04:57 percent to the fourth pixel and here you
05:00 can already see how the tension enables
05:02 us to Route information between all
05:04 pixels in a single layer and the last
05:06 step is now to multiply our similarity
05:08 matrix by the V Matrix and here you'll
05:11 see that this basically just acts as a
05:13 weighted average you can see that best
05:15 If instead of doing the normal matrix
05:17 multiplication we take each row in the
05:19 similarity Matrix and multiply each
05:22 value by the corresponding row embedding
05:24 for each pixel in V let's look at the
05:28 first row of a similarity Matrix we take
05:30 the first value which stands for how
05:32 much the first pixel should attend to
05:34 the first pixel in that case 0.23 or 23
05:38 percent and multiplied by the first row
05:40 of the V Matrix then we take the second
05:43 value 0.33 or 33 percent which stands
05:47 for how much the first pixel should
05:49 attend to the second pixel and multiply
05:51 it by the second row and we do the same
05:53 for the third and fourth value and at
05:56 the end we sum up all vectors and you
05:58 can see it's the same outcome as with
06:00 the normal matrix multiplication just a
06:02 bit better to understand in my opinion
06:04 and that's how we do it for every row of
06:06 the similarity Matrix we end up having a
06:09 four by two Matrix where each row is the
06:12 weighted average of the embeddings of
06:14 all pixels and I really want to
06:17 emphasize the last point we started off
06:19 with each pixel having its own embedding
06:22 thoroughly influenced by itself and we
06:25 end up with a mix of all pixel
06:27 embeddings and this makes and its
06:29 proportions can be determined by the
06:32 model itself it's like you being a cook
06:34 and you can determine how you want to
06:36 mix ingredients together to get the best
06:39 possible outcome in the form of a meal
06:41 and that was the core idea of how
06:43 attention works in the form of
06:44 self-attention pretty easy and intuitive
06:47 right the last thing that we still have
06:49 to do is to project a 4x2 Matrix back
06:52 into the original space which had three
06:54 dimensions for RGB and we do that with a
06:57 simple linear layer of shape two by
06:59 three and now at the end we add the
07:02 attention output to the input the reason
07:04 we do the skip connection with the input
07:07 is not only because the gradients can
07:09 flow better but it's also because of the
07:11 fact that the tension can fully
07:13 concentrate to attend to whatever it
07:15 wants without the skip connection each
07:17 pixel would also need to pay a lot of
07:19 attention to itself in order not to lose
07:21 information about itself with the skip
07:24 connection the original input stays
07:27 fully preserved and can be modified by
07:29 the attention output in whatever way the
07:31 model thinks it's useful I hope it makes
07:33 sense if not just ask in the comments
07:35 and we can think of a nice example to
07:37 explain it okay now with that knowledge
07:40 in mind understanding cross-attention
07:42 will be super easy in Cross attention we
07:45 do one significant change to q k n V
07:48 instead of all matrices being
07:50 projections of the input features in our
07:52 case the images only Q will be a
07:55 projection of the input features and K
07:57 and V will be projections of the
07:59 conditional information in the example
08:01 that we'll be doing we'll assume the
08:04 condition will be text but basically
08:06 anything else would work too let's first
08:08 talk about how text is represented
08:10 usually right now most text to image
08:12 models employ some large pre-trained
08:14 Transformers and usually they only use
08:16 the encoder the first step is to take
08:18 your caption and tokenize it then each
08:21 token will be embedded and fed through
08:23 the Transformer layers of the encoder
08:25 and eventually we'll get an output with
08:27 a shape that sort of looks like this the
08:30 first number is just the batch size the
08:32 second number is the number of tokens
08:33 and the third is the encoding Dimension
08:36 after all Transformer layers in our
08:38 simple example we'll ignore the batch
08:40 size and take as an example I love
08:42 mountains which I do and let's say that
08:44 we tokenize it and apply the Transformer
08:47 and get us output The Matrix of shape 3
08:49 by 6. three tokens I love mountains
08:53 which all have an embedding of six and
08:55 now let's construct all weight matrices
08:57 we again will project into a
08:59 two-dimensional space and we said that
09:01 this time only Q comes from the image
09:03 and K and V from the conditioning for
09:05 our 4x4 image we'll take the same
09:07 projection of three by two and for K and
09:10 V we'll create projections of 6 by 2 to
09:13 project the six-dimensional embeddings
09:15 down to two so now we have q which has a
09:18 shape of 4x2 and knv which have a shape
09:21 of three by two four pixels and 3 words
09:24 and now we do the exact same thing as
09:26 before we first multiply Q by K and to
09:30 look at it in more detail every
09:31 embedding of each pixel will be
09:33 multiplied by each embedding of each
09:35 word so embeddings were just similar
09:37 will have a high value and therefore a
09:39 given pixel will attend more to a
09:42 specific word in our example the pixels
09:44 containing the mountains might pay a lot
09:47 of attention to the White Mountain after
09:49 creating the similarity Matrix now we
09:51 again normalize the values with the
09:53 softmax the final step is to multiply
09:56 the similarity matrix by the value
09:58 Matrix and we can apply the same trick
10:00 as before here to understand better
10:02 what's going on let's focus on the first
10:04 row for the first pixel this pixel wants
10:06 to attend 47 to the first word 22
10:10 percent to the second word and 31
10:12 percent to the last word and that's why
10:14 we multiply the first row of the value
10:16 Matrix which remember comes from the
10:18 text conditioning by 0.47 then add the
10:22 second row times 0.22 and then at the
10:25 third row times 0.31 to it this will
10:29 make up the weighted embedding Vector of
10:30 all words for the first pixel and now we
10:33 do the same for the rest of the pixels
10:35 and all the words and by the end each
10:37 pixel will have a quiet information of
10:39 the entire text conditioning and decided
10:42 for itself what it wants to pay
10:43 attention to the most beautiful right
10:47 and now we have the final projection
10:48 layer which goes back to the original
10:50 space of three dimensions for RGB since
10:54 we're at a two-dimensional space we will
10:56 again construct this projection layer to
10:57 have a weight shape of two by three
10:59 inputting the attention output of shape
11:02 4x2 transforms it into four by three and
11:05 we can happily add it back to our input
11:07 with the skip connection and there you
11:10 have it that is cross attention and the
11:12 cool thing is that you can use cross
11:14 attention for so many other things as
11:16 also shown by stable diffusion you just
11:18 need to find a good way to bring your
11:20 conditional data into a not too big
11:22 dimensional space such that the
11:24 computations don't explode and you're
11:26 good to go to conclude this video let's
11:29 do a quick summarization in
11:31 self-attention we create three different
11:33 projections from our inputs q k and V
11:36 the goal of attention is to enable our
11:38 model to Route information together in a
11:40 global manner of what it thinks is
11:42 useful we first multiply Q by K which
11:45 serves as a similarity lookup and gives
11:47 us a measurement how much each element
11:49 wants to attend to every other element
11:51 after normalizing the output we then
11:53 multiply the similarity matrix by V to
11:56 Route the essential information together
11:58 we apply a final linear layer to project
12:00 back into the original space and add the
12:03 skip connection to modify the input in a
12:05 useful way and the same logic applies to
12:07 cross attention where now K and V come
12:09 from a conditional input we create our
12:12 projection matrices to embed q k and V
12:15 into the same space and do the same as
12:17 before this time each element from our
12:19 main input has the possibility to attend
12:22 to each element from the conditional
12:24 input in any way so you see it's really
12:26 about giving a model the most possible
12:28 freedom and avoid human handcrafted
12:31 features and hard-coded rules this to me
12:34 seems to be one of the biggest lessons
12:35 of the past of deep learning and I guess
12:37 it's really valuable to keep in mind
12:39 when tackling new problems I hope most
12:42 of the things were clear but if anything
12:43 was unclear just feel free to post it in
12:46 the comments and if you like practice
12:47 kind of content consider subscribing and
12:49 sharing it to a friend and with that
12:51 being said I wish you a nice day