00:00 Transformer models rely on positional
00:03 embeddings to help them uphold the
00:06 sequential nature of the data when they
00:11 the original Transformer model
00:13 introduced sine soil positional
00:15 embeddings where each hidden Dimension
00:18 is modeled via a sine soil curve
00:22 since then different solutions have come
00:26 in this modern era of llms most models
00:31 have found one type of embeddings to be
00:33 performant and generalizable that is
00:37 rope embeddings rope stands for rotary
00:41 positional embeddings and as the name
00:44 suggests its primary functionality is
00:47 rotating query and key vectors based on
00:51 their position in the sequence
00:54 as we go through the video we will see
00:56 why this is important and things will
01:01 so with no pun intended let's get the
01:06 the flagship computation of a transform
01:09 model is the self-attention layer
01:12 computation which revolves around three
01:15 ingredients queries keys and values out
01:20 of these queries and keys are used to
01:23 compute the attention Matrix if you'd
01:26 like to learn about the end-to-end
01:28 computation in Transformer models check
01:31 out my video on the Transformer model
01:35 each position has a query and a key in
01:39 our case let's assume a dimensionality
01:43 so each query or key is a 2d Vector when
01:48 Computing the attention Matrix we wanted
01:51 to encode two characteristics
01:54 tokens that have similar token
01:56 embeddings should have a higher score
01:59 which is a no-brainer for example if the
02:02 two tokens are cat and dog it makes
02:06 sense to give them a higher score as
02:08 they appear in very similar contexts
02:12 next the score should be higher for
02:15 words that are close together
02:17 again quite sensible
02:20 the further part two words are the less
02:23 likely for them to be related you can
02:25 see the effect of dispositional behavior
02:28 from the bottom attention Matrix where
02:31 the intensity is highest along the
02:33 diagonal the diagonal is where the query
02:36 and the key share the same position and
02:40 the score declines as you go in the
02:42 orthogonal direction to the diagonal
02:44 each entry in this attention Matrix is
02:48 the dot product of sub Q M and K N
02:52 since we are working with 2D vectors we
02:55 can visualize the query and the key on a
03:00 let's then switch to the polar
03:02 coordinates now here's one way to think
03:05 about the relationships between the Q
03:08 and K vectors and the two versions of
03:11 the attention Matrix that needs to be
03:13 captured the magnitude or the radial
03:16 component of the vector contributes to
03:19 the token's similarity Matrix that is
03:22 the similarity between radial components
03:25 denoted by Norm of Q dot Norm of K
03:29 corresponds to the Token embedding
03:31 similarity at M comma and position the
03:35 angles of Q and K Theta M and Theta n
03:39 contributes to the positional similarity
03:42 at M comma and position here's something
03:45 important about this design
03:48 the angle component only depends on the
03:51 position of the vector in this equals
03:54 and is independent of the actual token
03:58 on the other hand the token similarity
04:01 is captured by the radial component of
04:03 the vector and is independent of the
04:08 remember this as this school of thought
04:11 plays the intuition for rope embeddings
04:16 next we will discuss how we can design
04:19 functions to capture these properties
04:21 during the attention computation
04:24 we are not so much worried about the
04:26 token similarity as we are about the
04:30 positional similarity
04:31 so for the rest of this talk we will
04:34 assume same token everywhere
04:37 so the token similarity attention Matrix
04:40 has the same value everywhere now the
04:43 only differentiator is how we design the
04:46 positional embeddings
04:48 before we type headfirst to the Rope
04:51 computation let's motivate ourselves by
04:54 learning what's wrong with the sinusoid
04:57 embeddings here's what the sinusoidal
04:59 membrings look like for 2D query key
05:02 vectors all we do is push these dots
05:06 along the curves and take measurements
05:10 so at position 0 it's 0 comma one and
05:14 position one it's 0.88 comma 0.61 and so
05:22 if you compute the actual values at m
05:25 equals 1 2 3 and so on you will get just
05:30 slightly different results due to
05:33 Precision issues in our computation this
05:36 looks like a sensible thing for a
05:37 positional embedding the values seem to
05:40 gracefully go up and down based on their
05:44 so what's the issue once the position
05:47 embeddings are computed they are added
05:50 to X so let's see how this affects an
05:53 example Vector for Simplicity let's pick
05:57 one comma one ah here's what's wrong the
06:01 conceptually it looks good not so much
06:04 in practice the vector seems to move a
06:07 bit chaotically as the position changes
06:11 it's difficult to see a pattern in the
06:14 way it moves both the magnitude and the
06:17 angle changes significantly from each
06:21 also remember this is only two
06:25 standard Transformer models have
06:27 hundreds of them and things are probably
06:29 getting unimaginably complex when you
06:32 scale up to those numbers here's some
06:35 organizational view of what might be
06:37 happening in this model
06:39 first you give the model some data
06:42 the talk appears at position 4 bark at
06:47 six and per at 20. then the model asks
06:51 what are these positions
06:54 then you tell the model that design
06:56 using specialized sinusoids to provide a
06:59 vector at each position
07:01 but all that the model sees is a table
07:07 as we saw it's hard to see the pattern
07:10 through those numbers the model does the
07:12 best it can given the situation it's
07:15 going to memorize these positions and
07:18 learning the pattern here's an analogy
07:20 imagine you have an exam coming up and
07:23 you haven't studied for
07:24 so the night before the exam you hit the
07:27 boots and try to study the material but
07:30 concepts are tough to learn and you're
07:33 running out of time so what do you do
07:35 you try to memorize the answers instead
07:39 the Transformer model opts into
07:42 something similar but unlike you the
07:45 Transformer model has its reasons to
07:48 memorize answers first reason these
07:51 models have billions of parameters and
07:53 as you know the more parameters the more
07:57 overfitting or memorization takes place
08:00 second reason when being trained on
08:03 trillions of tokens it probably sees
08:06 most of the words at all possible
08:09 positions in the sequence so there's
08:11 hardly any reason for the model to go
08:13 above and beyond to learn patterns
08:16 continuing on our conversation you train
08:19 the model this way using 20 positions
08:22 and then at the inference time you ask
08:25 about position 21 this will confuse the
08:29 model so much because it has no notion
08:31 of a position 21. the model asks what is
08:37 and output something erratic
08:39 research has already confirmed this
08:41 Behavior the perplexity of the models
08:44 trained using sinusoidals explored past
08:48 the training length now let's learn how
08:51 Rob fixes this rope is built on a simple
08:55 concept instead of trying to match the
08:58 token and position embeddings into a
09:00 paste by adding the two together
09:03 and Computing Q let's find a
09:06 transformation for Q and its position M
09:09 that gives a new vector
09:11 the idea they came up with is rotating Q
09:14 by a factor of M Theta where Theta is a
09:20 it depends on the index of the Hidden
09:24 since we only have D equals 2 Theta will
09:28 be 1. if you had a hidden dimensionality
09:31 greater than 2 you'd have several thetas
09:34 causing each pair of Dimensions to
09:37 rotate at different speeds we'll chat
09:40 about D greater than 2 K's later that's
09:43 what all these equations on the left do
09:46 rotating Q by m Theta
09:49 the uppercase or denotes the standard
09:53 when you multiply a matrix or a vector
09:56 by this you're simply rotating that
09:58 Matrix or vector by some angle here's
10:01 what you get for different positions of
10:03 Q when you use this equation
10:06 you can already see how well-behaved Q
10:09 is with this new embedding Q rotates
10:12 around counterclockwise by some Theta
10:14 every time you shift the position to the
10:17 right but one question Still Remains
10:20 how does it scale to D greater than two
10:24 since we have an r that can only rotate
10:27 a vector on a 2d plane
10:30 for this they use a trick
10:32 they break Q or k into blocks of two
10:36 so for D equals four we have two blocks
10:39 and more generally for a d dimensional
10:43 Matrix you'd have D divided by two
10:46 blocks then we can repeat the same
10:49 operations we did for D equals two for
10:52 each block independently
10:54 specifically we convert R to a block
10:57 diagonal rotation Matrix
11:00 each block is associated with one block
11:03 of Q or k each rotation block will have
11:06 its own Theta constant and Theta depends
11:10 on the index along the hidden Dimension
11:12 multiplying a d dimensional vector by a
11:16 block diagonal R means each block in Q
11:19 or k will be rotated by the angle unique
11:25 to see how this works we will break our
11:27 four-dimensional Vector 2 to the vectors
11:30 and see how the rotation changes
11:33 you can see that the new Vector in green
11:37 moves ever so slightly every time the
11:40 yellow Vector makes a big jump
11:43 now you have the full idea
11:45 each block of qrk will rotate at
11:50 but they rotate in a consistent
11:52 predictable way always in combination it
11:57 gives the positionally embedded version
11:58 of Q or k as you can see in this graph
12:02 rope embeddings are much more resilient
12:05 to Out of Time predictions than
12:07 Transformers with the sinusoidal
12:11 they also generally perform better on
12:13 log likelihood loss
12:15 there's also a paper that explains how
12:18 to increase the context window length of
12:21 an already trained model using
12:23 interpolation techniques but that's a
12:26 story for another day let's also look at
12:29 how the equations are changing from
12:32 sinusoidal positional embeddings to rope
12:35 embeddings first we're gonna ditch the
12:38 positional embeddings which leaves us
12:41 with wqxm transpose Dot wkxn
12:46 then we're going to introduce our Theta
12:49 1 for Q and 1 for k then we can arrange
12:53 this in a way that the r thetas are
12:56 close together and finally we arrive at
12:59 the equation that is in the paper this
13:02 brings us to the end of this video
13:04 though the raw paper is full of
13:06 mathematics conceptually this is all
13:09 that happens rotating q and K by some
13:12 angle which is dependent on their
13:14 position in the sequence as he saw the
13:18 erratic nature in which sinusoidal
13:20 positional embeddings move reduces
13:23 models that aren't able to generalize
13:25 Beyond training sequence length this
13:27 lies the motivation for rope due to rope
13:31 behaving in a predictable manner as the
13:33 position changes it's able to adapt much
13:36 better to sequence lengths Beyond
13:39 training length there's another
13:41 technique called Alibi which claims to
13:43 be even more Superior than rope but
13:46 right now rope is the most commoditized
13:50 positional embedding for llms if you
13:53 enjoyed this video and like to see more
13:55 videos about machine learning and large
13:58 language models subscribe to my channel
14:01 deep learning hero I will see you soon