00:00 Transformer models rely on positional

00:03 embeddings to help them uphold the

00:06 sequential nature of the data when they

00:11 the original Transformer model

00:13 introduced sine soil positional

00:15 embeddings where each hidden Dimension

00:18 is modeled via a sine soil curve

00:22 since then different solutions have come

00:26 in this modern era of llms most models

00:31 have found one type of embeddings to be

00:33 performant and generalizable that is

00:37 rope embeddings rope stands for rotary

00:41 positional embeddings and as the name

00:44 suggests its primary functionality is

00:47 rotating query and key vectors based on

00:51 their position in the sequence

00:54 as we go through the video we will see

00:56 why this is important and things will

01:01 so with no pun intended let's get the

01:06 the flagship computation of a transform

01:09 model is the self-attention layer

01:12 computation which revolves around three

01:15 ingredients queries keys and values out

01:20 of these queries and keys are used to

01:23 compute the attention Matrix if you'd

01:26 like to learn about the end-to-end

01:28 computation in Transformer models check

01:31 out my video on the Transformer model

01:35 each position has a query and a key in

01:39 our case let's assume a dimensionality

01:43 so each query or key is a 2d Vector when

01:48 Computing the attention Matrix we wanted

01:51 to encode two characteristics

01:54 tokens that have similar token

01:56 embeddings should have a higher score

01:59 which is a no-brainer for example if the

02:02 two tokens are cat and dog it makes

02:06 sense to give them a higher score as

02:08 they appear in very similar contexts

02:12 next the score should be higher for

02:15 words that are close together

02:17 again quite sensible

02:20 the further part two words are the less

02:23 likely for them to be related you can

02:25 see the effect of dispositional behavior

02:28 from the bottom attention Matrix where

02:31 the intensity is highest along the

02:33 diagonal the diagonal is where the query

02:36 and the key share the same position and

02:40 the score declines as you go in the

02:42 orthogonal direction to the diagonal

02:44 each entry in this attention Matrix is

02:48 the dot product of sub Q M and K N

02:52 since we are working with 2D vectors we

02:55 can visualize the query and the key on a

03:00 let's then switch to the polar

03:02 coordinates now here's one way to think

03:05 about the relationships between the Q

03:08 and K vectors and the two versions of

03:11 the attention Matrix that needs to be

03:13 captured the magnitude or the radial

03:16 component of the vector contributes to

03:19 the token's similarity Matrix that is

03:22 the similarity between radial components

03:25 denoted by Norm of Q dot Norm of K

03:29 corresponds to the Token embedding

03:31 similarity at M comma and position the

03:35 angles of Q and K Theta M and Theta n

03:39 contributes to the positional similarity

03:42 at M comma and position here's something

03:45 important about this design

03:48 the angle component only depends on the

03:51 position of the vector in this equals

03:54 and is independent of the actual token

03:58 on the other hand the token similarity

04:01 is captured by the radial component of

04:03 the vector and is independent of the

04:08 remember this as this school of thought

04:11 plays the intuition for rope embeddings

04:16 next we will discuss how we can design

04:19 functions to capture these properties

04:21 during the attention computation

04:24 we are not so much worried about the

04:26 token similarity as we are about the

04:30 positional similarity

04:31 so for the rest of this talk we will

04:34 assume same token everywhere

04:37 so the token similarity attention Matrix

04:40 has the same value everywhere now the

04:43 only differentiator is how we design the

04:46 positional embeddings

04:48 before we type headfirst to the Rope

04:51 computation let's motivate ourselves by

04:54 learning what's wrong with the sinusoid

04:57 embeddings here's what the sinusoidal

04:59 membrings look like for 2D query key

05:02 vectors all we do is push these dots

05:06 along the curves and take measurements

05:10 so at position 0 it's 0 comma one and

05:14 position one it's 0.88 comma 0.61 and so

05:22 if you compute the actual values at m

05:25 equals 1 2 3 and so on you will get just

05:30 slightly different results due to

05:33 Precision issues in our computation this

05:36 looks like a sensible thing for a

05:37 positional embedding the values seem to

05:40 gracefully go up and down based on their

05:44 so what's the issue once the position

05:47 embeddings are computed they are added

05:50 to X so let's see how this affects an

05:53 example Vector for Simplicity let's pick

05:57 one comma one ah here's what's wrong the

06:01 conceptually it looks good not so much

06:04 in practice the vector seems to move a

06:07 bit chaotically as the position changes

06:11 it's difficult to see a pattern in the

06:14 way it moves both the magnitude and the

06:17 angle changes significantly from each

06:21 also remember this is only two

06:25 standard Transformer models have

06:27 hundreds of them and things are probably

06:29 getting unimaginably complex when you

06:32 scale up to those numbers here's some

06:35 organizational view of what might be

06:37 happening in this model

06:39 first you give the model some data

06:42 the talk appears at position 4 bark at

06:47 six and per at 20. then the model asks

06:51 what are these positions

06:54 then you tell the model that design

06:56 using specialized sinusoids to provide a

06:59 vector at each position

07:01 but all that the model sees is a table

07:07 as we saw it's hard to see the pattern

07:10 through those numbers the model does the

07:12 best it can given the situation it's

07:15 going to memorize these positions and

07:18 learning the pattern here's an analogy

07:20 imagine you have an exam coming up and

07:23 you haven't studied for

07:24 so the night before the exam you hit the

07:27 boots and try to study the material but

07:30 concepts are tough to learn and you're

07:33 running out of time so what do you do

07:35 you try to memorize the answers instead

07:39 the Transformer model opts into

07:42 something similar but unlike you the

07:45 Transformer model has its reasons to

07:48 memorize answers first reason these

07:51 models have billions of parameters and

07:53 as you know the more parameters the more

07:57 overfitting or memorization takes place

08:00 second reason when being trained on

08:03 trillions of tokens it probably sees

08:06 most of the words at all possible

08:09 positions in the sequence so there's

08:11 hardly any reason for the model to go

08:13 above and beyond to learn patterns

08:16 continuing on our conversation you train

08:19 the model this way using 20 positions

08:22 and then at the inference time you ask

08:25 about position 21 this will confuse the

08:29 model so much because it has no notion

08:31 of a position 21. the model asks what is

08:37 and output something erratic

08:39 research has already confirmed this

08:41 Behavior the perplexity of the models

08:44 trained using sinusoidals explored past

08:48 the training length now let's learn how

08:51 Rob fixes this rope is built on a simple

08:55 concept instead of trying to match the

08:58 token and position embeddings into a

09:00 paste by adding the two together

09:03 and Computing Q let's find a

09:06 transformation for Q and its position M

09:09 that gives a new vector

09:11 the idea they came up with is rotating Q

09:14 by a factor of M Theta where Theta is a

09:20 it depends on the index of the Hidden

09:24 since we only have D equals 2 Theta will

09:28 be 1. if you had a hidden dimensionality

09:31 greater than 2 you'd have several thetas

09:34 causing each pair of Dimensions to

09:37 rotate at different speeds we'll chat

09:40 about D greater than 2 K's later that's

09:43 what all these equations on the left do

09:46 rotating Q by m Theta

09:49 the uppercase or denotes the standard

09:53 when you multiply a matrix or a vector

09:56 by this you're simply rotating that

09:58 Matrix or vector by some angle here's

10:01 what you get for different positions of

10:03 Q when you use this equation

10:06 you can already see how well-behaved Q

10:09 is with this new embedding Q rotates

10:12 around counterclockwise by some Theta

10:14 every time you shift the position to the

10:17 right but one question Still Remains

10:20 how does it scale to D greater than two

10:24 since we have an r that can only rotate

10:27 a vector on a 2d plane

10:30 for this they use a trick

10:32 they break Q or k into blocks of two

10:36 so for D equals four we have two blocks

10:39 and more generally for a d dimensional

10:43 Matrix you'd have D divided by two

10:46 blocks then we can repeat the same

10:49 operations we did for D equals two for

10:52 each block independently

10:54 specifically we convert R to a block

10:57 diagonal rotation Matrix

11:00 each block is associated with one block

11:03 of Q or k each rotation block will have

11:06 its own Theta constant and Theta depends

11:10 on the index along the hidden Dimension

11:12 multiplying a d dimensional vector by a

11:16 block diagonal R means each block in Q

11:19 or k will be rotated by the angle unique

11:25 to see how this works we will break our

11:27 four-dimensional Vector 2 to the vectors

11:30 and see how the rotation changes

11:33 you can see that the new Vector in green

11:37 moves ever so slightly every time the

11:40 yellow Vector makes a big jump

11:43 now you have the full idea

11:45 each block of qrk will rotate at

11:50 but they rotate in a consistent

11:52 predictable way always in combination it

11:57 gives the positionally embedded version

11:58 of Q or k as you can see in this graph

12:02 rope embeddings are much more resilient

12:05 to Out of Time predictions than

12:07 Transformers with the sinusoidal

12:11 they also generally perform better on

12:13 log likelihood loss

12:15 there's also a paper that explains how

12:18 to increase the context window length of

12:21 an already trained model using

12:23 interpolation techniques but that's a

12:26 story for another day let's also look at

12:29 how the equations are changing from

12:32 sinusoidal positional embeddings to rope

12:35 embeddings first we're gonna ditch the

12:38 positional embeddings which leaves us

12:41 with wqxm transpose Dot wkxn

12:46 then we're going to introduce our Theta

12:49 1 for Q and 1 for k then we can arrange

12:53 this in a way that the r thetas are

12:56 close together and finally we arrive at

12:59 the equation that is in the paper this

13:02 brings us to the end of this video

13:04 though the raw paper is full of

13:06 mathematics conceptually this is all

13:09 that happens rotating q and K by some

13:12 angle which is dependent on their

13:14 position in the sequence as he saw the

13:18 erratic nature in which sinusoidal

13:20 positional embeddings move reduces

13:23 models that aren't able to generalize

13:25 Beyond training sequence length this

13:27 lies the motivation for rope due to rope

13:31 behaving in a predictable manner as the

13:33 position changes it's able to adapt much

13:36 better to sequence lengths Beyond

13:39 training length there's another

13:41 technique called Alibi which claims to

13:43 be even more Superior than rope but

13:46 right now rope is the most commoditized

13:50 positional embedding for llms if you

13:53 enjoyed this video and like to see more

13:55 videos about machine learning and large

13:58 language models subscribe to my channel

14:01 deep learning hero I will see you soon