Rotary Positional Embeddings: Combining Absolute and Relative

Efficient NLP2023-08-08

positional encoding#relative positional encodings

12K views|1 years ago

💫 Short Summary

Rotary positional embeddings are a new architectural improvement for Transformer models that combine the benefits of absolute and relative positional embeddings by applying rotation to word vectors to encode their positions in a sentence. This method preserves the advantages of both absolute and relative positional embeddings and has been quickly adopted into various language models. Rotary Positional Embeddings involve the use of rotation matrices to introduce rotational invariance in self-attention mechanisms for natural language processing tasks. This approach has been shown to improve training efficiency and has been replicated across different model architectures and training setups.

✨ Highlights

📊 Transcript

✦

A new architectural improvement called RoPE (Rotary Positional Embeddings) has been proposed in 2022 and quickly adopted into many language models.

00:00

The Transformer architecture, introduced in the 'Attention is All You Need' paper in 2017, has seen almost no architectural changes until the proposal of RoPE.

RoPE combines the best of absolute and relative positional embeddings.

✦

Rotation Matrix in Self-Attention

08:00

The rotation matrix rotates a vector by an angle of M Theta, where m is the absolute position of the token in a sentence.

The rotation is applied to the query and key vectors before applying the rotation matrix to preserve rotational invariance.

For vectors with more than two dimensions, the rotation matrix splits the vector into chunks of two dimensions and rotates them separately.

A different rotation angle is applied to each pair of dimensions in the vector.

The rotation matrix multiplication can be expressed more simply using two vector multiplications and one vector addition.

✦

Effect of Word Proximity on Dot Product

09:00

Words that are close together are more likely to have a larger dot product, while words that are far apart are expected to have a smaller dot product on average.

This is due to the way the rotation is defined, and it intuitively makes sense because words that are far apart are less likely to have anything to do with each other.

✦

Experiments with Rotary Positional Embeddings

10:00

The boot and performer models trained on language modeling tasks using rotary positional embeddings showed faster training compared to using sinusoidal embeddings.

Other researchers have replicated these findings across different model architectures and training setups.

00:00 the Transformer architecture has stood

00:02 the test of time ever since the

00:05 attention is all you need paper was

00:07 published in 2017 there has been almost

00:09 no architectural changes to it well up

00:12 until now a new architectural

00:14 Improvement was proposed in 2022 and

00:17 this Improvement has quickly been

00:19 adopted into many language models such

00:22 as pom GPT Neo and gbtga by elusive Ai

00:26 and met us Lemma 1 and 2 models and this

00:30 new method is called rope which stands

00:32 for rotary positional embeddings my name

00:35 is bai I'm a machine learning engineer

00:37 and a PhD in natural language processing

00:39 and in this video I will explain what a

00:42 rotary positional embeddings and how do

00:44 they combine the best of absolute and

00:47 relative positional embeddings

00:49 let's first review why our positional

00:51 embeddings needed in the first place the

00:54 reason is that Transformer models are

00:56 invariant to order by default so a

00:58 sentence like the dog chase the pig will

01:00 have the same representation as the

01:02 pictures the dog even though they mean

01:04 different different things or any other

01:06 combination of these words for that

01:08 matter because all of these tokens are

01:10 basically fed into the Transformer as an

01:13 unordered said and if you want to

01:16 preserve the order information then you

01:18 need to add in the positional

01:20 information somehow

01:22 the most common way of doing this is

01:24 absolute positional embeddings let's say

01:26 you have an embedding that represents

01:28 one word in a sentence then to represent

01:31 the positional information we have a

01:33 vector of the same Dimension as the word

01:35 itself so each one of these vectors

01:37 represents one specific position in a

01:40 sentence for example position 2 meaning

01:42 the second word in a sentence and each

01:44 possible sentence Position will be

01:46 represented by a different Vector then

01:48 we simply add together the word

01:50 embedding and the positional embedding

01:51 to produce the input for the Transformer

01:53 layer and there are two main ways you

01:56 can generate these positional embeddings

01:58 the first method is to Simply learn them

02:00 from data just like the rest of the

02:02 parameters of the model so we would

02:04 learn one positional Vector for position

02:06 one for position two and so on up to the

02:09 max length that you want to represent

02:11 and this is a problem because the max

02:13 length that you can represent is bounded

02:15 if you only learned positional vectors

02:17 up to position say 512 then there is no

02:21 way to represent a sequence of longer

02:23 than 512 tokens the second method of

02:26 deriving positional embeddings is using

02:28 a sinusoidal function that looks

02:30 something like this the details of how

02:32 this is constructed don't matter too

02:34 much but basically we are constructing a

02:36 unique positional embedding for each

02:38 possible position in the sequence and

02:40 the question of which one is better well

02:43 empirically people have found that these

02:46 two methods of learning them from data

02:48 and constructing them from sinusoidal

02:51 functions they have similar performance

02:53 when reduced in Real Models

02:56 another problem is that every single

02:58 positional embedding is basically

03:00 independent of each other so there is

03:03 not really any difference between

03:04 position 1 and 2 versus between 2 and

03:07 500 but intuitively we should probably

03:10 consider positions 1 and 2 to be more

03:13 similar to each other than to position

03:15 500 which is really far apart

03:19 a different approach is relative

03:21 positional embeddings instead of

03:23 representing our token's absolute

03:25 position in a sentence we instead learn

03:27 a representation for every pair of

03:30 tokens in a sentence so for example some

03:32 way to represent two tokens being

03:34 distance of 3 apart obviously the

03:37 position is different for every pair of

03:40 tokens so we cannot simply add a

03:42 position Vector to the word vector and

03:45 instead you have to modify the attention

03:47 mechanism to add in the relative

03:49 positional embeddings there are a number

03:51 of ways you could do this so in this

03:54 video I'll just focus on one model

03:56 called T5 which uses relative positional

03:59 embeddings the T5 model represents each

04:02 possible positional offset with a bias

04:05 which is basically just a floating Point

04:07 number for example B1 represents the

04:10 relative distance between a and b as

04:12 well as between V and C or c and d and

04:15 so on

04:16 and B3 will represent the distance

04:18 between a and d and this Matrix of

04:21 relative position embeddings is added to

04:24 the query key matrix product in the

04:27 self-attention layer the advantage of

04:29 this method is that any two tokens that

04:32 are for example distance 3 are part will

04:34 be represented by the same bias no

04:36 matter what absolute position they are

04:38 in the sentence and this method can

04:41 extend to arbitrarily long sequences as

04:43 well

04:44 but in practice there are some

04:46 engineering challenges with relative

04:48 embeddings here I found a benchmark that

04:50 compares the T5 bias relative embeddings

04:53 against other types of positional

04:55 embeddings and what they found was that

04:57 the T5 positional embeddings were much

05:00 slower especially for longer sequences

05:02 and the reason is that relative

05:05 traditional embeddings require you to do

05:07 an extra step in the self-attention

05:09 layer to add the positional Matrix to

05:12 the query key self-attention Matrix and

05:14 also for each extra token you generate

05:16 the embedding for each token changes so

05:19 this makes it difficult to effectively

05:22 use the key value cache if you're not

05:24 sure what is the key value of hash check

05:26 out my video here that explains how it

05:29 works and why we need it but because of

05:31 the Practical engineering challenges

05:33 here relative embeddings are not very

05:36 commonly used today especially in larger

05:38 language models by the way if you made

05:41 this file please smash the like button

05:43 to feed the YouTube algorithm And

05:45 subscribe to my channel to get notified

05:47 when I post new machine learning content

05:50 now let's talk about rotary positional

05:53 embeddings the key idea is instead of

05:55 adding a positional in Vector to encode

05:57 the position of the wood in a sentence

05:59 they propose to apply a rotation to the

06:01 vector instead so for example imagine

06:04 you have a two-dimensional word Vector

06:06 that represents the word dog then to

06:09 represent the word dog appearing in the

06:12 second position in a sentence we rotate

06:14 the vector let's call the amount that we

06:16 rotated by the angle Theta if the word

06:19 appears in an even later position in the

06:22 sentence then we rotate the vector even

06:24 more and the amount that we rotate is

06:28 just an integer multiple of the position

06:30 of the word in the sentence

06:33 so to represent the position m in a

06:35 sentence we rotate the original word

06:37 vector by angle of M times Theta this

06:42 has a lot of the advantages of absolute

06:44 positional embeddings like if we add

06:46 more tokens to the end of the sentence

06:48 then the vectors for the beginning of

06:50 the sentence stay the same which makes

06:52 them easier to catch

06:54 because as long as the word is in

06:57 position number one in a sentence then

06:59 it does not matter how many more words

07:00 come after it because the words

07:02 positional embedding will not be

07:04 affected

07:05 an additional Advantage is that relative

07:08 positions of words are preserved in this

07:10 sentence we have the word Pig

07:12 represented by a yellow vector and the

07:14 dog represented by a blue vector

07:16 now let's say we add some more random

07:19 words to the sentence but preserve the

07:21 distance between the two words so the

07:23 two words are still three tokens apart

07:25 now because of the way that rotary

07:27 embeddings are designed the two vectors

07:30 are rotated by the exact same amount

07:32 therefore the angle between the vectors

07:34 will be preserved this means the dot

07:37 product between the two vectors will

07:39 remain the same when we add words to the

07:41 beginning or end of the sentence as long

07:43 as the distance between the two words

07:45 stay the same so we can clearly see that

07:48 rotary positional embeddings have both

07:51 the advantages of absolute as well as

07:54 relative positional embeddings

07:56 let's see how this is implemented this

07:59 is the equation from the paper that

08:01 represents rotary embeddings for the 2D

08:04 case the most important part is this

08:06 part here which is just a rotation

08:09 Matrix and what it does is it rotates a

08:11 vector by an angle of M Theta where m is

08:15 the absolute position of the token in a

08:17 sentence

08:18 X is a vector that we're trying to

08:20 rotate and in this case it is two

08:23 dimensional also notice that we apply

08:26 the linear transformations to get the

08:27 query and key vectors before we apply

08:30 the rotation Matrix this is so that the

08:33 rotational invariance property that we

08:36 talked about earlier is preserved and we

08:39 also only need to apply the rotation to

08:41 the query and key vectors and not the

08:43 value Vector in self-attention

08:46 this is an equation for the more General

08:48 case when the vector has more than two

08:50 dimensions

08:51 and basically it takes your vector and

08:54 splits it up into chunks of two

08:56 dimensions and rotates them to a radheim

08:59 so the first part of the rotation

09:02 applies a rotation to the first two

09:05 dimensions of the vector and then these

09:07 two terms apply a rotation to the second

09:10 part of the vector and so on

09:13 and we apply a different rotation angle

09:15 to each pair of dimensions in the vector

09:18 they assume that the dimension of the

09:20 vector is an even number which is

09:22 usually the case

09:24 and Visually you can kind of imagine

09:26 this as a n-dimensional corkscrew that's

09:29 rotating in space

09:31 the people expressed the rotation as a

09:34 matrix multiplication because this is

09:36 convenient mathematically in practice

09:38 though you will never want to do

09:40 something like this because it is just

09:42 too slow to construct a matrix and do a

09:44 matrix multiplication for this

09:46 because the same computation can be

09:48 expressed in a much simpler way using

09:51 just two Vector multiplications and one

09:53 vector addition

09:54 and in fact implementing this in pytorch

09:57 takes only around 10 lines of code

09:59 another useful mathematical property is

10:02 that when the words are close together

10:04 then they are more likely to have a

10:06 larger dot product but when you have two

10:09 words that are separated by a lot of

10:12 tokens apart then they are expected to

10:14 have a smaller dot product on average

10:17 this is due to the way that the rotation

10:19 is defined and intuitively it makes

10:22 sense because words that are far apart

10:24 are less likely to have anything to do

10:27 with each other

10:28 check out the paper if you want to see a

10:31 detailed derivation of the math of why

10:33 all of this works out but I'm not going

10:35 to go into it in this video

10:37 finally let's look at some experiments

10:40 and they basically trained the boot and

10:42 performer models on language modeling

10:45 tasks and they found that using rotary

10:48 positional embeddings the model trained

10:50 faster than using sinusoidal embeddings

10:52 and other researchers have so far

10:54 replicated these findings and found it

10:56 relatively robust across different types

10:59 of model architectures and training

11:01 setups

11:02 and that's it for rotary positional

11:04 embeddings if you have any questions

11:06 don't hesitate to leave a comment below

11:09 and if you found this video helpful

11:11 please like And subscribe to my channel

11:13 that's it for now see you next time

💫 FAQs about This YouTube Video

1. What are positional embeddings in Transformer models?

Positional embeddings are added to the input of Transformer models to preserve the order of tokens in a sequence, as the Transformer is inherently invariant to token order. There are absolute positional embeddings, which represent each position in the sequence with a unique vector, and relative positional embeddings, which capture the relationship between tokens in a sequence.

2. How do rotary positional embeddings combine the advantages of absolute and relative positional embeddings?

Rotary positional embeddings apply a rotation to the word vectors based on their positions in the sequence, allowing the embeddings to capture both the absolute position of a word and its relative position to other words. This approach combines the benefits of absolute positional embeddings, which preserve the order of tokens, and relative positional embeddings, which capture token relationships.

3. What is the advantage of using rotary positional embeddings in Transformer models?

The advantage of using rotary positional embeddings in Transformer models is that they can capture both the absolute and relative positions of tokens in a sequence, providing a more comprehensive representation of token order and relationships. This can lead to improved performance in tasks that require understanding of token positions and dependencies.

4. How are rotary positional embeddings implemented in Transformer models?

Rotary positional embeddings are implemented in Transformer models by applying a rotation to the word vectors based on their positions in the sequence. This rotation allows the embeddings to capture both the absolute position of a word and its relative position to other words, combining the benefits of absolute and relative positional embeddings.

5. What are the key features of rotary positional embeddings in Transformer models?

The key features of rotary positional embeddings in Transformer models include the ability to capture both absolute and relative positions of tokens in a sequence, providing a comprehensive representation of token order and relationships. This can lead to improved model performance in tasks that require an understanding of token positions and dependencies.

6. What are rotary positional embeddings and how do they work?

Rotary positional embeddings are used in natural language processing to add rotational invariance to the token positions in a sentence, allowing the model to better understand the relationships between words. They are achieved by applying a rotation matrix to the query and key vectors, preserving the rotational invariance property discussed earlier.

7. How are rotary positional embeddings helpful in language modeling?

Experiments have shown that using rotary positional embeddings in language modeling tasks can lead to faster training compared to other methods such as sinusoidal embeddings. Researchers have found this to be relatively robust across different types of model architectures and training setups.

8. What mathematical property is associated with rotary positional embeddings?

When using rotary positional embeddings, words that are close together are more likely to have a larger dot product, while words that are far apart are expected to have a smaller dot product on average. This is due to the way the rotation is defined, and it makes intuitive sense in capturing the relationships between words based on their positions.

9. What are the key findings regarding the use of rotary positional embeddings in language modeling?

The key finding is that using rotary positional embeddings can lead to faster training in language modeling tasks. This has been replicated by other researchers and is found to be relatively robust across different types of model architectures and training setups.

🎥 Related Videos