Transformers, explained: Understand the model behind GPT, BERT, and T5

Google Cloud Tech2021-08-18

purpose: Educate#pr_pr: Google Cloud#series: Making with Machine Learning#type: DevByte+#GDS: Yes#Making with Machine Learning#Making with ML#transformers#ml#machine learning#transformer#transformers machine learning#transformers ml#how do transformers work#transformers explained#understanding transformers#autoML#google cloud#automl transformer#transformer model#transformer models#bert#Dale Markowitz

778K views|2 years ago

💫 Short Summary

Transformers are a type of neural network architecture that have had a significant impact in machine learning, particularly in natural language processing. They utilize innovations such as positional encodings and self-attention to effectively analyze and understand language, leading to the development of powerful models like BERT which have various practical applications in tasks such as text summarization, question answering, and classification.

✨ Highlights

📊 Transcript

✦

Transformers are a type of neural network architecture that has revolutionized natural language processing.

00:00

Transformers can translate text, write poems, generate computer code, and solve complex biological problems like protein folding.

Before transformers, recurrent neural networks (RNNs) were used for language tasks, but they had limitations in handling large sequences of text and were difficult to train.

The ability of transformers to scale well with large datasets, such as being trained on almost 45 terabytes of text data for models like GPT-3, has made them extremely impactful in machine learning.

00:00 the neat thing about working in machine

00:01 learning is that every few years

00:03 somebody invents something crazy that

00:04 makes you totally reconsider what's

00:06 possible

00:07 like models that can play go or generate

00:10 hyper-realistic faces

00:12 and today the mind-blowing discovery

00:13 that's rocking everyone's world is a

00:15 type of neural network called a

00:16 transformer

00:17 transformers are models that can

00:18 translate text write poems and op-eds

00:21 and even generate computer code these

00:23 can be used in biology to solve the

00:24 protein folding problem

00:26 transformers are like this magical

00:28 machine learning hammer that seems to

00:29 make every problem into a nail if you've

00:31 heard of the trendy new ml models bert

00:33 or gpt-3 or t5 all of these models are

00:37 based on transformers

00:38 so if you want to stay hip in machine

00:40 learning and especially natural language

00:42 processing you have to know about the

00:44 transformer so in this video i'm going

00:46 to tell you about what transformers are

00:47 how they work and why they've been so

00:49 impactful let's get to it so what is a

00:52 transformer it's a type of neural

00:54 network architecture to recap neural

00:56 networks are a very effective type of

00:58 model for analyzing complicated data

01:00 types like images videos audio and text

01:02 but there are different types of neural

01:04 networks optimized for different types

01:05 of data like if you're analyzing images

01:07 you typically use a convolutional neural

01:09 network which is designed to vaguely

01:11 mimic the way that the human brain

01:13 processes vision and since around 2012

01:16 neural networks have been really good at

01:17 solving vision tasks like identifying

01:19 objects and photos but for a long time

01:21 we didn't have anything comparably good

01:23 for analyzing language whether for

01:25 translation or text summarization or

01:27 text generation

01:28 and this is a problem because language

01:30 is the primary way that humans

01:31 communicate you see until transformers

01:33 came around the way we use deep learning

01:35 to understand text was with a type of

01:36 model called a recurrent neural network

01:39 or an rnn that looks something like this

01:42 let's say you wanted to translate a

01:44 sentence from english to french

01:46 an rnn would take as input an english

01:48 sentence and process the words one at a

01:50 time and then sequentially spit out

01:52 their french counterparts the key word

01:54 here is sequential in language the order

01:56 of words matters and you can't just

01:58 shuffle them around

01:59 for example the sentence jane went

02:02 looking for trouble means something very

02:04 different in the sentence trouble went

02:05 looking for jane

02:07 so any model that's going to deal with

02:08 language has to capture word order and

02:10 recurrent neural networks do this by

02:12 looking at one word at a time

02:13 sequentially but rnns had a lot of

02:15 problems

02:16 first they never really did well at

02:17 handling large sequences of text like

02:20 long paragraphs or essays by the time

02:22 they were analyzing the end of a

02:23 paragraph they'd forget what happened in

02:24 the beginning

02:25 and even worse rnns were pretty hard to

02:28 train because they processed words

02:30 sequentially they couldn't paralyze well

02:32 which means that you couldn't just speed

02:33 them up by throwing lots of gpus at them

02:36 and when you have a model that's slow to

02:37 train you can't train it on all that

02:38 much data this is where the transformer

02:40 changed everything they were a model

02:42 developed in 2017 by researchers at

02:44 google and the university of toronto and

02:46 they were initially designed to do

02:47 translation but unlike recurrent neural

02:50 networks you could really efficiently

02:51 paralyze transformers and that meant

02:53 that with the right hardware you could

02:54 train some really big models

02:56 how big

02:57 really big

02:59 remember gpt3 that model that writes

03:00 poetry and code and has conversations

03:03 that was trained on almost 45 terabytes

03:05 of text data including like almost the

03:07 entire public web

03:10 so if you remember anything about

03:11 transformers let it be this combine a

03:13 model that scales really well with a

03:15 huge data set and the results will

03:17 probably blow your mind so how do these

03:19 things actually work from the diagram of

03:21 the paper it should be pretty clear

03:24 or maybe not actually it's simpler than

03:26 you might think there are three main

03:28 innovations that make this model work so

03:29 well

03:30 positional encodings and attention and

03:32 specifically a type of attention called

03:34 self-attention

03:36 let's start by talking about the first

03:37 one positional encodings

03:39 let's say we're trying to translate text

03:40 from english to french positional

03:42 encodings is the idea that instead of

03:44 looking at words sequentially you take

03:46 each word in your sentence and before

03:47 you feed it into the neural network you

03:49 slap a number on it one two three

03:51 depending on what number the word is in

03:53 the sentence in other words you store

03:55 information about word order in the data

03:57 itself rather than in the structure of

03:58 the network then as you train the

04:00 network on lots of text data it learns

04:02 how to interpret those positional

04:04 encodings

04:05 in this way the neural network learns

04:07 the importance of word order from the

04:09 data

04:10 this is a high-level way to understand

04:12 positional encodings but it's an

04:13 innovation that really helped make

04:15 transformers easier to train than rnns

04:18 the next innovation in this paper is a

04:20 concept called attention which you'll

04:21 see used everywhere in machine learning

04:23 these days in fact the title of the

04:25 original transformer paper is attention

04:27 is all you need

04:28 so

04:29 the agreement on the european economic

04:31 area was signed in august 1992. did you

04:34 know that

04:35 that's the example sentence given in the

04:36 original paper and remember the original

04:39 transformer was designed for translation

04:41 now imagine trying to translate that

04:43 sentence to french one bad way to

04:45 translate text is to try to translate

04:46 each word one for one but in french some

04:49 words are flipped like in the french

04:51 translation european comes before

04:52 economic

04:54 plus french is a language that has

04:55 gendered agreement between words so the

04:57 word europe bien needs to be in the

05:00 feminine form to match with la zone

05:02 the attention mechanism is a neural

05:03 network structure that allows a text

05:05 model to look at every single word in

05:07 the original sentence when making a

05:09 decision about how to translate a word

05:10 in the output sentence in fact here's a

05:13 nice visualization from that paper that

05:14 shows what words in the input sentence

05:16 the model is attending to when it makes

05:18 predictions about a word for the output

05:20 sentence

05:21 so when the model outputs the word

05:24 european it's looking at the input words

05:26 european and economic you can think of

05:29 this diagram as a sort of heat map for

05:31 attention

05:32 and how does the model know which words

05:33 it should be attending to

05:35 it's something that's learned over time

05:37 from data

05:38 by seeing thousands of examples of

05:39 french and english sentence pairs the

05:41 model learns about gender and word order

05:43 and plurality and all of that

05:44 grammatical stuff so we talked about two

05:47 key transformer innovations positional

05:49 encoding and attention

05:51 but actually attention had been invented

05:52 before this paper the real innovation in

05:55 transformers was something called

05:56 self-attention a twist on traditional

05:59 attention

06:00 the type of attention we just talked

06:01 about had to do with aligning words in

06:03 english and french which is really

06:05 important for translation but what if

06:06 you're just trying to understand the

06:08 underlying meaning in language so that

06:10 you can build a network that can do any

06:12 number of language tasks

06:14 what's incredible about neural networks

06:16 like transformers is that as they

06:18 analyze tons of text data they begin to

06:20 build up this internal representation or

06:22 understanding of language automatically

06:25 they may learn for example that the

06:27 words programmer and software engineer

06:29 and software developer are all

06:30 synonymous and they might also naturally

06:33 learn the rules of grammar and gender

06:34 and tense and so on the better this

06:37 internal representation of language the

06:38 neural network learns the better it will

06:40 be at any language task

06:42 and it turns out that attention can be a

06:44 very effective way to get a neural

06:45 network to understand language if it's

06:47 turned on the input text itself

06:50 let me give you an example take these

06:52 two sentences server can i have the

06:54 check

06:55 versus looks like i just crashed the

06:57 server the word server here means two

06:59 very different things and i know that

07:01 because i'm looking at the context of

07:03 the surrounding words

07:04 self-attention allows a neural network

07:06 to understand a word in the context of

07:08 the words around it so when a model

07:10 processes the word server in the first

07:12 sentence it might be attending to the

07:14 word check which helps it disambiguate

07:16 from a human server versus a mental one

07:19 in the second sentence the model might

07:21 be attending to the word crashed to

07:22 determine that this server is a machine

07:24 self-attention can also help neural

07:26 networks disambiguate words recognize

07:28 parts of speech and even identify word

07:30 tense

07:31 this in a nutshell is the value of

07:32 self-attention

07:34 so to summarize transformers boil down

07:36 to positional encodings attention and

07:39 self-attention

07:41 of course this is a 10 000 foot look at

07:43 transformers but how are they actually

07:45 useful one of the most popular

07:46 transformer based models is called bert

07:49 which was invented just around the time

07:50 that i joined google in 2018.

07:52 bert was trained on a massive text

07:54 corpus and has become this sort of

07:56 general pocket knife for nlp that can be

07:58 adopted to a bunch of different tasks

08:01 like text summarization question

08:02 answering classification and finding

08:04 similar sentences

08:06 it's used in google search to help

08:08 understand search queries and it powers

08:10 a lot of google cloud's nlp tools like

08:12 google cloud automl natural language

08:14 burt also proved that you could build

08:16 very good models on unlabeled data like

08:18 text scraped from wikipedia or reddit

08:21 this is called semi-supervised learning

08:23 and it's a big trend in machine learning

08:24 right now

08:27 so if i've sold you on how cool

08:28 transformers are you might want to start

08:30 using them in your app no problem

08:32 tensorflow hub is a great place to grab

08:34 pre-trained transformer models like

08:36 burnt you can download them for free in

08:38 multiple language and drop them straight

08:40 into your app

08:41 you can also check out the popular

08:42 transformers python library built by the

08:45 company hugging face that's one of the

08:46 community's favorite ways to train and

08:48 use transformer models for more

08:50 transformer tips check out my blog post

08:52 linked below and thanks for watching

09:10 you

💫 FAQs about This YouTube Video

1. What is the transformer in the context of machine learning?

The transformer is a type of neural network architecture that has had a significant impact on natural language processing. It is capable of tasks such as translation, text generation, and code generation, and has become essential in the field of machine learning.

2. How does the transformer differ from previous neural network models for language processing?

The transformer differs from previous neural network models, such as recurrent neural networks (RNNs), in its ability to efficiently process and understand language data. It introduces innovations like positional encodings and self-attention, allowing it to handle large sequences of text and build a better internal representation of language.

3. What are the key innovations that make the transformer model effective for language processing?

The key innovations that make the transformer model effective for language processing are positional encodings and self-attention. Positional encodings capture word order in the input data, while self-attention allows the model to understand the context of each word in a sentence, leading to a better understanding of language.

4. How is the transformer model used in practical applications of natural language processing?

The transformer model, particularly popular variations like BERT, is used in various practical applications of natural language processing, including text summarization, question answering, and language understanding in search queries. It has also enabled semi-supervised learning by performing well on unlabeled data.

5. Where can developers access pre-trained transformer models for their applications?

Developers can access pre-trained transformer models, such as BERT, from platforms like TensorFlow Hub and the transformers python library by Hugging Face. These pre-trained models can be readily used in applications for natural language processing tasks.

🎥 Related Videos