00:00 the neat thing about working in machine
00:01 learning is that every few years
00:03 somebody invents something crazy that
00:04 makes you totally reconsider what's
00:07 like models that can play go or generate
00:10 hyper-realistic faces
00:12 and today the mind-blowing discovery
00:13 that's rocking everyone's world is a
00:15 type of neural network called a
00:17 transformers are models that can
00:18 translate text write poems and op-eds
00:21 and even generate computer code these
00:23 can be used in biology to solve the
00:24 protein folding problem
00:26 transformers are like this magical
00:28 machine learning hammer that seems to
00:29 make every problem into a nail if you've
00:31 heard of the trendy new ml models bert
00:33 or gpt-3 or t5 all of these models are
00:37 based on transformers
00:38 so if you want to stay hip in machine
00:40 learning and especially natural language
00:42 processing you have to know about the
00:44 transformer so in this video i'm going
00:46 to tell you about what transformers are
00:47 how they work and why they've been so
00:49 impactful let's get to it so what is a
00:52 transformer it's a type of neural
00:54 network architecture to recap neural
00:56 networks are a very effective type of
00:58 model for analyzing complicated data
01:00 types like images videos audio and text
01:02 but there are different types of neural
01:04 networks optimized for different types
01:05 of data like if you're analyzing images
01:07 you typically use a convolutional neural
01:09 network which is designed to vaguely
01:11 mimic the way that the human brain
01:13 processes vision and since around 2012
01:16 neural networks have been really good at
01:17 solving vision tasks like identifying
01:19 objects and photos but for a long time
01:21 we didn't have anything comparably good
01:23 for analyzing language whether for
01:25 translation or text summarization or
01:28 and this is a problem because language
01:30 is the primary way that humans
01:31 communicate you see until transformers
01:33 came around the way we use deep learning
01:35 to understand text was with a type of
01:36 model called a recurrent neural network
01:39 or an rnn that looks something like this
01:42 let's say you wanted to translate a
01:44 sentence from english to french
01:46 an rnn would take as input an english
01:48 sentence and process the words one at a
01:50 time and then sequentially spit out
01:52 their french counterparts the key word
01:54 here is sequential in language the order
01:56 of words matters and you can't just
01:58 shuffle them around
01:59 for example the sentence jane went
02:02 looking for trouble means something very
02:04 different in the sentence trouble went
02:07 so any model that's going to deal with
02:08 language has to capture word order and
02:10 recurrent neural networks do this by
02:12 looking at one word at a time
02:13 sequentially but rnns had a lot of
02:16 first they never really did well at
02:17 handling large sequences of text like
02:20 long paragraphs or essays by the time
02:22 they were analyzing the end of a
02:23 paragraph they'd forget what happened in
02:25 and even worse rnns were pretty hard to
02:28 train because they processed words
02:30 sequentially they couldn't paralyze well
02:32 which means that you couldn't just speed
02:33 them up by throwing lots of gpus at them
02:36 and when you have a model that's slow to
02:37 train you can't train it on all that
02:38 much data this is where the transformer
02:40 changed everything they were a model
02:42 developed in 2017 by researchers at
02:44 google and the university of toronto and
02:46 they were initially designed to do
02:47 translation but unlike recurrent neural
02:50 networks you could really efficiently
02:51 paralyze transformers and that meant
02:53 that with the right hardware you could
02:54 train some really big models
02:59 remember gpt3 that model that writes
03:00 poetry and code and has conversations
03:03 that was trained on almost 45 terabytes
03:05 of text data including like almost the
03:10 so if you remember anything about
03:11 transformers let it be this combine a
03:13 model that scales really well with a
03:15 huge data set and the results will
03:17 probably blow your mind so how do these
03:19 things actually work from the diagram of
03:21 the paper it should be pretty clear
03:24 or maybe not actually it's simpler than
03:26 you might think there are three main
03:28 innovations that make this model work so
03:30 positional encodings and attention and
03:32 specifically a type of attention called
03:36 let's start by talking about the first
03:37 one positional encodings
03:39 let's say we're trying to translate text
03:40 from english to french positional
03:42 encodings is the idea that instead of
03:44 looking at words sequentially you take
03:46 each word in your sentence and before
03:47 you feed it into the neural network you
03:49 slap a number on it one two three
03:51 depending on what number the word is in
03:53 the sentence in other words you store
03:55 information about word order in the data
03:57 itself rather than in the structure of
03:58 the network then as you train the
04:00 network on lots of text data it learns
04:02 how to interpret those positional
04:05 in this way the neural network learns
04:07 the importance of word order from the
04:10 this is a high-level way to understand
04:12 positional encodings but it's an
04:13 innovation that really helped make
04:15 transformers easier to train than rnns
04:18 the next innovation in this paper is a
04:20 concept called attention which you'll
04:21 see used everywhere in machine learning
04:23 these days in fact the title of the
04:25 original transformer paper is attention
04:29 the agreement on the european economic
04:31 area was signed in august 1992. did you
04:35 that's the example sentence given in the
04:36 original paper and remember the original
04:39 transformer was designed for translation
04:41 now imagine trying to translate that
04:43 sentence to french one bad way to
04:45 translate text is to try to translate
04:46 each word one for one but in french some
04:49 words are flipped like in the french
04:51 translation european comes before
04:54 plus french is a language that has
04:55 gendered agreement between words so the
04:57 word europe bien needs to be in the
05:00 feminine form to match with la zone
05:02 the attention mechanism is a neural
05:03 network structure that allows a text
05:05 model to look at every single word in
05:07 the original sentence when making a
05:09 decision about how to translate a word
05:10 in the output sentence in fact here's a
05:13 nice visualization from that paper that
05:14 shows what words in the input sentence
05:16 the model is attending to when it makes
05:18 predictions about a word for the output
05:21 so when the model outputs the word
05:24 european it's looking at the input words
05:26 european and economic you can think of
05:29 this diagram as a sort of heat map for
05:32 and how does the model know which words
05:33 it should be attending to
05:35 it's something that's learned over time
05:38 by seeing thousands of examples of
05:39 french and english sentence pairs the
05:41 model learns about gender and word order
05:43 and plurality and all of that
05:44 grammatical stuff so we talked about two
05:47 key transformer innovations positional
05:49 encoding and attention
05:51 but actually attention had been invented
05:52 before this paper the real innovation in
05:55 transformers was something called
05:56 self-attention a twist on traditional
06:00 the type of attention we just talked
06:01 about had to do with aligning words in
06:03 english and french which is really
06:05 important for translation but what if
06:06 you're just trying to understand the
06:08 underlying meaning in language so that
06:10 you can build a network that can do any
06:12 number of language tasks
06:14 what's incredible about neural networks
06:16 like transformers is that as they
06:18 analyze tons of text data they begin to
06:20 build up this internal representation or
06:22 understanding of language automatically
06:25 they may learn for example that the
06:27 words programmer and software engineer
06:29 and software developer are all
06:30 synonymous and they might also naturally
06:33 learn the rules of grammar and gender
06:34 and tense and so on the better this
06:37 internal representation of language the
06:38 neural network learns the better it will
06:40 be at any language task
06:42 and it turns out that attention can be a
06:44 very effective way to get a neural
06:45 network to understand language if it's
06:47 turned on the input text itself
06:50 let me give you an example take these
06:52 two sentences server can i have the
06:55 versus looks like i just crashed the
06:57 server the word server here means two
06:59 very different things and i know that
07:01 because i'm looking at the context of
07:03 the surrounding words
07:04 self-attention allows a neural network
07:06 to understand a word in the context of
07:08 the words around it so when a model
07:10 processes the word server in the first
07:12 sentence it might be attending to the
07:14 word check which helps it disambiguate
07:16 from a human server versus a mental one
07:19 in the second sentence the model might
07:21 be attending to the word crashed to
07:22 determine that this server is a machine
07:24 self-attention can also help neural
07:26 networks disambiguate words recognize
07:28 parts of speech and even identify word
07:31 this in a nutshell is the value of
07:34 so to summarize transformers boil down
07:36 to positional encodings attention and
07:41 of course this is a 10 000 foot look at
07:43 transformers but how are they actually
07:45 useful one of the most popular
07:46 transformer based models is called bert
07:49 which was invented just around the time
07:50 that i joined google in 2018.
07:52 bert was trained on a massive text
07:54 corpus and has become this sort of
07:56 general pocket knife for nlp that can be
07:58 adopted to a bunch of different tasks
08:01 like text summarization question
08:02 answering classification and finding
08:06 it's used in google search to help
08:08 understand search queries and it powers
08:10 a lot of google cloud's nlp tools like
08:12 google cloud automl natural language
08:14 burt also proved that you could build
08:16 very good models on unlabeled data like
08:18 text scraped from wikipedia or reddit
08:21 this is called semi-supervised learning
08:23 and it's a big trend in machine learning
08:27 so if i've sold you on how cool
08:28 transformers are you might want to start
08:30 using them in your app no problem
08:32 tensorflow hub is a great place to grab
08:34 pre-trained transformer models like
08:36 burnt you can download them for free in
08:38 multiple language and drop them straight
08:41 you can also check out the popular
08:42 transformers python library built by the
08:45 company hugging face that's one of the
08:46 community's favorite ways to train and
08:48 use transformer models for more
08:50 transformer tips check out my blog post
08:52 linked below and thanks for watching