CONFERENCE JENSEN HUANG (NVIDIA) and ILYA SUTSKEVER (OPEN AI).AI TODAY AND VISION OF THE FUTURE

Mind Cathedral2023-03-23

32K views|1 years ago

💫 Short Summary

The video delves into the journey of a computer scientist from the University of Toronto to co-inventing alexnet, emphasizing the importance of deep learning in AI development. It discusses the shift towards large neural networks, optimization breakthroughs, and the impact of GPUs. OpenAI's work on unsupervised learning and scaling laws is highlighted, along with the importance of reinforcement learning. GPT-4's reasoning skills and limitations are explored, emphasizing the need for neural networks to admit uncertainty. Multi-modality learning, combining text and images, enhances understanding. The future of language models lies in improving reliability and trustworthiness for broader applications.

✨ Highlights

📊 Transcript

✦

The journey of a computer scientist from University of Toronto to co-invention of alexnet with Alex and Jeff Hinton.

00:29

Initial challenges faced and importance of deep learning in AI development.

Impact of neural networks and parallel computing in revolutionizing the field.

Role of scale in training neural networks emphasized.

Visionary approach to AI research and collaborative efforts highlighted.

✦

Emphasis on large and deep neural networks for solving hard tasks.

06:04

Optimization methods by James Martens are crucial breakthroughs for training neural networks.

Importance of data selection, such as the challenging Imagenet dataset, for successful training.

Evolution of neural network training methods and impact of advancements in optimization techniques on the field.

✦

Introduction of GPUs in the lab revolutionized neural network training.

07:51

Imagenet dataset showcased the effectiveness of convolutional neural networks on GPUs, leading to faster and more efficient training.

Fast programming of convolutional kernels by Alex Krizhevsky set new standards in computer vision.

Breakthroughs with Imagenet dataset paved the way for the establishment of OpenAI.

OpenAI initially focused on diverse AI research before developing GPT models.

✦

OpenAI's early work on unsupervised learning through compression was groundbreaking.

12:37

The sentiment neuron, discovered through training neural networks on Amazon reviews, demonstrated the potential of unsupervised learning.

Predicting the next character accurately led to the discovery of a neuron corresponding to sentiment.

This work influenced OpenAI's approach to machine learning and emphasized the significance of data compression for revealing hidden insights.

✦

Importance of scaling in improving model performance, particularly in models like GPT.

17:20

OpenAI paper on scaling laws and the relationship between loss, model size, and data set size.

Strong belief in the benefits of larger models from the beginning.

Journey of GPT models from one to three driven by the intuition that bigger is better.

Emphasis on utilizing scale correctly and scalability from the start.

✦

Importance of reinforcement learning in training AI agents for real-time strategy games like DotA 2.

21:55

OpenAI's project to train a reinforcement learning agent to play against itself and reach a competitive level.

Fine-tuning large neural networks to predict text, highlighting how the network learns a representation of the world and human conditions through statistical correlations in text.

The neural network acquires a compressed abstract representation of human experiences and interactions.

✦

Training neural networks involves communicating desired behaviors and setting boundaries to avoid unsafe actions.

24:44

Research and innovation are ongoing to improve communication fidelity and make neural networks more reliable and precise.

Applications like chat GBT have grown rapidly for their ease of use and ability to exceed expectations.

Users can interact with these applications without specific instructions, refining their intents through conversation with the AI.

The impact of AI applications is evolving and shaping the future.

✦

GPT-4 improvements over Chat GPT.

26:33

GPT-4 shows higher SAT and GRE scores, better exam performance, and improved understanding.

GPT-4's enhanced prediction of the next word leads to deeper text understanding and challenges the idea that deep learning cannot lead to reasoning.

GPT-4 excels in reasoning tests that Chat GPT struggled with, demonstrating advancements in neural network capabilities and problem-solving skills.

✦

The segment discusses the reasoning skills of GPT-4 and limitations that can be improved.

30:57

Neural networks can address limitations by thinking out loud and improving reliability.

The goal is to achieve higher reliability and accuracy in predictions for more precise responses.

Emphasis is on improving the reliability of neural networks for various applications.

✦

GPT-4 capabilities and benefits for learning from text and images.

35:45

GPT-4 is a strong predictor and can consume images, fine-tuned with data and reinforcement learning.

Multi-modality GPT-4 enhances understanding of the world by learning from text and images.

Human reliance on vision makes multi-modality essential for neural networks to see and learn from images.

Learning from images in addition to text significantly enhances learning and perception.

✦

Importance of accessing multiple sources of information.

39:00

Neural networks can learn from text and make connections without visual stimuli.

Text-based learning is slower but provides valuable knowledge.

Adding vision enhances learning by capturing additional details.

Audio can contribute to learning models, but not as significantly as images or video.

✦

The importance of audio in AI recognition and production was emphasized, focusing on advancements in GPT-3 and GPT-4 testing.

43:12

GPT-3.5 faced challenges in tests involving diagrams, but GPT-4 showed improvement when visual input was added, leading to a notable increase in success rates.

Visual reasoning and communication were identified as key factors in enhancing AI capabilities, potentially enabling more effective learning and problem-solving.

The idea of AI generating its own training data was discussed, sparking discussions about the implications for the future of synthetic data generation in AI development.

✦

Advancements in language model technology will focus on improving reliability and trustworthiness in systems.

47:02

Systems are expected to recognize important details, ask for clarification when needed, and provide accurate and useful summaries.

The progress in language model technology will significantly impact the usability of these systems.

Advancements in the next two years are crucial for building trust among users and expanding the applications of the technology.

✦

Advancements in GPT models like GPT-4 have shown improved ability in various tasks.

51:54

GPT models can now solve math problems, produce poems, explain jokes and memes, and provide clear explanations for complex images.

Evolution of neural networks to handle larger data sets and different training algorithms has been surprising and effective.

The exponential growth in computational power over the past 10 years is reflected on, praising accomplishments in advancing AI technology.

Inventions like AlexNet and GPT at OpenAI are highlighted for their contributions to AI technology.

✦

Conclusion of the video segment.

53:02

The speaker expressed gratitude and appreciation.

The experience was enjoyable for the speaker.

00:02Ilia

00:04unbelievable today is the day after gpt4

00:09it's great to have you here I'm

00:12delighted to have you I've known you a

00:14long time the journey and just my mental

00:17hit my my mental memory of your of the

00:21times that I've known you and the

00:23seminal work that you have done

00:25starting in Universal University of

00:27Toronto

00:29the co-invention of alexnet with Alex

00:33and Jeff Hinton

00:35that led to the Big Bang of modern

00:39artificial intelligence your career that

00:42took you out here to the Bay Area the

00:45founding of open Ai

00:47gpt123 and then of course chat GPT the

00:52AI heard around the world

00:55this is this is the incredible resume of

00:58a young computer scientist

01:00um you know an entire community and

01:02Industry at all with your achievements I

01:05I guess my I just want to go back to the

01:07beginning and ask you deep learning

01:11what was your intuition around deep

01:13learning why did you know that it was

01:15going to work did you have any intuition

01:18that are going to lead to this kind of

01:19success

01:20okay well first of all thank you so much

01:24for the quote for all the kind words

01:27a lot has changed thanks to the

01:30incredible power of deep learning

01:33like I think this my personal starting

01:36point I was interested in artificial

01:38intelligence for a whole variety of

01:41reasons

01:42starting from

01:44an intuitive understanding of

01:46appreciation of its impact

01:48and also I had a lot of curiosity about

01:51what is consciousness what is The Human

01:53Experience

01:54and it felt like

01:56progress in artificial intelligence will

01:57help with that

01:59The Next Step was

02:01well back then I was starting in 2002

02:032003

02:05and it seemed like learning is the thing

02:08that humans can do that people can do

02:10that computers can't do at all in 2003

02:152002

02:16computers could not learn anything

02:20and it wasn't even clear that it was

02:22possible in theory

02:24and so I thought that making progress

02:28in learning in artificial learning in

02:31machine learning

02:32that would lead to the greatest progress

02:34in AI

02:36and then I started to look around for

02:37what was out there

02:39and nothing seemed too promising

02:42but to my great luck

02:44Jeff Hinton was a professor at my

02:46University

02:47and I was able to find him

02:50and he was working in neural networks

02:51and it immediately made sense

02:53because neural networks had the property

02:56that we are learning we are

02:59automatically programming parallel

03:01computers

03:02back then the parallel computers were

03:04small

03:05but the promise was if you could somehow

03:06figure out how learning in neural

03:09networks work

03:10then you can program small parallel

03:12computers from data and it was also

03:14similar enough to the brain and the

03:16brain works so it's like you had these

03:18several factors going for it now it

03:20wasn't clear how to get it to work

03:24but of all the things that existed that

03:27seemed like it had by far the greatest

03:30long-term promise even though in the

03:32time that you first started at the time

03:34that you first started working with deep

03:35learning and and neural networks what

03:38was what was the scale of the network

03:39what was the scale of computing at that

03:41moment in time what was it like an

03:43interesting thing to note is that the

03:45importance of scale wasn't realized back

03:47then

03:48so people would just training on neural

03:50networks with like 50 neurons 100

03:52neurons several hundred neurons that

03:54would be like a big neural network

03:57a million parameters would be considered

03:59very large we would run our models on

04:02unoptimized CPU code because we were a

04:05bunch of researchers we didn't know

04:07about blast

04:08we used Matlab the Matlab was optimized

04:12and we'd just experiment like what is

04:15the what is even the right question to

04:16ask you know so you try to want to

04:18gather

04:19to just find interesting phenomena

04:22interesting observation

04:24you can do this small thing you can do

04:26that small thing you know Jeff Hinton

04:28was really excited about

04:30training neural Nets on small little

04:33digits both for classification and also

04:36he was very interested in generating

04:38them so the beginnings of generating

04:40models were right there

04:42but the question is like okay so you've

04:43got all this cool stuff floating around

04:46what really gets traction

04:48and so that

04:49it wasn't so it wasn't obvious that this

04:51was the right question

04:53back then but in hindsight that turned

04:56out to be the right question now now the

04:58the year

05:00Alex net was 2012. now

05:04you and Alex were working on Alex net

05:06for some time before then

05:08and and uh at what point at what point

05:12was it was it clear to you that you

05:14wanted to build a computer vision

05:17oriented neural network that imagenet

05:20was the right set of data to go for and

05:22to somehow go for the computer computer

05:26vision contest

05:28yeah

05:29so I can talk about the context there

05:32it I think probably

05:35two years before that it became clear to

05:39me

05:40that supervised learning is what's going

05:43to get us the traction and I can explain

05:45precisely why it wasn't just an

05:48intuition it was

05:50I would argue an irrefutable argument

05:52which went like this

05:54if your neural network is deep and large

06:00then it could be configured to solve a

06:03hard task

06:04so that's the key word deep and large

06:08people weren't looking at large neural

06:10networks people were you know maybe

06:11studying a little bit of depth in neural

06:13networks but most of the machine

06:15learning field wasn't even looking at

06:16neural networks at all they were looking

06:18at all kinds of Bayesian models and

06:19kernel methods

06:21which are theoretically elegant methods

06:24which have the property that they

06:26actually can't represent a good solution

06:28no matter how you configure them whereas

06:31the large and deep neural network can

06:33represent a good solution to the problem

06:36to find a good solution

06:39you need a big data set which requires

06:41it and a lot of compute to actually do

06:45the work we've also made advanced work

06:47so we've worked on optimization for for

06:49a little bit it was clear that

06:52optimization is a bottleneck

06:54and there was a breakthrough by another

06:56grad student in Jeff hinton's lab called

06:59James Martens and he came up with an

07:01optimization method which is different

07:02from the one we are doing now using now

07:04some second order method but the point

07:07about it is that it's proved that we can

07:10train those neurologics because before

07:11we didn't even know we could train them

07:12so if you can train them

07:15you make it big you find the data and

07:17you will succeed so then the next

07:19question is well what data and an

07:22imagenet data set back then it seemed

07:24like this unbelievably difficult data

07:26set

07:27but it was clear that if we were to

07:30train a large convolutional neural

07:31network on this data set it must succeed

07:33if you just can have the compute and

07:36write that right at a time gpus you and

07:38I you and I are history and our paths

07:42intersected and somehow

07:45you had the the the the observations at

07:49a GPU and at that time we had this is

07:51our couple of generations into a Cuda

07:53GPU and I think it was GTX 580 uh

07:57generation you had the you're at the

07:59inside that the GPU could actually be

08:01useful for training your neural network

08:03models what was that how did that day

08:06start tell me you know you and I you

08:07never told me that moment you know how

08:09did that day start yeah so

08:13you know the the GP the gpus

08:16appeared in our in our lab in our

08:18Toronto lab

08:19thanks to Jeff and he said we should be

08:21good we should try this gpus and we

08:23started trying and experimenting with

08:24them

08:25and

08:26it was a lot of fun but but it was

08:29unclear

08:31what to use them for exactly where are

08:33you going to get the real traction

08:36but then with the existence of the

08:39imagenet data set

08:42and then it was also very clear that the

08:44convolutional neural network is such a

08:46great fit for the GPU so it should be

08:48possible to make it go unbelievably fast

08:51and therefore train something which

08:54would be completely unprecedented in

08:55terms of its size

08:58and that's how it happened and you know

09:00very fortunately Alex krajevski he

09:03really loved programming the GPS

09:06and he was able to do it he was able to

09:09code

09:10to to program really fast

09:13convolutional kernels

09:17and then

09:19and then trained the neural network data

09:22set and that led to the result but it

09:24was like it shocked the world this

09:26shocked the world it it broke the record

09:29of a computer vision by such a wide

09:31margin that that it was a clear

09:34discontinuity yeah yeah and I wouldn't I

09:36would say it's not just like there is

09:37another bit of context there

09:40it's not so much like when we say break

09:42the record there is an important it's

09:44like I think there's a different way to

09:46phrase it it's that that data set was so

09:50obviously hard and so obviously outside

09:53of reach of anything

09:55people are making progress with some

09:57classical techniques and they were

09:58actually doing something but this thing

10:01was so much better on a data set which

10:03was so obviously hard it was it's not

10:05just that it's just some competition it

10:08was a competition which back in the day

10:10an average Benchmark it was so obviously

10:14difficult so obviously Out Of Reach and

10:17so obviously

10:19feed the property that if you did a good

10:21job that would be amazing

10:22Big Bang of AI fast forward to now uh

10:26you came out to the valley you started

10:28open AI with some friends

10:30um you're the chief scientist

10:33now

10:34what was the first initial idea about

10:36what to work on at open AI because you

10:38guys worked on several things some of

10:40the trails of of inventions and and work

10:43uh you could you could see led up to

10:47the chat GPT moment

10:50um but what were the initial inspiration

10:53what were you how would you approach

10:56intelligence from that moment and led to

10:59this

10:59yeah

11:01so

11:02obviously when we started it wasn't

11:06100 clear how to proceed

11:09and the field was also very different

11:12compared to the way it is right now

11:15so right now you already used we already

11:17used to

11:19you have these amazing artifacts these

11:22amazing neural Nets who are doing

11:23incredible things and everyone is so

11:26excited but back in 2015 2016 early 2016

11:31when you were starting out

11:33the whole these things seemed pretty

11:35crazy

11:36there were so many fewer researchers

11:38like

11:40100 maybe they were between a hundred

11:42and a thousand times fewer people in the

11:44field compared to now yeah like back

11:46then you had like

11:48100 people

11:50most of them are working in Google slash

11:53deepmind and that was that and then

11:55there were people picking up the skills

11:57but it was very very scarce very rare

11:59still

12:01and we had two

12:04big initial ideas

12:07at the start of open AI

12:09that state that had a lot of staying

12:11power and they stayed with us to this

12:12day

12:13and I'll describe them right now

12:16the first big idea that we had

12:18one which

12:21I was especially excited about very

12:23early on is the idea

12:27of

12:29unsupervised learning through

12:32compression

12:34some context

12:36today we take it for granted that

12:38unsupervised learning is this easy thing

12:39you just pre-train on everything and it

12:41all does exactly as you'd expect

12:44in 2016

12:46unsupervised learning was an unsolved

12:49problem in machine learning that no one

12:53had any insight exactly any clue as to

12:56what to do that's right iyanla Khan

12:58would go around and give a talk give

13:00talk saying that you have this Grand

13:02Challenge and supervised learning

13:05and I really believed

13:07that really good compression of the data

13:10will lead to unsupervised learning

13:13now

13:15compression is not language that's

13:17commonly used to describe what is really

13:19being done until recently when suddenly

13:22it became apparent to many people that

13:25those gpts actually compress the

13:27training data

13:28you may recall that Ted Chiang Neo Times

13:31article which also alluded to this

13:33but there is a real mathematical sense

13:35in which training these autoregressive

13:38generative models compress the data and

13:42intuitively you can see why that should

13:43work

13:44if you compress the data really well you

13:46must extract all the hidden secrets

13:48which exist in it

13:50therefore that is the key

13:53so that was the first idea that we're

13:54really excited about and that led to

13:58quite a few Works in openai

14:01to the sentiment neuron

14:03which I'll mention

14:06very briefly it is not this work might

14:09not be well known

14:10outside of the machine learning field

14:13but it was very influential especially

14:15in our thinking

14:17this work

14:20like the result there

14:23was that when you train a neural network

14:26back then it was not a Transformer it

14:28was before the Transformer right small

14:29recurrent neurological lstm sequence

14:33work you've done I mean there's some of

14:34your some of the words that you've done

14:36yourself yeah

14:37so the same lstm with a few twists

14:40trying to predict the next token in

14:43Amazon reviews next character

14:46and we discovered that if you predict

14:48the next character well enough

14:50it will be a neuron inside that lstm

14:53that corresponds to its sentiment

14:56so that was really cool because it

14:59showed some traction for unsupervised

15:02learning and it validated the idea that

15:06really good next

15:09character prediction next something

15:11prediction compression yeah

15:13has the property that it discovers the

15:16secrets in the data that's what we see

15:18with these GPT models right you train

15:19and people say just statistical

15:21correlation I mean at this point it

15:23should be so clear to anyone that

15:24observation also

15:26you know for me intuitively open up the

15:29whole world of where do I get the data

15:31for unsupervised learning

15:34because I do have a whole lot of data if

15:36I could just make you predict the next

15:38character and I know what the ground

15:39truth is I know what the answer is I

15:42could be I could train a neural network

15:43model with that so that that observation

15:46and masking and other other technology

15:48other approaches you know open open my

15:51mind about where would the world get all

15:53the data that's unsupervised for

15:55unsupervised learning well I think I

15:57think so I would I would phrase it a

15:59little differently

16:01I would say that within supervised

16:03learning the hard part has been less

16:06around

16:07where you get the data from though that

16:10part is there as well especially now but

16:13it was more about

16:14why should you do it in the first place

16:17why should you bother

16:20the hard part was to realize

16:23that the training

16:25these neural Nets to predict the next

16:27token

16:28is a worthwhile goal at all they would

16:31learn a representation

16:32that it would it would be able to

16:34understand that's right but it will be

16:36use grammar and yeah

16:38but to actually to actually

16:40it just wasn't obvious right so people

16:43weren't doing it but the sentiment

16:45neuron work and you know I want to call

16:47out alakratford is a person who really

16:51was responsible for many of the advances

16:52there

16:54the sentiment this this was this was

16:57before GPT one it was the precursor to

16:59gpt-1 and it influenced our thinking a

17:01lot then the Transformer came out

17:04and we immediately went oh my God this

17:06is the thing

17:07and we trained

17:09let me try and GPT one

17:11now along the way

17:14you've always believed that scaling will

17:18improve the performance of these models

17:20yes larger larger networks deeper

17:23networks more training data would scale

17:25that

17:26um there was a very important paper that

17:29open AI wrote about the scaling laws and

17:32the relationship between loss and the

17:36size of the model and the amount of data

17:37set the size of the data set when

17:40Transformers came out it gave us the

17:42opportunity to train very very large

17:43models in very reasonable amount of time

17:46but what did the intuition about about

17:51the scaling laws or the size of of

17:53models and data

17:55and your journey of gpt123 which came

18:00first did you see the evidence of GPT

18:02one through three first or was there the

18:05intuition about the scaling law first

18:07the intuition so I would say that the

18:10way the way I'd phrase it is that I had

18:13a very strong belief that bigger is

18:16better

18:18and that one of the goals that we had at

18:21open AI is to figure out how to use the

18:24scale correctly

18:26there was a lot of belief about an open

18:28AI about scale from the very beginning

18:31the question is

18:33what to use it for precisely

18:35because I'll mention right now we're

18:37talking about the gpts but there is

18:38another very important line of work

18:40which I haven't mentioned the second big

18:41idea but I think now is a good time to

18:43make a detour and that's reinforcement

18:45learning

18:48that clearly seems important as well

18:49what do you do with it

18:52so the first really big project that was

18:56done inside openai

18:58was

18:59our effort at solving a real-time

19:03strategy game

19:05and for context a real-time strategy

19:07game is like it's a competitive sport

19:09yeah right we need to be smart you need

19:12to have faster you need to have a quick

19:14reaction time you there's steam work

19:16and you're competing against another

19:18team and it's pretty it's pretty it's

19:20pretty involved

19:22and there is a whole competitive league

19:25for that game the game is called DotA 2.

19:27and so we train the reinforcement

19:29learning agent to play against itself

19:32to produce

19:36with the goal of reaching a level so

19:39that it could compete against the best

19:41players in the world and I was a major

19:44undertaking as well it was a very

19:46different line it was reinforcement

19:48learning yeah I remember the data that

19:50you guys announced that work and this is

19:52this by the way when I was asking

19:54earlier about about there's a there's a

19:56large body of work that have come out of

19:57open AI some of it seem like detours

20:01um but but in fact as you're explaining

20:03now they might might have been detours

20:05it seemingly detours but they they

20:07really led up to some of the important

20:08work that we're now talking about chat

20:09GPT yeah I mean there has been real

20:12convergence where the gpts

20:16produce the foundation and in the

20:19reinforcement learning from DOTA morphed

20:22into reinforcement learning from Human

20:24feedback that's right and that

20:25combination gave us chat GPT you know

20:28there's a there's a there's a

20:29misunderstanding that that uh chat GPT

20:33is uh in itself just one giant large

20:36language model there's a system around

20:38it that's fairly complicated is it could

20:41could you could you explain

20:43um briefly for the audience the the uh

20:46the fine-tuning of it the reinforcement

20:48learning of it the the

20:50um uh you know the various surrounding

20:53systems that allows you to keep it on

20:55Rails and and let it let it uh give it

20:59knowledge and you know so on so forth

21:02yeah I can

21:04so the way to think about it

21:06is that when we train

21:08a large neural network

21:10to accurately predict the next word

21:14in lots of different texts from the

21:16internet what we are doing is that we

21:20are learning a world model

21:22it looks like we are learning this it

21:24may it may look on the surface but we

21:27are just learning statistical

21:28correlations in text

21:31but it turns out that to just learn the

21:35statistical correlations in text

21:37to compress them really well

21:39what the neural network learns is some

21:42representation of the process that

21:44produced the text

21:46this text is actually a projection of

21:49the world

21:50there is a world out there

21:52and it has a projection on this text

21:55and so what the neural network is

21:56learning is more and more aspects of the

22:00world of people of the human conditions

22:03their their their hopes dreams and

22:06motivations their interactions and the

22:08situations that we are in and the neural

22:12network learns a compressed abstract

22:15usable representation of that this is

22:19what's being learned from accurately

22:21predicting the next word

22:23and furthermore the more accurate you

22:26are is predicting the next word

22:28the higher Fidelity the more resolution

22:32you get in this process

22:33so that's what the pre-training stage

22:35does

22:37but what this does not do

22:40is specify the desired behavior that we

22:43wish our neural network to exhibit

22:47you see a language model what it really

22:49tries to do

22:51is to answer the following question

22:54if I had some random piece of text on

22:57the internet which starts with some

22:59prefix some prompt

23:02what will it complete to if you just

23:05randomly ended up on some text from the

23:08internet

23:09but this is different from well I want

23:11to have an assistant which will be

23:13truthful that will be helpful that will

23:16follow certain rules and not violate

23:18them that requires additional training

23:21this is where the fine tuning and the

23:24reinforcement learning from Human

23:25teachers

23:26and other forms of AI assistance it's

23:29not just reinforcement learning from

23:30Human teachers it's also reinforcement

23:32learning from human and AI collaboration

23:34our teachers are working together with

23:36an AI to teach Rai to behave

23:39but here we are not teaching it new

23:40knowledge this is not what's happening

23:43we are teaching it

23:45we are communicating with it we are

23:48communicating to it what it is that we

23:51want it to be and this process the

23:54second stage is also extremely important

23:57the better we do the second stage the

24:00more useful the more reliable this

24:02neural network will be so the second

24:04stage is extremely important too in

24:06addition to the first stage of the learn

24:09everything

24:10learn everything learn as much as you

24:13can about the world from the projection

24:15of the world which is text now you could

24:18tell you could you could fine tune it

24:20you could instruct it to perform certain

24:24things can you instruct it to not

24:26perform certain things so that you could

24:27give it guard rails about avoid these

24:30type of behavior

24:31um you know give it some kind of a

24:32bounding box so that so that it doesn't

24:34it doesn't wander out of that bounding

24:37box and and perform things that that are

24:39you know unsafe or otherwise yeah so

24:44this second stage of training is indeed

24:47where we communicate to the neural

24:49network anything we want which includes

24:52the bowling box and the better we do

24:56this training the higher the Fidelity

24:58with which we communicate is about inbox

25:01and so with constant research and

25:03Innovation on improving this fidelity

25:07we are able to improve this fidelity

25:10and so it becomes more and more reliable

25:14and precise in the way in which it

25:17follows the

25:19intended intended instructions chat gbt

25:22came out just a few months ago

25:24um fastest growing application in the

25:26history

25:27of humanity

25:30lots of lots of uh uh interpretations

25:35about why

25:36but some of the some of the things that

25:38that is clear it is it is the easiest

25:42application that anyone has ever created

25:44for anyone to use it performs tasks it

25:49performs things it does things that are

25:52Beyond people's expectation

25:55anyone can use it there are no

25:57instruction sets there are no wrong ways

25:59to use it you you just use it and if

26:03it's if your instructions are our

26:05prompts are ambiguous the conversation

26:08refines the ambiguity until your intents

26:11are understood by by the by the

26:14application by the AI

26:16the the impact of course uh clearly or

26:21remarkable now yesterday this is the day

26:25after gpt4 just a few months later the

26:30the performance of gpt4 in many areas

26:33astounding SAT scores GRE scores bar

26:38exams the number of the number of

26:42tests that it's able to perform at very

26:46capable levels very capable human levels

26:48astounding

26:50what were the what were the major

26:52differences between chat GPT and gpt4

26:56that led to its improvements in these in

26:59these areas

27:00so gpt4

27:05is a

27:07pretty substantial Improvement on top of

27:10chat GPT across very many dimensions

27:14we trained gpt4 I would say

27:19between more than six months ago maybe

27:22eight months ago I don't remember

27:23exactly

27:26GPT is the first build big difference

27:29between Shad GPT and gpt4

27:32and that's perhaps is the more

27:35the most important difference

27:37is that the base on top of gpt4 is built

27:40predicts the next word

27:43with greater accuracy this is really

27:45important

27:47because the better a neural network can

27:50predict the next word in text the more

27:52it understands it this claim is now

27:55perhaps accepted by many at this point

27:58but it might still not be intuitive or

28:01is not completely intuitive as to why

28:02that is

28:03so I'd like to take a small detour and

28:05to give an analogy that will hopefully

28:07clarify why more accurate prediction of

28:11the next word leads to more

28:13understanding real understanding

28:16let's consider an example

28:18say you read a detective novel

28:22complicated plot a storyline different

28:23characters lots of events Mysteries like

28:27Clues it's unclear then let's say that

28:31at the last page of the book the

28:33detective has got all the clues gathered

28:35all the people and saying okay I'm going

28:37to reveal the identity of whoever

28:40committed the crime and that person's

28:42name is predict that word predict that

28:45word exactly my goodness right yeah

28:47right now

28:48there are many different words but by

28:50predicting those words better and better

28:52and better the understanding

28:54of the text keeps on increasing gpt4

28:58predicts the next word better I tell you

29:00people say that the Deep learning won't

29:04lead to reasoning that deep learning

29:06won't lead to reasoning but in order to

29:09predict that next word figure out from

29:11all of the agents that were there and

29:15and all of their

29:16you know strengths or weaknesses or

29:19their intentions and the context and to

29:23be able to predict that word who who was

29:26the murderer that requires some amount

29:29of reasoning a fair amount of reasoning

29:31and so so how did how did the how is it

29:34that that and

29:36that it's able to pre to learn reasoning

29:39and and if if it learn reasoning

29:42um you know one of the one of the things

29:44that I was going to ask you is of all

29:45the tests that were that were taken

29:48um between Chad GPT and gbd4 there were

29:50some tests

29:52that gpt3 or Chan GPT was already very

29:55good at

29:56there were some tests that gbd3 or chai

29:59GB was not as good at

30:01um that gbt4 was much better at and

30:04there were some tests that neither are

30:06good at yet

30:07I would love for it you know and some of

30:09it has to do with reasoning it seems

30:11that you know maybe in calculus that

30:13that it wasn't able to break maybe the

30:15problem down

30:17um into into its reasonable steps and

30:19solve it is is it but yet in some areas

30:23it seems to demonstrate reasoning skills

30:26and so is that an area that that um uh

30:30that in predicting the next word you're

30:32you're learning reasoning and um uh what

30:36are the limitations uh now of gpt4 that

30:39would enhance his ability to reason even

30:41even further

30:43you know

30:45reasoning isn't this super well-defined

30:48concept but we can try to Define it

30:52anyway

30:53which is when you maybe maybe when you

30:56go further

30:57where you're able to somehow think about

31:00it a little bit and get a better answer

31:03because of your reasoning

31:05and I'd say I'd say that there are

31:08neural Nets

31:10you know maybe there is some kind of

31:12limitation which could be addressed

31:15by for example asking the neural network

31:17to think out loud this has proven to be

31:19extremely effective for reasoning but I

31:21think it also remains to be seen just

31:23how far the basic neural network will go

31:25I think we have yet to uh

31:28tap fully tap out its potential

31:35but yeah I mean there is definitely some

31:37sense where reasoning is still not

31:40quite at that level

31:43as some of the other capabilities of the

31:46neural network though we would like the

31:48reasoning capabilities of the neural

31:50network to be high higher

31:53I think that it's fairly likely that

31:56business as usual will keep will improve

31:58the reasoning capabilities of the neural

32:00network I wouldn't

32:02I wouldn't necessarily confidently rule

32:05out this possibility

32:06yeah because one of the things that that

32:09is really cool is you ask you as a

32:12treasury video question but before it

32:14answers the question tell me first first

32:16of what you know and then to answer the

32:18question

32:19um you know usually when somebody

32:20answers a question if you give me the

32:22the foundational knowledge that you have

32:24or the foundational assumptions that

32:26you're making before you answer the

32:27question now that really improves the my

32:30believability of of the answer you're

32:33also demonstrating some level of Reason

32:35well you're demonstrating reasoning and

32:37so it seems to me that chat GPD has this

32:39inherent capability embedded in it

32:42yeah to some degree yeah this cup the

32:45the the the the the way the one way to

32:47think about what's happening now is that

32:50these neural networks have a lot of

32:52these capabilities they're just not

32:54quite very reliable in fact you could

32:57say that reliability is currently the

33:00single biggest obstacle for these neural

33:02networks being useful truly useful if

33:06sometimes it is still the case

33:09that these neural networks

33:12hallucinate a little bit

33:13or maybe make some mistakes which are

33:15unexpected which you wouldn't expect a

33:17person to make

33:19it is this kind of unreliability that

33:22makes them substantially less useful

33:24but I think that perhaps with a little

33:27bit more research with the current ideas

33:28that you have and perhaps a few more of

33:30the

33:32ambitious research plans will be able to

33:35achieve higher reliability as well and

33:37that will be truly useful that will

33:39allow us to have very accurate guard

33:43rails which are very precise that's

33:45right and it will make it ask for

33:47clarification where it's unsure or maybe

33:52say that it doesn't know something when

33:54it does anything he doesn't know and do

33:57so extremely reliably so I'd say that

34:00these are

34:01some of the bottlenecks really so it's

34:03not about whether it exhibits some

34:06particular capability but more how

34:08reliable exactly yeah you know one is

34:11speaking of speaking of factualness and

34:14faithfulness uh hallucination I I I saw

34:19in in uh one of the videos uh a

34:22demonstration that that

34:24um uh links to a Wikipedia page uh to it

34:28does retrieval capability uh has ever

34:31been included in the gpt4 is it able to

34:34retrieve information from a factful

34:36place that that could augment its

34:39response to you so the current gpt4

34:43as released

34:45does not have a built-in retrieval

34:47capability it is just a

34:50really really good next word predictor

34:54which can also consume images by the way

34:56we haven't spoken about yeah it is

34:59really good at images which is also then

35:01fine-tuned with data and

35:05various reinforcement learning variants

35:08to behave in a particular way

35:10it is perhaps I'm sure someone will it

35:16wouldn't surprise me if some of the

35:17people who have access could perhaps

35:18request gpt4 to maybe make some queries

35:23and then populate the results inside

35:25inside the context because also the

35:27context duration of gpd4 is quite a bit

35:29longer now yeah that's right so

35:33in short

35:35although lgbt4 does not support built-in

35:39retrieval it is completely correct

35:43that it will get better with retrieval

35:45yeah yeah multi-modality gpt4 has the

35:49ability to learn from text and images

35:52and respond to input from text and

35:57images first of all

35:59the foundation of multi-modality

36:02learning

36:04of course Transformers has made it

36:07possible for us to learn from

36:08multimodality tokenized text and images

36:15yeah but at the foundational level help

36:18us understand how multimodality enhances

36:21the understanding of the world

36:25um Beyond text by itself

36:27and uh and my understanding is that that

36:31that when you when you um I do

36:35multi-modality learning that even when

36:39it is just a text prompt the text prompt

36:41the text understanding could actually be

36:44enhanced

36:45um tell us about multi-modality at the

36:48foundation why it's so important and and

36:50um what was what's the major

36:51breakthrough in the the and the

36:53characteristic differences as a result

36:56so there are two Dimensions to

36:58multi-modality

36:59two reasons why it is interesting

37:02the first reason

37:04is

37:06a little bit humble the first reason is

37:09that multi-modality is useful

37:12it is useful for a neural network to see

37:15Vision in particular

37:17because the world is very visual human

37:20beings are very visual animals

37:23I believe that a third of the visual

37:26core of the human cortex is dedicated to

37:29vision

37:30and so

37:33by not having vision

37:36the usefulness of our neural networks

37:38though still considerable

37:40is not as big as it could be so it is a

37:44very simple usefulness argument it is

37:47simply

37:48useful to see

37:51and gpt4 can see quite well

37:54the there is a second reason to division

37:58which is that we learn more about the

38:01world

38:02by learning from images in addition to

38:05learning from text

38:07that is also a powerful argument though

38:11it is not as clear-cut as it may seem

38:14I'll give you an example

38:16or rather before giving an example I'll

38:18make the general comment

38:21for a human being us human beings we get

38:24to hear about 1 billion words

38:27in our entire life

38:29only only one billion words that's

38:31amazing yeah that's not a lot yeah

38:33that's not a lot

38:35so we need to company we need does that

38:37include my own words in my own head

38:40make it 2 billion but you see what I

38:43mean yeah you know you can see that

38:46because

38:47um a billion seconds is 30 years so you

38:51can kind of see like we don't get to see

38:52more than a few words a second then if

38:54you're asleep half the time so like a

38:56couple billion words is the total we get

38:58in our entire life so it becomes really

39:00important for us to get as many sources

39:02of information as we can and we

39:04absolutely learn a lot more from vision

39:07the same argument holds true for our

39:10neural networks as well

39:12except

39:13except for the fact that the neural

39:15network can learn from so many words

39:17so

39:19things which are hard to learn about the

39:22world from text in a few billion words

39:25may become easier

39:28from trillions of words and I'll give

39:30you an example

39:33consider colors

39:36surely

39:37one needs to see to understand colors

39:40and yet the text only neural networks

39:44who've never seen a single Photon in

39:47their entire life

39:49if you ask them which colors are more

39:51similar to each other it will know that

39:53red is more similar to Orange than to

39:55Blue it will know that blue is more

39:57similar to purple than to Yellow

40:01how does that happen

40:03and one answer is that information about

40:06the world even the visual information

40:08slowly leaksane through text

40:12slowly not as quickly

40:13but then you have a lot of text you can

40:15still learn a lot

40:16of course once you also add vision and

40:21learning about the world from Vision you

40:22will learn additional things which are

40:24not captured in text but it is you know

40:27I would not say that it is a binary

40:29there are things which are impossible to

40:32learn from the from text only I think

40:34this is more of an exchange rate and in

40:37particular as you want to learn if we

40:39are if you if you are if you are like a

40:42human being and you want to learn from a

40:44billion words

40:45or a hundred million words then of

40:47course the other sources of information

40:48become far more important

40:52yeah and so so the the you learn from

40:56images

40:57is there is there a sensibility that

40:59that would suggest that if we wanted to

41:01understand

41:03um also the construction of the world as

41:05in you know the arm is connected to my

41:07shoulder that my elbow is connected that

41:10somehow these things move the the the

41:12the animation of the world the physics

41:15of the world if I wanted to learn that

41:17as well can I just watch videos and

41:19learn that yes yeah

41:21and and if I wanted to augment all of

41:24that would sound like for example if

41:25somebody said

41:27um the meaning of of great

41:29uh great could be great

41:32or great could be great you know so one

41:36is sarcastic one is enthusiastic uh

41:39there are many many words like that you

41:41know uh that's sick or you know I'm sick

41:46or I'm sick depending on how people say

41:47it

41:49would audio also make a contribution to

41:52the learning of the the model and and

41:54could we put that to good use soon yes

41:57yeah I think I think it's definitely the

42:00case that well you know what can we say

42:02about audio it's useful it's an

42:05additional source of information

42:06probably not as much as

42:08images of video

42:10but

42:11there is another there is a case to be

42:13made for the usefulness of audio as well

42:15both on the recognition side and on the

42:18production side

42:19when you when you um uh

42:22on the on the context of the scores that

42:24I saw

42:25um the thing that was really interesting

42:27was was uh the the data that you guys

42:30published which which one of the tests

42:32were were um uh performed well by GPT

42:35three and which one of the tests

42:37performs substantially better with gbt4

42:40um how did multi-modality contribute to

42:44those tests you think oh I mean

42:47in a pretty straightforward

42:48straightforward way anytime there was a

42:51test where a problem

42:53would were to understand the problem you

42:55need to look at a diagram like for

42:57example in some math competitions like

43:00there is a

43:01cont math competition for high school

43:03students called AMC 12 right

43:07and there

43:08presumably many of the problems have a

43:10diagram

43:12so GPT 3.5 does quite badly on that on

43:17that text on that on the test gpt4 with

43:20text only does

43:22I think I don't remember but it's like

43:24maybe from two percent to 20 accuracy of

43:28success rate but then when you add

43:30Vision it jumps to 40 success rate so

43:32the vision is really doing a lot of work

43:34the vision is extremely good and

43:37I think being able to reason visually as

43:40well and communicate visually will also

43:42be very powerful and very nice things

43:45which go beyond just learning about the

43:48world you have several things you got to

43:49learn you can learn about the world

43:51you can then reason about the world

43:53visually and you can communicate

43:55visually where now in the future perhaps

43:58in some future version if you ask your

43:59neural net hey like explain this to me

44:01rather than just producing four

44:03paragraphs it will produce hey like fear

44:05is like a little diagram which clearly

44:07conveys to you exactly what you need to

44:09know and so that's incredible you know

44:12one of the things that you said earlier

44:13about about an AI generating generating

44:16uh tests to train another AI

44:19um you know there's there was a paper

44:21that was written about and I I don't I

44:24don't completely know whether whether

44:25it's factual or not but but um that

44:27there's there's a total amount of

44:29somewhere between 4 trillion to

44:31something like 20 trillion useful you

44:34know tokens in in language tokens that

44:39that the world will be able to train on

44:41you know over some period of time and

44:43that would have run out of tokens to

44:44train and and um I I well first of all I

44:48wonder if that's you feel the same way

44:51and then this secondary secondarily

44:53whether whether the AI generating its

44:58own

44:59data could be used to train the AI

45:03itself which you could argue is a little

45:06circular but we train our brain with

45:11generated

45:12data all the time by self-reflection

45:16working through a problem in our brain

45:19uh you know and and uh or you know some

45:23I guess I guess neuroscientists suggest

45:25sleeping we do a lot of fair amount of

45:28you know developing our neurons

45:30um how do you see this this area of

45:32synthetic data generation is that going

45:34to be an important part of the future of

45:36training Ai and and the AI teaching

45:38itself

45:39well that's I think

45:41like I wouldn't underestimate the data

45:44that exists out there I think this

45:47probably

45:48I think is probably more more data Than

45:50People realize

45:52and as to your second question certainly

45:55a possibility remains to be seen yeah

45:59yeah it's it really does seem that that

46:02um one of these days our AIS are are

46:05you know when we're not using it maybe

46:08generating either adversarial content

46:10for itself to learn from or imagine

46:12solving problems that that it can go off

46:15and and then and then improve itself

46:17tell us whatever you can about about uh

46:23where we are now and and what do you

46:25think will be and and not not too

46:27distant future but you know pick pick

46:30your your horizon a year or two what do

46:33you think this whole language Model area

46:35would be in some of the areas that

46:36you're most excited about you know

46:38predictions are hard

46:39and um

46:41speed it's a bit although it's a little

46:43difficult to

46:45say things which are too specific

46:47I think it's safe to assume

46:51that

46:52progress will continue

46:55and that we will keep on seeing systems

46:57which Astound us

46:58in there in the things that they can do

47:02and the current Frontiers are will be

47:04centered around reliability around the

47:08system can be trusted really get into a

47:11point where you can trust what it

47:13produces

47:14really get into a point where if it

47:16doesn't understand something it asks for

47:18clarification

47:20says that it doesn't know something says

47:23that it needs more information I think

47:25those are perhaps the biggest the areas

47:28where Improvement will lead to the

47:30biggest impact on the usefulness of

47:33those systems because right now that's

47:35really what stands in the way you have

47:37an area of asking neural network you ask

47:38a neural net to maybe summarize some

47:40long document and you get a summary

47:43like are you sure that some important

47:45detail wasn't omitted it's still a

47:46useful summary

47:47but it's a different story when you know

47:49with all the important points have been

47:52covered

47:54at some point like and in particular

47:56it's okay like if some if there is

47:58ambiguity it's fine but

48:00if a point is clearly important such

48:02that anyone else who saw that point

48:04would say this is really important

48:06when the neural network will also

48:07recognize that reliably that's when you

48:09know

48:10same for the guardrail say same for its

48:12ability to clearly follow the intent of

48:15the user of its operator

48:18so I think we'll see a lot of that in

48:20the next two years yeah that's terrific

48:21because the progress in those two areas

48:23will make

48:24this technology trusted by people to use

48:28and be able to apply for so many things

48:30I was thinking that was going to be the

48:31last question but I did have another one

48:33sorry about that so Chad uh chat gbt to

48:36gpt4 gpt4 when when it first when you

48:40first started using it uh what are some

48:42of the skills that

48:44it demonstrated that surprised even you

48:48well

48:51there were lots of really cool things

48:52that it demonstrated which

48:57which is which were quite cool and

48:59surprising it was

49:02it was quite good so I'll mention two

49:05excess so let's see I'm just I'm just

49:07trying to think

49:08about the best way to go about it

49:10the short answer

49:12is that the level of its reliability was

49:15surprising

49:16where the previous neural networks if

49:19you ask them a question sometimes they

49:22might

49:22misunderstand something in a kind of a

49:26silly way whereas the gpt4 that stopped

49:28happening

49:29its ability to solve math problems

49:31became far greater it's like you could

49:34really like say sometimes you know

49:35really do the derivation and like long

49:38complicated derivation you can convert

49:40the units and so on and that was really

49:41cool you know like many people it works

49:44who are proof it works through a proof

49:45it's pretty amazing not all proofs yeah

49:47naturally but but quite a few or another

49:50example would be like many people

49:52noticed that it has the ability to

49:56produce poems with you know every word

49:59starting with the same letters or every

50:01word starting with some it follows

50:04instructions really really clearly not

50:06perfectly still but much better before

50:08yeah really good and on the vision side

50:10I really love how it can explain jokes

50:13it can explain memes you show it a meme

50:16and ask it why it's funny and it will

50:18tell you and it will be correct the the

50:21vision part I think is very

50:24was also very it's like really actually

50:26seeing it when you can ask questions

50:28follow-up questions about some

50:30complicated image with a complicated

50:32diagram and get an explanation that's

50:34really cool

50:35but yeah overall I will say to take a

50:38step back

50:39you know I've been I've been in this

50:41business for quite some time actually

50:43like almost exactly 20 years

50:47and

50:49the thing which most of which I find

50:51most surprising is that it actually

50:53works

50:55yeah like it it's turned out to be the

50:58same little thing all along which is no

51:00longer little and it's a lot more

51:02serious and much more intense but it's

51:05the same neural network just larger

51:07trained on maybe larger data sets in

51:10different ways with the same fundamental

51:12training algorithm yeah

51:14so it's like wow

51:17I would say this is what I find the most

51:19surprising yeah whenever I take a step

51:21back I go how is it possible those ideas

51:23those conceptual ideas about well

51:25the brain has neurons so maybe

51:28artificial neurons are just as good and

51:30so maybe we just need to train them

51:31somehow with some learning algorithm

51:33that those arguments turned out to be so

51:35incredibly correct

51:37that would be

51:39the biggest surprise I'd say in the in

51:42the 10 years that we've known each other

51:44uh your your uh the near the models that

51:48you've trained and the amount of data

51:50you've trained from uh the what you did

51:54on alexnet to now is about a million

51:57times and and uh uh no no one in the

52:02world of computer science would have

52:04would have believed that the amount of

52:06computation that was done in that 10

52:08years time would be a million times

52:10larger in that that you dedicated your

52:13career to go go do that

52:16um you've done two uh many more uh your

52:20body of work is incredible but two

52:22seminal works and the invention the

52:24co-invention with Alex Ned and that that

52:26early work and and now with uh GPT at

52:29open AI uh it is it is truly remarkable

52:33what you've accomplished and it's great

52:35to catch up with you again Ilya my good

52:37friend and and um uh it is uh it is

52:41quite an amazing moment and it's a

52:43today's today's talk the way you you uh

52:46break down the problem and describe it

52:48uh this is one of the one of the uh the

52:51best PhD Beyond PhD descriptions of the

52:55state of the art of large language

52:56models I really appreciate that it's

52:58great to see you congratulations thank

52:59you so much yeah thank you so I had so

53:01much fun

53:02thank you

🎥 Related Videos

YTP: He's The One They Call Dr Phil

Catching up:Erotic club in Thailand

Catching up: Benji got invited to orgies

CONFERENCE JENSEN HUANG (NVIDIA) and ILYA SUTSKEVER (OPEN AI).AI TODAY AND VISION OF THE FUTURE

I Miss YouTube Rewinds

Zoltan Kaszas - Honorary Jones (FULL SPECIAL) 2024

🔥 Recently Summarized Examples

The Hitler-Stalin Pact | Reflections Episode 9

Uncovering Corruption From Health "Experts" | Scott Carney

The Forgotten Geometry: A New Path to Unification

Joe Rogan Experience #2194 - Luis Elizondo

From Tesla to DNA: The Science of Scalar Waves - Dr. Sandra Rose Michael - Think Tank E44

Bitcoin Holders...Watch Out for Sept

View original video