Safety in Numbers: Keeping AI Open

a16z2023-12-28

5K views|4 months ago

💫 Short Summary

In 2022, the importance of data sets over model size was highlighted, leading to the creation of MROL and open-source models. Deep Mind's llama project revolutionized technology distribution, emphasizing overtraining for efficiency. Sparse Mixture of Experts improves efficiency with a router mechanism. The benefits of Mix over dense models were discussed for cost and latency advantages. The AI field progressed towards open sourcing, empowering developers to control biases. Companies are customizing models for specific tasks at a lower cost. Open source models are seen as competitive with proprietary ones. Concerns about misuse and regulating open source models were raised for safety and accountability. The battle for neutrality in technology like large language models and the importance of monitoring software performance were emphasized. The focus is on regulating applications rather than models themselves. The debate on whether models can reason like humans continues, with a focus on generalization and multi-step reasoning. Developers are encouraged to create small, task-specific models for revolutionizing interactions with machines and the internet. The video concludes with viewer suggestions for future topics.

✨ Highlights

📊 Transcript

✦

The importance of data sets over model size was emphasized in a pivotal paper in 2022.

00:45

Arthur MCH, founder of MROL, released state-of-the-art open-source models like MROL 7B and Mixt.

The discussion with Anan Maida at A16Z covered misconceptions around open source and performance of open vs. closed models.

The future of scaling LLMS highlighted the importance of compute, data, and algorithmic innovations for efficient scaling.

✦

Optimal approach for scaling models in 2021.

03:28

Initially, researchers suggested infinitely scaling model size with fixed data points, causing misconceptions.

By the end of 2021, a shift occurred towards multiplying compute capacity by 4, model size by 2, and data size by 2 for consistency.

This method ensures models are appropriately sized, aligning with theoretical perspectives and empirical evidence.

✦

Deep Mind team develops game-changing technology through projects like retro and Flamingo.

06:13

The team decides to focus on distributing technology in an open-source manner, leading to the creation of their own team.

The project llama surpasses chinchilla scaling laws by emphasizing overtraining models for improved efficiency during inference.

This approach saves costs and creates new opportunities in the industry.

✦

Sparse Mixture of Experts enhances efficiency in inference and training time by duplicating dense layers and utilizing a router mechanism for token assignment.

08:52

It boasts 12 billion parameters per token and surpasses dense Transformers in performance.

The network structure differs from dense models as dense layers are duplicated instead of alternating with attention layers.

This technology enables superior performance without requiring data compression.

✦

Mix offers cost and latency advantages compared to a dense model.

11:33

Challenges in developing efficient models include training correctly and using hardware efficiently.

Communication constraints arise from tokens moving between experts.

Efficient inferencing is crucial when deploying the model.

An open-source package based on VM was released for community modification, emphasizing the speaker's open-source approach in a competitive field.

✦

Evolution of AI from 2012 to 2022.

14:51

Open communication between academic and industry labs drove progress in AI.

Secrecy increased in 2020 with the release of GPT-3, hindering further advancements.

Lack of communication in 2022 becomes a barrier to AI innovation.

Researchers emphasize the need for continued innovation and collaboration for AI improvement.

✦

Shift towards open sourcing in AI for transparency and customizability.

18:12

Emphasis on empowering developers to control biases and behaviors in pre-trained models.

Open source models are competitive with proprietary ones, with performance gap closing in six months.

Community contributions to open source models lead to faster advancements compared to closed systems.

Trend of open sourcing in AI follows the successful model seen in Linux development.

✦

Open-source models are becoming as effective as proprietary models in the field.

19:30

Companies are customizing models like mT5 and GPT-3.5 for specific tasks at a lower cost.

Community efforts are improving models like mT5 by extending context length and adding new capabilities.

Hugging Face optimized the M7B model, creating a more powerful version.

The research community is eager to innovate and add new features to models for various applications, including mobile devices.

✦

Impact of internet access on knowledge similar to printing press revolution.

21:55

Concerns over misuse of large language models, necessitating awareness and prevention measures.

Scrutiny of AI models crucial to prevent misuse, with open sourcing enabling larger community involvement.

Open sourcing viewed as safer method to detect biases and breaches, promoting transparency and accountability.

Policymakers discussing regulation of open source models for AI safety and security.

✦

Importance of monitoring software performance through open source collaboration.

25:16

Questioning the need for complex governance structures in software collaboration, favoring open source models.

Emphasis on independent control of product safety, access to strong open source models, and technology ownership.

Challenges in releasing open source models, competitiveness in the market, and matching the strength of closed source models.

Focus on performance, latency, and staying relevant in a complex technological landscape.

✦

The battle for the neutrality of technology like large language models (LLMs) used as programming languages by application makers.

27:37

There is a confusion between models and applications, leading to a focus on regulating the application rather than the model itself.

Companies are working to make LLMs controllable for compliant and safe applications.

Regulating neutral tools like LLMs should focus on the application and not the model itself.

Efforts are being made to distinguish between regulating apps and mathematics to ensure a better understanding of the technology.

✦

Challenges of understanding Foundation models and scaling laws.

30:14

Strong evaluations are needed, not just focusing on pre-market conditions like flops.

Tools for application makers to evaluate models are crucial for innovation.

Emphasis on ensuring product safety and competitiveness.

Importance of increasing data efficiency and reasoning capabilities through adaptive computing and high-quality data filtering techniques.

✦

Discussion on the ability of models to reason like humans, focusing on generalization and multi-step complex reasoning.

33:22

Challenges in evaluating reasoning capabilities when training models on human knowledge, but effectiveness shown in performance on simple tasks.

Potential future use of specialized models in complex applications, with developers emphasizing latency and cost efficiency.

Advancements in scaling laws and representation learning may impact end users in terms of consumption and programming, leading to the need for new paradigms and ongoing exploration for improved models.

✦

The future of language models and applications.

36:31

User preferences are used to create small, task-specific models that will revolutionize interactions with machines and the internet in the next five years.

Small models interacting will lead to complex systems, with different personas able to use the same model with different prompts and functions.

Developers are encouraged to build amazing applications as technology becomes more efficient, with the software development process evolving rapidly.

The call for action for application makers to build quickly and efficiently.

✦

Viewer suggestions for future topics.

38:41

Video concludes with a thank you message and promise to return.

Brief music interlude included before video ends.

00:00scaling laws now these underpin the

00:02success of large language models today

00:05but the relationship between data sets

00:07compute and the number of parameters was

00:10not always clear in fact in 2022 a

00:13pivotal paper came out that changed the

00:15way that many people in the research

00:16Community thought about this very

00:18calculus and it demonstrated that data

00:20sets were actually more important than

00:22just the sheer size of the model one of

00:25the key authors of this paper was Arthur

00:27MCH who was working at Deep Mind at time

00:30now earlier this year Arthur banded

00:32together with two other researchers yam

00:34Lampo and timate laqua two researchers

00:38at meta who worked on llama and together

00:40the three of them founded a new company

00:42mrol that team has been hard at work

00:45releasing mrol 7B in September a

00:47state-of-the-art open source model that

00:50quickly became the go-to for developers

00:52and they just released as in in the last

00:55few days a new mixture of experts model

00:58that naturally they're calling mixt

01:00so today you'll get to hear directly

01:02from Arthur as he sits down with a16z

01:05General partner Anan maida as the

01:07Battleground for large language models

01:09heats up to say the least together they

01:11discussed the many misconceptions around

01:13open source and the war being waged on

01:15the industry plus the current

01:17performance reality of open versus

01:19closed models and whether that Gap will

01:21realistically close with time plus the

01:24kind of compute data and algorithmic

01:26Innovations required to keep scaling

01:28llms efficiently

01:30now it's really rare to have someone at

01:32the frontier of this kind of research be

01:34so candid about what they're building

01:36and why so I hope that you come out of

01:38this episode as excited about the future

01:41of Open Source as I did

01:43enjoy as a reminder the content here is

01:46for informational purposes only should

01:48not be taken as legal business tax or

01:50investment advice or be used to evaluate

01:52any investment or security and is not

01:54directed at any investors or potential

01:56investors in any a6c fund please note

02:00that a16z and its Affiliates may also

02:02maintain investments in the companies

02:03discussed in this podcast for more

02:06details including a link to our

02:07investments please see

02:10az.com

02:14disclosures you've got uh quite the

02:16founding team story you know we

02:17flashback to a few years ago labs are

02:20building Foundation models and the

02:22consensus across the research Community

02:24was that the size of these models was

02:26What mattered most you know how many

02:29million billion parameters went into the

02:32model seemed to be the primary debate

02:34that people are having but it seems like

02:36you had a hunch that data sets mattered

02:38more could you just give us the

02:40backstory on the chinchilla paper you

02:41co-wrote you know what were the key

02:43takeaways on the paper and how was it

02:45received yeah so I guess the backstory

02:47is that uh 2019 2020 people were relying

02:51a lot on um on a paper called scaling

02:54lows for large language models that was

02:56advocating for uh basically scaling

02:59infinitely the size of models and

03:01keeping so number of data points uh

03:03rather fixed so do saying that if you

03:06had like four times the amount of

03:07compute you should be mostly multiplying

03:10by 3.5 your model size and then maybe by

03:131.2 uh and so a lot of work was actually

03:16done on top of that so in particular at

03:18Deep mine when I joined uh I I joined a

03:21project called gopher and and that's a

03:23there was a misconception there there

03:25was also a misconception on gpt3 and

03:28basically in 2021 every paper uh made

03:31this mistake and at the end of 2021 we

03:35started to realize there were some

03:36issues and as it turns out we turned

03:38back to to the mathematical paper that

03:41was actually talking about scaling lows

03:43and it was a bit hard to understand and

03:45we figured out that actually if you

03:47thought about it bit more in a

03:49theoretical perspective and if you

03:51looked at like empirical evidence we had

03:54um it didn't really make sense uh to

03:56actually grow the model size faster than

03:58the data size

04:00and we did some measurement and as it

04:02turned out what was actually true was

04:05actually what we expect which is in

04:09common words if you multiply by four uh

04:11your compute capacity you should

04:13multiply by two the model size and by

04:15two the data size that's approximately

04:17what you should be doing which is good

04:19because if you move everything to

04:20Infinity everything remains consistent

04:22so you don't have a model which is

04:23infinity big or a model which is

04:25infinity small with infinite compression

04:27or like close to zero compression so it

04:29really makes sense and as it sounds out

04:31it's really what You observe if you look

04:32at uh if you do multiple runs and and so

04:35that's how we we train chinchila and and

04:38that's how we wrote the chinchila paper

04:40at the time you know you were at Deep

04:42mind and your co-founders were at meta

04:45what's the backstory around how you

04:47three end up coming together to form

04:49mrol after the compute optimal skating

04:52laws work that you just described so

04:54we've known each other for a while

04:55because Gom and I were in school

04:57together and timoth and I were in master

04:59together in Paris basically we had like

05:01very parallel careers Timo and I we

05:04actually work together as well again

05:06when I was doing a post talk um in

05:09mathematics uh and then I joined deep

05:11mine as Gan timot uh went to become

05:16permanent researchers at meta uh and so

05:19we continued doing this I was doing

05:21large language models in between 2020

05:23and 2023 Gom and timote were working on

05:26um solving U mathematical uh problems

05:30with large langage models and if I

05:34understand correctly I wasn't there but

05:35they realized they had to have stronger

05:37models and they started to do large

05:38language models at this point uh so like

05:41I guess a year after I started uh and on

05:45my side I was mostly working on a small

05:47team at uh Deep Mind so we did very

05:49interesting work on uh we retro which is

05:52a paper doing retrieval for large

05:53language models we did the chinchila

05:56then uh we I I was in the team doing

05:59Flamingo which is actually one one of

06:01the good way of doing a model that can

06:02see things I guess when chbt went out we

06:06knew I mean we knew from before that the

06:08technology was very very much

06:10gamechanging but it was kind of a signal

06:13that there was a strong opportunity for

06:15building a small team uh focusing uh on

06:18a different way of Distributing the

06:19technology so redoing things in a more

06:21open source manner uh which was not the

06:24direction what that Google at least was

06:26taking and and so we had this

06:28opportunity then we left the company uh

06:31at the beginning of last year and and

06:34created the team that started to work on

06:36the 5th of June and if I recall

06:38correctly right before they left Tim and

06:41guom had started to work on llama right

06:44over at meta could you just describe

06:46that project and how it was related to

06:48the chinchilla scaling laws work You'

06:50done so Lama was uh like a small team

06:54reproduction of chinchila at least in

06:56the in its approach of the

06:57parameterization and all of these things

07:00uh it was uh one of the first papers

07:02that established that you could go

07:04beyond the chinchila scaling lows so

07:06chinchila scaling lows tell you what you

07:08should be training if you want to have

07:10an optimal model uh for a certain

07:12compute cost at training time but if you

07:15take into account the fact that your

07:16model should also be efficient at

07:18inference time you probably want to go

07:20far beyond the chinula scaling low so it

07:23means you want to overtrain the model so

07:26train on more tokens than would be

07:28optimal for performance

07:30but the reason why you do that is that

07:31you actually compress models more and

07:33then when you do inference you end up

07:35having a model which is much more

07:37efficient uh for a certain performance

07:39so by spending more time during training

07:42you spend less time during inference and

07:43so you save cost and I think uh that was

07:46something we well I guess we observed

07:48that at Google also but uh the Lama

07:52paper was the first to establish it in

07:53the open and it opened a lot of

07:55opportunities yep I remember you know

07:58the both the impact of the chinchilla

08:00scaling laws work on the labs on on

08:03multiple Labs realizing just how

08:05unoptimal the the compute setups were

08:09right um and then the impact of llama

08:11being dramatic on the industry and

08:13realizing how to be much more efficient

08:16about inference time so I can imagine

08:18that those were some of the the top

08:20insights on your mind and the top

08:22concerns on your mind when you guys left

08:24uh to start mistol so let's fast forward

08:28to to today you know it's December 2023

08:31we'll get to the role of Open Source in

08:33a bit but let's just level set on what

08:35you've built so far you know a couple

08:37months ago You released mistol 7B which

08:40was a best-in-class model um and this

08:43week you're releasing a new mixture of

08:45experts model so just tell us a little

08:48bit more about mixol I believe is what

08:50you're calling it and how it compares to

08:52other

08:52models yeah so mix is our new model that

08:55wasn't released in Open Source before in

08:57a in a usable form uh so it's a

09:00technology called sparse mixture of

09:01experts uh which is quite simple you

09:03take all of the dense layers of your

09:05Transformer and you duplicate them you

09:07call these layers expert layers and then

09:10what you do is that for each exp for

09:12each token that you have in your

09:14sequence uh you have a router mechanism

09:16just a very simple Network that decides

09:20which expert should be looking at which

09:21token and so you send all of the tokens

09:23to their experts and then you apply the

09:26experts and you get back the the the

09:27output and you combine them and then you

09:30go forward in the network you have eight

09:33experts per layer and you execute only

09:35two of them so what it means at the end

09:37of the day is that you have a lot of

09:38parameters on your model you have uh 46

09:42billion parameters but the thing is that

09:44the the number of parameters that ex

09:46that you execute uh is much lower than

09:49that because you only execute two

09:50branches out of eight and so at the end

09:53of the day you only execute 12 billion

09:55parameters per token and this is what

09:57counts for latency and throughput and

09:59for performance so you have a model

10:01which has the performance of a 12

10:02billion parameter Network that can uh

10:06that have performance that are much

10:08higher than what you could get even by

10:10compressing data a lot on a 12 billion

10:12dense Transformer so Spar mixure of

10:15experts is a technology uh that allows

10:18to be much more efficient at inference

10:19time and also much more efficient at

10:21training time so that's the reason why

10:22we mates to develop it very quickly just

10:25for folks who are listening um who might

10:28not be with sort of state-of-the-art

10:30architecture and language models could

10:32you just describe the difference between

10:34you know dense models which have been

10:36the primary architecture today and

10:39mixture of experts intuitively what are

10:40the biggest differences between these

10:42two

10:43architectures so they are very similar

10:45except on the what we call the D Network

10:48so the the you know in in the dense

10:51Transformer you have you alternate

10:53between an attention layer and and a

10:54dense layer generally that's that's the

10:57idea a spar mixture of experts you you

11:00take the dense layer and you duplicate

11:01it several times and so that's where you

11:03actually increase the number of

11:04parameters so you increase the capacity

11:06of the model without increasing the cost

11:08so that's a way of decoupling the

11:10memorization and what you can remember

11:12the capacity of the network to its cost

11:14at inference time if you had to describe

11:17the biggest benefits for developers as a

11:18result of that inference efficiency it's

11:20cost and and latency so you can have uh

11:24usually that's what you look at when

11:26you're a developer you want something

11:27which is cheap and you want something

11:28which is fast generally speaking the

11:30just the trade-off uh is strictly

11:33favorable in using mix compared to using

11:36a 12 billion dense model and the other

11:38way to think about it is that if uh you

11:41want to use a model which is as good as

11:42L 270b you should be using Mixr because

11:45m is actually on par with Lama 270b

11:49while being approximately six times

11:50cheaper or six times faster for the same

11:52price could you talk just a little bit

11:55about why it's been so challenging for

11:58folks to uh for research labs and

12:01research teams to really get the mixture

12:03of experts model right it sounds like

12:05you know for a while now folks have

12:07known that the dense model architecture

12:09that all of us have been using in in

12:10sort of the most notable products or the

12:12most well-known products um are slow uh

12:16they're expensive and they're difficult

12:19to scale and so for a while people have

12:22been looking for an alternative

12:24architecture that could be like you were

12:26saying cheaper could be faster could be

12:28more efficient what were some of the

12:30biggest challenges you have to figure

12:31out to get thee model right well I guess

12:35I won't disclose all Trade Secrets but

12:37uh there's basically two challenges the

12:38first one is you need to figure out how

12:40to train it correctly from from a

12:42mathematical perspective the other

12:44challenges is to train efficiently so

12:46how to use actually a hardware uh as

12:48efficiently as possible you have like

12:50new challenges coming from the fact that

12:52you have tokens flying around from one

12:54to one expert to another uh that creates

12:57some communication constraints and you

12:58need to figure out um uh you need to

13:01make it fast and then on top of that you

13:03also have new constraints that apply

13:04when you deploy the model uh you do need

13:07to do inferencing uh efficiently and

13:11that's also the reason why we released

13:12an open source package based on VM so

13:14that the community can can take also

13:16this code and modify it and see how that

13:18works yeah obviously we're excited to

13:21see what the community does with thee uh

13:24mixol release you're putting out this

13:26week let's talk about open source and

13:28approach and a philosophy that's that's

13:31permeated all the work you've been doing

13:33so far um why choose to tackle this

13:36increasingly competitive space with an

13:39open source approach which has been

13:40which is quite different from the way

13:41everybody else is approaching it I guess

13:43it's a good question the the answer is

13:44that it's partly ideological and and

13:46partly pragmatical um we have grown with

13:51the field of AI uh that when from 2012

13:54we were detecting cat and go cats and

13:56dogs and in 2022 we were actually

13:59generating text that looked humanik so

14:01really made a lot of progress and if you

14:03look at the reason why we made all of

14:04this progress uh well most of it is

14:08explainable by the free flow of

14:09information so you had academic Labs you

14:13had very big uh industry backed Labs

14:16communicating all the time about the

14:18results and building on top of uh each

14:20other results and that's the way we went

14:22from we increased

14:24significantly uh the architecture and

14:27training uh

14:29techniques uh we we just made everything

14:32work as a community and all of a sudden

14:35in 2020 with gpt3 this tide kind of

14:38reversed and uh companies started to be

14:42more opaque about what they were doing

14:43because they they they realized there

14:45was actually a very big market and all

14:47of a sudden 2022 on the important

14:50aspects of AI and on llms we had we went

14:54and Beyond chinchila there were

14:56basically no communication at all and

14:58that's something that that I as a

14:59researcher and and timot and Gom and all

15:01of the people that joined us as well

15:03deeply regretted uh because we think

15:05that we're definitely not at the end of

15:07the story we need to invent new things

15:09there's no reason why to stop now uh

15:12because the technology is effectively uh

15:14good but not working completely well

15:15enough and so we believe that it's still

15:18the case that we should be communicating

15:20a lot about uh models we should be

15:23allowing the community to take the

15:25models and make it their own and that's

15:27that's a some ideological reason why we

15:30went into that the other reason is that

15:33we are talking to

15:34developers uh developers want to modify

15:37things and and having a deep access to

15:39to to very good model is is a good way

15:42of engaging with this community and uh

15:45well and I guess addressing their needs

15:47so that the platform we're building as

15:49well is going to to be used by them so

15:52that's that's also like a a business

15:54reason obviously uh as a business we we

15:56do need to have a valid Mone ization

15:59Approach at some point uh but we've seen

16:01many businesses build open core

16:03approaches uh and have a very strong

16:06open source community and also a very

16:08good offer of services and that's what

16:10we want to build that resonates I I

16:12remember a very detectable shift you're

16:16right you know the early days of of deep

16:19learning were largely driven by a bunch

16:21of open collaboration between

16:23researchers from different Labs who

16:24would often publish all their work and

16:26share them at conferences you know

16:27Transformers ly was published and and

16:30opened to the entire research Community

16:32um but that has has definitely changed

16:35yes so I think there's some level of

16:37open sourcing uh in in Ai and so we

16:40offer the open uh the weights and we

16:42offer the inference code that's like the

16:44end product that is already super usable

16:46so uh it's already a very big step

16:49forward compared to closed apis because

16:52you can modify it and you can look at

16:54what's happening under the hood look at

16:56activations and all so you have inter

16:58interpretability and the possibility of

17:00modifying the model to adapt it to some

17:03editorial tone to adapt it to

17:05proprietary data to adapt it to some

17:07Specific Instructions which is something

17:09that is actually much harder to make if

17:10you only have access to a close Source

17:12API um and that's something that also

17:14goes with our approach of the technology

17:16which is to say pre-train model should

17:19be neutral and we should Empower our

17:21customers to take these models and just

17:25put their editorial approaches there are

17:27instruction they Constitution if you

17:29want to talk like entropic into the

17:31model so that's the that's the way we

17:33approach the technology we don't want to

17:35pour our own biases into the into the

17:38pre-train model on the other hand we

17:39want to enable the developers to control

17:42exactly how the model behaves uh and

17:45what kind of biases it has what what

17:47kind of biases it doesn't have so we we

17:49really take this modular approach and

17:51that goes very well with the fact that

17:53we release uh some very strong open we

17:55models could you just ground Us in the

17:57reality of where these models are today

17:58just to give people a sense of of where

18:01in the timeline we are is open source

18:03really a viable competitor to

18:05proprietary close models or is there a

18:06performance Gap you know what are the

18:08trade-offs or limitations that people

18:10should be aware of uh with open source

18:12so mix is as similar performance to GPT

18:153.5 so that's the that's a good

18:18grounding internally we have SW models

18:20that are in between 3.5 and four that

18:23are basically second or third the second

18:25or third best model in the world so

18:27really we think that the Gap is closing

18:30uh the Gap is approximately six months

18:32at that point and the reason why it's

18:34it's six months is that it actually goes

18:35faster if you do open source things

18:36because you get the community uh modify

18:39the model toest very good ideas that can

18:41then be Consolidated by us for instance

18:44and and we just go faster because of

18:46that so it has always been the case that

18:49open source at the end well ends up

18:52being going faster and that's the reason

18:54why uh the entire inet rends on Linux I

18:57don't see why it would be any different

18:59for AI uh obviously there's some

19:01constraint that are slightly different

19:02because the infrastructure cost is quite

19:04High uh to train a model it cost a lot

19:06of money but uh but I really think that

19:10we'll converge to a setting where you

19:12have propri ey models and the open

19:13source model are just as good and I

19:15think eventually the field will be much

19:16more open because if you want to go

19:18beyond the the biggest model today you

19:21do need to find new paradigms and so

19:22that means that we also need to do

19:24research H and that's we're very excited

19:26by this perspective because we like

19:28competitive environment and research

19:30yeah so let's talk about that a little

19:31bit more how are you seeing people use

19:34and innovate on the open source models

19:37and are there any use cases that diverge

19:39from proprietary close models at all I

19:41think we've seen several categories of

19:43usage um you had there's a few companies

19:46that know how to strongly find you

19:47models to their needs so they took

19:49mistal 7B had a lot of human annotations

19:52had a lot of proprietary data just

19:55modify mral 7B so that it solve their

19:57task just as as well as gbt 3.5 but only

20:00for a lower cost and a higher level of

20:02control we've also seen I think very

20:04interesting Community efforts in adding

20:06capabilities to mral 7B so we saw like a

20:10context length extension to 128k that

20:12worked very well again was done in the

20:14open so like the recipe was available

20:16and this is something that we were able

20:18to consolidate we've seen imag en coders

20:20to make it a visual visual language

20:23Model A very actionable thing that we

20:25saw is uh I think the hugging face folks

20:28first did the direct preference

20:30optimization on top of M 7B and made a

20:33very strong much stronger model than the

20:35instructed model we proposed at the

20:37early release and it turned out it's

20:39actually a very good idea to do it and

20:41so that's something that we've

20:42Consolidated as well uh so generally

20:45speaking uh just the community is super

20:48eager to just take the model and add new

20:50capabilities put it on the laptop put it

20:53on on an iPhone I saw m 7B on an iPhone

20:56I saw Mr 7B on the stuffed parot as well

20:59so fun things useful things uh but

21:02generally speaking it's been super

21:04exciting to see the research Community

21:06take a hold of of our technology and

21:09with mix which is a new architecture I

21:11think we're are also going to see much

21:12more interesting things because on the

21:14interpretability field also on the

21:16safety field as it turns out you have a

21:18lot of things to do when you have deep

21:20access to an open model and so we're

21:22really eager to to help that uh and to

21:25engage with the community safety you

21:27know this an important

21:28I think piece to talk about the

21:31immediate reaction of a lot of folks is

21:33to deem open source less safe than

21:36closed models how would you respond to

21:39that so I think we believe that it's

21:42actually not the case for the current

21:43generation of model uh models that we

21:46are using today are not that much are

21:49not much more than just a compression of

21:51whatever is available on the internet so

21:54it does make access to knowledge more

21:55food uh but this this is the story of

21:58humanity making knowledge uh access more

22:01fre it so it's no different than uh

22:03inventing the printing machine where we

22:05had apparently similar debate it wasn't

22:07there but that was the debate we had uh

22:09so we are not making the world any time

22:12uh any less safer by providing more

22:15interactive access to knowledge so

22:16that's the first thing now the other

22:18thing is that you do have immediate risk

22:20of misusage of large language models and

22:23uh you do have them for open source

22:25models but also for closed models and so

22:27the the way you do address these

22:31problems and come up with counter

22:32measures is to know about them uh so you

22:35need to know about uh about bridges

22:38basically and that's the same way in

22:39which you need to know about bridges on

22:41on operating systems and on uh on

22:43networks and so it's no no different for

22:46for AI uh putting uh models under the

22:49highest level of scrutiny is a way of

22:50knowing how they can be misused and it's

22:52a way of coming up with conter measures

22:55and I think a good example of that is

22:56that it's actually super easy to exploit

22:58an API uh it's super easy especially if

23:00you have fine tuning access to make gp4

23:03behave uh in a very bad way and it's um

23:08since it's the case and it's always

23:10going to be the case it's super hard to

23:11be adversarially robust it means that

23:13we're only trusting the team uh of large

23:16companies to figure out ways of

23:18addressing these problems whereas if you

23:21do open sourcing you trust the community

23:23and the community is much larger um and

23:26so if you look at the history of

23:27software in cyber security in operating

23:30systems that's the way we made the

23:32system safe and so if we want to make

23:34the current AI system safe and then move

23:36on to a Next Generation that potentially

23:38will be even stronger and then we can re

23:41have this we can have this discussion

23:43again well you do need to do open

23:45sourcing so today we think that open

23:47sourcing is the safe way yeah I think

23:50this is this is not understood right

23:53widely that um when you have thousands

23:57or hundreds of thousands of people able

23:58to Red Team models because it's open

24:01source the likelihood that you'll detect

24:03biases and um built-in breaches and

24:07risks are just dramatically higher um

24:11and I think if you were talking to

24:13policymakers how would you help advise

24:16them how do you think they should be

24:18thinking about regulating open source

24:19models given that you know the safest

24:22way often to battle Harden software and

24:25tools is to put them out in the open

24:27well we've been saying that precisely

24:29this that um that the current technology

24:34is not dangerous on the other end the

24:36fact that we we are effectively making

24:39them stronger means that we need to

24:40monitor what's happening empirically

24:42monitor performances the best way of

24:44empirically monitoring software

24:47performances is is through open source

24:49so that's what we've been saying um

24:53there's been some effort to try to come

24:55up with very complex governance

24:56structure where where uh you would have

24:59like several companies talking together

25:01having some safe space some safe

25:02soundbox uh for uh red teer that would

25:06be potentially independent so things

25:08that are super complex uh but as it

25:11turns out if you look at the history of

25:12software the only way we did software

25:14collaboratively is through open source

25:16so why change the recipe today where uh

25:19the the technology we're looking at is

25:21actually nothing else than the

25:23compression of the internet so that's

25:25the that's what we've been saying to The

25:26Regulators uh J

25:28another thing we've added to the

25:29regulator is that if they want to

25:32enforce that AI products that needs to

25:35be safe like like if you if you want to

25:38have a diagnosis assistant you want it

25:39to be safe right well in order to

25:42Monitor and to evaluate whether it's

25:44actually safe you need to have some very

25:46good tooling and the the tooling

25:48requires to have access to llms and if

25:51you access close Source apis llms where

25:55you're bit in a in a in trouble water

25:58because it's hard to be independent in

25:59that setting so we think that

26:01independent controller of product safety

26:04should have access to very strong open

26:06source models and should own the

26:07technology and if open source llms were

26:12to fail relative to close Source models

26:15why would that

26:17be well I guess the the regulation

26:20burden is is uh potentially one thing

26:23that that could uh make it harder to uh

26:26to release open source models it's also

26:28generally speaking it's a very

26:30competitive market and I think in order

26:33for open source models to be widely

26:34adopted they need to be as strong as

26:37open as a close Source model they have a

26:40little Advantage because you do have

26:42more control and so you can do heavier

26:44fing and so you can make performance

26:46jump a lot on a specific task because

26:48you have deep access uh but really at

26:51the end of the day um developers look at

26:54performance and and latency and so

26:57that's why why we think that as a

26:59company we need to be very much on the

27:00frontier if we want to be relevant given

27:03the complexity of uh Frontier models and

27:06Foundation models in these systems there

27:08are just tons of misconceptions that

27:11folks have about these models and so if

27:13you step back and we look at the Battle

27:16that's raging between folks um uh

27:20pushing for closed Source systems and

27:23versus the open source system what do

27:25you think is at stake here what do you

27:26think the battle is for well I think the

27:29battle is for the neutrality of the

27:31technology like a technology by ense is

27:34something neutral you can use it for bad

27:36purposes you can use it for good

27:37purposes if you look at what the llm

27:39does it's not really different from

27:42programming language it's actually used

27:44very much as a programming language by

27:45the application makers there's a strong

27:49um confusion made between what we call a

27:51model and what we call an application

27:54and so a model is really the programming

27:56language of a application so if you talk

27:59to all of the startups doing amazing

28:01products with generative AI they're

28:03using llms just as um as a function and

28:07on top of that you have a very big

28:08systems with filters with decision

28:11making with control flow and all of this

28:13things and what you want to regulate if

28:15you want to regulate something is the

28:17system the system is it's the product so

28:20for instance um a healthcare diagnosis

28:23assistant is is an application you want

28:26it to be non-biased you want it to take

28:30good decisions uh even under high

28:32pressure so you want its statistical

28:34accuracy to be very high and so you want

28:36to measure that and it doesn't matter if

28:39it uses a large language Mo

28:42uh under the hood what you want to

28:44regulate the application and the issue

28:47we had and the issue we're still having

28:49now

28:50is we hear a lot of people saying we

28:53should regulate the tech so we should

28:55regulate the function the mathematics

28:57behind it

28:58but really you never use a large

28:59language model itself you only always

29:01use it in an application in a in a in a

29:03way with a user interface and so that's

29:07the one thing you want to regulate and

29:09what it means is that companies like us

29:11like foundational model companies will

29:13obviously make the model as controllable

29:15as possible so that the applications on

29:17top of it can be compliant can be safe

29:20uh we'll also build the tools that allow

29:22to measure the compliance and the safety

29:26of the application because that's super

29:28useful for the application makers it's

29:30actually

29:31needed but there's no point in

29:33regulating something that is neutral in

29:36itself that is just a mathematical tool

29:38so I think that's the one thing that

29:40we've been hammering a lot uh I think

29:42we've been which is good uh but there's

29:45still a lot of effort uh in uh I guess

29:48in making this strong distinction which

29:49is super important to understand what's

29:51going on so to regulate apps not math

29:56seems like you know the right direction

29:58that a lot of folks who are who

30:01understand the inner workings of these

30:02models and how they're actually

30:03implemented in in reality um are

30:06advocating

30:07for what do you think um is the best way

30:11to clear up the this misconception for

30:14folks who don't maybe don't have

30:15technical backgrounds don't actually

30:17understand how Foundation models work

30:20and how the scaling laws work so I've

30:22been using a lot of metaphors I guess to

30:24to to make it understood large language

30:26models are like programming languages uh

30:28and so you don't regulate uh programming

30:31languages you regulate malwares you you

30:33ban malwares we've also been actively uh

30:38vocal about the fact that pre-market

30:41conditions like flops the number of of

30:43flops that you do to create a model is

30:45definitely not the right way of doing uh

30:48of measuring the performance of a model

30:50right we um we're very much in favor of

30:53having very strong evaluations that's as

30:55I've said uh this this is something that

30:57we want to provide to our customers the

30:59ability to evaluate our models in their

31:03application uh and so I think this is a

31:05very strong thing um to well that we've

31:08been stressing we want to provide the

31:11tools for application makers to be

31:13compliant that's the that's something we

31:15have we've been saying and so we find it

31:17a bit unfortunate that uh we haven't we

31:20haven't been heard everywhere and that

31:22there's still a big focus on the tech uh

31:25probably because things are not

31:27completely well understood because it's

31:28a very complex field and it's also a

31:30very fast moving field uh but eventually

31:32I think I'm I'm very optimistic that

31:35we'll find a way to uh continue

31:37innovating uh while having safe products

31:40but also uh high level of competition on

31:44on the foundational mod uh layer well

31:47let's let's Channel your optimism a

31:49little bit you know there's there's very

31:50few people who have the ground level

31:53understanding of scaling laws um like

31:56you gam and Tim and your team when you

31:59step back and you look at the entire

32:00space of language modeling in addition

32:02to open source what are the key

32:04differentiators that you see in the next

32:06wave of Cutting Edge models um things

32:09like you know uh self-play you have

32:12process reward models um the uses of

32:14synthetic data uh if you had to

32:17conjecture what do you think some of the

32:18most exciting or important breakthroughs

32:21will be in the field going forward so I

32:24guess it's good to start with diagnosis

32:26so what is uh what is not working that

32:28well so reasoning is not working that

32:30well and it's super inefficient to train

32:32a model uh if you compare like the

32:34training process of of a large language

32:36mobel to the brain you have like a

32:39factor I think 100,000 so really there's

32:43some progress to be made in term of data

32:45efficiency so I think the the frontier

32:48is increasing data efficiency increasing

32:50reasoning capabilities so adaptive

32:52comput is one way uh and when to

32:55increase data efficiency you do you need

32:57to work on coming up with very high

32:59quality data filtering things uh many

33:02new techniques that needs to be invented

33:04still but that's really where the lock

33:06is uh data is the one important thing

33:10and the ability of the model to decide

33:13how much computed want to allocate to

33:14certain problem uh is definitely on the

33:17frontier as well so these are things

33:19that we're actively looking at you know

33:22this is a raging debate right we've and

33:24we've talked about this a few times

33:26before which is um Can models actually

33:29reason today do they actually generalize

33:31out of distribution what's your take on

33:34it and what do you think is required to

33:37to exhibit what what would convince you

33:40that models are actually capable of

33:41multi-step complex reasoning yeah it's

33:43very hard because you train on the

33:45entire human knowledge and so you have a

33:47lot of reasoning places so it's uh it's

33:50hard to say whether they reason or not

33:52or whether they do retrieval of

33:53reasoning and it looks like reasoning

33:55right uh I guess at the end of the day

33:57what matters is whether it works or not

33:59and on many simple reasoning task it

34:01does so we can call it reasoning it

34:04doesn't really matter if they reason

34:05like we do we don't even know how we

34:07reason so right so we are not going to

34:09know about how machines reason anytime

34:11soon yeah um so yeah it's a it's a

34:15raging debate uh the reason the the way

34:17you do evaluate that is to try to be as

34:20out of distribution as possible uh like

34:24working on on mathematics uh is not

34:26something I've ever done but that

34:28something that Timo and Gom have are

34:31very sensitive to because they've been

34:33doing it for a while uh when they were

34:35at meta that's probably one way of

34:37measuring whether you have a very good

34:38model or not and actually if you look at

34:41um we're starting to see some very good

34:44mathematicians uh I'm thinking of Teran

34:46St right that are using large language

34:49models for some things uh obviously not

34:52the high level reasoning but for some

34:53part of their proofs and so I think we

34:55will move up in the abstraction uh and

34:59the question where does that stop we do

35:01need to find new new paradigms to uh to

35:04go one step forward and and we we will

35:07be actively looking for them we've

35:09talked a lot about developers so far if

35:11you had to sort of Channel your product

35:13View and and sort of just conjecture on

35:15what these advances in scaling laws in

35:18in in representation learning in get

35:21teaching the models to to reason faster

35:24better cheaper what will these advances

35:25mean for end users in terms of how they

35:28consume how they program and they

35:29generally work with models what we think

35:31is that uh fast forward five years uh

35:34everybody will be using their

35:36specialized uh models Within part of

35:39complex applications and systems

35:41developers will be very um looking at

35:44latency so they will want to have for

35:46any specific task of the system they

35:50will want to have the lowest cost and

35:52lowest latency and the way you make that

35:54happen is that uh you will ask for the

35:57task ask for user preferences ask for

36:00what you want the model to do and you

36:01try to make the M as small as possible

36:04and as suitable to the task as possible

36:05and so I think that's the way we'll be

36:07evolving on the developer space I also

36:10think that U generally speaking the fact

36:13that we have access to large language

36:15models is going to reform completely the

36:17way we interact with machines and the

36:19internet of five years five years from

36:21now is going to be much different so

36:22much more interactive uh because I think

36:25this is already unlocked I mean it's

36:27just about making very good applications

36:29with very fast systems uh with very fast

36:31models so yeah very exciting times ahead

36:34so what would those interaction

36:35modalities look like yeah so that's very

36:38interesting and I think in in games for

36:40instance it's going to be fascinating uh

36:42we've seen some very good applications

36:44you do need to have small models because

36:46you want to have swarms of it and it

36:48start to be bit costly if you if it's

36:49too big uh but having them interact is

36:52just going to make pretty complex

36:54systems and interesting systems to

36:56observe and to use uh so uh so we have a

37:00few friends making applications in the

37:03Enterprise space space with different

37:05Persona playing different roles relying

37:07on the same language model but with

37:09different prompts and different uh

37:11functioning um and I think that's going

37:13to be quite interesting as well to uh to

37:16look at as I've said complex

37:17applications in in in three years time

37:19are just going to use different parts

37:22different llms for different parts and

37:24that's going to be quite exciting well

37:26what's your call action to builders

37:28researchers folks who are excited about

37:30the space what would you ask them to do

37:33I I would take uh mistal models and try

37:36to build amazing applications uh the way

37:39many developers had uh it's not that

37:42hard uh it's the stack is starting to be

37:44pretty clear pretty efficient uh you

37:47only need a couple of gpus you canot

37:49even do it on your MacBook Pro if you

37:51want it's going to to be a bit hot but

37:53uh uh but it's good enough to to do

37:56interesting

37:57applications uh really the way we do

37:59software today is very different from

38:00the way we did it from last year and so

38:03I'm really calling application makers to

38:05action because we we are going to to try

38:09to enable them to to build as fast as

38:11possible thank you so much for listening

38:13to the a6c podcast what we're trying to

38:15do here is provide an informed cleared

38:18but also optimistic view of technology

38:21and its future and we're trying to do

38:23that by featuring some of the most

38:25inspiring people and the things they're

38:28building and so if you believe in that

38:31and you'd like to join us on this

38:32journey make sure to click subscribe but

38:35also let us know in the comments below

38:37what you'd like to see us cover next

38:39thank you so much for listening and we

38:41will see you next

38:44[Music]

38:55time

🎥 Related Videos

a16z Podcast | Things Come Together -- Truths about Tech in Africa

a16z Podcast | Things Come Together -- Truths about Tech in Africa

a16z Podcast | The Infrastructure of Total Health

a16z Podcast | The Infrastructure of Total Health

The Robot Lawyer Resistance with Joshua Browder of DoNotPay

The Robot Lawyer Resistance with Joshua Browder of DoNotPay

a16z Podcast | Bots and Beyond

a16z Podcast | Bots and Beyond

Design Sprints as a Tool for Organizational Change

Design Sprints as a Tool for Organizational Change

a16z Podcast | Valuing Today's Fast-Growing Software Companies

a16z Podcast | Valuing Today's Fast-Growing Software Companies

🔥 Recently Summarized Examples

Former Priest REVEALS Jesus' MYSTICAL Lost Years & His Connection to BUDDHA! | Fr. Seán ÓLaoire

Former Priest REVEALS Jesus' MYSTICAL Lost Years & His Connection to BUDDHA! | Fr. Seán ÓLaoire

Kim Kardashian's Plastic Surgery Reversal: Is She Trying to Rewind Time?

Kim Kardashian's Plastic Surgery Reversal: Is She Trying to Rewind Time?

How To Succeed As A NEW & YOUNG Realtor [Deals Every Month + Luxury Listings]

How To Succeed As A NEW & YOUNG Realtor [Deals Every Month + Luxury Listings]

BITCOIN EMERGENCY: NEXT PRICE TARGETS REVEALED!! Bitcoin News Today & Ethereum Price Prediction!

BITCOIN EMERGENCY: NEXT PRICE TARGETS REVEALED!! Bitcoin News Today & Ethereum Price Prediction!

Uncovering Ancient Atlantean Ruins: Exploring Evolutionary Pathways and Psychic Phenomenon

Uncovering Ancient Atlantean Ruins: Exploring Evolutionary Pathways and Psychic Phenomenon

Samsung Technician Knives TV To Void Warranty

Samsung Technician Knives TV To Void Warranty

View original video