00:00scaling laws now these underpin the
00:02success of large language models today
00:05but the relationship between data sets
00:07compute and the number of parameters was
00:10not always clear in fact in 2022 a
00:13pivotal paper came out that changed the
00:15way that many people in the research
00:16Community thought about this very
00:18calculus and it demonstrated that data
00:20sets were actually more important than
00:22just the sheer size of the model one of
00:25the key authors of this paper was Arthur
00:27MCH who was working at Deep Mind at time
00:30now earlier this year Arthur banded
00:32together with two other researchers yam
00:34Lampo and timate laqua two researchers
00:38at meta who worked on llama and together
00:40the three of them founded a new company
00:42mrol that team has been hard at work
00:45releasing mrol 7B in September a
00:47state-of-the-art open source model that
00:50quickly became the go-to for developers
00:52and they just released as in in the last
00:55few days a new mixture of experts model
00:58that naturally they're calling mixt
01:00so today you'll get to hear directly
01:02from Arthur as he sits down with a16z
01:05General partner Anan maida as the
01:07Battleground for large language models
01:09heats up to say the least together they
01:11discussed the many misconceptions around
01:13open source and the war being waged on
01:15the industry plus the current
01:17performance reality of open versus
01:19closed models and whether that Gap will
01:21realistically close with time plus the
01:24kind of compute data and algorithmic
01:26Innovations required to keep scaling
01:30now it's really rare to have someone at
01:32the frontier of this kind of research be
01:34so candid about what they're building
01:36and why so I hope that you come out of
01:38this episode as excited about the future
01:41of Open Source as I did
01:43enjoy as a reminder the content here is
01:46for informational purposes only should
01:48not be taken as legal business tax or
01:50investment advice or be used to evaluate
01:52any investment or security and is not
01:54directed at any investors or potential
01:56investors in any a6c fund please note
02:00that a16z and its Affiliates may also
02:02maintain investments in the companies
02:03discussed in this podcast for more
02:06details including a link to our
02:07investments please see
02:14disclosures you've got uh quite the
02:16founding team story you know we
02:17flashback to a few years ago labs are
02:20building Foundation models and the
02:22consensus across the research Community
02:24was that the size of these models was
02:26What mattered most you know how many
02:29million billion parameters went into the
02:32model seemed to be the primary debate
02:34that people are having but it seems like
02:36you had a hunch that data sets mattered
02:38more could you just give us the
02:40backstory on the chinchilla paper you
02:41co-wrote you know what were the key
02:43takeaways on the paper and how was it
02:45received yeah so I guess the backstory
02:47is that uh 2019 2020 people were relying
02:51a lot on um on a paper called scaling
02:54lows for large language models that was
02:56advocating for uh basically scaling
02:59infinitely the size of models and
03:01keeping so number of data points uh
03:03rather fixed so do saying that if you
03:06had like four times the amount of
03:07compute you should be mostly multiplying
03:10by 3.5 your model size and then maybe by
03:131.2 uh and so a lot of work was actually
03:16done on top of that so in particular at
03:18Deep mine when I joined uh I I joined a
03:21project called gopher and and that's a
03:23there was a misconception there there
03:25was also a misconception on gpt3 and
03:28basically in 2021 every paper uh made
03:31this mistake and at the end of 2021 we
03:35started to realize there were some
03:36issues and as it turns out we turned
03:38back to to the mathematical paper that
03:41was actually talking about scaling lows
03:43and it was a bit hard to understand and
03:45we figured out that actually if you
03:47thought about it bit more in a
03:49theoretical perspective and if you
03:51looked at like empirical evidence we had
03:54um it didn't really make sense uh to
03:56actually grow the model size faster than
04:00and we did some measurement and as it
04:02turned out what was actually true was
04:05actually what we expect which is in
04:09common words if you multiply by four uh
04:11your compute capacity you should
04:13multiply by two the model size and by
04:15two the data size that's approximately
04:17what you should be doing which is good
04:19because if you move everything to
04:20Infinity everything remains consistent
04:22so you don't have a model which is
04:23infinity big or a model which is
04:25infinity small with infinite compression
04:27or like close to zero compression so it
04:29really makes sense and as it sounds out
04:31it's really what You observe if you look
04:32at uh if you do multiple runs and and so
04:35that's how we we train chinchila and and
04:38that's how we wrote the chinchila paper
04:40at the time you know you were at Deep
04:42mind and your co-founders were at meta
04:45what's the backstory around how you
04:47three end up coming together to form
04:49mrol after the compute optimal skating
04:52laws work that you just described so
04:54we've known each other for a while
04:55because Gom and I were in school
04:57together and timoth and I were in master
04:59together in Paris basically we had like
05:01very parallel careers Timo and I we
05:04actually work together as well again
05:06when I was doing a post talk um in
05:09mathematics uh and then I joined deep
05:11mine as Gan timot uh went to become
05:16permanent researchers at meta uh and so
05:19we continued doing this I was doing
05:21large language models in between 2020
05:23and 2023 Gom and timote were working on
05:26um solving U mathematical uh problems
05:30with large langage models and if I
05:34understand correctly I wasn't there but
05:35they realized they had to have stronger
05:37models and they started to do large
05:38language models at this point uh so like
05:41I guess a year after I started uh and on
05:45my side I was mostly working on a small
05:47team at uh Deep Mind so we did very
05:49interesting work on uh we retro which is
05:52a paper doing retrieval for large
05:53language models we did the chinchila
05:56then uh we I I was in the team doing
05:59Flamingo which is actually one one of
06:01the good way of doing a model that can
06:02see things I guess when chbt went out we
06:06knew I mean we knew from before that the
06:08technology was very very much
06:10gamechanging but it was kind of a signal
06:13that there was a strong opportunity for
06:15building a small team uh focusing uh on
06:18a different way of Distributing the
06:19technology so redoing things in a more
06:21open source manner uh which was not the
06:24direction what that Google at least was
06:26taking and and so we had this
06:28opportunity then we left the company uh
06:31at the beginning of last year and and
06:34created the team that started to work on
06:36the 5th of June and if I recall
06:38correctly right before they left Tim and
06:41guom had started to work on llama right
06:44over at meta could you just describe
06:46that project and how it was related to
06:48the chinchilla scaling laws work You'
06:50done so Lama was uh like a small team
06:54reproduction of chinchila at least in
06:56the in its approach of the
06:57parameterization and all of these things
07:00uh it was uh one of the first papers
07:02that established that you could go
07:04beyond the chinchila scaling lows so
07:06chinchila scaling lows tell you what you
07:08should be training if you want to have
07:10an optimal model uh for a certain
07:12compute cost at training time but if you
07:15take into account the fact that your
07:16model should also be efficient at
07:18inference time you probably want to go
07:20far beyond the chinula scaling low so it
07:23means you want to overtrain the model so
07:26train on more tokens than would be
07:28optimal for performance
07:30but the reason why you do that is that
07:31you actually compress models more and
07:33then when you do inference you end up
07:35having a model which is much more
07:37efficient uh for a certain performance
07:39so by spending more time during training
07:42you spend less time during inference and
07:43so you save cost and I think uh that was
07:46something we well I guess we observed
07:48that at Google also but uh the Lama
07:52paper was the first to establish it in
07:53the open and it opened a lot of
07:55opportunities yep I remember you know
07:58the both the impact of the chinchilla
08:00scaling laws work on the labs on on
08:03multiple Labs realizing just how
08:05unoptimal the the compute setups were
08:09right um and then the impact of llama
08:11being dramatic on the industry and
08:13realizing how to be much more efficient
08:16about inference time so I can imagine
08:18that those were some of the the top
08:20insights on your mind and the top
08:22concerns on your mind when you guys left
08:24uh to start mistol so let's fast forward
08:28to to today you know it's December 2023
08:31we'll get to the role of Open Source in
08:33a bit but let's just level set on what
08:35you've built so far you know a couple
08:37months ago You released mistol 7B which
08:40was a best-in-class model um and this
08:43week you're releasing a new mixture of
08:45experts model so just tell us a little
08:48bit more about mixol I believe is what
08:50you're calling it and how it compares to
08:52models yeah so mix is our new model that
08:55wasn't released in Open Source before in
08:57a in a usable form uh so it's a
09:00technology called sparse mixture of
09:01experts uh which is quite simple you
09:03take all of the dense layers of your
09:05Transformer and you duplicate them you
09:07call these layers expert layers and then
09:10what you do is that for each exp for
09:12each token that you have in your
09:14sequence uh you have a router mechanism
09:16just a very simple Network that decides
09:20which expert should be looking at which
09:21token and so you send all of the tokens
09:23to their experts and then you apply the
09:26experts and you get back the the the
09:27output and you combine them and then you
09:30go forward in the network you have eight
09:33experts per layer and you execute only
09:35two of them so what it means at the end
09:37of the day is that you have a lot of
09:38parameters on your model you have uh 46
09:42billion parameters but the thing is that
09:44the the number of parameters that ex
09:46that you execute uh is much lower than
09:49that because you only execute two
09:50branches out of eight and so at the end
09:53of the day you only execute 12 billion
09:55parameters per token and this is what
09:57counts for latency and throughput and
09:59for performance so you have a model
10:01which has the performance of a 12
10:02billion parameter Network that can uh
10:06that have performance that are much
10:08higher than what you could get even by
10:10compressing data a lot on a 12 billion
10:12dense Transformer so Spar mixure of
10:15experts is a technology uh that allows
10:18to be much more efficient at inference
10:19time and also much more efficient at
10:21training time so that's the reason why
10:22we mates to develop it very quickly just
10:25for folks who are listening um who might
10:28not be with sort of state-of-the-art
10:30architecture and language models could
10:32you just describe the difference between
10:34you know dense models which have been
10:36the primary architecture today and
10:39mixture of experts intuitively what are
10:40the biggest differences between these
10:43architectures so they are very similar
10:45except on the what we call the D Network
10:48so the the you know in in the dense
10:51Transformer you have you alternate
10:53between an attention layer and and a
10:54dense layer generally that's that's the
10:57idea a spar mixture of experts you you
11:00take the dense layer and you duplicate
11:01it several times and so that's where you
11:03actually increase the number of
11:04parameters so you increase the capacity
11:06of the model without increasing the cost
11:08so that's a way of decoupling the
11:10memorization and what you can remember
11:12the capacity of the network to its cost
11:14at inference time if you had to describe
11:17the biggest benefits for developers as a
11:18result of that inference efficiency it's
11:20cost and and latency so you can have uh
11:24usually that's what you look at when
11:26you're a developer you want something
11:27which is cheap and you want something
11:28which is fast generally speaking the
11:30just the trade-off uh is strictly
11:33favorable in using mix compared to using
11:36a 12 billion dense model and the other
11:38way to think about it is that if uh you
11:41want to use a model which is as good as
11:42L 270b you should be using Mixr because
11:45m is actually on par with Lama 270b
11:49while being approximately six times
11:50cheaper or six times faster for the same
11:52price could you talk just a little bit
11:55about why it's been so challenging for
11:58folks to uh for research labs and
12:01research teams to really get the mixture
12:03of experts model right it sounds like
12:05you know for a while now folks have
12:07known that the dense model architecture
12:09that all of us have been using in in
12:10sort of the most notable products or the
12:12most well-known products um are slow uh
12:16they're expensive and they're difficult
12:19to scale and so for a while people have
12:22been looking for an alternative
12:24architecture that could be like you were
12:26saying cheaper could be faster could be
12:28more efficient what were some of the
12:30biggest challenges you have to figure
12:31out to get thee model right well I guess
12:35I won't disclose all Trade Secrets but
12:37uh there's basically two challenges the
12:38first one is you need to figure out how
12:40to train it correctly from from a
12:42mathematical perspective the other
12:44challenges is to train efficiently so
12:46how to use actually a hardware uh as
12:48efficiently as possible you have like
12:50new challenges coming from the fact that
12:52you have tokens flying around from one
12:54to one expert to another uh that creates
12:57some communication constraints and you
12:58need to figure out um uh you need to
13:01make it fast and then on top of that you
13:03also have new constraints that apply
13:04when you deploy the model uh you do need
13:07to do inferencing uh efficiently and
13:11that's also the reason why we released
13:12an open source package based on VM so
13:14that the community can can take also
13:16this code and modify it and see how that
13:18works yeah obviously we're excited to
13:21see what the community does with thee uh
13:24mixol release you're putting out this
13:26week let's talk about open source and
13:28approach and a philosophy that's that's
13:31permeated all the work you've been doing
13:33so far um why choose to tackle this
13:36increasingly competitive space with an
13:39open source approach which has been
13:40which is quite different from the way
13:41everybody else is approaching it I guess
13:43it's a good question the the answer is
13:44that it's partly ideological and and
13:46partly pragmatical um we have grown with
13:51the field of AI uh that when from 2012
13:54we were detecting cat and go cats and
13:56dogs and in 2022 we were actually
13:59generating text that looked humanik so
14:01really made a lot of progress and if you
14:03look at the reason why we made all of
14:04this progress uh well most of it is
14:08explainable by the free flow of
14:09information so you had academic Labs you
14:13had very big uh industry backed Labs
14:16communicating all the time about the
14:18results and building on top of uh each
14:20other results and that's the way we went
14:24significantly uh the architecture and
14:29techniques uh we we just made everything
14:32work as a community and all of a sudden
14:35in 2020 with gpt3 this tide kind of
14:38reversed and uh companies started to be
14:42more opaque about what they were doing
14:43because they they they realized there
14:45was actually a very big market and all
14:47of a sudden 2022 on the important
14:50aspects of AI and on llms we had we went
14:54and Beyond chinchila there were
14:56basically no communication at all and
14:58that's something that that I as a
14:59researcher and and timot and Gom and all
15:01of the people that joined us as well
15:03deeply regretted uh because we think
15:05that we're definitely not at the end of
15:07the story we need to invent new things
15:09there's no reason why to stop now uh
15:12because the technology is effectively uh
15:14good but not working completely well
15:15enough and so we believe that it's still
15:18the case that we should be communicating
15:20a lot about uh models we should be
15:23allowing the community to take the
15:25models and make it their own and that's
15:27that's a some ideological reason why we
15:30went into that the other reason is that
15:34developers uh developers want to modify
15:37things and and having a deep access to
15:39to to very good model is is a good way
15:42of engaging with this community and uh
15:45well and I guess addressing their needs
15:47so that the platform we're building as
15:49well is going to to be used by them so
15:52that's that's also like a a business
15:54reason obviously uh as a business we we
15:56do need to have a valid Mone ization
15:59Approach at some point uh but we've seen
16:01many businesses build open core
16:03approaches uh and have a very strong
16:06open source community and also a very
16:08good offer of services and that's what
16:10we want to build that resonates I I
16:12remember a very detectable shift you're
16:16right you know the early days of of deep
16:19learning were largely driven by a bunch
16:21of open collaboration between
16:23researchers from different Labs who
16:24would often publish all their work and
16:26share them at conferences you know
16:27Transformers ly was published and and
16:30opened to the entire research Community
16:32um but that has has definitely changed
16:35yes so I think there's some level of
16:37open sourcing uh in in Ai and so we
16:40offer the open uh the weights and we
16:42offer the inference code that's like the
16:44end product that is already super usable
16:46so uh it's already a very big step
16:49forward compared to closed apis because
16:52you can modify it and you can look at
16:54what's happening under the hood look at
16:56activations and all so you have inter
16:58interpretability and the possibility of
17:00modifying the model to adapt it to some
17:03editorial tone to adapt it to
17:05proprietary data to adapt it to some
17:07Specific Instructions which is something
17:09that is actually much harder to make if
17:10you only have access to a close Source
17:12API um and that's something that also
17:14goes with our approach of the technology
17:16which is to say pre-train model should
17:19be neutral and we should Empower our
17:21customers to take these models and just
17:25put their editorial approaches there are
17:27instruction they Constitution if you
17:29want to talk like entropic into the
17:31model so that's the that's the way we
17:33approach the technology we don't want to
17:35pour our own biases into the into the
17:38pre-train model on the other hand we
17:39want to enable the developers to control
17:42exactly how the model behaves uh and
17:45what kind of biases it has what what
17:47kind of biases it doesn't have so we we
17:49really take this modular approach and
17:51that goes very well with the fact that
17:53we release uh some very strong open we
17:55models could you just ground Us in the
17:57reality of where these models are today
17:58just to give people a sense of of where
18:01in the timeline we are is open source
18:03really a viable competitor to
18:05proprietary close models or is there a
18:06performance Gap you know what are the
18:08trade-offs or limitations that people
18:10should be aware of uh with open source
18:12so mix is as similar performance to GPT
18:153.5 so that's the that's a good
18:18grounding internally we have SW models
18:20that are in between 3.5 and four that
18:23are basically second or third the second
18:25or third best model in the world so
18:27really we think that the Gap is closing
18:30uh the Gap is approximately six months
18:32at that point and the reason why it's
18:34it's six months is that it actually goes
18:35faster if you do open source things
18:36because you get the community uh modify
18:39the model toest very good ideas that can
18:41then be Consolidated by us for instance
18:44and and we just go faster because of
18:46that so it has always been the case that
18:49open source at the end well ends up
18:52being going faster and that's the reason
18:54why uh the entire inet rends on Linux I
18:57don't see why it would be any different
18:59for AI uh obviously there's some
19:01constraint that are slightly different
19:02because the infrastructure cost is quite
19:04High uh to train a model it cost a lot
19:06of money but uh but I really think that
19:10we'll converge to a setting where you
19:12have propri ey models and the open
19:13source model are just as good and I
19:15think eventually the field will be much
19:16more open because if you want to go
19:18beyond the the biggest model today you
19:21do need to find new paradigms and so
19:22that means that we also need to do
19:24research H and that's we're very excited
19:26by this perspective because we like
19:28competitive environment and research
19:30yeah so let's talk about that a little
19:31bit more how are you seeing people use
19:34and innovate on the open source models
19:37and are there any use cases that diverge
19:39from proprietary close models at all I
19:41think we've seen several categories of
19:43usage um you had there's a few companies
19:46that know how to strongly find you
19:47models to their needs so they took
19:49mistal 7B had a lot of human annotations
19:52had a lot of proprietary data just
19:55modify mral 7B so that it solve their
19:57task just as as well as gbt 3.5 but only
20:00for a lower cost and a higher level of
20:02control we've also seen I think very
20:04interesting Community efforts in adding
20:06capabilities to mral 7B so we saw like a
20:10context length extension to 128k that
20:12worked very well again was done in the
20:14open so like the recipe was available
20:16and this is something that we were able
20:18to consolidate we've seen imag en coders
20:20to make it a visual visual language
20:23Model A very actionable thing that we
20:25saw is uh I think the hugging face folks
20:28first did the direct preference
20:30optimization on top of M 7B and made a
20:33very strong much stronger model than the
20:35instructed model we proposed at the
20:37early release and it turned out it's
20:39actually a very good idea to do it and
20:41so that's something that we've
20:42Consolidated as well uh so generally
20:45speaking uh just the community is super
20:48eager to just take the model and add new
20:50capabilities put it on the laptop put it
20:53on on an iPhone I saw m 7B on an iPhone
20:56I saw Mr 7B on the stuffed parot as well
20:59so fun things useful things uh but
21:02generally speaking it's been super
21:04exciting to see the research Community
21:06take a hold of of our technology and
21:09with mix which is a new architecture I
21:11think we're are also going to see much
21:12more interesting things because on the
21:14interpretability field also on the
21:16safety field as it turns out you have a
21:18lot of things to do when you have deep
21:20access to an open model and so we're
21:22really eager to to help that uh and to
21:25engage with the community safety you
21:27know this an important
21:28I think piece to talk about the
21:31immediate reaction of a lot of folks is
21:33to deem open source less safe than
21:36closed models how would you respond to
21:39that so I think we believe that it's
21:42actually not the case for the current
21:43generation of model uh models that we
21:46are using today are not that much are
21:49not much more than just a compression of
21:51whatever is available on the internet so
21:54it does make access to knowledge more
21:55food uh but this this is the story of
21:58humanity making knowledge uh access more
22:01fre it so it's no different than uh
22:03inventing the printing machine where we
22:05had apparently similar debate it wasn't
22:07there but that was the debate we had uh
22:09so we are not making the world any time
22:12uh any less safer by providing more
22:15interactive access to knowledge so
22:16that's the first thing now the other
22:18thing is that you do have immediate risk
22:20of misusage of large language models and
22:23uh you do have them for open source
22:25models but also for closed models and so
22:27the the way you do address these
22:31problems and come up with counter
22:32measures is to know about them uh so you
22:35need to know about uh about bridges
22:38basically and that's the same way in
22:39which you need to know about bridges on
22:41on operating systems and on uh on
22:43networks and so it's no no different for
22:46for AI uh putting uh models under the
22:49highest level of scrutiny is a way of
22:50knowing how they can be misused and it's
22:52a way of coming up with conter measures
22:55and I think a good example of that is
22:56that it's actually super easy to exploit
22:58an API uh it's super easy especially if
23:00you have fine tuning access to make gp4
23:03behave uh in a very bad way and it's um
23:08since it's the case and it's always
23:10going to be the case it's super hard to
23:11be adversarially robust it means that
23:13we're only trusting the team uh of large
23:16companies to figure out ways of
23:18addressing these problems whereas if you
23:21do open sourcing you trust the community
23:23and the community is much larger um and
23:26so if you look at the history of
23:27software in cyber security in operating
23:30systems that's the way we made the
23:32system safe and so if we want to make
23:34the current AI system safe and then move
23:36on to a Next Generation that potentially
23:38will be even stronger and then we can re
23:41have this we can have this discussion
23:43again well you do need to do open
23:45sourcing so today we think that open
23:47sourcing is the safe way yeah I think
23:50this is this is not understood right
23:53widely that um when you have thousands
23:57or hundreds of thousands of people able
23:58to Red Team models because it's open
24:01source the likelihood that you'll detect
24:03biases and um built-in breaches and
24:07risks are just dramatically higher um
24:11and I think if you were talking to
24:13policymakers how would you help advise
24:16them how do you think they should be
24:18thinking about regulating open source
24:19models given that you know the safest
24:22way often to battle Harden software and
24:25tools is to put them out in the open
24:27well we've been saying that precisely
24:29this that um that the current technology
24:34is not dangerous on the other end the
24:36fact that we we are effectively making
24:39them stronger means that we need to
24:40monitor what's happening empirically
24:42monitor performances the best way of
24:44empirically monitoring software
24:47performances is is through open source
24:49so that's what we've been saying um
24:53there's been some effort to try to come
24:55up with very complex governance
24:56structure where where uh you would have
24:59like several companies talking together
25:01having some safe space some safe
25:02soundbox uh for uh red teer that would
25:06be potentially independent so things
25:08that are super complex uh but as it
25:11turns out if you look at the history of
25:12software the only way we did software
25:14collaboratively is through open source
25:16so why change the recipe today where uh
25:19the the technology we're looking at is
25:21actually nothing else than the
25:23compression of the internet so that's
25:25the that's what we've been saying to The
25:28another thing we've added to the
25:29regulator is that if they want to
25:32enforce that AI products that needs to
25:35be safe like like if you if you want to
25:38have a diagnosis assistant you want it
25:39to be safe right well in order to
25:42Monitor and to evaluate whether it's
25:44actually safe you need to have some very
25:46good tooling and the the tooling
25:48requires to have access to llms and if
25:51you access close Source apis llms where
25:55you're bit in a in a in trouble water
25:58because it's hard to be independent in
25:59that setting so we think that
26:01independent controller of product safety
26:04should have access to very strong open
26:06source models and should own the
26:07technology and if open source llms were
26:12to fail relative to close Source models
26:17be well I guess the the regulation
26:20burden is is uh potentially one thing
26:23that that could uh make it harder to uh
26:26to release open source models it's also
26:28generally speaking it's a very
26:30competitive market and I think in order
26:33for open source models to be widely
26:34adopted they need to be as strong as
26:37open as a close Source model they have a
26:40little Advantage because you do have
26:42more control and so you can do heavier
26:44fing and so you can make performance
26:46jump a lot on a specific task because
26:48you have deep access uh but really at
26:51the end of the day um developers look at
26:54performance and and latency and so
26:57that's why why we think that as a
26:59company we need to be very much on the
27:00frontier if we want to be relevant given
27:03the complexity of uh Frontier models and
27:06Foundation models in these systems there
27:08are just tons of misconceptions that
27:11folks have about these models and so if
27:13you step back and we look at the Battle
27:16that's raging between folks um uh
27:20pushing for closed Source systems and
27:23versus the open source system what do
27:25you think is at stake here what do you
27:26think the battle is for well I think the
27:29battle is for the neutrality of the
27:31technology like a technology by ense is
27:34something neutral you can use it for bad
27:36purposes you can use it for good
27:37purposes if you look at what the llm
27:39does it's not really different from
27:42programming language it's actually used
27:44very much as a programming language by
27:45the application makers there's a strong
27:49um confusion made between what we call a
27:51model and what we call an application
27:54and so a model is really the programming
27:56language of a application so if you talk
27:59to all of the startups doing amazing
28:01products with generative AI they're
28:03using llms just as um as a function and
28:07on top of that you have a very big
28:08systems with filters with decision
28:11making with control flow and all of this
28:13things and what you want to regulate if
28:15you want to regulate something is the
28:17system the system is it's the product so
28:20for instance um a healthcare diagnosis
28:23assistant is is an application you want
28:26it to be non-biased you want it to take
28:30good decisions uh even under high
28:32pressure so you want its statistical
28:34accuracy to be very high and so you want
28:36to measure that and it doesn't matter if
28:39it uses a large language Mo
28:42uh under the hood what you want to
28:44regulate the application and the issue
28:47we had and the issue we're still having
28:50is we hear a lot of people saying we
28:53should regulate the tech so we should
28:55regulate the function the mathematics
28:58but really you never use a large
28:59language model itself you only always
29:01use it in an application in a in a in a
29:03way with a user interface and so that's
29:07the one thing you want to regulate and
29:09what it means is that companies like us
29:11like foundational model companies will
29:13obviously make the model as controllable
29:15as possible so that the applications on
29:17top of it can be compliant can be safe
29:20uh we'll also build the tools that allow
29:22to measure the compliance and the safety
29:26of the application because that's super
29:28useful for the application makers it's
29:31needed but there's no point in
29:33regulating something that is neutral in
29:36itself that is just a mathematical tool
29:38so I think that's the one thing that
29:40we've been hammering a lot uh I think
29:42we've been which is good uh but there's
29:45still a lot of effort uh in uh I guess
29:48in making this strong distinction which
29:49is super important to understand what's
29:51going on so to regulate apps not math
29:56seems like you know the right direction
29:58that a lot of folks who are who
30:01understand the inner workings of these
30:02models and how they're actually
30:03implemented in in reality um are
30:07for what do you think um is the best way
30:11to clear up the this misconception for
30:14folks who don't maybe don't have
30:15technical backgrounds don't actually
30:17understand how Foundation models work
30:20and how the scaling laws work so I've
30:22been using a lot of metaphors I guess to
30:24to to make it understood large language
30:26models are like programming languages uh
30:28and so you don't regulate uh programming
30:31languages you regulate malwares you you
30:33ban malwares we've also been actively uh
30:38vocal about the fact that pre-market
30:41conditions like flops the number of of
30:43flops that you do to create a model is
30:45definitely not the right way of doing uh
30:48of measuring the performance of a model
30:50right we um we're very much in favor of
30:53having very strong evaluations that's as
30:55I've said uh this this is something that
30:57we want to provide to our customers the
30:59ability to evaluate our models in their
31:03application uh and so I think this is a
31:05very strong thing um to well that we've
31:08been stressing we want to provide the
31:11tools for application makers to be
31:13compliant that's the that's something we
31:15have we've been saying and so we find it
31:17a bit unfortunate that uh we haven't we
31:20haven't been heard everywhere and that
31:22there's still a big focus on the tech uh
31:25probably because things are not
31:27completely well understood because it's
31:28a very complex field and it's also a
31:30very fast moving field uh but eventually
31:32I think I'm I'm very optimistic that
31:35we'll find a way to uh continue
31:37innovating uh while having safe products
31:40but also uh high level of competition on
31:44on the foundational mod uh layer well
31:47let's let's Channel your optimism a
31:49little bit you know there's there's very
31:50few people who have the ground level
31:53understanding of scaling laws um like
31:56you gam and Tim and your team when you
31:59step back and you look at the entire
32:00space of language modeling in addition
32:02to open source what are the key
32:04differentiators that you see in the next
32:06wave of Cutting Edge models um things
32:09like you know uh self-play you have
32:12process reward models um the uses of
32:14synthetic data uh if you had to
32:17conjecture what do you think some of the
32:18most exciting or important breakthroughs
32:21will be in the field going forward so I
32:24guess it's good to start with diagnosis
32:26so what is uh what is not working that
32:28well so reasoning is not working that
32:30well and it's super inefficient to train
32:32a model uh if you compare like the
32:34training process of of a large language
32:36mobel to the brain you have like a
32:39factor I think 100,000 so really there's
32:43some progress to be made in term of data
32:45efficiency so I think the the frontier
32:48is increasing data efficiency increasing
32:50reasoning capabilities so adaptive
32:52comput is one way uh and when to
32:55increase data efficiency you do you need
32:57to work on coming up with very high
32:59quality data filtering things uh many
33:02new techniques that needs to be invented
33:04still but that's really where the lock
33:06is uh data is the one important thing
33:10and the ability of the model to decide
33:13how much computed want to allocate to
33:14certain problem uh is definitely on the
33:17frontier as well so these are things
33:19that we're actively looking at you know
33:22this is a raging debate right we've and
33:24we've talked about this a few times
33:26before which is um Can models actually
33:29reason today do they actually generalize
33:31out of distribution what's your take on
33:34it and what do you think is required to
33:37to exhibit what what would convince you
33:40that models are actually capable of
33:41multi-step complex reasoning yeah it's
33:43very hard because you train on the
33:45entire human knowledge and so you have a
33:47lot of reasoning places so it's uh it's
33:50hard to say whether they reason or not
33:52or whether they do retrieval of
33:53reasoning and it looks like reasoning
33:55right uh I guess at the end of the day
33:57what matters is whether it works or not
33:59and on many simple reasoning task it
34:01does so we can call it reasoning it
34:04doesn't really matter if they reason
34:05like we do we don't even know how we
34:07reason so right so we are not going to
34:09know about how machines reason anytime
34:11soon yeah um so yeah it's a it's a
34:15raging debate uh the reason the the way
34:17you do evaluate that is to try to be as
34:20out of distribution as possible uh like
34:24working on on mathematics uh is not
34:26something I've ever done but that
34:28something that Timo and Gom have are
34:31very sensitive to because they've been
34:33doing it for a while uh when they were
34:35at meta that's probably one way of
34:37measuring whether you have a very good
34:38model or not and actually if you look at
34:41um we're starting to see some very good
34:44mathematicians uh I'm thinking of Teran
34:46St right that are using large language
34:49models for some things uh obviously not
34:52the high level reasoning but for some
34:53part of their proofs and so I think we
34:55will move up in the abstraction uh and
34:59the question where does that stop we do
35:01need to find new new paradigms to uh to
35:04go one step forward and and we we will
35:07be actively looking for them we've
35:09talked a lot about developers so far if
35:11you had to sort of Channel your product
35:13View and and sort of just conjecture on
35:15what these advances in scaling laws in
35:18in in representation learning in get
35:21teaching the models to to reason faster
35:24better cheaper what will these advances
35:25mean for end users in terms of how they
35:28consume how they program and they
35:29generally work with models what we think
35:31is that uh fast forward five years uh
35:34everybody will be using their
35:36specialized uh models Within part of
35:39complex applications and systems
35:41developers will be very um looking at
35:44latency so they will want to have for
35:46any specific task of the system they
35:50will want to have the lowest cost and
35:52lowest latency and the way you make that
35:54happen is that uh you will ask for the
35:57task ask for user preferences ask for
36:00what you want the model to do and you
36:01try to make the M as small as possible
36:04and as suitable to the task as possible
36:05and so I think that's the way we'll be
36:07evolving on the developer space I also
36:10think that U generally speaking the fact
36:13that we have access to large language
36:15models is going to reform completely the
36:17way we interact with machines and the
36:19internet of five years five years from
36:21now is going to be much different so
36:22much more interactive uh because I think
36:25this is already unlocked I mean it's
36:27just about making very good applications
36:29with very fast systems uh with very fast
36:31models so yeah very exciting times ahead
36:34so what would those interaction
36:35modalities look like yeah so that's very
36:38interesting and I think in in games for
36:40instance it's going to be fascinating uh
36:42we've seen some very good applications
36:44you do need to have small models because
36:46you want to have swarms of it and it
36:48start to be bit costly if you if it's
36:49too big uh but having them interact is
36:52just going to make pretty complex
36:54systems and interesting systems to
36:56observe and to use uh so uh so we have a
37:00few friends making applications in the
37:03Enterprise space space with different
37:05Persona playing different roles relying
37:07on the same language model but with
37:09different prompts and different uh
37:11functioning um and I think that's going
37:13to be quite interesting as well to uh to
37:16look at as I've said complex
37:17applications in in in three years time
37:19are just going to use different parts
37:22different llms for different parts and
37:24that's going to be quite exciting well
37:26what's your call action to builders
37:28researchers folks who are excited about
37:30the space what would you ask them to do
37:33I I would take uh mistal models and try
37:36to build amazing applications uh the way
37:39many developers had uh it's not that
37:42hard uh it's the stack is starting to be
37:44pretty clear pretty efficient uh you
37:47only need a couple of gpus you canot
37:49even do it on your MacBook Pro if you
37:51want it's going to to be a bit hot but
37:53uh uh but it's good enough to to do
37:57applications uh really the way we do
37:59software today is very different from
38:00the way we did it from last year and so
38:03I'm really calling application makers to
38:05action because we we are going to to try
38:09to enable them to to build as fast as
38:11possible thank you so much for listening
38:13to the a6c podcast what we're trying to
38:15do here is provide an informed cleared
38:18but also optimistic view of technology
38:21and its future and we're trying to do
38:23that by featuring some of the most
38:25inspiring people and the things they're
38:28building and so if you believe in that
38:31and you'd like to join us on this
38:32journey make sure to click subscribe but
38:35also let us know in the comments below
38:37what you'd like to see us cover next
38:39thank you so much for listening and we