00:00finding the compute capacity to run
00:02their applications is actually a real
00:04challenge what really is stopping
00:06companies from going and like 10xing
00:08their production the crazy exponential
00:10growth of AI at the moment how do I get
00:12access to the compute that I need I
00:15think my number one advice would be to
00:16shop around for a certain process which
00:17you don't want to use this capacity but
00:19for another one that you do want to use
00:20they don't have the capacity how much
00:22should Founders know about Hardware
00:24there's probably a certain scale what
00:27makes sense for you it's a whole like a
00:28new ecosystem forming and I think
00:30there's a ton of opportunities too to
00:31build great conference
00:36with software becoming more important
00:39than ever Hardware is following suit and
00:42with the world constantly generating
00:44more data unlocking the full potential
00:46of AI means a constant need for faster
00:49and more resilient Hardware that is
00:52exactly why we've created this
00:53mini-series on AI Hardware in part one
00:56we took you through the emerging
00:58architecture powering llms from GPU to
01:01TPU including how they work who's
01:04creating them and also whether we can
01:07expect Moore's law to continue but part
01:10two is for the founders trying to build
01:12AI companies and here we dive into the
01:14Delta between supply and demand why we
01:17can't just print our way out of a
01:18shortage how Founders can get access to
01:20inventory whether they should think
01:22about renting or owning or moats can be
01:25found and even where open source comes
01:27into play you should also look out for
01:29part 3 coming very soon where we break
01:31down exactly how much all of this costs
01:34from training to inference and today
01:36we're joined Again by a16z special
01:39advisor Guido appenzeller someone who is
01:41truly unique suited for this deep dive
01:44as a storied infrastructure expert with
01:46experience like CTO for Intel status and
01:49our group dealing a lot with Hardware
01:50the low-level components so it's given
01:52me sort of I think a good Insight how
01:54large data centers work you know what
01:56the the basic components are that make
01:59all of this AI boom uh you know possible
02:02today despite working with
02:03infrastructure for quite some time
02:05here's keto commenting on how the
02:07momentum of the recent AI wave is
02:09Shifting supply and demand Dynamics the
02:12biggest thing that is tricking that is
02:13just the the crazy exponential growth of
02:16ai ai has been booming since mid last
02:19year I think nobody expected how quickly
02:21it would move and that is just a
02:23credited demand which at the moment the
02:26as a reminder the content here is for
02:28informational purposes only should not
02:31be taken as legal business tax or
02:32investment advice or be used to evaluate
02:34any investment or security and is not
02:37directed at any investors or potential
02:38investors in any a16z fund please note
02:41that a16z and its Affiliates may also
02:44maintain investments in the companies
02:46discussed in this podcast for more
02:48details including a link to our
02:49investments please see a16c.com
03:03article keto even stated that some
03:06reputable sources indicate that demand
03:08for AI Hardware outstrip Supply by a
03:11factor of 10. here's some commenting on
03:14how that Dynamic is impacting
03:16competition we currently don't have as
03:20many AI chips or servers as we'd like to
03:24have so for some of our portfolio
03:25companies you're finding the compute
03:27capacity that they need to run their
03:29applications is actually a real
03:31challenge right there's a whole value
03:32chain behind that it's a combination of
03:34many things you know we have some
03:35bottlenecks on the chip manufacturing
03:37side we have some bottlenecks on on
03:39building the actual cards you know these
03:41development Cycles take some time maybe
03:43this is a silly question but what really
03:45is stopping companies like Intel like
03:47Nvidia from going in like 10xing their
03:50production like is that on the road map
03:51where we're just gonna see a lot more
03:53chips and we won't see this discrepancy
03:55between supply and demand or is there
03:57something more complex at play it's a
03:59bit more complex because if you want to
04:01make it ship right the way you do it is
04:03you make it in a Foundry right which are
04:05like you know extremely extremely large
04:07extremely complex Intel makes trips on
04:10their own foundries but but most
04:11companies manufacture with Taiwan
04:13semiconductor tsmc right and they are
04:16capacity constraint right you often have
04:18to reserve capacity along in advance
04:20there's different processes so you know
04:22might be for a certain process which you
04:23don't want to use this capacity but for
04:25another one that you do want to use they
04:26don't have the capacity and you know you
04:28could just say like well in that case
04:29let's just build more Fabs but building
04:31a Fab is you know takes you a couple of
04:33years and you know probably uh a couple
04:35of billion or or 10 billion of
04:37investment so you're looking at some
04:39very large investment projects that take
04:41some time to adjust right and that's
04:43sort of what's what prevents us from
04:45reacting more quickly at the moment and
04:46while some countries are indeed making
04:48major multi billion dollar investments
04:51in new semiconductor production plants
04:53AKA Fabs these will take time to scale
04:56and there are also no promises given
04:58that expertise is concentrated in a few
05:01companies so with demand not subsiding
05:04what does this mean for who gets access
05:06to the supply available I mean it
05:08doesn't sound like the demand is going
05:10to subside especially because we see
05:11this really what seems like intrinsic
05:14relationship between the power of these
05:15models and then the compute that's
05:17thrown at them so how does a company
05:19let's say if I'm a Founder today how do
05:21I get access to the compute that I need
05:24and who decides is it just who's willing
05:26to pay the most or how is that Supply
05:31yeah there's some of that right at the
05:33moment uh capacity is expensive wherever
05:35you go you know I I try just to run some
05:38some personal experiments you know try
05:39to to reserve an instance monthly cloud
05:41service providers A few days ago and
05:42they just didn't have any it's like no
05:43not available what we're seeing is that
05:46often in order to get access to the the
05:48newer cards or newer chips if you want
05:51that at scale you have to pre-reserve
05:53capacity so often these are negotiations
05:55between a company and a large Cloud
05:57where you say okay I need this many this
05:59many chips for this amount of time what
06:01they'll often ask for is they ask for a
06:03certain time commitment so I'll be like
06:04okay we can give you this many tips but
06:06we want you to sign uh basically that
06:07you get them exclusively for two years
06:10um for that amount right but we've seen
06:12um I mean I think open AI wasn't in use
06:14with that right where you you have
06:16investment deals but for example cloud
06:17provider comes in and invests in a
06:19company as a result so the company gets
06:21capacity so we're seeing all kinds of
06:23deals being struck as with any any scars
06:25scarce resource right there's a lot of
06:27deal making going on and it's not just a
06:29matter of getting access to compute it's
06:31about ensuring you get access to the
06:33compute tailored to your needs and cost
06:35is not the only Factor what would you
06:38say in terms of the considerations that
06:40they should be keeping in mind really
06:42how much should Founders know about
06:45hardware and again selecting which
06:48Hardware to use fantastic question I
06:50mean I think the first question honestly
06:52I would ask is do you really need to
06:54consume the hardware directly or do you
06:57really just want to consume something
06:58that runs on top of the hardware right
07:00today let's take an example if I want to
07:02generate images with a stable diffusion
07:04for my for example for my mobile phone
07:06app or something like that
07:08um it might be easier to go to a SAS
07:11company like replicate for example
07:12essentially will host the model for you
07:14where you just pay for for access to the
07:16model and they send you like the the
07:18generated images and they will manage
07:21all the provision of compute
07:23infrastructure and we'll find the gpus
07:25for you if you do want to run your your
07:27own model I think number one advice
07:29would be to shop around right there's a
07:30there's a fair number of providers the
07:32large clouds in my experience are not
07:34always the best option right if you
07:36price it out uh you know we've seen that
07:38the startups typically are more likely
07:41to go with specialized clouds you know
07:42like core weave or Lambda right that
07:44that obviously specialize in providing
07:46AI infrastructure to startups let's shop
07:49around look at the different offers
07:51yeah and when you're shopping around in
07:53addition to price which I feel like is a
07:56major motivating factor what other
07:58factors are there in terms of these
08:01other companies who maybe aren't the big
08:02clouds how are they differentiating
08:04relative to one another how are they
08:06standing out in that market yeah there's
08:09a whole sort of decision tree there you
08:10know I mean the first thing is uh one
08:12thing it often drives the decision is
08:14how much memory do I need in my cards
08:16right if I have a small image model
08:18right I might be able to to work with a
08:20more consumer grade card which is much
08:22cheaper right per hour if I reserve it
08:24in a cloud whereas if I for example
08:27training for a large language model I
08:29not only need a card with the most
08:31memory I can find but I probably won't
08:32have as many cards as possible in one
08:35server because communication between
08:36them matters and I may even care about
08:38the networking fabric behind it right
08:39some of the very large models you
08:41actually Network constraints in in terms
08:43of how quickly you can constrain them so
08:45it really becomes a question of what's
08:47your objective is it inference is the
08:48training if it's training how big is
08:50your model right and based on that you
08:51figure out what the card is what kind of
08:53server you need what kind of fabric you
08:55need between those servers and then you
08:56sort of can can decide what the right
08:58fit is for your application even prior
09:00to this AI wave compute was a major line
09:03item for many software companies and the
09:05calculus of leaning on the easily
09:07accessible Cloud versus bringing
09:09infrastructure in-house was becoming an
09:11increasingly important consideration
09:13here is Guido touching further on that
09:16very calculus in today's era and where
09:18scale comes into play compute is
09:20expensive it's a major line item for
09:22many companies and this is even before
09:26today yeah yeah so do you think how do
09:29you think about again like how that
09:31impacts different companies bottom lines
09:32and whether they really factor that in
09:36to having their own I guess like
09:38allocated gpus versus using something
09:40more like replicate you really have to
09:43um you know what is the right fit for
09:45you and it probably depends a lot on the
09:47scale at which you need them right if
09:48you need a lot you frankly you have to
09:50pre-reserve them right you have to have
09:51your own there's just no no way around
09:53that you need a smaller quantity you may
09:56um to reserve them on a more short-term
09:58basis or you know you have various
10:01models where you can consume only while
10:03your application runs but at a higher
10:05price right and so this really comes
10:07down to what kind of load do you have
10:09right what we're talking is seeing if
10:10somebody is training they're more likely
10:12to to do a long-term reservation for a
10:15GPU because you want to make sure you
10:16have access to it if somebody has a more
10:19continuous workloads for availability is
10:21important like if I just do inference
10:23but you know I want to make 100 sure
10:25that if the request comes in I can
10:26service it I can never be down right it
10:27probably choose a capacity as well right
10:29on the other hand if I have more batch
10:31jobs where it's like well I have this
10:33jobs runs an hour later that's that's
10:34not the end of the world then you
10:35probably can go with variable capacity
10:37and you know just preserve it ad hoc but
10:40it's it's really a conversation of what
10:41is the usage pattern what is your demand
10:43pattern and from that comes you know the
10:45the best pick for for you the partners
10:47that you work with we've seen that
10:49companies even prior to AI have
10:52benefited from building their own
10:54infrastructure by basically bringing
10:56that in-house because
10:57before that they were renting and they
10:59were paying a lot to rent that compute
11:02do you think that will be a
11:04differentiator for companies moving
11:06forward or how should Founders be
11:07thinking about that relationship between
11:09owning the infrastructure and renting it
11:12owning the infrastructure comes with
11:15cost as well right because you need to
11:16you know now hire people that run it
11:18right you need to get money for the
11:20capex and so on so my guess is that most
11:23early stage Founders and probably even
11:26most mid-stage and late stage Founders
11:28uh you know are better off by by renting
11:31capacity renting a cloud or using
11:33consumer sound service right
11:36um there's a couple of exceptions if you
11:38have really really specialized needs
11:40right you may just not not find anybody
11:42who has exactly the kind of Hardware
11:43that you need right there might be some
11:45cases where you have geopolitical
11:46concerns you know your data is just too
11:48sensitive you need you need to run your
11:52um and there's probably a certain scale
11:53what makes sense for you to run your own
11:55data center but it's a pretty large
11:56scale right if you're spending 10
11:58million dollars a year you're probably
11:59still under critical right you're
12:00spending 100 million dollars a year on
12:02infrastructure that that may be a reason
12:04to look into into options for your own
12:06data but if everyone is competing for
12:08the same compute are there other ways to
12:11stand out where's the moat here or you
12:13could say a moat is getting access to
12:15different training data but that
12:17actually doesn't necessarily have to do
12:20compute or money being thrown out the
12:24problem it's getting access to
12:25differentiated data if you have access
12:26to differentiated differentiated data
12:28that could be a mode I mean there's
12:31it's a bit more subtle um because look
12:34if if you had an area where there's just
12:37not much Public Training data that's
12:38probably right right you know and there
12:40might be areas like Finance or so well
12:43um but for for example for a large
12:45language model it turns out that just
12:48making a larger model and training on
12:49more data has more benefits than just
12:52absorbing more knowledge it also means
12:54that it's better in reasoning and and
12:56you know understanding abstract context
12:57and then you know answering really
12:59complex multi-stage questions and so on
13:01so so probably you know if I have to
13:04guess I think the future will be that
13:05we'll we'll still train on all the data
13:07we can find right and and then maybe you
13:09fine-tune meaning you yourself to do
13:11some additional training on a particular
13:12problem domain with your with your
13:14private data if that makes sense and so
13:16you you first go to elementary school to
13:18learn reading and writing and you know
13:19then you go to your your uh you know
13:21vocational training for the for the
13:23special specialized job that uh you know
13:25you you have to do in the future another
13:27important question worth addressing is
13:29who can realistically compete in if
13:32compute is expensive will all the
13:34largest most heavily capitalized
13:36companies win since they can build the
13:38largest models with the most data or
13:40what role does open source play as one
13:42of many emerging examples vicunia was
13:45created by fine-tuning meta's llama1
13:49the cost of fine-tuning added only an
13:52additional 300 but the result is
13:55competitive with much larger models like
13:57Chachi PT or Bard so what might this
14:01example and a growing number of Open
14:03Source projects tell us about the future
14:05of openlms so first of all the in
14:09general larger models if they're you
14:12know everything else being equal perform
14:13better right so the really small open
14:16source models that we're seeing out
14:18there today they're not yet at a level
14:20of a GPD 3.5 or GT4 and that's actually
14:24a website that runs sort of regular bake
14:26offs where they basically ask users to
14:28prepare answers and you know it seems to
14:30be pretty clear that the the large ones
14:32are still a little bit ahead that said
14:34we're making big advances there we're
14:37figuring out a couple of things so one
14:38thing we've learned is um
14:40there's something called the chinchilla
14:42scaling loss that basically give us an
14:43idea how does data correspond to model
14:46size and if we over train so don't train
14:49as efficiently as we could we can
14:51actually get potentially a smaller and
14:53better model right so you can match the
14:55performance of a large model with a
14:56smaller model if you train it more right
14:57so that's interesting that that reduces
14:59model sizes and the trend at the moment
15:01is to make slightly smaller model models
15:04and train them more to get to get equal
15:08um the second thing is that um
15:11when we talk about models there's models
15:13for a slightly different purposes right
15:14you have the base large language models
15:16you know all that trained and
15:18practically speaking is completing text
15:19right literally how you train them is
15:21you give them text and say guess the
15:22next letter and then you tell them nope
15:23that was wrong or yes that was right I
15:25didn't and doesn't by propagate the
15:28um you know how they predict
15:30um and they're really good at that
15:31completing text right that's not quite
15:34the same that you want from a chat bot
15:36or from you know a model that you can
15:38tell to do something so there's usually
15:40another step afterwards which is called
15:42uh you know fine-tuning for instruction
15:43following or for for chat specifically
15:46right where basically I tell a model
15:47look if somebody somebody asks you to uh
15:51you know come up with a list of of to do
15:53either like a list of steps how to make
15:54pizza right this is roughly what I
15:56expect you to to answer right these
15:57models are very good in learning these
15:59things so so basically you first train
16:01them just complete text and then you
16:03train them how to uh you know react to
16:07to human requests and and instructions
16:09right as well the instruction fine
16:10tuning and so so so uh you know llama
16:12for example that was a Facebook model
16:14where they published the weights for for
16:15researchers and then some people took
16:17that and they fine-tuned it meaning they
16:18you know they they took a bunch of
16:20instructions foreign things to turn it
16:22into uh which is a much much nicer model
16:25in terms of interacting with it right
16:26for humans much much more useful right
16:28and so the the biggest challenge at the
16:30moment we have the open source side is
16:31there's currently no large open source
16:35llm out there right the the you know the
16:37the GPT three uh was 175 billion
16:41parameters there's currently nothing in
16:43that weight class uh that's that's open
16:45source and that people could use to
16:46fine-tune or to to play without a modify
16:48it's worth noting that since this
16:50recording several more open source
16:52models have been released including
16:54llama 2 with 70 billion parameters and
16:58an open license unlike its predecessor
17:00llama one another 40 billion parameter
17:03open source model falcon was released as
17:06but both of these are still dwarfed in
17:08parameters compared to closed models
17:10like open AIS gpt3 at a 175 billion
17:14parameters or gbt4 at an estimated 1.8
17:19trillion parameters although the latter
17:22is speculated to be a collection of
17:24multiple smaller models however
17:26parameter count isn't the only driver of
17:29performance for example while llama 2
17:31has fewer parameters than gpt3 its
17:34performance is actually much better due
17:37to being trained on more data in fact
17:39lamentu is currently comparable to
17:42gpt3's successor gbt3 3.5 the current
17:46default for Chachi BT
17:48and as these models continue to get
17:51larger we may actually see some models
17:53compress becoming more efficient in
17:55enabling inference on your device stable
17:58diffusion can you know run on your
18:00computer's GPU do we expect to see more
18:02of that because right now they are all
18:04hosted by these companies right they're
18:06trained by these companies on their
18:08dedicated servers and then even you know
18:09if you interface with chat GPT it's
18:12running that inference for you do we
18:15expect to see that change at all as
18:16compute becomes cheaper maybe more
18:18decentralized or how would you think
18:20about that that's a really good question
18:22and we're speculating a little bit here
18:24but my guess is we will right and the
18:26we're seeing some of these smaller
18:28models getting pretty good they run on
18:30you know your laptop or even your phone
18:32uh you know essentially stable diffusion
18:35implementations that run well on phones
18:37right which I would have never thought
18:38right they take a couple of 10 seconds
18:40to create an image uh you know which is
18:42comparatively slow but uh you know
18:44there's there's certain applications
18:45where that's with that success
18:47acceptable so my guess is as both the
18:50devices get faster and the models get
18:53more optimized right we'll we'll this
18:55will be a trend that we see more and
18:57more and you know in the future it might
18:58just be part of the operating system to
18:59have a basic large language model of
19:01design image generation model we've
19:03talked about how expensive compute can
19:05be and how ultimately that can be a
19:06major line item for companies and I
19:08guess probably the model training will
19:11remain you know with those companies and
19:13not necessarily on folks devices but in
19:16terms of the inference I assume that's
19:18still you know pretty significant cost
19:20and in a way if if someone is able to
19:22run that locally it doesn't that
19:24disjoint the company from having to pay
19:26for that compute because it's running on
19:28let's say you know someone's MacBook GPU
19:31oh yeah totally I mean like look if I
19:33can generate an image on my phone
19:34directly you know all it takes is some
19:37battery power and it gets a little warm
19:38right and that's and that's it right so
19:40that's a huge Advantage um you know at
19:42the same time there's probably going to
19:43be a little bit bifurcation there around
19:45quality and parameters right you can run
19:47things locally but you can probably run
19:49them a lot better in the cloud right
19:51because you have them have a much bigger
19:52server there so it probably depends on
19:54what you want to do all right if I just
19:55want to have a better spell checker uh
19:57you know that checks my email or maybe
19:59you know just some some simple
20:00completion that's perfectly fine I can
20:03run that on my phone on the other hand
20:04if I want something you know that is
20:08you know or write a good speech or you
20:10know summarize a a complex text they
20:12might be like all that I'm going to run
20:14the cloud because it takes so many more
20:15operations hopefully this is getting
20:19here is Guido speaking to how this
20:22presents a fundamentally new stack and
20:24what that means in terms of opportunity
20:26it feels like this really is like this
20:28massive wave this Renaissance of of
20:31innovation it's full of opportunities
20:33right I mean we're rebuilding a stack
20:35you know it's it's you can look at AI
20:38just as a new application but honestly I
20:40think it's probably a better way to look
20:41at it as a different type of compute
20:43right we traditionally build software by
20:46composing algorithms in a way that we
20:49and where you know the end result was uh
20:52you know a program there so Bottoms Up
20:53constructed now we have a second type
20:55compute uh you know where we're just
20:57trying to launch neural network and the
20:58big Advantage is we don't actually need
20:59to know how to solve a problem as long
21:01as the network can figure it out right
21:03the neural network can figure it out
21:04we're fine and that opens up a bunch of
21:08but it also means you know you need a
21:10completely different stack in terms of
21:11all the different pieces right you
21:13probably want Vector DBS to retrieve a
21:15context you know you you want different
21:17types of Hosting providers that are good
21:19in hosting these models and providing
21:21them to you as a service right it's it's
21:24a whole like you know Cambrian explosion
21:26of creativity as a whole Echo a new
21:28ecosystem forming and I think there's a
21:29ton of opportunities too to build right
21:31companies I think that paints a pretty
21:33incredible picture of opportunity across
21:35the stack and as many of these Trends
21:38continue to progress like supply and
21:40demand the calculus of renting versus
21:42owning compute close versus open source
21:44models we look to part three of the
21:46series to answer a very important
21:48question how much does all of this cost
21:52we'll explore all this in depth
21:53including how much startups are really
21:55spending on AI compute and whether
21:57that's sustainable how much it really
21:59costs to train a model like gpt3 the
22:02difference in cost between training and
22:03inference and how all of this will
22:06change with time we will see you there
22:09thank you so much for listening to the
22:11a16c podcast what we're trying to do
22:14here is provide an informed clear-eyed
22:16but also optimistic take on technology
22:19and its future and we're trying to do
22:21that by featuring some of the most
22:24inspiring people and the things that
22:25they're building so if that is
22:28interesting to you and you'd like to
22:29join us on this journey go ahead and
22:31click subscribe and make sure to let us
22:33know in the comments below what you'd
22:35like to see us cover next
22:36thank you so much for listening and
22:38we'll see you next time