00:00there's very few computational problems
00:01that complex that mankind has you know
00:04undertaken so if napkin math 175 billion
00:06parameters 350 billion floating Point
00:09operations three times ten to the 23 and
00:12that's a completely crazy number got it
00:14got it expectation at the moment is that
00:16the cost for training these models may
00:19actually sort of top out or even go down
00:22a little bit how do you think about the
00:24relationship between compute capital and
00:27then the technology that we have today
00:28yeah that's uh that's the million dollar
00:30question or maybe trillion dollar
00:31question I don't know
00:35with software becoming more important
00:37than ever Hardware is following suit
00:40and with the world constantly generating
00:43more data unlocking the full potential
00:45of AI means a constant need for faster
00:48and more resilient Hardware
00:50but how much does all of this really
00:53cost in this final segment of our AI
00:56Hardware series we tackle that question
00:58head on but if you're just catching up
01:01be sure to check out part 1 and part two
01:03where we explored the emerging
01:05architectures and the momentous
01:07competition for AI hardware and today
01:10we're joined Again by a16z special
01:12advisor Guido apenzeller someone who is
01:15uniquely suited for this deep dive as a
01:18storied infrastructure expert cto4
01:20Intel's data center group dealing a lot
01:22with Hardware the low-level components
01:24so it's given me so if I think a good
01:25Insight how large data centers work you
01:28know what the the basic components are
01:30that make all of this I AI boom uh you
01:33know possible today here is Guido
01:36touching on the reality of these models
01:37and how much they cost today training
01:40one of these large language models today
01:42is it's not a hundred thousand dollar
01:43thing it's probably millions of dollars
01:47um practically speaking what we're
01:49seeing in industry tree is that it's
01:51actually more for tens of millions of
01:52dollars thing as a reminder the content
01:55here is for informational purposes only
01:57should not be taken as legal business
01:59tax or investment advice or be used to
02:01evaluate any investment or security and
02:04is not directed at any investors or
02:05potential investors in any a16z fund
02:08please note that a16z and its Affiliates
02:11may also maintain investments in the
02:13companies discussed in this podcast for
02:15more details including a link to our
02:17investments please see a16c.com
02:30in Guido's recent article navigating the
02:33high cost of AI compute Guido even noted
02:36that access to compute resources has
02:38become a determining factor for the
02:40success of AI companies and this is not
02:43just true for the largest companies
02:45building the largest models in fact many
02:47companies are spending more than 80
02:49percent of their total Capital raised on
02:52compute resources the specifically seen
02:54that with Founders that want to train
02:56their own models right which is
02:57extremely expensive you have to spend a
02:59large chunk of your your funding just on
03:01compute capacity right and they often
03:03have a very small lean teams right
03:05that's that's the plus side over time I
03:07would expect this to normalize a little
03:11as you go more from the the core
03:13technology that you're building in your
03:14early days towards more a complete
03:17product offering right there's just a
03:19lot more boxes to check and you know
03:21features to implement and all the
03:22administrative parts of your application
03:24if you're getting to the Enterprise so
03:26probably you'll have more normal
03:27software development that's not AI right
03:29or classic software development
03:30happening uh you'll probably also have a
03:32larger head count of people that
03:34um that you have to pay so at the end of
03:36the day I would expect as a percentage
03:38that will go down over time right as an
03:41absolute amount I think it'll be going
03:43up for some time just because this AI
03:45boom is still just in its infancy the AI
03:47boom has just begun and in part two we
03:49discussed how it's unlikely for compute
03:51demand to subside anytime soon there we
03:55also discussed how the decision to own
03:57or rent infrastructure can make a
03:59non-trivial difference to a company's
04:01bottom line but there are other
04:04considerations when it comes to cost
04:06batch size learning rate and the
04:09duration of the training process all can
04:11attribute to the final price tag how
04:14much does it cost to train a model
04:16depends on the Myriad of factors right
04:17now the good news is we can simplify
04:19this a little bit because the vast
04:20majority of models that are being used
04:22today are Transformer models right
04:24that's that was the Transformer
04:25architecture huge breakthrough in AI
04:27they've proven to be incredibly
04:28versatile they're easier to train
04:29because they paralyze a little bit
04:31better than previous models and so in a
04:34Transformer you can sort of approximate
04:37the inference time as twice the number
04:39of parameters floating Point operations
04:41right and the the training time is about
04:43six times the number of parameters so if
04:45you take something like gpt3 right which
04:47is you know the open aisp model they
04:49have 175 billion parameters so you need
04:54twice as much as 350 billion floating
04:56Point operations to the one inference
04:59and so you know so based on that you can
05:01sort of figure out how much compute
05:03capacity you need how this is going to
05:06scale how you should should uh you
05:08should price it you know how much it'll
05:09cost you at the end of the day this also
05:11gives you for model training and idea
05:13how long the training is going to take
05:14you know how much your AI accelerator uh
05:18can do in terms of floating Point
05:20operations per second right you can sort
05:22of theoretically calculate how many
05:24operations it is to to train your model
05:26in practice the math is more complicated
05:28because uh you know you there's certain
05:30ways to accelerate that so maybe you can
05:32train with a reduced Precision but
05:34there's it's also very hard to achieve
05:36100 utilization on these cards if you're
05:39naively implementing you can be probably
05:40going to be below 10 utilization but you
05:42know you can probably get into the tens
05:43of percent with a little bit of um a
05:45little bit of work this gives you a
05:47rough swag how much capacity you need
05:49for training and for inference but at
05:51the end you probably do want to test it
05:52before you make any final decisions uh
05:54on these things and make sure that your
05:56assumptions hold on how much you need
05:58now if all those numbers confused you
06:00that's okay we'll walk through a very
06:02specific example gpt3 gpd3 has about 175
06:07billion parameters and here's Guido on
06:10the computation requirements for
06:12training the model and ultimately
06:13inference that's when you're prompting
06:16the already trained model to elicit a
06:18response so if you if you just do very
06:20naively the math right let's start with
06:22training right we know how many tokens
06:24it was trained on we know how many
06:25parameters the model has so we can
06:27there's soft napkin math and you end up
06:30with something like 3 times 10 to the 23
06:32floating Point operations and that's a
06:34completely crazy number right it's like
06:36a number with 23 digits right it's like
06:40there's very few computational problems
06:42that complex that mankind has has
06:44actually you know undertaken right this
06:46is like it's a huge effort because then
06:48you can be like okay so let's take say
06:50an a100 right the most commonly used
06:52card uh we know how many floating Point
06:54operations you can do per second we can
06:56divide that right let's give us an order
06:58of magnitude estimation like how much
07:00time it would take right and then we
07:02know how much one of these cards cost
07:03right you know like a renting an a100
07:05cost you between I would say between one
07:08and four dollars probably right
07:09depending on on who you render from and
07:12you end up with something in the order
07:14of half a million dollars right with
07:16this very very naive analysis now
07:19there's a couple of things there right
07:20we didn't take to account optimization
07:21we also didn't take into account that
07:23they you probably cannot run this at um
07:25at full capacity because of memory
07:27bandwidth limitations and network
07:30um and you know last but not least you
07:32probably need more than one run to get
07:33this right and you probably need a bunch
07:35of test runs they're probably not gonna
07:36be full runs and so on
07:37but you know this gives you an idea
07:39that's of training one of these large
07:40language models today is it's not a
07:43hundred thousand dollar thing it's
07:44probably millions of of dollars thing
07:47um practically speaking what we're
07:49seeing in industry is that it's actually
07:51more for tens of millions of dollars
07:52thing and that's because you need the
07:54reserve capacity right so if I could get
07:57all my cards for the next two months uh
08:00would only cost me a million dollars but
08:02the problem is they want to two year
08:03reservation right so really the the cost
08:06is uh you know 12 times as high and uh
08:09you know so that basically adds a zero
08:16cost so inference is much much cheaper
08:18basically for a modern text model for
08:20example the training set is about a
08:21trillion tokens right and if I run
08:23inference each letter so each word that
08:25comes out is as one token right so it's
08:28a factor of a trillion or so uh you know
08:30faster in in the on the inference part
08:33you know if you run the numbers like a
08:34large language model you actually at a
08:37at a fraction of a cent like a tenth of
08:39a cent or hundredths of a cent somewhere
08:40in that in that ballpark
08:42um for the inference again if we just
08:43naively look at this right for inference
08:45your problem is usually that you have to
08:47provision for Peak capacity right so if
08:49everybody wants to use your model at 9am
08:51on a Monday right uh you still have to
08:53pay for you know Saturday Saturday night
08:55at midnight but nobody's using it right
08:57that increases your cost substantially
08:59there for some of them on specifically
09:01image models what you can do for
09:02inference is that you use much much
09:04cheaper cards because they the model is
09:06small enough that you can run it on
09:08essentially the server version of a
09:09consumer graphics card and and that that
09:11can save a lot of cost and unfortunately
09:14as we discussed in part one you can't
09:16just make up for these inefficiencies by
09:18piecing together a bunch of less
09:20performance chips at least for model
09:22training at least you need some super
09:23sophisticated software for that right
09:25and because the the overhead of this
09:27Distributing the data between these
09:29cards will probably outweigh any saving
09:31you get from from cheaper cards
09:33inference on the other hand for
09:35inference you can often do the inference
09:36on a single card so that's not really a
09:38problem if you take something like
09:39stable diffusion right a very popular
09:40model for image generation you know that
09:43runs that that runs on a Macbook for
09:46example out of uh you know that has
09:47enough memory and enough compute power
09:49so you can generate an image locally so
09:51that'll run on a relatively cheap
09:52consumer card and you don't have to to
09:55use an a100 for it to do inference
09:58so when we're talking about the training
10:00of the models clearly the the amount of
10:03is drastically more than the inference
10:05and something else that we've already
10:07talked about is the more compute often
10:10not always but often the better model
10:11and so does this ultimately these
10:14factors all ladder up to the idea that
10:16heavily capitalized incumbents when this
10:19race or this competition or how do you
10:21think about the relationship between
10:23compute capital and then the technology
10:27that we have today yeah that's uh that's
10:29the million dollar question or maybe
10:30trillion dollar question I don't know so
10:33training these models is expensive right
10:35for example the we haven't seen uh yet
10:39really good uh open source large
10:42language model or I'm sure part of the
10:44reason is they're training these models
10:45it's just really really expensive right
10:47I mean there's a bunch of enthusiasts
10:48out there that would love to do this but
10:50you know you need to find uh I think a
10:52couple of million or 10 million dollars
10:53of compute capacity to do it and that
10:55makes it so much harder all right and it
10:57means you just need to create a
10:58substantial effort before before
11:00something like that can happen
11:02um all that said you know the the cost
11:05for training these models overall seems
11:10um and in part I think it is because it
11:13seems to me like like we're becoming
11:14data limited right so the it turns out
11:17there is a there is a correspondence
11:19between how big your model is and what
11:21the the optimal amount of training data
11:23is for the model it's having a super
11:24large model with very few data doesn't
11:26doesn't get you anything or you know
11:28having a ton of data with a small model
11:29also doesn't get you anything right your
11:31software the size of your brain needs to
11:33roughly correspond to the length of your
11:34University education here right like
11:37otherwise it it doesn't work what this
11:39means is that you know because some of
11:42the large models today already leverage
11:44a good percentage of all human knowledge
11:47in a particular area
11:48all right I mean the if you look at you
11:51know GPT there was probably trained on
11:52something like 10 of the of the internet
11:54right and all of Wikipedia and many
11:56books like a good chunk of all books
11:58right so so going up by a factor of 10
12:01yeah that's probably possible going up
12:02by a factor of a hundred
12:04that's not clear if that's possible I
12:07mean we as mankind just haven't produced
12:09enough knowledge yet that you could
12:10absorb all of that into one of these
12:13and and so I think the the expectation
12:16at the moment is that the the cost for
12:19training these models you know may
12:21actually sort of top out or even go down
12:24a little bit you know as the the chips
12:26get faster but we we don't discover new
12:28training material as quickly I mean
12:31unless somebody comes up with a new idea
12:32how to generate training material
12:34if that assumption is true I think this
12:37means that the mode that's created by
12:39these large Capital Investments is
12:40actually not particularly deep right
12:42it's more of a speed bump than uh than
12:45you know something that prevents new
12:46entrance I mean today
12:48training a large language model is
12:50something that is definitely within
12:52reach for a well-funded startup right so
12:54and for that reason we expect to see
12:55more innovation in that area in the
12:57future all right that is a wrap for our
12:59AI Hardware series we genuinely hope you
13:02came away with a little more knowledge
13:04about this increasingly important space
13:06because if software is indeed eating the
13:09world well Hardware is coming along for
13:12that ride and as a reminder if you
13:15haven't yet listened to part one where
13:17we explore the emerging architectures
13:18and who's creating them or part two
13:21where we dive into the future AI stack
13:23and how Founders can participate well
13:25those are already live and ready for
13:28consumption and as always thank you so
13:30much for listening we'd actually like to
13:32leave you with a fun fact from gpt4
13:34itself commenting on the technology that
13:37created it and yes we did fact check
13:39this and this is also AI generated audio
13:43chat GPT and its sibling models are
13:46trained on diverse internet text however
13:49the exact amount of data used can be
13:51hard to comprehend if we were to print
13:54all of the data used to train these
13:55models it could fill a large Library
13:58consider that one single book may
14:00contain around 1 million characters if
14:03we estimate that the training data is
14:05hundreds of gigabytes of text Data let's
14:07take a conservative estimate and say
14:11considering that one character is
14:13approximately one byte this would mean
14:15the model was trained on approximately
14:17100 billion characters if each book has
14:201 million characters then the data used
14:23to train chat GPT is equivalent to the
14:25text in approximately 100 million books
14:28if we take the size of a large Library
14:30such as the New York Public Library
14:33which has around 53 million items not
14:36just books the training data is
14:38equivalent to the text in almost twice
14:40the number of items in that Library
14:42thanks Chachi BT a quick note to close
14:45out that many models today are even
14:47bigger with llama 2 for example being
14:50trained on 2 trillion tokens or about 8
14:52trillion characters now that is a lot of
14:55libraries we'll see you next time
14:58thank you so much for listening to the
15:01what we're trying to do here is provide
15:03an informed clear-eyed but also
15:06optimistic take on technology and its
15:09future and we're trying to do that by
15:10featuring some of the most inspiring
15:13people and the things that they're
15:15so if that is interesting to you and
15:17you'd like to join us on this journey go
15:19ahead and click subscribe and make sure
15:21to let us know in the comments below
15:23what you'd like to see us cover next
15:25thank you so much for listening and
15:27we'll see you next time