The True Cost of Compute

a16z2023-09-01

4K views|8 months ago

💫 Short Summary

AI model training costs are a significant factor for companies, with expenses reaching millions due to the need for massive computational power. Training large models like GPT-3 can cost tens of millions, impacting even smaller companies. While inference costs are lower, training requires substantial resources for optimal performance. The future of AI technology relies heavily on hardware advancements to handle the exponential growth in data utilization for training large language models. Ultimately, the availability of new training material and faster chips will play a crucial role in stabilizing or decreasing training costs.

✨ Highlights

📊 Transcript

✦

The high cost of training large language models poses a significant challenge for AI companies.

01:37

Some companies are spending over 80% of their total capital on compute resources.

Access to compute resources has become a crucial factor for the success of AI companies, affecting even smaller companies.

Founders looking to train their own models face significant expenses due to the high cost of AI compute.

Efficient use of resources is essential for minimizing the financial burden of training AI models.

✦

Cost implications of AI model training.

04:14

Factors such as batch size, learning rate, and training duration play a crucial role in determining costs.

Transformer models have simplified training process due to their versatility and parallelization capabilities.

Training large models like GPT-3 with billions of parameters requires significant compute capacity.

Understanding these factors helps in estimating compute requirements, pricing, training time, and assessing AI accelerators capabilities.

✦

The cost of training large language models like GPT3 is substantial due to the massive computational power required.

06:02

Estimates suggest that utilizing common cards like the A100 for this task can cost up to half a million dollars, not including optimization or memory limitations.

The training process involves multiple runs and significant testing, making it a multimillion-dollar endeavor.

Industry trends indicate that training such models now costs tens of millions of dollars due to the need for reserve capacity.

✦

The cost of running a model for inference is cheaper compared to training, with faster and more efficient processes.

10:03

Using consumer graphics cards for inference can save a significant amount of cost.

Training models require more compute power and resources, leading to better performance.

Heavily capitalized companies have an advantage in this competition due to their access to more compute resources and technology.

✦

Cost of training large language models is decreasing.

11:10

Model size should match training data for best performance.

Large models utilize a lot of human knowledge, like GPT being trained on internet and Wikipedia.

Training costs expected to stabilize or decrease as chips improve, but new training material availability is a limiting factor.

✦

Importance of hardware in the future of technology.

13:25

AI models like Chat GPT are trained on massive amounts of text data, equivalent to millions of books.

Larger models like Llama 2 are trained on trillions of characters, showcasing the scale of data involved.

Exponential growth in data utilization for training AI models is emphasized.

Hardware advancements are crucial in enabling the capabilities of AI models.

00:00there's very few computational problems

00:01that complex that mankind has you know

00:04undertaken so if napkin math 175 billion

00:06parameters 350 billion floating Point

00:09operations three times ten to the 23 and

00:12that's a completely crazy number got it

00:14got it expectation at the moment is that

00:16the cost for training these models may

00:19actually sort of top out or even go down

00:22a little bit how do you think about the

00:24relationship between compute capital and

00:27then the technology that we have today

00:28yeah that's uh that's the million dollar

00:30question or maybe trillion dollar

00:31question I don't know

00:35with software becoming more important

00:37than ever Hardware is following suit

00:40and with the world constantly generating

00:43more data unlocking the full potential

00:45of AI means a constant need for faster

00:48and more resilient Hardware

00:50but how much does all of this really

00:53cost in this final segment of our AI

00:56Hardware series we tackle that question

00:58head on but if you're just catching up

01:01be sure to check out part 1 and part two

01:03where we explored the emerging

01:05architectures and the momentous

01:07competition for AI hardware and today

01:10we're joined Again by a16z special

01:12advisor Guido apenzeller someone who is

01:15uniquely suited for this deep dive as a

01:18storied infrastructure expert cto4

01:20Intel's data center group dealing a lot

01:22with Hardware the low-level components

01:24so it's given me so if I think a good

01:25Insight how large data centers work you

01:28know what the the basic components are

01:30that make all of this I AI boom uh you

01:33know possible today here is Guido

01:36touching on the reality of these models

01:37and how much they cost today training

01:40one of these large language models today

01:42is it's not a hundred thousand dollar

01:43thing it's probably millions of dollars

01:46thing

01:47um practically speaking what we're

01:49seeing in industry tree is that it's

01:51actually more for tens of millions of

01:52dollars thing as a reminder the content

01:55here is for informational purposes only

01:57should not be taken as legal business

01:59tax or investment advice or be used to

02:01evaluate any investment or security and

02:04is not directed at any investors or

02:05potential investors in any a16z fund

02:08please note that a16z and its Affiliates

02:11may also maintain investments in the

02:13companies discussed in this podcast for

02:15more details including a link to our

02:17investments please see a16c.com

02:19disclosures

02:20[Music]

02:30in Guido's recent article navigating the

02:33high cost of AI compute Guido even noted

02:36that access to compute resources has

02:38become a determining factor for the

02:40success of AI companies and this is not

02:43just true for the largest companies

02:45building the largest models in fact many

02:47companies are spending more than 80

02:49percent of their total Capital raised on

02:52compute resources the specifically seen

02:54that with Founders that want to train

02:56their own models right which is

02:57extremely expensive you have to spend a

02:59large chunk of your your funding just on

03:01compute capacity right and they often

03:03have a very small lean teams right

03:05that's that's the plus side over time I

03:07would expect this to normalize a little

03:09bit and it's mostly

03:11as you go more from the the core

03:13technology that you're building in your

03:14early days towards more a complete

03:17product offering right there's just a

03:19lot more boxes to check and you know

03:21features to implement and all the

03:22administrative parts of your application

03:24if you're getting to the Enterprise so

03:26probably you'll have more normal

03:27software development that's not AI right

03:29or classic software development

03:30happening uh you'll probably also have a

03:32larger head count of people that

03:34um that you have to pay so at the end of

03:36the day I would expect as a percentage

03:38that will go down over time right as an

03:41absolute amount I think it'll be going

03:43up for some time just because this AI

03:45boom is still just in its infancy the AI

03:47boom has just begun and in part two we

03:49discussed how it's unlikely for compute

03:51demand to subside anytime soon there we

03:55also discussed how the decision to own

03:57or rent infrastructure can make a

03:59non-trivial difference to a company's

04:01bottom line but there are other

04:04considerations when it comes to cost

04:06batch size learning rate and the

04:09duration of the training process all can

04:11attribute to the final price tag how

04:14much does it cost to train a model

04:16depends on the Myriad of factors right

04:17now the good news is we can simplify

04:19this a little bit because the vast

04:20majority of models that are being used

04:22today are Transformer models right

04:24that's that was the Transformer

04:25architecture huge breakthrough in AI

04:27they've proven to be incredibly

04:28versatile they're easier to train

04:29because they paralyze a little bit

04:31better than previous models and so in a

04:34Transformer you can sort of approximate

04:37the inference time as twice the number

04:39of parameters floating Point operations

04:41right and the the training time is about

04:43six times the number of parameters so if

04:45you take something like gpt3 right which

04:47is you know the open aisp model they

04:49have 175 billion parameters so you need

04:54twice as much as 350 billion floating

04:56Point operations to the one inference

04:59and so you know so based on that you can

05:01sort of figure out how much compute

05:03capacity you need how this is going to

05:06scale how you should should uh you

05:08should price it you know how much it'll

05:09cost you at the end of the day this also

05:11gives you for model training and idea

05:13how long the training is going to take

05:14you know how much your AI accelerator uh

05:18can do in terms of floating Point

05:20operations per second right you can sort

05:22of theoretically calculate how many

05:24operations it is to to train your model

05:26in practice the math is more complicated

05:28because uh you know you there's certain

05:30ways to accelerate that so maybe you can

05:32train with a reduced Precision but

05:34there's it's also very hard to achieve

05:36100 utilization on these cards if you're

05:39naively implementing you can be probably

05:40going to be below 10 utilization but you

05:42know you can probably get into the tens

05:43of percent with a little bit of um a

05:45little bit of work this gives you a

05:47rough swag how much capacity you need

05:49for training and for inference but at

05:51the end you probably do want to test it

05:52before you make any final decisions uh

05:54on these things and make sure that your

05:56assumptions hold on how much you need

05:58now if all those numbers confused you

06:00that's okay we'll walk through a very

06:02specific example gpt3 gpd3 has about 175

06:07billion parameters and here's Guido on

06:10the computation requirements for

06:12training the model and ultimately

06:13inference that's when you're prompting

06:16the already trained model to elicit a

06:18response so if you if you just do very

06:20naively the math right let's start with

06:22training right we know how many tokens

06:24it was trained on we know how many

06:25parameters the model has so we can

06:27there's soft napkin math and you end up

06:30with something like 3 times 10 to the 23

06:32floating Point operations and that's a

06:34completely crazy number right it's like

06:36a number with 23 digits right it's like

06:38hard to write down

06:40there's very few computational problems

06:42that complex that mankind has has

06:44actually you know undertaken right this

06:46is like it's a huge effort because then

06:48you can be like okay so let's take say

06:50an a100 right the most commonly used

06:52card uh we know how many floating Point

06:54operations you can do per second we can

06:56divide that right let's give us an order

06:58of magnitude estimation like how much

07:00time it would take right and then we

07:02know how much one of these cards cost

07:03right you know like a renting an a100

07:05cost you between I would say between one

07:08and four dollars probably right

07:09depending on on who you render from and

07:12you end up with something in the order

07:14of half a million dollars right with

07:16this very very naive analysis now

07:19there's a couple of things there right

07:20we didn't take to account optimization

07:21we also didn't take into account that

07:23they you probably cannot run this at um

07:25at full capacity because of memory

07:27bandwidth limitations and network

07:28limitations

07:30um and you know last but not least you

07:32probably need more than one run to get

07:33this right and you probably need a bunch

07:35of test runs they're probably not gonna

07:36be full runs and so on

07:37but you know this gives you an idea

07:39that's of training one of these large

07:40language models today is it's not a

07:43hundred thousand dollar thing it's

07:44probably millions of of dollars thing

07:47um practically speaking what we're

07:49seeing in industry is that it's actually

07:51more for tens of millions of dollars

07:52thing and that's because you need the

07:54reserve capacity right so if I could get

07:57all my cards for the next two months uh

08:00would only cost me a million dollars but

08:02the problem is they want to two year

08:03reservation right so really the the cost

08:06is uh you know 12 times as high and uh

08:09you know so that basically adds a zero

08:11too much money

08:16cost so inference is much much cheaper

08:18basically for a modern text model for

08:20example the training set is about a

08:21trillion tokens right and if I run

08:23inference each letter so each word that

08:25comes out is as one token right so it's

08:28a factor of a trillion or so uh you know

08:30faster in in the on the inference part

08:33you know if you run the numbers like a

08:34large language model you actually at a

08:37at a fraction of a cent like a tenth of

08:39a cent or hundredths of a cent somewhere

08:40in that in that ballpark

08:42um for the inference again if we just

08:43naively look at this right for inference

08:45your problem is usually that you have to

08:47provision for Peak capacity right so if

08:49everybody wants to use your model at 9am

08:51on a Monday right uh you still have to

08:53pay for you know Saturday Saturday night

08:55at midnight but nobody's using it right

08:57that increases your cost substantially

08:59there for some of them on specifically

09:01image models what you can do for

09:02inference is that you use much much

09:04cheaper cards because they the model is

09:06small enough that you can run it on

09:08essentially the server version of a

09:09consumer graphics card and and that that

09:11can save a lot of cost and unfortunately

09:14as we discussed in part one you can't

09:16just make up for these inefficiencies by

09:18piecing together a bunch of less

09:20performance chips at least for model

09:22training at least you need some super

09:23sophisticated software for that right

09:25and because the the overhead of this

09:27Distributing the data between these

09:29cards will probably outweigh any saving

09:31you get from from cheaper cards

09:33inference on the other hand for

09:35inference you can often do the inference

09:36on a single card so that's not really a

09:38problem if you take something like

09:39stable diffusion right a very popular

09:40model for image generation you know that

09:43runs that that runs on a Macbook for

09:46example out of uh you know that has

09:47enough memory and enough compute power

09:49so you can generate an image locally so

09:51that'll run on a relatively cheap

09:52consumer card and you don't have to to

09:55use an a100 for it to do inference

09:58so when we're talking about the training

10:00of the models clearly the the amount of

10:03compute is

10:03is drastically more than the inference

10:05and something else that we've already

10:07talked about is the more compute often

10:10not always but often the better model

10:11and so does this ultimately these

10:14factors all ladder up to the idea that

10:16heavily capitalized incumbents when this

10:19race or this competition or how do you

10:21think about the relationship between

10:23compute capital and then the technology

10:27that we have today yeah that's uh that's

10:29the million dollar question or maybe

10:30trillion dollar question I don't know so

10:32first of all

10:33training these models is expensive right

10:35for example the we haven't seen uh yet

10:39really good uh open source large

10:42language model or I'm sure part of the

10:44reason is they're training these models

10:45it's just really really expensive right

10:47I mean there's a bunch of enthusiasts

10:48out there that would love to do this but

10:50you know you need to find uh I think a

10:52couple of million or 10 million dollars

10:53of compute capacity to do it and that

10:55makes it so much harder all right and it

10:57means you just need to create a

10:58substantial effort before before

11:00something like that can happen

11:02um all that said you know the the cost

11:05for training these models overall seems

11:07to be coming down

11:09right

11:10um and in part I think it is because it

11:13seems to me like like we're becoming

11:14data limited right so the it turns out

11:17there is a there is a correspondence

11:19between how big your model is and what

11:21the the optimal amount of training data

11:23is for the model it's having a super

11:24large model with very few data doesn't

11:26doesn't get you anything or you know

11:28having a ton of data with a small model

11:29also doesn't get you anything right your

11:31software the size of your brain needs to

11:33roughly correspond to the length of your

11:34University education here right like

11:37otherwise it it doesn't work what this

11:39means is that you know because some of

11:42the large models today already leverage

11:44a good percentage of all human knowledge

11:47in a particular area

11:48all right I mean the if you look at you

11:51know GPT there was probably trained on

11:52something like 10 of the of the internet

11:54right and all of Wikipedia and many

11:56books like a good chunk of all books

11:58right so so going up by a factor of 10

12:01yeah that's probably possible going up

12:02by a factor of a hundred

12:04that's not clear if that's possible I

12:07mean we as mankind just haven't produced

12:09enough knowledge yet that you could

12:10absorb all of that into one of these

12:11large models

12:13and and so I think the the expectation

12:16at the moment is that the the cost for

12:19training these models you know may

12:21actually sort of top out or even go down

12:24a little bit you know as the the chips

12:26get faster but we we don't discover new

12:28training material as quickly I mean

12:31unless somebody comes up with a new idea

12:32how to generate training material

12:34if that assumption is true I think this

12:37means that the mode that's created by

12:39these large Capital Investments is

12:40actually not particularly deep right

12:42it's more of a speed bump than uh than

12:45you know something that prevents new

12:46entrance I mean today

12:48training a large language model is

12:50something that is definitely within

12:52reach for a well-funded startup right so

12:54and for that reason we expect to see

12:55more innovation in that area in the

12:57future all right that is a wrap for our

12:59AI Hardware series we genuinely hope you

13:02came away with a little more knowledge

13:04about this increasingly important space

13:06because if software is indeed eating the

13:09world well Hardware is coming along for

13:12that ride and as a reminder if you

13:15haven't yet listened to part one where

13:17we explore the emerging architectures

13:18and who's creating them or part two

13:21where we dive into the future AI stack

13:23and how Founders can participate well

13:25those are already live and ready for

13:28consumption and as always thank you so

13:30much for listening we'd actually like to

13:32leave you with a fun fact from gpt4

13:34itself commenting on the technology that

13:37created it and yes we did fact check

13:39this and this is also AI generated audio

13:42from 11 labs

13:43chat GPT and its sibling models are

13:46trained on diverse internet text however

13:49the exact amount of data used can be

13:51hard to comprehend if we were to print

13:54all of the data used to train these

13:55models it could fill a large Library

13:58consider that one single book may

14:00contain around 1 million characters if

14:03we estimate that the training data is

14:05hundreds of gigabytes of text Data let's

14:07take a conservative estimate and say

14:09it's 100 gigabytes

14:11considering that one character is

14:13approximately one byte this would mean

14:15the model was trained on approximately

14:17100 billion characters if each book has

14:201 million characters then the data used

14:23to train chat GPT is equivalent to the

14:25text in approximately 100 million books

14:28if we take the size of a large Library

14:30such as the New York Public Library

14:33which has around 53 million items not

14:36just books the training data is

14:38equivalent to the text in almost twice

14:40the number of items in that Library

14:42thanks Chachi BT a quick note to close

14:45out that many models today are even

14:47bigger with llama 2 for example being

14:50trained on 2 trillion tokens or about 8

14:52trillion characters now that is a lot of

14:55libraries we'll see you next time

14:58thank you so much for listening to the

15:00a16z podcast

15:01what we're trying to do here is provide

15:03an informed clear-eyed but also

15:06optimistic take on technology and its

15:09future and we're trying to do that by

15:10featuring some of the most inspiring

15:13people and the things that they're

15:14building

15:15so if that is interesting to you and

15:17you'd like to join us on this journey go

15:19ahead and click subscribe and make sure

15:21to let us know in the comments below

15:23what you'd like to see us cover next

15:25thank you so much for listening and

15:27we'll see you next time

🎥 Related Videos

a16z Podcast | Things Come Together -- Truths about Tech in Africa

a16z Podcast | The Infrastructure of Total Health

The Robot Lawyer Resistance with Joshua Browder of DoNotPay

a16z Podcast | Bots and Beyond

Design Sprints as a Tool for Organizational Change

a16z Podcast | Valuing Today's Fast-Growing Software Companies

🔥 Recently Summarized Examples

Former Priest REVEALS Jesus' MYSTICAL Lost Years & His Connection to BUDDHA! | Fr. Seán ÓLaoire

Kim Kardashian's Plastic Surgery Reversal: Is She Trying to Rewind Time?

How To Succeed As A NEW & YOUNG Realtor [Deals Every Month + Luxury Listings]

BITCOIN EMERGENCY: NEXT PRICE TARGETS REVEALED!! Bitcoin News Today & Ethereum Price Prediction!

Uncovering Ancient Atlantean Ruins: Exploring Evolutionary Pathways and Psychic Phenomenon

Samsung Technician Knives TV To Void Warranty

View original video