a16z Podcast | Why the Datacenter Needs an Operating System

a16z2019-01-02

32 views|5 years ago

💫 Short Summary

The video explores the challenges of managing complex systems like Kafka and Cassandra in data centers, advocating for a Data Center Operating System (DCOs) to optimize resource allocation. It discusses the evolution of operating systems, task scheduling, deployment of tasks and services, security measures, and the integration of containerization technologies like Docker. Emphasis is placed on efficient resource management, the scalability and efficiency of DC OS used by major companies, and the importance of building and managing distributed systems. It highlights the need for better utilization, customer service, and transitioning from manual to software-based maintenance processes.

✨ Highlights

📊 Transcript

✦

Challenges of managing complex systems like Kafka and Cassandra in modern data centers.

02:10

Early days of computing involved manual allocation of resources by programmers.

Buying more resources for specific purposes leads to 85% of resources going unused.

Introduction of Data Center Operating System (DCOs) to optimize resource allocation.

DCOs functions similarly to a traditional operating system but for a data center environment.

✦

Evolution of operating systems from monolithic to microkernel-like systems using Apache Mesos at the core.

03:11

Companies like Twitter and Airbnb are utilizing this evolution in their operations.

Data center services such as HDFS, Kafka, Hadoop, and Cassandra are leveraging these frameworks.

Distributed init system like Marathon is crucial for efficient task management.

Kernel provides core primitives for task and process management, resource allocation, and isolation in multi-tenant environments.

✦

Overview of task scheduling in a data center.

06:11

Mesos serves as a system call interface and API for launching tasks in the data center.

Mesos communicates with data center services, monitors tasks, and handles failures.

The importance of using higher-level tools like Marathon for task management in a data center environment.

✦

Setting up individual machines to communicate through a master, which manages connected machines.

08:34

A command-line interface is provided for running tasks like marathon, allowing users to specify commands and flags on how tasks should run.

The CLI enables users to monitor processes and tasks, with plans to drill down into specific tasks and processes.

Resource consumption understanding is crucial for diagnostics or performance monitoring, even with abstracted interfaces like CLI.

The video segment empowers DevOps personnel to be aware of what's happening by utilizing the provided tools and interfaces.

✦

Deployment of tasks and services using a repository system for software installation.

12:11

Twitter's architecture transformation from monolithic to small services, with examples like tweet processing and SMS notifications.

Emphasis on the importance of operating systems in resource management and security, focusing on conceptual models.

Mention of the flexibility of running applications on different computers within a distributed system.

✦

Challenges in centralized security measures and distributed systems.

14:18

Varied approaches and increased vulnerabilities result from the lack of centralized security measures in organizations.

Building personalized distributed systems hinders security and portability between organizations.

The need for a standardized data center operating system is emphasized to provide security primitives for applications to run seamlessly across organizations.

The complexity of building distributed systems often requires advanced expertise, limiting accessibility to these systems.

✦

Importance of building distributed systems without requiring PhDs.

17:08

Emphasis on the significance of security in distributed systems and ease of moving applications across organizations.

Highlight on the use of containerization technologies, specifically Docker.

Explanation of how the data center operating system integrates with Docker images for launching and running containers seamlessly.

Noting the flexibility in container creation and deployment for efficient scaling and rescheduling during failures.

✦

Importance of utilizing unused resources efficiently in a manner similar to how an operating system manages resources.

18:51

Redefining abstractions in operating systems can enhance resource management and scheduling.

Drawing parallels to the impact of virtual memory highlights the potential for more sophisticated resource allocation methods.

Adapting to new abstraction layers during transitions is significant for efficient resource utilization.

Better utilization and customer service in data center operations are essential for improved efficiency.

✦

The DC OS is utilized by major companies like Twitter, Netflix, and eBay due to its scalability and efficiency.

21:55

Software evolution allows the DC OS to work effectively at both small and large scales, adapting to hardware advancements.

DC OS enables hardware innovation at a different pace and provides abstractions for companies of all sizes.

The focus is on building new distributed systems with abstractions and primitives, setting it apart from traditional infrastructure as a service models.

✦

Importance of data center operating system in building and managing distributed systems.

23:37

Moving to the cloud without proper understanding can lead to bugs and inefficiencies.

Utilizing a data center operating system is valuable for those already using infrastructure as a service.

Careful consideration is needed when starting from scratch to determine the necessity of virtualization.

Informed decisions based on specific needs and existing resources are essential.

✦

Benefits of bypassing virtualization overhead and using platforms like Mesa for cost savings.

25:20

Importance of building new applications instead of virtualizing old ones for efficient resource utilization.

Platform-as-a-service abstracts machines for seamless task execution.

Comparison between platform-as-a-service and data center operating systems, focusing on resource management and application execution.

✦

The discussion focuses on the evolution of operating systems to efficiently run distributed applications.

28:18

Developers can access resources and information on Mesa Apache org to learn about the kernel and new developments.

There is an emphasis on advancing towards a more effective data center operating system.

Automation of maintenance tasks such as machine repairs and data rescheduling is highlighted.

The significance of transitioning from manual to software-based maintenance processes is emphasized.

✦

Benefits of direct operating system and application interaction for smarter functionality in distributed environments.

29:50

Reimagining basic primitives to meet modern needs and enable smarter ways of working at scale.

Transitioning manual tasks to software-based solutions for efficient handling of complexities.

Vision towards enhancing productivity and efficiency in various operations by leveraging technology.

00:00good afternoon everybody this is Steven

00:02Sinofsky here with the a16 Z podcast

00:05very excited today to have Benjamin

00:08Hyneman of mrs. fear here today and

00:10we're gonna talk about a new concept

00:13that the company's coming out with

00:14called the data center operating system

00:16or DCOs

00:18you know today you know you know apps

00:20they span servers there are things like

00:23Kafka and spark and MapReduce Cassandra

00:26that's super super complex to roll out

00:29these these huge systems in fact the

00:31real challenge of just allocating

00:33resources and figuring things out

00:34reminds me personally of the very early

00:37days of computing when when programmers

00:40were responsible for allocating the

00:42resources of a machine you know if you

00:44wanted to file you sort of wrote your

00:46own file system if you wanted a process

00:48you had to figure out which part of the

00:50CPU to save and store and load and you

00:53know great programmers back in those

00:55days which really weren't as long ago as

00:56people seem to think knew how to squeeze

00:59the most out of a computer by being able

01:02to manually allocate resources you know

01:04my old boss Bill Gates was famous for

01:07how many things he could squeeze into an

01:088-bit byte of basic you know over the

01:11weekend and it's very very important

01:14back then to do that and it the problem

01:16was if you were really good at it your

01:18code became completely unmanageable and

01:20hard to deal with and that turns out to

01:23be a little bit of what's going on today

01:25and in the data center except I think

01:27it's a little bit of the opposite today

01:30you know an enterprise of the big data

01:32center is taking the opposite approach

01:33which is let's just keep buying more and

01:35more resources and use them for special

01:38purposes so I don't have to think hard

01:39about packing more bits into a byte so

01:42to speak and so there's more servers and

01:44more complexity and more VMs you know

01:46you're in this world of like it's

01:47basically one app per server one app per

01:50VM and you know what the problem is

01:52that's simpler but not simple to manage

01:54but it leads to this unbelievable waste

01:57and waste in a data center is is a big

02:00mess eighty-five percent of the

02:02resources go unused and and I think to

02:05me that's where where the data center

02:07operating system really comes in and so

02:10I think that

02:12do is just sort of talk about this well

02:15we call it the D cos which is weird cuz

02:17data center is is one word so it really

02:19should be dass but that would take this

02:21podcast a whole different level and I

02:23don't think you know if we think about

02:26traditional operating system as

02:27allocating the CPU and the memory and

02:30the disk in the network all for a single

02:32computer what what is the been what's

02:36the D cos yeah yeah great

02:38so so the data center operating system

02:41consists of a bunch of components and

02:44when you really think about an operating

02:45system it itself consists of a bunch of

02:47components in fact operating systems

02:49have evolved over the years we've had

02:51you know monolithic operating system

02:53Microsoft service based based based

02:56operating systems and what we've really

03:00done with with Mesa spheres data center

03:03operating system is we've created

03:04something that's more like a microkernel

03:07like operating system where at the core

03:10of it is this open source project that

03:11we have called Apache mesos and it's

03:14what's being used at a bunch of

03:15companies like Twitter and and Airbnb

03:17and other companies to actually run run

03:19run their infrastructure and then a lot

03:21of the other what we're calling data

03:22center services which are these these

03:24these software frameworks which run on

03:26top of miso scanned and can take

03:28advantage of may suppose to actually

03:29execute the computations that you want

03:31to do things like Kafka and HDFS and

03:34Hadoop and Cassandra and those

03:38components really make up the core parts

03:40of what makes data center operating

03:43system so you can really think about the

03:44base level is the kernel which is which

03:46is may so it's just like a kernel in an

03:48operating system and then things like

03:50storage something like HDFS which

03:53leverages the kernel meso stew actually

03:55actually provide its storage and then

03:58these these data center services that I

04:00mentioned and and then a really really

04:03key one for us is what we call our

04:05distributed in it D and that's that's a

04:09Linux and not Windows that's true yeah

04:16and are distributed in it D is

04:20what we ship is something called

04:21marathon but there's alternatives to

04:24that just like in fact in the Linux

04:25world today there's alternatives to n8du

04:27a system D you've got a bunch of

04:29different and it D alternatives kind of

04:31like on the on your operating system

04:34today you have many alternative browsers

04:36Internet Explorer Chrome Firefox but we

04:41we we use marathon and and that's kind

04:45of that's that that's the core of how

04:46you end up running a lot of your tasks

04:47because that's that's your your init

04:48system where you describe all your tasks

04:50and then of course to interact with your

04:54operating system you need some kind of

04:55interface what about going back just

04:57before we jump into the interface tell

04:58me like you know when I think about what

05:00an operating system needs to do one of

05:02the things that needs to do is it needs

05:03to like schedule things so they schedule

05:05or yeah yeah so the kernel may so the

05:10core primitives that it really provides

05:11is task management a process management

05:14but task management resource allocation

05:16resource isolation the things the things

05:18you'd expect to get from something that

05:21needs to run multi-tenant lots of

05:23applications at the same time

05:24what's your between a task then in a

05:26process great yeah

05:27so we chose tasks because we didn't want

05:30to overload the the process nomenclature

05:34and a task is just it's the entity that

05:36we use to describe something that we

05:38have launched on some host in the data

05:41center and so it could be a process it

05:43could be a collection of processes but

05:45it's the thing it's the unit that we use

05:47to actually schedule it's the thing that

05:48consumes resources really at the end of

05:50the day so so I have a conceptual

05:53understanding of like a level of

05:55services but how does it actually work

05:57like how do i how do I get all of this

05:59onto a machine on to a data center

06:01what's the what's the mechanism that

06:03everybody's connected up yeah yeah so

06:05it's the bus yeah yeah yeah so so the

06:09way it works is that using one of these

06:11data center services we talked about

06:12that consists of the entire operating

06:14system something like marathon prefer

06:16for running your tasks you would

06:19interface through a marathon you would

06:20ask marathon you'd say hey marathon

06:21launched this task just like you would

06:23tell in addy on Linux hey run this task

06:25when when when you boot up and then what

06:28it does is it uses what we really think

06:31of as kind of a system call interface

06:33Maye sews to get resources allocated to

06:35it and then launch a task so so it says

06:38to me says hey I'd like to run this I'd

06:40like to run a task I need these

06:41resources to get the resources allocated

06:43to it and then it launches the task and

06:45then meso said that that period takes it

06:46takes care of making sure that it gets

06:48the tasks to the right machine the right

06:50host launches the task monitors it

06:53isolates it when it fails it tells the

06:55system that it's failed so it can either

06:56be relaunched whatever it needs to

06:58happen and so that the communication

07:00really is between one of these data

07:02center services like marathon that's

07:04running on top of may sauce and may

07:06sauce which is really providing kind of

07:07this system call the system call API and

07:10when you think about it this is one of

07:13the interesting things about Mei sauce

07:14itself it really is much more like a

07:17kernel anyway you know if you download

07:19meso by itself today it's not really

07:22much you can do with it just like if

07:23you're telling like the Linux kernel I

07:24sell today right great now I got the

07:26kernel what I do I'm not gonna program

07:27code which is gonna do interrupt a tea

07:29so you know do a system call you're

07:32going to use something at a higher level

07:33you're gonna say say bash at a higher

07:35level to launch tasks we're gonna use

07:38some kind of window manager at a higher

07:39level and that's exactly what something

07:41like marathon is provided on top of my

07:43sauce today so so first how do all of

07:46the you know I think of datacenter I

07:48think a rack and I think of all these

07:50boxes how does how does Mises know that

07:54the boxes are part of its resource pool

07:56yeah what's connects them all yeah so on

07:59each individual machine we run an agent

08:01process and and so that process could

08:04could be launched either via a system

08:06image that you would use one of our

08:07system images or if you are using some

08:10more traditional configuration

08:11management software you could use that

08:13to set up here to set up all your

08:15individual machines physical or virtual

08:17and then they all communicate back

08:20through the the mesas master as we call

08:22it the sort of the brain of Mesa which

08:26is responsible for managing all these

08:28machines that have connected through

08:29their agents and then the bus is

08:32basically between those machines and the

08:34Masters themselves cool so then so now

08:37I'm sitting in front of the machine yeah

08:41or of the the cluster or whatever and

08:44how do I know I'm running it like

08:46there you mentioned command line so like

08:48that I'm sort of in my head I have this

08:50now the data center is now like one one

08:52big computer yep and so well I want to

08:54tell it to do something yep what do I do

08:57yeah so so the interface really the

09:00first interface that we've provided is a

09:01command-line interface and so we did

09:04this for a bunch of reasons

09:05so not a card reader okay oh yeah we

09:10made it pluggable we can make that

09:13interface as well but not that it made a

09:15lot of sense for us to actually make

09:16this be really the first interface - or

09:18- to the to the to the DCOs

09:21and and so what you can do is is you can

09:23actually type from from from a terminal

09:25you can type DCOs space and then one of

09:28these data center services that I was I

09:29was mentioning something like marathon

09:30you can say marathon and then you can

09:31give it some information to run a task

09:33you can say like DCOs marathon run and

09:36then the command you want to run and

09:37maybe some extra flag information to

09:39describe how it gets its its artifacts

09:40its you know it's its resources to run

09:42and then you do that and it starts

09:44running and so of course what does that

09:46mean it starts running well it could

09:48mean that you could go to some web

09:49browser if the task that you launched

09:50happen to be a web server but of course

09:54you can also do something with a CLI

09:55which is DCOs

09:56PS so you can actually see all the

09:58processes that are running all the tasks

10:00you have running so all the processes

10:02were all the tasks all the tasks yes

10:03right tasks you've immediately like I'm

10:05kind of done with processes yeah so now

10:07I'm looking at a task might be spanning

10:10resource yes yeah so in the the the CLI

10:13today what we have is just just all the

10:15tasks but as we evolved the CLI we'll be

10:18able to drill down so you can see for

10:20this tasks what processes represent

10:22those tasks for those processes what

10:24threads and those processes so you'll go

10:27to see all the resources are actually

10:28being consumed to define because of all

10:30even even in the best cases of single

10:33machine computing at some point for

10:35diagnostics or performance or something

10:37you're gonna actually have to know how

10:40things are done my staff so the fact

10:41that you're using these abstractions

10:43doesn't prohibit a DevOps person from

10:46really knowing what's going on that's

10:48exactly right

10:48yep and that's that's just the same

10:50today where you know if you just type PS

10:52on say Linux box you just do just see

10:54the processes but if you want you can

10:55really dive in and you can say show me

10:57all the threads for those

10:58so okay so so you sort of describe how I

11:02get a something going like is that do I

11:05install software on it what what do I

11:08think of is like where does where does

11:09the tasks come from yeah yeah so once

11:12once once the Mesa sphere DCOs

11:15software's really installed everywhere

11:16and you want to run other tasks we have

11:20built a repository a registry like

11:23system that allows you to to describe a

11:26task and just kind of like homebrew or

11:28like the the package managers out there

11:31you can say hey I want to install one of

11:33these one of these frameworks one of

11:34these services you can do that it'll

11:36pull down from a repository the

11:38necessary bits of information you can

11:40have it either get installed on the

11:41distributed file system you might have

11:43running something like HDFS or Ceph

11:45which again is is something that's

11:47running on top of the the DC OS and so

11:52you know you can point to where it is

11:53and then you can say hey my init.d you

11:55know my service scheduler go ahead and

11:57now run this service pull it from this

11:58location so you have the bits and go

12:00from there just so folks can have a

12:01clearer view like give me what are some

12:03specific examples of services that

12:05you're that come to mind or tasks that

12:07you would yeah I would really think yeah

12:08so making it really concrete yeah yeah

12:11yeah so at a company like Twitter which

12:14is a big user of of may sews the they've

12:17basically decomposed their architecture

12:19from this monolithic architecture and a

12:21bunch of small services and each of

12:23those individual apps each of those

12:24individual services which is say when a

12:27tweet comes in it's sending out a you

12:30know post to an SMS or it's us you know

12:36hydrating the tweet for other people's

12:38timelines so they so that other people

12:39can see that this tweet has come din

12:40because it should show up each of these

12:42individual services would be the kind of

12:44task and app that you might want to run

12:45and so you could just say to the DCOs

12:47hey I want to run this this this

12:49application I don't care you know where

12:51I want to run it just here's the

12:52information here's the binary needs to

12:53run go your big computer run this some

12:56really big computer so so one of the

12:59things that jumps to mind is is that you

13:01know when I think of an OS I think not

13:03just of like the resource management but

13:05it also provides

13:06conceptual models for really important

13:08things like one that jumps to mind is

13:09security and

13:11any time you start telling me like hell

13:13by the way codes running anywhere I

13:15start to worry like well if code is

13:17anywhere and I don't know where it is

13:19doesn't that make me vulnerable in

13:20places that I'm not predicting great so

13:22tell me a little bit about how like

13:24something like like isolation or I think

13:25of security and a DCOs model yeah so you

13:28know I think this is a really

13:30interesting topic because what tends to

13:32happen a lot of these organizations when

13:33there isn't some centralized way and

13:35people are thinking about how they want

13:36to do resource management and run their

13:38applications is you get a bunch of

13:42disaggregated you know everyone's doing

13:44it slightly differently yeah so often

13:46times you have worse security because

13:48you know rather than a security team

13:49being able to audit just the one way in

13:51which everything gets to run they have

13:52to audit a whole bunch of different

13:54processes and some people get a little

13:55bit differently and then the worst part

13:57about that is they can't compose right

13:59and this to me is is one of the the

14:01fundamental issues I have with a lot of

14:03distributed systems is because people

14:06are building distribute systems in such

14:07a in such a personalized way and are

14:10personalized for their organization or

14:11their company you can't you can't easily

14:13build a distributed system in one

14:14organization and move it to another

14:16organization and right and security is a

14:18perfect example that you know one

14:19organization uses LDAP so the first way

14:21that they build it in is it hooks into

14:23LDAP and it's so ingrained that they're

14:24gonna do LDAP and another organization

14:26doesn't use LDAP they use some other

14:28mechanisms of authentication or identity

14:32earlier yes like I you always see this

14:34like with like come with when you have a

14:36big giant web presence you have the

14:38company that operates the web server

14:39part yeah and then they went and did

14:41analytics and a completely different

14:43sort of stack yeah and they're figuring

14:45out how to get the access to the logs to

14:47do the analysis yeah

14:48and then no one can either do both do

14:50audit both exactly exactly so I mean

14:53this is one of the biggest drivers of

14:55why we are we're trying to build why

14:57we're building a data center operating

14:59system is because I think in the day

15:00somebody should be able to build an

15:01application against the primitives like

15:03security primitives that could be

15:05provided by it by a data center

15:06operating system and go and run it in

15:08another organization because it's just

15:11an app that you built and you know it

15:13was very interesting at the beginning of

15:14the podcast when you were talking about

15:15the people that wrote the you know the

15:17hardcore applications that that's the

15:20case with distribute systems today so

15:21you know we choake have to have a PhD to

15:23write a distributed system

15:25many PhDs came about showing you how to

15:27write disputes and they went and built

15:29them yeah that's right and and and but

15:31we're at the point now where everyone is

15:33basically building a distributed system

15:35they don't all have PhDs and we want to

15:37be able to build those distribute

15:38systems in one organization run them in

15:40another organization I'm going to do

15:42that in a really really efficient manner

15:44and and and and that like security is a

15:45perfect example of something that if we

15:47can provide the interfaces for doing

15:50security and our distributed systems and

15:52people can build against those

15:53interfaces then we can easily move our

15:55applications across across organizations

15:58so so speed building on the applications

16:00part like one of the the things that

16:02obviously has a huge amount of attention

16:04and excitement right now whether it's

16:06from docker or Core OS is just the

16:08notion of containers yeah so in

16:09listening to you I'm sort of trying to

16:11parse in my head like do i no longer

16:14need containers are you gonna provide a

16:17container that i have to use am i gonna

16:19be able to use containers that i've

16:20already created

16:21where did containers fit in on your

16:23stack yeah that's a great question so um

16:27so meso says use containerization

16:30technologies what we've used to underpin

16:32the mesosphere DCOs

16:34has used containerization technologies

16:36for a long time since 2009 in fact in

16:392009 we even had Solaris zones support

16:41so we had containerization technologies

16:43so from from from from even even outside

16:45of linux and we've provided that

16:47containerization technology and will

16:49continue to do so so with when people

16:51have created have have used the existing

16:53containers containerization technology

16:55to build new things like docker on top

16:57that's been something that we've been

16:58able to integrate with very very easily

17:00so if you're creating docker images this

17:02is a fantastic thing you can give us you

17:04can give it directly to us we can launch

17:06those those those docker

17:07docker images directly using our

17:08containerization technology and as this

17:10stuff evolves as other companies

17:12introduce new image like formats to

17:15describe the bits you need to run your

17:18containers again this is just going to

17:19be something that we can plug in to our

17:23data center operating system you just

17:24give us bits and we'll run those bits

17:26and if those bits happen to be a docker

17:28cat a docker image or a rocket

17:33app container specification we'll take

17:35those things and we can actually run

17:36them and so but the benefit is of course

17:38so first you can go create your

17:39container however you want to go create

17:41it and then the neat thing is you were

17:43able to deploy it in in a in a

17:45distributed way like where you don't

17:46where you're scaling in a highly

17:48efficient way without really realizing

17:49it yeah and when there are failures they

17:51get rescheduled when and when when when

17:53we want to do even smarter things like

17:55oversubscription because we want to we

17:57want to move that 85 percent unused

17:59resources to say 10% unused resources we

18:03can start to do all that just like an

18:05operating system does for you under the

18:06covers today on set your laptop and you

18:09just gave us you know the binary that we

18:10need to run whether it's a container

18:12image or whether it's or whether it's a

18:14you know some some real binary well so

18:16if I want to take a step back because to

18:19me this is what's so fascinating is that

18:21that what you're really doing is just

18:22changing what I view is the abstractions

18:24of an operating system and you're you're

18:26you're basically directly or by

18:28implication saying wow you know the

18:30abstractions that people deal with like

18:32the notion of having a virtual machine

18:33is just completely wrong and that we

18:35really need a new set of abstractions

18:38and to me what this feels like is is

18:40when virtual memory came out the

18:42abstraction just you blew your mind

18:44because you went from like I literally

18:46personally went from like figuring out

18:48where to put stuff in 640 K of memory to

18:51having having two gigabytes of memory

18:53yeah and and not only that but the

18:55address space was linear so I actually

18:56got to just you know just not worry

18:58about where it went whereas I I spent

19:00the first two years of my career like

19:02swap tuning code so I knew exactly where

19:04in memory it was gonna be and so it

19:06seems crazy to think like that like cuz

19:10aren't a bunch of hardcore people just

19:11gonna say no the problem is if I have a

19:13whole datacenter I'm gonna be better at

19:15organizing what goes where then some

19:17piece of software that doesn't know the

19:18loads the resource needs and why would

19:21why would the DC OS know better than me

19:24I'm a smart PhD yeah no no I I think I

19:29think that's exactly right I think that

19:31that what we're doing is is we're doing

19:34exactly what virtual memory did for for

19:37existing operating systems which is

19:38providing the abstractions so that we

19:41can really really effectively do the

19:42resource management the scheduling

19:44with the failures and I think just like

19:47what you saw in virtual memory there

19:48will probably be a lot of people who

19:50believe that they can do it better but

19:51times going to show that actually we can

19:53start to do far more sophisticated

19:54things and we will be able to do far

19:57better scheduling for utilization for

19:59meeting SLA s4 for serving the customers

20:03yeah and I think to me that that's just

20:06a super important point for folks to

20:07understand because in these kind of

20:09transitions when you're changing

20:10abstraction layers like you tend to

20:12there's this sort of management

20:14retrenching of like wow security is

20:16really important we know how to secure

20:18this so we're gonna stick with it even

20:20though you know and a few percentage of

20:22utilization won't change it even though

20:23the system isn't secure it's just

20:26comfortably insecure that's right yeah

20:28and and like it was great to be

20:31comfortable even though you were failing

20:33and so I I think that like for me that's

20:35the big transition that people gonna

20:37have to just sort of get over their own

20:38perceived expertise yeah and let

20:41computers do stuff that they're good at

20:42that's right yeah and then that's why I

20:44think bring pulling in analogies of the

20:46past is so valuable yes it helps to

20:49people start to realize you know what

20:50maybe yeah this is a good idea so so

20:54who's using it today yeah so the

20:58open-source components that make up a

21:00large part of the DC OS are used by a

21:02large number of companies today some of

21:04the biggest users out there are

21:05companies like Twitter Airbnb HubSpot -

21:09eBay and and paypal are using it for

21:11running things Netflix is using it for

21:13running things some of the smaller

21:15companies without a lot of machines

21:16that's right ya know I mean one of the

21:19great things about the way that that the

21:20software has evolved over the years is

21:23we've we've made it so that it works

21:24well at small scale but it also scales

21:28and it works very very well for for the

21:29large scale skies and as Hardware itself

21:32is starting to evolve in our data

21:33centers and maybe the rack is gonna

21:35start looking less like the rack or a

21:37machine is gonna start looking less like

21:38a machine you really need these levels

21:40of abstraction for both the small guys

21:42and for the big guys yeah it certainly

21:44seems to me that that one of the things

21:46that our operating system brings is it

21:47allows hardware to proceed at a

21:49different pace of innovation and so I

21:52when I look at DCOs I think wow this is

21:55really gonna free a set of people to go

21:57well let's just go replace our servers

21:59with arm servers let's go replace our

22:01networking infrastructure in a certain

22:02way because they'll be able to map those

22:05abstractions up yeah rather than today I

22:07mean you can't once you say it's a VM

22:09running this instruction set that

22:11assumes this level of you're stuck

22:13that's right yeah so a lot of people are

22:15looking at the stack of cloud today you

22:18know or we haven't even used a lot here

22:19because we're really focused on

22:20distributed operating system but you

22:22know and they think of platform as a

22:24service or infrastructure as a service

22:26and so to me like let's assume that this

22:28isn't platform as a service let's take

22:31that let's assume what we understand

22:34pasta V but but isn't you know is is

22:36definitely at this VM server level so

22:39why is this not an is yeah I think yeah

22:42yeah yeah so one of the biggest

22:44differentiators between what we've done

22:46versus what they've done with the

22:48infrastructures infrastructure as a

22:49service space is really try to provide

22:53these abstractions and these primitives

22:55that enable you building new distributed

22:58systems on top and again that's really

22:59what an operating system should be

23:00providing what infrastructure as a

23:02service provides to you is another

23:04machine you know it turns a physical

23:07machine into a virtual machine or maybe

23:09a virtual machine it's first show a

23:10machine when you're running say

23:12OpenStack on ec2 and that does not help

23:15the developer build another system it's

23:18the same primitive it's just kind of

23:19wrapped up and so really what you get

23:22from from something like data center

23:23operating system are the abstractions

23:25and primitives that make it easier to

23:26build new distribute systems and that's

23:28what makes it easier to then move those

23:30distribute systems from one organization

23:32to another organization because that's

23:33the abstraction that everybody has and

23:35they can use those yeah you know I think

23:37that this is super interesting because I

23:38think from a IT leadership and the

23:40enterprise perspective you know right

23:42now we're on the verge where everybody

23:44wants to move to cloud they don't know

23:45what that means and so they're very

23:46quickly virtualizing that servers that

23:50they have laying around and I'm a big

23:51believer that that's just not a useful a

23:53good use of time yeah I think it might

23:55be cost effective in some marginal way

23:56but a cost of moving and the bugs you

23:58introduced and stuff and so I think what

24:00would you say to sort of your typical

24:02enterprise CIO there's not really a

24:04typical but an enterprise CIO is

24:05overseeing a move like like what what is

24:09it that they'll that

24:10should understand about moving to a

24:13missus kind of environment rather than

24:15take this intermediate step of doing a

24:17bunch more VM stuff or better managing

24:19review right right right

24:20yeah I mean I think one thing that's

24:22really really clear is that one of the

24:24nice things about a data center

24:25operating system is that it doesn't

24:27really compete with an infrastructure

24:29service at the end of the day because

24:30it's still about just taking all your

24:32resources whether those resources come

24:34from virtual or physical machines and

24:35using those resources effectively so for

24:38folks that do already have

24:39infrastructure as a service like

24:40deployments there's still a ton of value

24:42in using may sauce in the data center

24:45operating system because you still want

24:46to best take advantage of the resources

24:48that you already have again if you're

24:49just bunch of virtual machines and the

24:51same thing applies why something like

24:53Mesa sauce in the DC OS is still so

24:55valuable in ec2 like environments on AWS

24:57is because again still you want to best

24:59take advantage of all the resources that

25:01you have but for people that are

25:02starting from scratch I think you can

25:04really now start to take a very close

25:06look on whether or not you need to go

25:08through that first level of

25:09virtualization or not and we've had a

25:12lot of reports of people that can go

25:14directly to using something like Mesa

25:16and the data center operating system and

25:18then you don't have to start paying that

25:2030% virtualization overhead for running

25:22your applications which can start to

25:24save a lot of money well because that's

25:26how I sort of think of it as as you know

25:29both our cost savings and then like if

25:31you're gonna go a Greenfield in like if

25:33you're gonna build a new expense app

25:34rather than just virtualize the old

25:36expense app you probably want to build

25:38it because you know it's never gonna use

25:39a whole rack yeah like so why would but

25:42you're gonna probably if you were to go

25:43build it you would dedicate the rack

25:44yeah and then you get all the overhead

25:46of a bunch of VMs and so it seems like

25:49you should just go straight to building

25:51it as a distributed ab and then you'll

25:53have your thousand apps over the next

25:54ten years that get rewritten or all just

25:56gonna squeeze in and use the right

25:57amount of resource yes that's exactly

25:59right so but don't I want to go back one

26:01quick SEC to the platform as-a-service

26:02because to me they're like platform as a

26:05service infrastructure or a service or

26:06sort of almost inherently connected in

26:09an inefficient way yeah like so what

26:11would you say that well oh no we're okay

26:13because we're just going to use you know

26:15a cloud vendors platform right but that

26:17doesn't solve the distributed yeah

26:19no I mean I mean what ends up happening

26:21at the end of the day with platformers

26:23is services again it's so it's that's a

26:25high level attraction on top of em

26:27structure of service what platform as a

26:28service really solves is the fact that

26:29oh great from infrastructure service I

26:31got a bunch of machines now what do I do

26:32it's a platform reserved said okay well

26:34we'll abstract away the machines and

26:36we'll let you just run your tasks your

26:38your processes your apps whatever it is

26:40but but then you just run the processes

26:42and what you really want is you want to

26:43be able to launch those processes those

26:45applications and then you want those

26:46applications to be able to continue to

26:49execute by using the underlying

26:51infrastructure by calling back into

26:53something like the data center operating

26:55system and say hey now I need more

26:56resources or for us to be able to call

26:58into the apps the data center operating

27:00system people call in the absence a hey

27:01this machine is going down for reboot

27:03because it's doing maintenance you

27:05should know about this just like in a

27:06normal operating system we actually did

27:07we actually you know you you do memory

27:10paging and that's the big distinguisher

27:12again between something like

27:13platform-as-a-service

27:14and the data center operating system is

27:15platform-as-a-service about okay here's

27:17an app I run it I go and the data center

27:19operating systems but okay here's an app

27:21I run it and then while that app is

27:23running it uses the data center

27:24operating system to continue to run it

27:26calls back in it uses the system call

27:28API and as that IP I gets bigger and

27:31bigger and bigger it makes a really

27:33really rich environment for programmers

27:34to be able to build really sophisticated

27:35distributed applications one last

27:37question is I mean you just read a lot

27:39of stuff so I'll make it two parts a

27:42where can I get the stuff today yep and

27:44what can I do with it and then be like

27:46like what comes next

27:48yeah and go to Mesa Apache org and and

27:52that's where you can where you can learn

27:54a lot about the kernel itself the mesas

27:56kernel and the new stuff that was a

28:02second yeah yeah this effort was like

28:04well tell everybody now that they've

28:05absorbed all this what's coming next

28:06yeah yeah so the new stuffs the most fun

28:10stuff to me it's really where we start

28:13to take the beginning steps of what it

28:15means to be you know a data center

28:17operating system and take it to the next

28:18level and it means we start to take the

28:20things that historically have been

28:22really really tough to run regardless of

28:25whether or not you've used higher levels

28:26of abstraction like things like passes

28:28or infrastructure as a service like

28:30stateful services

28:32and we get to start running those things

28:33in a really really really effective way

28:35in the data center that historically

28:38have required a lot of humans to

28:40actually deal with that kind of stuff so

28:41there are two examples I want to give

28:43here two primitives that are being built

28:44that I think are really really cool one

28:47primitive we're building and is this

28:48notion of maintenance so because we have

28:51this this software let layer the kernel

28:53actually running in our data center

28:55operating system when the applications

28:56are running on top we can have it start

28:59to actually deal with maintenance of

29:03things that are happening in your data

29:05center so for example when a machine or

29:06Iraq needs to go offline we can have the

29:09software talk to the other software and

29:11say hey you know what this machine is

29:12going down for repair you should you

29:15know we need to reschedule you or you

29:16should get reschedule you need to move

29:18data let's treat it like it was a

29:19failure but a planned failure that's

29:20right that's right it's a failure but a

29:21plan for that's exactly right and that

29:23this is this is huge because usually the

29:25way this works in most most

29:26organizations is a human walks up to

29:28another human and says hey I'm going to

29:29be taking this rack down what can we

29:31actually do about this we can turn this

29:32into software right and the analogy that

29:35I'd like to give from just try to

29:37traditional operating systems is the

29:40operating systems today would do things

29:41like page out memory but what they do is

29:44they just they just say hey you know

29:45we're gonna use the LRU algorithm we're

29:47gonna page out the least recently used

29:49and that doesn't always work great and

29:50it wouldn't it be better if actually the

29:52operating system could work with the

29:54applications right on top to do smarter

29:55things when it comes to failures or

29:58needing more resources whatever it is

29:59and that I think is like that realm of

30:01things is is to me one of the most

30:03exciting things that were going to be

30:04working on because we get to reimagine a

30:06lot of the basic primitives that existed

30:08for single machines and rebuild them in

30:11a way that makes sense in a distributed

30:12environment and make sense for people

30:15that want to do things in a smarter way

30:17sort of what we've been working with for

30:19a lot a scale that people can only

30:21imagine yeah at a scale that it's it's

30:23already hard enough to do it manually

30:25and so we have to do it in software

30:28based ways and so so we can do that

30:29awesome well thanks so much this has

30:32been a Benjamin Hyneman

30:34from useless fear and I'm Steven

30:36Sinofsky signing off this episode of the

30:38a16 z podcast thanks everybody great

30:41thank you

🎥 Related Videos

a16z Podcast | Things Come Together -- Truths about Tech in Africa

a16z Podcast | The Infrastructure of Total Health

The Robot Lawyer Resistance with Joshua Browder of DoNotPay

a16z Podcast | Bots and Beyond

Design Sprints as a Tool for Organizational Change

a16z Podcast | Valuing Today's Fast-Growing Software Companies

🔥 Recently Summarized Examples

Former Priest REVEALS Jesus' MYSTICAL Lost Years & His Connection to BUDDHA! | Fr. Seán ÓLaoire

Kim Kardashian's Plastic Surgery Reversal: Is She Trying to Rewind Time?

How To Succeed As A NEW & YOUNG Realtor [Deals Every Month + Luxury Listings]

BITCOIN EMERGENCY: NEXT PRICE TARGETS REVEALED!! Bitcoin News Today & Ethereum Price Prediction!

Uncovering Ancient Atlantean Ruins: Exploring Evolutionary Pathways and Psychic Phenomenon

Samsung Technician Knives TV To Void Warranty

View original video