00:00hello everyone welcome to the a6 & Z
00:02podcast I'm sonal and I'm here today
00:04with Matassa Hara the CTO and co-founder
00:06of data BRICS which is the primary
00:09company driving and developing spark and
00:11we're actually just coming out of the
00:13spark summit which took place this week
00:14and it's one of the biggest events for
00:17developers who are working on spark for
00:19companies that are interested in spark
00:21and pretty much for anyone who cares
00:22about trends in the big data space just
00:24to start off matei just start by just
00:26giving us a description of what spark is
00:28so spark is software for processing
00:31large volumes of data on a cluster and
00:33the things that make it unique are first
00:35of all it has a very powerful
00:37programming model that lets you do many
00:40kinds of advanced analytics and
00:42processing such as machine learning or
00:44graph computation or stream processing
00:46and second is designed to be very easy
00:48to use much easier to use than previous
00:51systems for working with large data so
00:53what were some of the previous systems
00:54for working with large data sets before
00:57SPARC the most widely used system was
00:59probably MapReduce which was invented at
01:03Google and popularized through the open
01:05source Hadoop project and you know
01:08MapReduce itself was it was a major step
01:10over just writing distributed programs
01:13from scratch but it was still very
01:14difficult to to adopt and use and and
01:17led to very complicated applications and
01:19also very poor performance in some of
01:21them so what were some of the reasons
01:23for inventing SPARC in the first place
01:25then I mean besides the problems and in
01:27the limitations of that were you just
01:29trying to solve the problems of Map
01:31Reduce or were you actually trying to do
01:33yeah it's a good question so we started
01:36building SPARC after several years of
01:38working on MapReduce and working with
01:40companies that were very early on using
01:42MapReduce I was a PhD student at UC
01:45Berkeley and we actually started working
01:48with Hadoop users back in 2007 and I did
01:51for example an internship at Facebook
01:53when Facebook was only about 300 people
01:55and they were just starting to set up
01:57Hadoop and in all these companies I saw
01:59that there was a lot of potential to
02:01putting together large data sets and
02:03processing them on a cluster but they
02:05were all hitting the same kind of
02:07limitations and they all wanted to do
02:09more with it so basically our lab
02:13to address these divided nation you know
02:14let's go take a step back for a moment
02:16and talk again about your experience at
02:17Facebook though why was the problem
02:20challenging there I mean BIC Tate has
02:22been around forever so what about that
02:24problem was interesting and different
02:25that made you want something better than
02:28what you already had like what was it
02:30about that data I guess so Facebook like
02:33like other companies starting to you
02:35know to use business data was able to
02:38collect a lot of very valuable data
02:40about how users are interacting with it
02:42and Facebook was also going very quickly
02:45so they were adding you know many tens
02:47of millions of users every few months
02:49and you know they definitely couldn't
02:51just talk to every user or even send
02:53someone to every country where Facebook
02:55was used and figure out how people are
02:57using the site so they needed to use
02:59this data to improve this user
03:01experience so there are two things that
03:03made it especially challenging one was
03:06the scale of the data which was you know
03:07much higher than you could do with
03:09traditional tools and the second one was
03:12how many different people within
03:15Facebook wanted to interact with it it
03:17wasn't just one person or one team doing
03:19it it was many people often with not you
03:22know not not that many technical skills
03:24so I need it to be very easy to work
03:26with this data what was a limitation of
03:28like why wasn't what was in place
03:30MapReduce I guess you're describing
03:32enough so MapReduce was was actually
03:35great for running sort of large batch
03:38jobs that kind of scan through the whole
03:40data and and summarize all of it and
03:42give you an answer but it was designed
03:45mainly you know for jobs that take tens
03:47of minutes to hours MapReduce initially
03:50came out of Google where it was used for
03:51web indexing and the whole point was are
03:54we'll run this giant job every night and
03:56in the morning it's built a new index of
03:58the web but what Facebook wanted to do
04:01was different they had a lot of
04:02questions that they wanted to ask almost
04:05interactively and there's a person
04:07sitting there who launches a you know a
04:09question at the cluster and needs to get
04:12an answer back and it wasn't very well
04:14suited for that so it's more like to
04:15move fast and break things kind of model
04:17at Facebook where you want to like just
04:19deploy code or we're more importantly
04:21quickly iterate in real times you got
04:23your future testing and get ya know
04:26different future tests
04:27but just in general when you work with
04:29data you want to ask multiple questions
04:31repeatedly when you're doing ad hoc
04:33exploration of the data as opposed to
04:35you know when you have a certain
04:36application that you know okay I'm just
04:38going to hunt this every night so the
04:40second feature that you mentioned about
04:41the usability I mean I feel like it's
04:43obvious we care about things being
04:45pretty and easy to use because we no one
04:47wants to deal with the kludgy interface
04:48I mean does that really matter in this
04:49case because if you're really an insider
04:51who knows how to work with data do you
04:53even really need to care about that the
04:54interesting thing is nobody wants only
04:57the insiders to work with data basically
04:59everyone wants to be able to access it
05:02directly actually there was a great
05:04keynote about this at the SPARC summit
05:06by Gloria Lau where she said that also
05:09the insiders themselves don't want to be
05:11answering questions for other people you
05:13know if they have the specialized skills
05:15if there are data scientists for example
05:16they'd prefer to be doing something
05:18advanced and the other users outside
05:20would be for it to ask the questions
05:22themselves so ease of use was quite
05:24important so one of the most interesting
05:25announcements that came out of the SPARC
05:27summit that I think a lot of people saw
05:28it was the announcement that IBM is
05:30backing SPARC basically iBM is is doing
05:34two things first of all it's investing
05:36in the development of spark you know
05:38much in the summer in a similar way as
05:40to how they've invested in other
05:42technologies such as Java or Linux they
05:45see a spark as a key technology and
05:48actually putting developer resources and
05:50second IBM is also moving some of its
05:54internal products and product lines to
05:56use spark or to offer spark to customers
05:58because they think it you know it will
06:00improve these products to build them on
06:02spark so does that mean that IBM's
06:04basically making a big bet on the cloud
06:06in a bigger way than ever before I think
06:08it's much more than the cloud I think
06:10even though the cloud is one you know
06:13that IBM is is interested in expanding
06:15and they have huge business lines that
06:17are completely unrelated to the cloud in
06:19in in solutions and consulting and also
06:22in products and services such as Watson
06:25or data based products and so on and you
06:27know I hope that what will happen is
06:29really great interact integration
06:31between these products and spark were
06:34there any other major news or
06:35interesting things that you saw any
06:36other interesting keynotes or trends
06:40for audience to know about what came out
06:42of spark summit this year one of the
06:44coolest ones I saw was a talk from
06:46Toyota about how they use spark to
06:48improve you know to basically look at
06:50social media feedback what people are
06:53writing about their cars and figure out
06:55things like oh is the problem with the
06:57brakes and the PIA I saw how they can
06:59improve their products as a result of
07:01this so you have a car company that's
07:03basically using social media is big data
07:05because it's real-time and it's fast and
07:06I probably got a ton of feedback and
07:07they're using spark to figure out how to
07:10change their product in real time right
07:13they're not using it to deal with it in
07:15real time but they're using it just to
07:16get a lot more insight into how their
07:18vehicles behave once they're out there
07:20so you know what one example is if
07:23people report you know say some noise
07:25coming from the brakes or something like
07:27that now they might hear about this you
07:30know if a person goes and talks to their
07:32mechanic but there might be many other
07:33people who just ask their friends online
07:35and ask oh look I hear this kind of
07:37weird you know scratching sound from my
07:39car so and and it's actually very hard
07:41to take this kind of text data which you
07:44know might say well the people have many
07:46many different ways of describing say
07:48this noise problem and actually
07:50understand it and classify it cluster
07:52together all the messages about a
07:54specific problem but for the engineers
07:57at Toyota who you know are trying to
07:59design the next car or to figure out you
08:01know any potential issues with the
08:03current components it's very useful to
08:05have this not it that makes a lot of
08:06sense so they're basically analyzing
08:08this big social media data set get that
08:11where's many other interesting things
08:12that you saw coming out of it apart from
08:14Toyota we also saw a lot of other great
08:17companies starting to talk about the use
08:19of spark so you know we've known about
08:21some of them for a while but it's nice
08:23to see you know companies talking
08:25publicly about interesting applications
08:27they've built so so some of the other
08:29highlights for example included Netflix
08:31Capital One at PBS summits we also had
08:35Goldman Sachs and and Novartis in the
08:37biotech one other questions just on the
08:40big picture side you know I'm fascinated
08:42by the tension between open and closed
08:45and when it comes to the open source
08:46topic in the open source community as
08:48well and one of the things that I think
08:49is really interesting is that every side
08:53have closed as well and the evolution of
08:55open source historically as well as now
08:57always involves corporate big corporate
08:59players getting in the game as well as a
09:01lot of individual developers academics
09:04how does that sort of affect the
09:06ecosystem in general I want to love to
09:08hear your thoughts about that and also
09:09how that applies to spark so the spark
09:11community is already very large it's
09:14actually the most active open source
09:17project and data processing in general
09:18as far as we can tell and it has many
09:21companies are participating in it and I
09:23think we've been able to grow the
09:25community in a way that everyone gets
09:27what they want out of it and there
09:29aren't really any major tensions between
09:31the companies working on it and people
09:33trying to keep certain things closed or
09:35not is there something unique about this
09:37that's allowing that to happen so I
09:39think it's just a matter of the culture
09:41in the project and setting it up early
09:43on to be very welcoming to contributors
09:45and to actually you know have people
09:47converge on on how they're going to work
09:49together you know how they keep it
09:51stable and reliable so we started doing
09:54that you know back from the UC Berkeley
09:57days when we had other people contribute
09:59to it and I think we just have that
10:01pattern in place and it's sort of best
10:03for every contributor that's involved
10:05there's been plenty of open source
10:07projects that have not reached this kind
10:08of scale or grown as fast and the
10:10ecosystem you're describing not it's
10:12just not random chance that it ended up
10:13there I guess what I'm really interested
10:15in is that you are the creator we're the
10:17most popular and fastest growing open
10:18source projects ever what are some of
10:20the ingredients of a successful open
10:22source project like what does it take to
10:24kind of get here for other people
10:26working in open source yeah so I should
10:28say you know from the beginning that you
10:29know we didn't we certainly didn't
10:31imagine that spark would be this widely
10:33used when we started and it's only you
10:35know it's kind of been a feedback loop
10:37as we saw people being excited in it we
10:39also you know decided to spend a lot of
10:42time to make it better and and to foster
10:44the community around it but I think
10:46there are several things that helped so
10:49first of all spark was actually tackling
10:51a problem a real problem that people had
10:54and that more and more people were
10:56beginning to have in the future which
10:58was the problem of working with really
11:00large scale data sets so it kind of
11:01resonated with the real need people
11:03automatic yes exactly and I need that
11:06time so obviously and why there wasn't
11:08much else they could use so obviously
11:10that's helpful second we had a really
11:13fantastic set of people working on the
11:15project and I think it's especially
11:17important when you begin a project when
11:20it goes you know from the original
11:22initial team to bigger and bigger teams
11:24because you're going to need a really
11:26great team to work with it in the future
11:28the people at UC Berkeley and the people
11:31we got at database to work on spark are
11:33just a really fantastic team that's able
11:35to build great software very quickly yes
11:39so the people aspect of it is exactly
11:40because it is a community open Scott yes
11:42definitely and and third apart from the
11:45sort of core team that was around at
11:47early on we tried from the beginning to
11:49be very engaged with the community and
11:51to foster new contributors help them
11:54actually contribute to the project and
11:55learn how to do things and actually
11:58participate and hundreds of people that
12:00do that you know many people might only
12:02send in one or two patches but we've
12:04tried to make the barrier for that
12:05extremely low so that they can actually
12:07help out it takes some some effort to do
12:10that because at the beginning you know
12:12if you're someone working on it every
12:14day and that someone comes in and once
12:16helped you know to get some idea in it's
12:18always faster for you to do it yourself
12:20than to help this other person but you
12:22have to do that you have to help them
12:23get set up and help them contribute in
12:26order to actually go the total set of
12:28people who can contribute so is there
12:30just like a lot of documentation on
12:31their project then or I mean what really
12:33makes them easier to be able to
12:34contribute then look what concretely
12:36yeah yeah so there's several things so
12:38first you know first even before anyone
12:40contributes they have to be able to use
12:42it so there's been a lot of focus on
12:44making spark very easy to download and
12:46use and having as much documentation and
12:49examples out of the box as possible and
12:51we're still doing a lot to expand this
12:54actually the second thing you need is
12:57you know once people are actually trying
12:59to send in patches or to try to
13:02understand something about the project
13:04you do have to talk with them to review
13:06the patches and so on and you know and
13:08help them actually get them in and the
13:10third thing I need that's really
13:11important to keep a project moving
13:13quickly is just really great
13:14infrastructure for testing checking the
13:17quality making sure that it continues to
13:18be good and by investing in this kind
13:20infrastructure much the same as you do
13:22in any other engineering organization
13:24you can then end up moving a lot faster
13:27so these are the things we we spend time
13:29on the most interesting thing about open
13:31source is an ecosystem that grows up
13:32around it because otherwise why I have
13:34it even be open source could you talk a
13:36little bit more about but I've actually
13:37seen you share a chart that shows a
13:38really rich ecosystem going around spark
13:41and why that matters yeah definitely
13:44yeah so over the the past few years
13:46we've seen a lot of vote a lot of other
13:48open source projects integrate with
13:51spark and build on top of it and really
13:53put together you know this this set of
13:55software you can use together to build
13:58applications so in particular you know
14:01one of the things we size many of the
14:03projects that were built on top of
14:04Hadoop such as hive which is a sequel
14:07processing at scale and pig and mahout
14:10for machine learning are starting to
14:12hunt on top of spark as well so that
14:14users of those can can get you know the
14:16speed ups and the benefits from using
14:18spark and we've also seen quite a few of
14:20the of the data storage projects for
14:23example MongoDB or Cassandra or attack
14:26yarn or many of the no sequel key-value
14:29stories now connecting to spark offering
14:31ways to heat the data in and for spark
14:33users that's exciting because it means
14:35they can write an application against
14:37spark and use it against data and all
14:38these storage systems they don't have to
14:41change the application to talk to each
14:42one so I think even beyond the activity
14:45happening and in spark itself these
14:47projects are on top and on the side are
14:50one of the most valuable things for the
14:52users and in fact wasn't spark itself
14:54isn't spark itself able to sit on top of
14:57Hadoop for those that use the Hadoop
14:59file storage so yeah definitely from the
15:01beginning we design spark to consider
15:04you know on top of Hadoop and to talk to
15:06her Dube but we also left it open so
15:08that you can hone it in other
15:10environments and you know we basically
15:12our philosophy in it is to try to
15:14integrate with all the these kind of
15:16environments where data can be stored
15:17and give users a one really simple way
15:20to work with the data no matter where it
15:22is you've talked about the community
15:23you've talked about all the successful
15:25ingredients of an open source project
15:26but I guess I'm kind of fascinated by
15:29the story of an inventor and we haven't
15:31opportunity to talk to an inventor of a
15:32really interesting thing
15:34and I want to hear a little bit more
15:36color on what that was like one of the
15:39most interesting ones was Lester Mackey
15:42who was a PhD student and the same year
15:45as me and his he was actually on the
15:49team that got second place in the
15:51Netflix challenge they were extremely
15:52close to winning the whole challenge
15:54what was a Netflix challenge good for
15:55our audiences to remind people what it
15:57was yeah the Netflix the Netflix
15:59challenge was this $1,000,000 price to
16:01improve the accuracy of movie
16:03recommendations on Netflix so Netflix
16:05released this data set and said oh
16:07currently we can predict you know users
16:09score for a movie - you know something
16:11like within 0.85 and if you can increase
16:14this point you know if you can decrease
16:17the just point eight or something like
16:18that we'll give you a million dollars so
16:21it was an open challenge very
16:22large-scale to encourage innovation in
16:25this area okay so what happened with
16:26Lester yeah so so Lester was one of the
16:29people who had you know he was
16:31developing a ton of new algorithms and
16:33combination of algorithms for this
16:35problem he had this pretty large data
16:37set especially for that time and he
16:39wanted to run these very quickly so it's
16:40actually one of the applications that I
16:42first tried to support in spark was you
16:45know the recommendation algorithm he was
16:47working on so did you win the network's
16:48prize well he has his team won won
16:52second place so they didn't quite make
16:54it but they were at least get something
16:56for winning second place I mean first
16:57place got 1 million dollars I did again
16:59I'm not sure told us talk about another
17:05transition here which is the transition
17:07from an open-source project to becoming
17:09a commercial one and and part of what
17:11you guys are doing obviously is is
17:12involved in that can you talk a little
17:14bit more about the transition of what
17:15that takes to move from open-source to
17:16commercial application that people
17:18actually use and have expectations of
17:20yeah definitely so you know so as we saw
17:23a spark do very well in in the in the
17:26open source domain we we wanted to start
17:28a company around that - really hard in
17:30it and to bring it to a much wider class
17:32of commercial users and you know we
17:35really wanted to find a model that lets
17:38it continue to be fully open source and
17:41continue to be a successful project that
17:44way for everyone participating in it
17:47traditionally it's it's always been
17:49attention in companies that you know
17:52that try to commercialize open-source
17:54projects because you know they build all
17:56this great stuff and then they kind of
17:57give it away for free and you know they
18:00you know it's it's this tension between
18:01all are there some things we just
18:02shouldn't put into it or you know how
18:05else can we actually have you know power
18:08as a successful business around it so
18:10the way we're doing this at databases is
18:13actually quite different and I think
18:15it's a it's a very nice a very powerful
18:17model for doing this which is that we're
18:19offering spark in a cloud service let's
18:21talk about why that it's like why why
18:23are you guys doing that and actually why
18:25is that so different why don't other
18:26people do that we're offering spark has
18:28a you know cloud service and what that
18:32means is you know it's the same spark
18:34that anyone else gets in the open source
18:36all the libraries all the improvements
18:38we put into the engine you can just
18:40download them and run them yourselves or
18:42if you want you know you can talk to a
18:43vendor that provides support on it
18:45yourself and there isn't any tension for
18:48us between you know do we put something
18:50in spark or does it become some kind of
18:52premium feature thank you mate a is
18:54great hearing your story an evolution of
18:56spark and and what it is and thank you
18:59everyone and that's another episode of
19:00the a6 and Z podcast