00:00hi everyone welcome to the a6 & Z
00:02podcast I am sonal today's episode
00:04moderated by a 6 & Z board partner
00:06Steven Sinofsky is all about the data
00:08edge in machine learning startups the
00:10conversation covers everything from
00:12machine learning algorithms and academic
00:13papers verses and products to how
00:15startups can compete with big companies
00:16on the data front to machine learning as
00:18a service and how to tease apart hype
00:20versus reality joining us to have this
00:22conversation are Jensen Harris the CTO
00:24and co-founder of text EO which is a
00:26platform for augmented writing of
00:27business documents like job descriptions
00:29and let me ask you tell a little bit
00:30more about what they do only as it's
00:31relevant for the discussion on this pod
00:33their platform provides in real time
00:35quantitative guidance based on evidence
00:37that's continually mine from tens of
00:39millions of documents and then we also
00:40have AJ Schenker CEO and co-founder of
00:42everlaw which is an a6 in Z portfolio
00:44company he's a computer scientist who
00:46fell into law and not in a good way not
00:48in a bad way and they help lawyers sift
00:50through huge piles of evidence to find
00:51the proverbial needle in the haystack
00:52for about litigation discovery and for
00:54helping build the narrative of a
00:55convincing case and just to give a quick
00:56sense of the contrast between people and
00:58machines here it's something that would
00:59take people hours of manual linear
01:01review often on very tight deadlines
01:02like over a single weekend booking a
01:04warehouse just to sort through
01:05everything and now with cases and
01:07involve over 10 million documents and
01:08terabytes and terabytes of data machine
01:10learning helps sift through all that at
01:11both a content and context level to
01:13figure out who's saying what what
01:14they're talking about and why now over
01:16to Steven so the category of e-discovery
01:18with or without machine learning you
01:21know it's a big category it sort of been
01:23established for a long time even there's
01:24a lot of legacy tech but like the big
01:27players like Google has the e-discovery
01:28product Microsoft has the ediscovery
01:30product what what is it that doesn't
01:34that makes it so that hey they have all
01:35the data why it's all in Gmail already
01:38why why can't why is it that they don't
01:41have this huge advantage over a start-up
01:43what is it that that really enables a
01:45start-up to have an advantage in this
01:47machine learning world because a lot of
01:49people just believe that that deep
01:51learning is a big company thing because
01:53they have all the data and they're just
01:54gonna win it's a great question ask for
01:56any startup that's looking at machine
01:57learning right you know why why try
01:59because someone you know some big co
02:00will have all the data and all the
02:02expertise right and so the first
02:04compelling reason is that you know there
02:06are many areas that these big companies
02:08don't actually invest it they don't care
02:10about you know I'll discuss legal
02:12tecna's I can but you know legal
02:13actually think is one of them medical HR
02:16at the level that you guys are operating
02:18at agriculture FinTech they're huge
02:21valuable areas where Google or Microsoft
02:23isn't building specific domain specific
02:26products I mean they're fundamentally by
02:28and large you know they're generic b2b
02:29or b2c companies that address these
02:31broad areas so if you go for a narrow
02:33area there's a great opportunity
02:34the second you know there are a couple
02:36other reasons why you can have an
02:38advantage or at least be on equal
02:39footing right so one reason is if the
02:42data set you're dealing with they're
02:43isolated and having more data doesn't
02:45help you that much more in our case the
02:48individual matters that we deal with are
02:49largely it should be isolated the matter
02:51is like a specific case yeah yeah case
02:53and Matt yeah exactly yes a matter you
02:54know it's it's totally fine to look at
02:57them independently by and large and a
02:59particular data set in a particular
03:01litigation context will have a
03:02particular set of documents that are
03:04actually interesting and that set might
03:05differ depending on the context and so
03:08we don't even know coming in necessarily
03:09what's interesting until someone says
03:11here's why this is interesting for this
03:12particular case so the training has to
03:14happen every time in a case by case
03:16setting you can get a really low you
03:19know get pretty far with that oh another
03:22area of course is where you can get a
03:23lot of data yourself fast right so if
03:25you're a self-driving car company even a
03:27small one you drive a couple hundred
03:28miles you have an unbelievable amount of
03:30data you can start to work with but I'll
03:32say the third thing to unpack there is
03:34you know you mentioned of course you
03:35know deep learning as being this thing
03:36that's the problem it's a big companies
03:38and you know that in many cases is true
03:40for the big problems that they're trying
03:41to solve right so we saw breakthroughs
03:43in you know transcription and
03:45translation and you know alphago and all
03:47that other section yeah exactly it's
03:49that's obviously the most fundamental
03:51you know can you find a cat in a picture
03:53and you solve that you probably solved
03:54everything but you know most ml
03:56implementations are not deep learning
03:58they're not neural networks there's a
04:00ton of implementations that are
04:01incredibly valuable that don't involve
04:03network so for example you know what IBM
04:06calls Watson which is not some brain
04:08that DES solves every problem but a
04:09whole glomer ation of different
04:11techniques is largely regression which
04:13is an incredibly powerful technique you
04:15know the pokerbot that one heads-up
04:17poker recently also not a network right
04:19so there are many areas where if you
04:21know what to optimize statistical
04:23machine learning will take you really
04:24really far and there's no reason to shy
04:27be powerful in the right domain so
04:28you're one of the things I hear you're
04:30saying a little bit is if you're getting
04:31started in your company and you have
04:32access to a data source it doesn't
04:34necessarily mean your first reaction
04:35should be like get a VM going start
04:37torch up and figure out what to go do
04:40that there there's a whole bunch of work
04:42to decide what type of learning
04:43technique to apply to what you're doing
04:46I would definitely suggest you know
04:47having but I felt it's better to
04:49actually understand your own domain and
04:51the problem you're trying to solve and
04:52then figure out how machine learning
04:53works into it as opposed to saying hey
04:55I'm an expert in ml let me just do
04:57something here's the hammer
04:59I guess all things being equal you'd
05:00like to have more data versus less data
05:02but actually having the right data and
05:05having it be data that is you know tuned
05:08for the problem that you're trying to
05:09solve and then having a purpose-built
05:11stack that allows you to build the right
05:14user experience the right service and
05:16the right business on top of it you know
05:18oftentimes doesn't take having hundreds
05:21of millions of you know rows of data to
05:23achieve that you can achieve actually
05:25pretty good prediction in a lot of areas
05:27with relatively small data set if you
05:30build sort of the right product on top
05:31of it and so the value that you get out
05:33of a machine learning product isn't
05:35always about how much data is underneath
05:39it and I think that's a really important
05:40thing especially in the early days of a
05:42start-up where you're just not gonna
05:43have you know the the the huge
05:46proprietary data set can you get the
05:48right data set and can you use
05:50techniques that are appropriate for that
05:52size data set to find the interesting
05:55wedge that lets you find the smoking gun
05:58you know in the in the discovery or lets
06:01you find the thing that all of a sudden
06:03brings twenty percent more people to
06:05apply for a job and you know I think all
06:09of the algorithmic stuff in machine
06:10learning is going to be commodity like
06:13there are like twenty places in the
06:14world where they're inventing new
06:15algorithms and that's you know
06:17educational institutions and huge
06:18companies and that's really important
06:20but that stuff more and more rapidly
06:22ends up in the public domain anyway and
06:24so it's not so much about that as it is
06:27can you craft the thing that blends
06:30machine learning with other techniques
06:32with statistical techniques with user
06:34experience techniques to build a product
06:37that has actual value then navigating
06:40domains you both a sort of achieved
06:42product market fit using machine
06:44learning as an ingredient but not doing
06:47it in the way that we read so much about
06:48like the hey we're gonna do machine
06:51learning here so we got all images on
06:53earth or we're gonna do machine learning
06:54here and we've ingested all written
06:56words to do translation in between two
06:59languages so so that to me is like super
07:02interesting because it really says that
07:03for product market fit like there's a
07:05whole bunch of stuff beyond like a giant
07:08corpus of data and like the latest deep
07:11learning algorithm I think that's the
07:13difference between an academic paper and
07:14a company for the academic papers the
07:16sole concern is giving this corporates
07:19what can I extract from in we're trying
07:21to solve people's problems in the real
07:22world and the problems are rarely
07:23reduced to just finding you know the
07:26kind of information throughput of this
07:28corporate say what they have to do
07:29whatever problem they have attack it
07:31from many angles in many different ways
07:32and I would say our machine learning
07:34component is one of the features but
07:36there are many others because you can't
07:37just solve this problem with this one
07:39technique well not just that but you can
07:41go after huge data right you can go
07:43scrape the web to try to amass some
07:47database of data and if it's bad data or
07:49if it doesn't have an outcome associated
07:51with it it doesn't matter how many
07:53thousands or millions or tens of
07:56millions that you have it's not gonna
07:58you know it's garbage in garbage out
07:59you're not gonna have a good product at
08:01the end what you're gonna have is a
08:02slide that says you have ten million
08:04documents and a product that no one
08:06wants I want to actually then building
08:08on this which we're going in an
08:10interesting direction belief you share
08:11your path to product market fit involves
08:14a lot of other code you don't have like
08:16a command-line tool that takes a job
08:19description in you've built a whole
08:21system around it yeah so the first thing
08:23is even within the predictive engine
08:25like the brain of it machine learning is
08:27one part of it it turns out to get the
08:29best results it had to be combined with
08:31statistics and traditional natural
08:34language processing and heuristic
08:37analysis and all of these things
08:38together blended actually worked much
08:41better than just a single model oh and
08:44then you build a word processor well
08:45then of course on top of it you have the
08:47other really important thing is user
08:49experience like it has to be you know
08:51from the time you type something and let
08:53the key we have 300 milliseconds to
08:56roundtrip your whole document down into
08:59the cloud get all the predictions bring
09:01it back up and light it up on screen so
09:03that it's as fast as spell check and so
09:05to do that you have to build the editor
09:06you have to build the word processor
09:08piece you have to build of course you
09:10know authentication and all the other
09:12pieces that you build huge advantage
09:14that a start-up has that some big
09:16company trying to do the same thing
09:17doesn't have which is we can tailor our
09:19security policies tailor the way that we
09:22handle the data the way that we sanitize
09:24the data and the way that we use the
09:25data in a very clear way that a blanket
09:30Terms of Service that covers for
09:31instance all of Google services or all
09:33of Microsoft services can't do yeah
09:36you're a full-on enterprise thing so
09:38you've got all of that that's right so
09:39all that's doable to sign in and you
09:41have to be able to share document so you
09:42have to build a document library so you
09:44have the whole service stack and you
09:46have the whole monitoring stack so how
09:48do you tell whether or not people are
09:50using it and how much they're using it
09:51and things like that and so the the the
09:54heart of it of course is the pieces that
09:57you know are about the predictive engine
09:59but that's not all of it and in fact we
10:02found in our sort of earliest days like
10:04our first six months that we didn't have
10:07to you know really ingest tens of
10:11billions of things then what we really
10:12needed was you know tens of thousands or
10:15hundreds of thousands of really good
10:17pieces of data with the results with the
10:19results right we're really they had the
10:21outcomes associated with them and then
10:23we could build the right models and we
10:26could figure out the kind of data that
10:27we needed and build the right experience
10:28on top of it to figure out whether we
10:30had product market fit we could have
10:31spent a year doing nothing but just
10:33aggregating data being it getting PhDs
10:36both of you are actually you're your
10:38post series a startups you're not super
10:41super far along but you both have like
10:43like a SAS business you have ongoing
10:45revenue you have many customers signed
10:47up but what what I think is just rich
10:49thing is you basically are like in my
10:51view modern SAS companies if I like
10:54you've now taken machine learning and
10:55just like before when you were SAS
10:57company yeah yeah we have hosting and we
10:59use a database now you also have machine
11:01learning is just part of your ongoing
11:03value proposition it's just likening
11:05it's like an ingredient that's right I
11:07how it should be again maybe I mean my
11:10view it seems like there's no there's
11:11certainly a ton of hype around machine
11:13learning certainly with recent advances
11:15there should be a lot of hype it's
11:16really compelling but I do feel like
11:17typically when companies are hyping up
11:19the machine learning component there
11:20might be doing it more to raise money
11:22than they are to provide a value to the
11:23customer the customer of course wants to
11:25say what is this doing for me at the end
11:27of the day not what algorithms are using
11:29to do it and so I think part of it it
11:30should be packaged as part of a broader
11:33message about what you're doing for them
11:35yeah it's not clear that you're gonna
11:37pay just for a machine learning you're
11:39paying for solving a business problem if
11:41you're a customer so yeah core lesson
11:43there is make sure you're articulating
11:44that maybe you could just kind of tick
11:46off a little bit of the core tech that
11:48you did use either of you using any of
11:50the open-source things that people have
11:51heard about how are you integrating them
11:53I mean yeah I mean we use a whole bunch
11:55of stuff I would say that's a terrible
11:57answer so we use a lot of open source
12:02tools that are available SPARC I think
12:04is is that the go team one for now but
12:06we also spend a lot of time on how do we
12:09get clean data into that system you know
12:11we basically take stuff from a bunch of
12:13different sources and synthesize that to
12:15get good results but a lot of what we do
12:17actually to your point earlier is the
12:19notion of cleaning up the data is so
12:20important to getting good results oh
12:22okay so that the notion of garbage in
12:24garbage out is really it's true
12:25certainly true in AIS and anything else
12:28and so a lot of actually what we do is
12:30is take these off-the-shelf components
12:32but spend a lot of effort into putting
12:34really clean and put into them and so
12:36one brief example of that which is and
12:39this is the kind of thing we're with
12:40domain specificity understanding a
12:42user's workflow and providing a tool
12:43that's catered to how they like to work
12:45again not a big company skill sure sure
12:48yeah I would imagine you know Microsoft
12:50doesn't know a ton about specifically do
12:52this one kind of thing but here's an
12:54example right so if you have an email
12:56thread in your discovery process right
12:59there's there's all these emails we
13:00automatically thread the emails for you
13:02from anything really sometimes people
13:04will say well this you know this is the
13:05really interesting you know email in the
13:06thread I'm gonna mark it as really
13:07relevant but I'm gonna mark these other
13:09emails as not relevant because I only
13:10you know this is the one that
13:11encapsulates everything by enlarge
13:13there's actually a lot of shared content
13:14between emails and the thread replies
13:16and all that other stuff that if you
13:17were to feed all these into a machine
13:20one email was hot and the others were
13:21cold but they largely shared the same
13:23content it's going to mitigate that
13:25signal right it's gonna muddle it
13:26because you suddenly have these two
13:28opposite ends of the spectrum with the
13:29same data so we're able to take that
13:31kind of information for instance and
13:32clean it up because we know what all the
13:34threads are we look at document exact
13:36duplicates and near duplicates for data
13:37and say these things are actually really
13:39similar they probably picked this one
13:41because they thought it was interesting
13:42and they don't want to look at these
13:43other duplicates but we don't want those
13:44two muddled the signal so we have all
13:45these other tools that go into analyzing
13:47data coming in so that it's really clean
13:49when it comes into these algorithms and
13:50that actually helps with performance all
13:52that's that's a hugely important point
13:53you know we use probably been a the same
13:55technologies we use spark as you know
13:58pretty much anyone who has to do sort of
14:00massively parallel sort of data analysis
14:02stuff needs to do but you know our
14:04actual you know machine learning
14:06algorithms and the core and Opie stuff
14:07we do is the standard sort of Python
14:09libraries that you know you can go
14:10download and and and use but we have put
14:13an enormous amount of time until our
14:15data processing pipeline and that is the
14:18very you know very tailored thing that
14:21we write that takes the same thing as
14:23you mentioned like take all the job
14:24listings that come in and all sorts of
14:26different formats dee doop them clean
14:28them find the outcomes normalize the
14:30outcomes like all of the stuff that you
14:32do to make the data coming in be really
14:34meaningful throw out the data that isn't
14:36meaningful that is hugely important
14:39we've also found talking about
14:40technologies when you're getting your
14:42company off the ground you're gonna end
14:45up doing a lot of ad hoc data analysis
14:47to try to figure out what are the models
14:49that work what are the part the patterns
14:51that work and what that means it's like
14:53you're just gonna be looking at your
14:54data you're gonna be looking for
14:55patterns you're not gonna be operating
14:56at 100 million scale you're gonna be
14:58looking at five thousand you know
15:00things and trying to figure out like
15:02where are the patterns that work and
15:03what are the models that work and
15:04something like Athena which allows you
15:07to do basically serverless sequel
15:09queries on an s3 bucket can allow you
15:11super fast to just dump a bunch of data
15:13into AWS into an S let's call this 3
15:16bucket which is like a storage container
15:18and without deploying any infrastructure
15:21you can just start writing sequel
15:22queries and see what you can find you
15:24know five years ago you would have been
15:26managing a big infrastructure to do this
15:28and now in like your first two weeks
15:30creating your startup you can be
15:32starting who's been more
15:33I'm cleaning the data and finding the
15:34patterns and less time like actually
15:37trying to build the algorithms or you
15:39know doing the other kinds of things
15:39that was super interesting on some of
15:41the tools that that were were used but I
15:44think there's probably like some really
15:46good learning and just like wow there's
15:47a lot of stuff out there like all the
15:49hosted like all the infrastructure
15:51people have machine learning services so
15:54do you have any lessons and like how you
15:56approach the choosing of your technology
15:58stack relative to like what's out there
16:01bill versus use versus buy versus
16:05yeah the tooling is really good now and
16:06not only do they have these services
16:08they're actually seemed to be a race for
16:09all these big companies to open source
16:11their technologies for people to use
16:12which is great I think the best thing to
16:16do is just try to learn some of the
16:17fundamentals or get people on a team
16:20that can learn enough of the
16:21fundamentals to be able to choose
16:22rationally amongst the fundamental
16:24decision making points or you can have
16:25about what kind of problem you're trying
16:27to solve and then you know any one of
16:29these tools that are released it'll get
16:30you enough of the way to understand is
16:32this the kind of thing where I want to
16:33use ml or not yeah you don't need to
16:34custom build something and you shouldn't
16:36spend any of your time working on that
16:37like you should figure out what cloud
16:39platform you're using whether it's AWS
16:41or whether it's a or something else they
16:43all have built in ml services that you
16:46can use that are totally fine like
16:49they're using the same algorithms the
16:51same exact things that you might use a
16:53few sort of custom-built something and
16:55they have plenty of power for you to do
16:57the more important thing which is like
16:58figure out where the actual predictions
17:01are in your data and so you shouldn't
17:02try to especially in the early time
17:04build anything custom there you should
17:06use what's there and they all seem also
17:07one things I think is really cool they
17:09all are there so in a sense academic and
17:12their variations that they're they're
17:14able to make it easy for you to iterate
17:16and try different models and different
17:18networks very very quickly again without
17:20like if you don't build any of your own
17:22infrastructures use theirs you just
17:23change between deep learning models
17:25pretty quickly that's right and you're
17:26not buying any hardware you're not
17:28managing any hardware no-one's been a
17:29hardware though for sure I hope please
17:31don't because you're both post this
17:33product market fit and in the scaling
17:35kind of phase I but using machine
17:37learning using data I want to ask each
17:39of you to sort of offer some advice on
17:41something you might have done
17:43differently than you
17:44so one of the one of the things for us
17:46that I wish we had known now was we felt
17:49a lot of imposter syndrome in the first
17:51two or three months especially around
17:53machine learning this idea that because
17:55we didn't have the huge data set that we
18:00weren't gonna be able to build something
18:01that was a value and and I think we
18:04learned that it was actually okay that
18:07it was to start out based on a smaller
18:10amount of data that was more tailored to
18:12not just a specific domain but to a
18:14specific customer and that allowed us to
18:16primed our learning loop by having the
18:19relationship with that customer and got
18:21their data which then allowed us to make
18:24our predictions better and get the next
18:25customer so you got the flywheel going
18:27so what you describes is essentially the
18:29machine learning version of just early
18:32adopters in the enterprise space but
18:33using the data side of it and not being
18:36afraid to do it at a really small scale
18:38right like one customer one department
18:41one department yeah get the really clean
18:44really tailored data for whatever you're
18:46trying to do and then do the flywheel
18:49very small because it blows up very fast
18:51but isn't LinkedIn gonna win because it
18:53has all the job descriptions which is
18:54what led to sort of that impostor
18:56feeling I think was there were a lot of
18:58job descriptions that someone already
18:59had yeah but here you are doing
19:01something that they can't possibly
19:02imagine doing right now because we went
19:04and found the data that actually
19:05LinkedIn doesn't have right a small
19:07amount of that data is more valuable in
19:09some ways then the large pile of sort of
19:11aggregate data when you talk about big
19:13data obviously the algorithms are really
19:15compelling and the results are really
19:16compelling but the other half of the
19:18equation is presenting the data to the
19:19user in the way that they can understand
19:21it right and and actually make use of it
19:23certainly in our arena it since it was
19:25insufficient as we learned to just say
19:26hey here's this great stuff it's gonna
19:28tell you what to do for a lot of lawyers
19:30that feels like a black box right and
19:31that black box when they have to go
19:32explain to the client say I did this
19:34because this thing said this thing you
19:35know right so a lot of what we do now is
19:39is give greater transparency about what
19:41the system is doing what you can trust
19:42and what you can't trust and building
19:44that trust with the user I think is
19:45really important to actually drive that
19:47kind of usage because the end of the day
19:49it is their jobs and the more
19:51information to give them about how it's
19:52doing when it's doing a good job and
19:53when it's doing a bad job I think the
19:55better we're seeing the response be so
19:58little bit counter and I think a really
19:59awesome learning which is sometimes
20:01people think that with machine learning
20:02like it means you do away with a lot of
20:04UI you do away with a lot of stuff but
20:08it turns out in a world where people
20:09have to be comfortable with what led to
20:11the recommendation or to the solution
20:14like actually presenting a journey
20:16through that is is is actually helpful
20:18yeah I mean the key thing you want to do
20:20is is present a as a partner in the
20:22humor in the humans in the people's
20:23endeavors you know what they're trying
20:25to do this is something that's gonna
20:26help you you're gonna work with it and
20:27when you do that of course you need a
20:29lot of trust it's really about the human
20:31and the Machine together accomplishing
20:33something that neither could accomplish
20:34on it on its own and so you know so many
20:37machine learning products have been
20:39about trying to either look backwards
20:41and tell you what happened or to predict
20:44the future it's a much harder to predict
20:46the future and then help you change the
20:48future and that's user experience on
20:50model together it's not just the the
20:52algorithms of the secret sauces the
20:54algorithms plus the product experience
20:56that blends humans and computers
20:59together in this this sort of beautiful
21:02learning loop awesome well I really want
21:03to thank Jensen and AJ for sharing their
21:06advice and wisdom on building startups
21:08using machine learning in the enterprise