00:00Shh welcome thank you guys for turning
00:08up I wasn't sure if 10 people would come
00:10so I'm glad that more than 10 people
00:12came if I didn't pick the best title for
00:15what I'm describing my colleague
00:17referred to as cerebral so I'm glad that
00:20didn't put too many of you off before I
00:23start I want to point out that a lot of
00:26a lot of talks what people have talked
00:27about making all this stuff open-source
00:31all of the stuff I talked to you about
00:33today and a bunch of other solar plugins
00:35that we've built a dice over the last
00:37couple of years they were all available
00:39on github right now for you to check out
00:42you read the sort of container is dice
00:45tech jobs although github sir shouldn't
00:48seem to be small enough to pick that up
00:49so if you want to grab that URL if
00:52you're interested in checking it out
00:53we built a number of stellar plugins
00:55over the years one is a custom MLT
00:58hammer that allows you to do a lot of
01:00extra stuff that the regular MLT doesn't
01:02do one of those being you can imply
01:05function boosts or boost functions to
01:08boost your output so you can do kind of
01:10location-based matching on boosting you
01:14can do content-based queries you can do
01:16recommendations based on multiple
01:18documents that's all in a multi handler
01:20we have a custom spellcheck plugin that
01:22allows you to inject your own spellcheck
01:25Corrections before the so loved ones
01:26it's very simple but it we find it very
01:29useful because solar can only search in
01:31a distance of two and so a lot of the
01:33most common typos we have are worse than
01:35that so by injecting your own that can
01:38help give you better spellcheck
01:40questions so go there check it out any
01:44questions or any bugs or anything just
01:46post them on the issues log mm-hmm okay
01:52so it's good to the talk so I'll be
01:55talking about conceptual search and what
01:58that is in a nutshell is how to get
02:00relevant matches to your query even if
02:02none of your query terms appear in a
02:08well I'm the chief data scientist at
02:10dice comm I work on a data science and L
02:13headed by yo e Birkhoff and we did a
02:17number of projects in the two and a half
02:18years I've been there a lot of them
02:20involve solar so we've built custom
02:23recommender systems that leverage solar
02:26we build a custom MOT hammer that is
02:29fully available we've used solar to do
02:32did you mean functionality which is
02:33spellcheck correction we've done type a
02:36heads based on title skills and company
02:39all of those on the site right now I've
02:42also been working a lot improving the
02:44relevancy of our algorithms and our job
02:46search site so we switched to solar last
02:49December we found big increases in
02:52accuracy and performance and so on but
02:55you know as you all know relevancy is a
02:58ongoing process and so we're continually
03:00looking at algorithms and improving it
03:02and I'm the sort of lead on that project
03:05some other things I worked on at dice
03:07comm are I skills pages this is
03:10something we released a few months ago
03:11if you go dice comm slash skills you can
03:14look up technology skills on the site
03:17such as leucine or solar and you can get
03:20some interesting information about them
03:23so this is a page full of patchy leucine
03:26so the top here it shows you the most
03:28related skills to the scene in this case
03:30Apache Solr search engines and elastic
03:33search those kind of associations have
03:36been learned using algorithms that we'll
03:38talk about in this talk although that is
03:40not the goal of the talk we also have a
03:43hierarchy of skills that have been built
03:45with one of our partners you'll see that
03:47on here you'll see 10 to the skill of a
03:49time related jobs and so on there is a
03:53solar page very similar so what dice
03:57does I should mention if you haven't
03:59heard of us we are the most popular
04:02technology focused job board in the US
04:04so we serve IT jobs and so if you're
04:09looking for a technology job you know
04:12dices post those jobs and we also take
04:14people's resumes and now employees to
04:16search them ok so enough about the
04:19company I want to get on with the actual
04:22talk ok so before I talk about what
04:26conceptual search is I want to talk
04:27about why this is important why would
04:30you care so mmm when I was trying to
04:34think about how best to explain us we do
04:36what I came down to is this like it was
04:38a one big relevancy tuning mistake or a
04:41limitation that I see a lot of companies
04:42have we have a lot of sites about just
04:46ice like calm that are owned by my
04:47company and so I've worked a lot of
04:48these different sites and they all kind
04:51of have the same challenges with doom
04:54relevancy tuning and my mind that is
04:57most people ignore the importance or
04:59recall and there's a good reasons for
05:03this because it's very hard
05:04so you actually tune your algorithms for
05:07recall so what is recall if you don't
05:09know so do it I mean there's many
05:12metrics that people use to evaluate
05:14search engines but the two most
05:17important ones that most people care
05:18about our precision and recall those are
05:22often combined is the f1 measure and
05:24what precision is is is very easy to
05:28explain it is the percentage of the
05:32search results you get back for given
05:33query that are accurate not correct so
05:36most relevant see Tunis efforts and most
05:39relevancy tuning metrics you know social
05:42valuation metrics really look at recall
05:43I mean sorry precision people run a
05:46query they look at what documents come
05:48back that are not relevant to the query
05:49I spend a lot of time trying to tune a
05:51query to correct those documents I've
05:53seen a lot of talks today where they
05:54talk about that okay what position
05:57doesn't capture and what we call does
05:59capture is we call capture the document
06:01so you don't get back from the query
06:02right so you can spend all day trying to
06:05tune your your search engine you search
06:07algorithm to prove the precision of the
06:08algorithm but doesn't help it match
06:11documents that don't match your query
06:14and the reason that's really hard is
06:16because to really accurately measure
06:18recall you need to go for your entire
06:20document set figure out for a given
06:22query which of all is documents or
06:24relevant and then build some kind of
06:27metrics around there so for most
06:29practical applications you have a
06:31hundreds of thousands or millions of
06:32documents that's really kind of
06:34unfeasible so what I'm not going to tell
06:37you how to how to automate or fix that
06:39issue particularly what I'm going to
06:41is trying to use automated manners
06:45automated mechanisms to improve the
06:47recall of your search engine without
06:49hurting precision so what is conceptual
06:55search what does this have to do what we
06:57call an improvement recall it's also
07:01goes by the name of semantic search the
07:04key idea behind conceptual search is
07:07that rather than doing keyword matching
07:08so other than relying on keywords in a
07:11document to match exactly the query
07:12terms conceptive sorts tries to learn
07:16concepts are important in your domain
07:17these are concepts ideas these are very
07:22intuitive to people but it's extremely
07:24hard to get an algorithm to do this and
07:26so rather than match over exact keywords
07:28Oh synonyms conceptual search tries to
07:31do matching over concepts so it allows
07:34you to do things like match documents
07:36that don't contain any any query terms
07:38but still get pretty good results there
07:43are two key challenges with keyword
07:45matching or traditional keyword matching
07:46that this attempts to solve one is
07:49paulista me which I always have a hard
07:51time saying doesn't really roll off the
07:53tongue this is where word has more than
07:55one meaning so picking some examples
07:58from our site the term engineer on Dyess
08:01itself is very ambiguous because without
08:04more context surrounding that if someone
08:07just does a search for engineer and
08:08people do that we don't know if they
08:11mean mechanical engineer programmer
08:13automation engineer so that's one
08:16example of one when one term means many
08:18different things the other problem is
08:21synonyms so all of you will probably be
08:24familiar with this you have many
08:26different variants of the same word many
08:29different words that have the same
08:30meaning for example QA and Quality
08:33Assurance tester there's all kind of
08:36have the same meaning or VB be be nights
08:38Visual Basic they all have the same
08:41meaning so it's a kind of a challenge to
08:43make sure that you match across all of
08:45these these common terms also conceptual
08:49search tries to solve this by mapping
08:50these terms into concepts and then
08:53on those concepts it it will also help
08:56you solve things like typos and common
08:59spelling errors and also idioms which
09:02are phrases that means something very
09:04different to the words that are okay so
09:17why conceptual search again we will try
09:20to improve recall without diminishing
09:22precision and this can match documents
09:25containing not the query terms I repeat
09:27in this because that was really the
09:28thesis of this talk so I'm sorry if I'm
09:31kind of repeating myself okay so what
09:35are our concepts they represent some
09:40high level ideas in your domain so in
09:41our domain java technologies or big data
09:44jobs health tests healthcare support
09:46positions those all kind of represent
09:49concepts are important in a job search
09:52in IT jobs these concepts can be
09:56automatically learned from documents
09:57using some unsupervised machine learning
10:00techniques and typically this learns a
10:04set of concepts and a mapping of terms
10:06to concepts where the terms have sort of
10:09a degree of association with different
10:10concepts so a term can belong to
10:12multiple different concepts as you might
10:14expect for example Hadoop may belong to
10:17both the Java concept and also a big
10:19data concept so how people try to solve
10:25what are some a techniques are used up
10:27in the literature a lot of you have
10:31probably heard of some LSA or LSI
10:33so latent semantic indexing it's in the
10:35title of this talk that is one of the
10:38most popular techniques for doing this
10:40LD a latent Dokic location probably
10:43saying that wrong that's used for topic
10:45modeling which is a very similar kind of
10:47idea and then google recently will
10:50release an algorithm called word to back
10:52and that's really what I'll be focusing
10:54on today for various reasons I will
10:57explain in a minute all these algorithms
11:00work by mapping a document to a low
11:03dimensional vector that is a list of
11:07a nice number and out list kind of
11:09represents how well a document
11:11represents the concept problem with LSA
11:17LSI Lda is that there are very
11:21computationally intensive to to compute
11:23mm-hmm they rely on factorizations of
11:25very large matrices this is a very slow
11:28and computationally intensive process
11:30and requires a lot of memory and then
11:33and this is a really important thing if
11:36you want to use these in your search
11:38engine you have to embed a very hefty
11:40model into your search engine and then
11:43as Curry's come in you have to map them
11:45to this sort of concept space so that is
11:48not easy that has a big performance
11:51impact it's not really desirable if you
11:55want to just get something up and
11:56running quickly and something like solar
11:58great performance is very poor because
12:01documents are not represented by a
12:03number of terms which are showing unique
12:05to a subset of documents they're
12:06represented by these kind of lists of
12:08numbers so you can't use the inverted
12:11index in solar to get fast document
12:14matching so you can't really get behind
12:16good query time performance using these
12:19tools which is why they don't scale well
12:21that's why I believe companies like
12:23Google don't use them directly as far as
12:26I know because it just wouldn't it just
12:27wouldn't scale well so these these
12:31traditional techniques work by learning
12:33a vector representation of a document
12:36what I really want to do in order to use
12:39something like this within solar is to
12:42map words not documents to vectors and
12:46then well than embedding those vectors
12:47themselves do some processing so I can
12:50kind of use that information and get it
12:52into solar itself why are why do I want
12:56words not documents because solar is
12:58very much based on kind of a word or
13:00token based analysis chain so I want to
13:03kind of use that to kind of bootstrap
13:06into getting this to work mm-hmm
13:11okay so next couple slides have a little
13:16so fair warning were to Beck was
13:19developed by Google around 2013 it
13:22learns back to representations forwards
13:24as the link to the paper if you're
13:26mathematically inclined will like this
13:29kind of stuff but basically it works by
13:31training a machine learning model to
13:33predict what words will come before and
13:35after a word in a document within a
13:37fixed window so rather than requiring
13:40this very computationally intensive
13:42algorithm where you're condensing a
13:44matrix it's just learning word by word
13:47you know and it's much faster which is
13:49one of the reasons Google built it they
13:51need stuff to work well but I need to
13:53work very fast so you can give it a
13:55large amount of documents and it will
13:58scale very well like the traditional
14:01techniques but the important thing is
14:04once you train that model similar words
14:07get similar representations so you can
14:10then use that to kind of mine synonyms
14:11and related terms within your document
14:14set so one of the interesting things
14:19people found when they did train this
14:21model is you could do what I would call
14:22word math this is a little crazy that
14:25you can actually do this but smoothly
14:27this worked when they trained it on a
14:29large model so they trained it on
14:31something like Wikipedia large large
14:33data set and they they found out they
14:36took they looked at some of the vectors
14:39that they learned right and they tried
14:40to distill some very simple like
14:42arithmetic operations with these vectors
14:43and they found one of the things that
14:45worked is if they they took the vector
14:47form an subtracted King and added woman
14:49it kind of mapped to Queen so what it
14:53means is it'll kind of learn this notion
14:55of gender automatically itself from
14:59Wikipedia data or whatever did I said
15:02they turn it on to give you another
15:05it also learned notions of capital
15:08cities so it learned that you know these
15:11these countries on the left and these
15:13capital cities on the right and the
15:16difference in the vector space between
15:17them is very constant so it's able to
15:19learn concepts automatically from just
15:22raw data and it's supposed to be
15:25state-of-the-art or near
15:26state-of-the-art for sort of automatic
15:30tasks like SAT verbal reasoning tasks
15:32people have used it for that kind of
15:33thing and it does very well for one
15:37machine so at this point and I apologize
15:42for this you might be thinking what's
15:47all this Mathieu stuff right what's this
15:49got to do with search why do I care this
15:52is a search conference why are you
15:54telling me this stuff well you didn't
15:56this would this this weird kind of word
15:58map stuff well as I've kind of mentioned
16:02Soglin can be used to represent
16:04documents as vectors or concepts and we
16:09can use that to to configure conceptual
16:11search and we can do this by learning
16:14vectors for words and then learning
16:17which words are related to one another
16:19and we can boost the recall which as I
16:22mentioned was also the main goal of this
16:25you one way you can think of this as a
16:28way to automatically learn synonyms it's
16:30not really learning synonyms the words
16:31that the the similarities it finds are
16:33more kind of related terms so our layer
16:35term would be elastic search and solar
16:37there's are not really synonymous but
16:40they tend to occur lot together and they
16:41tend to be very related so if you get
16:43documents on lastik search they're
16:45probably likely to be pretty relevant to
16:46sort okay so I want to give you a quick
16:52demo of this kind of working
16:56so I've solar running on my machine
16:58I said config files that were in this
17:01conceptual search so the first query I
17:05would run if I run data scientist okay
17:07so this is a simple kind of search in
17:10the face that runs on our dice job site
17:12my dice jobs collection hmm and to
17:15illustrate how this is working I'm
17:17explicitly include excluding any
17:19document that matches any of those
17:21search terms because I want to know what
17:25documents I'm surfacing that are regular
17:27search engine will miss okay so I'll
17:30repeat that none of the match is here
17:32match those search keywords in any way
17:35shape or form okay so if I search for
17:39data scientists I get a lot of Hadoop
17:43Python engineer now I'm a data scientist
17:46but lot of you here might not be data
17:47scientist so Python is one of the main
17:49programming languages that data
17:50scientist like me use so that is what I
17:54would consider our relevant match and
17:57you see other kind of Hadoop jobs and
17:59Hadoop is a Big Data technology so all
18:01data scientists use that then I search
18:04for a programming language like C sharp
18:07I get kind of dotnet jobs that don't
18:11mention C sharp so I got done that
18:13developers I get vb.net leads dotnet
18:17architect this is a search conference so
18:26I want to do a search at a search
18:28conference for information retrieval
18:30because I figure I got to do that
18:33I can spell it ok so if I search for
18:36information retrieval what do I get I
18:38get data science positions I was to get
18:40natural language processing engineer
18:43these seem kind of relevant to to my
18:46search term even though these documents
18:48don't contain the words information or
18:49retrieval down here I get machine
18:51learning specialist machine learning
18:53library text mining just to show this
18:56doesn't just work with tech jobs I can
18:58do a search for project manager I will
19:03get managerial jobs notice none of these
19:05have the term manager they have 10
19:07managers or MGR I'm explicitly excluding
19:12synonyms just for the analysis here
19:14normally we would map those to manager
19:17but I'm just using this to evaluate tool
19:21also if you search for director Neil of
19:25development manager positions and so on
19:30so you can see that it's able to
19:32retrieve what I would at least consider
19:33relevant documents even though those
19:36documents don't contain any of the
19:37search terms I guess how do I do this
19:42how am I making this work with in solar
19:45without using a machine learning model
19:47within solar itself so I provided a
19:51number of Python scripts that go along
19:52with this talk and those Python scripts
19:55you can feed a large to docking
19:57- and it will do some sort processing on
20:00the documents it will try and extract
20:02important keywords and phrases and then
20:05we'll train the word - back model for
20:07you and then you'll have various options
20:09what to do with that model when you
20:11trained it and this is really where to
20:13get so how do we how we ingest this into
20:14solar so I'm kinda heavily using the
20:17synonym functionality and solar and kind
20:19of hacking in some ways so one method
20:23I'm going to use is I'm going to embed
20:24the raw vectors for the terms directly
20:27in solar using a few plugins I don't
20:31recommend doing that in production this
20:34is more proof of concept it does not
20:36perform well in terms of how it scales
20:39but if you want if you really want the
20:41most accurate results that's something
20:43you can take your with and play around
20:44with but to get it to scale well I
20:49looked at some other approaches so you
20:52can take the top n terms where n could
20:54be 10 20 30 that it found that was
20:56similar in your document set and you can
20:59extract those and you can put them into
21:01a into a synonym file but I don't just
21:06I'm the weights - similarity weights
21:09that the model has been able to compute
21:11in terms of how related it finds the
21:13terms within a document set and so I use
21:16those to augment the those sort of query
21:19expansion terms a query time so that it
21:22treats certain terms with a higher
21:24waiting than others and you can actually
21:25do that with a little bit of tweaking
21:27with the with a cinnamon filter and a
21:30slightly modified a query parser or you
21:32could just handle that in your API just
21:34as easily and the last less method to
21:37kind of inject this information into
21:40solar is you can use a machine learning
21:41clustering algorithm take these vectors
21:44that you've learned for the words and
21:46compute clusters for them and then for
21:48those clusters you can map those to
21:51we're using this in the file so if you
21:54have a cluster that contains Java Java
21:56developer you can map those to the same
21:58token that is the easiest way to consume
22:02the output of this so it's important for
22:06this to work that you have a set of
22:07important keywords for your domain
22:11and the token trade Grainger earlier he
22:13suggested using as a starting point for
22:16this kind of analysis your top search
22:18terms by doing query log mining I would
22:22that's a vey good starting point to
22:24determine what words are important in
22:26your domain but going beyond that I have
22:29built a simple algorithm that will mine
22:31your documents for you and find commonly
22:33occurring phrases and terms and you can
22:36also use those to learn keyword to learn
22:39synonyms over so I've kind of just gone
22:46over that it's important when you're
22:49doing this to take comma phrases and
22:52mathas the single tokens if you if you
22:56just do it on a single word basis you
22:57don't always capture terms like data
22:59scientist and one thing we found with
23:01our search is if initially searches for
23:05data produced data scientists would
23:07produce a lot of scientist positions
23:08because it wasn't taken into account the
23:11whole phrase so he is very important
23:13that you have configure solar to extract
23:16tokens for these phrases when you're
23:19defining your keywords so as I mentioned
23:23you can do it yourself all the code is
23:25now publicly available on github I will
23:27be adding documentation to that over the
23:29next few weeks but I've already tried to
23:31document it enough to get you started
23:35there's a set of solar plugins some of
23:38them are to do with this talk but as I
23:39mentioned they do a lot of other things
23:40too I have a repository for solar config
23:46examples the easiest way to understand
23:48how to use the plugins is to see an
23:50example of it in a config there you'll
23:52see the parameters documented in there
23:55and then conceptual search repository
23:57that contains the machine learning stuff
23:59so that is all in Python set to run
24:02Python the computer and give it as the
24:04documents and its other keywords because
24:09what additional tricks am i using to
24:12make this happen so you can use the
24:17synonym filter in solar to extract
24:20keywords from documents kind of a
24:22primitive parsing or you can use the
24:24that will do the same thing what that
24:28looks like is this so basically we kind
24:32of hacking solar here to sort of do
24:33simple kind of term extraction so if you
24:38enter you ever have a bunch of
24:40information like this it's kind of noisy
24:41awesome loosing position we have a
24:45cinema filter configure with us have
24:48important skills it will then go through
24:52oops sorry I just hit something very bad
25:03mouse wheel by accident if you configure
25:06your cinema filters with a set of
25:09phrases and terms and then use a type
25:12token filter factory you can use solar
25:14to do sort of primitive term extraction
25:17although it's actually quite
25:18sophisticated so always extract the
25:20longest matching phrase or as I
25:23mentioned you can use the solar text
25:24tagger to do this for you too so as you
25:27can see at the bottom here given that
25:29initial input we've extracted term solar
25:32seen solar elasticsearch and zookeeper
25:35and then kind of normalize them into
25:37these canonical forms at the bottom this
25:39is just using the standard functionality
25:41and solar and the synonyms file the
25:45second trick to make conceptual search
25:47happen in solar is to make a use of
25:52payloads so payloads are a very useful
25:55functionality and solar I think it's
25:57often very much overlooked what they
26:00allow you to do is to tag each term in
26:03your document with a value in this case
26:07that can be a real number then you just
26:09need to make some small modifications to
26:11a similarity class and a query parser to
26:14allowed it to allow it to use those
26:15payloads in the scoring functionality
26:18this is very powerful it has many
26:20applications beyond this one of them
26:22being we use it in our recommender
26:24engine to wait terms in a document by
26:29using our custom weighting scheme you
26:31could even use it to implement a simple
26:33machine learning algorithm within solar
26:34or model a linear model
26:37by allowing you to just basically set
26:39weights on terms in a document or than
26:42just use the tf-idf values mm-hmm so we
26:48use a synonym filter to expand that
26:50keyword to multiple tokens and then we
26:53take each token and it has an Associated
26:55payload and then we use that to adjust
26:58the relevancy scores the index or query
27:00time to make this kind of conceptual
27:02search happen so you can figure this as
27:05a form of sort of query expansion but
27:09you can do it with insallah in next time
27:10if you wish you can do it at query time
27:13or as I mentioned you could do it in the
27:15API so what does this look like in the
27:18analysis chain so again I'm using a
27:21cinema filter and a type filter to
27:23extract and I normalize it determines
27:27scene and then the bottom here you'll
27:30see the bit I'm now talking about so so
27:36there is a synonym map for Apache Lucene
27:38that maps to a set of top-end terms in
27:40this instance and each of those has a
27:43payload associated with it you'll see
27:47that payload with the pipe symbol so
27:50first of all it takes the patch in
27:51duccini expands it to the most related
27:53terms as found by the algorithm and then
27:56it takes that payload and then that
27:58query time converts it into a term time
28:01boost which you'll see down here you can
28:05also do this at index time and just use
28:08a version of EDA Smacks which is on my
28:11um my plugins site that can also kind of
28:14read payloads and include the payload in
28:16the score mm-hmm okay so what are some
28:22of these example soon--and files look
28:24like so I'm going to skip the back to
28:27base method but you can read more about
28:30that on the plugin page but this is a
28:35this is an example analysis chain this
28:40kind of technique so we just do some
28:45be processing here to remove HTML
28:47characters and whitespace tokenize and
28:49stuff then we use this job titles file
28:53to sort of filter down the set of terms
28:55to us our job titles we then use a type
28:58token filter Factory to remove anything
29:02that's not a synonym so not that's not
29:04in that file and then we use a delimited
29:07payload token filter to extract the
29:10payload and then I use this kind of
29:14these kind of custom token filter
29:17factories to then extract the term time
29:19boosts so what is a file that you can
29:26help by this model look like that you
29:28would ingest and solar in terms of a
29:30synonym file so this is an example entry
29:33here I'm just using top end of five so
29:36you can understand what it looks like
29:39you might want to use larger values so
29:41it's taking the phrase Java developer
29:43again I've collapsed that into a single
29:45token as I mentioned maps that to Java
29:48j2e developer Java architects lead Java
29:51developer these are all terms that the
29:53algorithm has learned itself or relevant
29:56to Java developer and then you'll see
29:58these different payloads so these are
30:00order the ones for this sort of closest
30:04to Java developer are the most similar
30:05or related to it as I mentioned you can
30:09configure this that in the next time
30:10with payloads or what query time you do
30:13it a query time that gives you a lot of
30:14choices because then you can play with
30:16different approaches different feel ways
30:18and so on without having to be index and
30:22I know for some people at least we're
30:23next thing can be something that takes
30:25it from an hour two weeks so being able
30:28to configure this a query time is
30:30important at least for some people one
30:32of our sites we have millions of
30:34documents and so we could not run many
30:37experiments where we have to keep the
30:38indexing so that's kind of important
30:41unlike the vector method and unlike LSA
30:44we're still able to make use of the
30:46inverted index right because we're just
30:48basically using this to do time
30:49expansion so it's faster index and a
30:54query time long as you don't say n - too
30:59so the last map that I mentioned instead
31:01of doing this sort of top end term
31:03approach you can do clustering and again
31:06I have some code that will do the
31:07clustering for you in that github repo
31:10so what you can do is you can take this
31:12Model T the vector that is learned over
31:15your top keywords and phrases mm-hmm and
31:18then you can cluster those by similarity
31:20what you might want to do is rather than
31:23just do one set of clustering you might
31:24want to do several steps with different
31:26size clusters I combined those into
31:28different fields so you can give those
31:29different weights so that the tighter
31:32the smaller clusters have more waiting
31:34because I was kind of a narrower basin
31:36of lated terms so what what is a synonym
31:45file look like that's output by the sort
31:47of clustering approach as you can see
31:50these these are all kind of terms of it
31:51in the same cluster so they just map to
31:54some arbitrary cluster token label
31:56within your sentiment file this is
31:59probably the easiest way to use this out
32:00of the box you don't need any kind of
32:02custom payload functionality or anything
32:04you can just kind of do this and then I
32:06could figure it in a synonym file I've
32:13not seen a noticeable impact on query or
32:15indexing performance effect in some ways
32:16it may actually improve it so to try to
32:23understand the power of this algorithm
32:25it helps if you kind of look at some of
32:27the clusters that it's able to learn
32:28from our jobs data so I picked some
32:31examples from the first time I ran the
32:33algorithm you know there's no like
32:35there's no tuning or anything this is
32:37just the first time I ran it to see if
32:39it kind of made sense what it was doing
32:42so there was a cluster for what I would
32:44call natural language information so I
32:46had bilingual Chinese French fluent
32:50speak speaker these are all kind of
32:52terms related to natural languages that
32:55were all in one cluster there's the
32:57entire cluster I haven't moved anything
32:59it made a small cluster for Apple
33:02programming languages of which you found
33:03to cocoa and swift by the way these
33:07these labels here I have added for
33:09interpretation but obviously kind of
33:11that is a little hard automatically
33:13it'll in the cluster for search engine
33:16technologies so it festered solar
33:19elasticsearch Lucene search engines they
33:23all came in the same cluster and they
33:25kinda had closed the food net beyond
33:28pure tech skills it also did some
33:29interesting things so another thing I
33:31found was it was able to cluster the
33:34terms that employees used when you're
33:36describing the sort ideal candidate for
33:37a role these these terms kind of relate
33:41to sort of attention and the attitude
33:43and focus also produced a larger cluster
33:46of these kind of attribute type terms
33:50where just describing someone's attitude
33:52which I thought was really interesting
33:53so terms like pay attention
33:56self-motivated diligent detail-oriented
33:59they were all kind of map to that same
34:01cluster and again this is all ultimately
34:04extracted by the algorithm
34:05there's no supervised labeling going on
34:07here you're not have to go through and
34:09do labeling of your data just learning
34:11like what at what terms are related in
34:13your you know data set for you it's
34:16doing a lot of work without without much
34:19configuration to summarize it's easy to
34:28overlook recall when you're performing
34:29relevancy tuning conceptual search
34:33provides you with a method of improving
34:35recall while still hopefully maintaining
34:37high precision by trying to match
34:39documents on related terms concepts
34:43mmm-hmm in reality this involves using a
34:47machine learning algorithm that will
34:49learn what at what terms are related to
34:51one another were to Beck is a very
34:54scalable algorithm for doing this it's
34:56very fast it doesn't seem to use much
35:01more memory even if you give it multiple
35:03times the amount of documents and it can
35:06it's shown to do a while and with
35:08analogy tasks and and mining similar
35:10words and so on so it's an appropriate
35:12tool for this task and there's a reason
35:14that Google developed it for doing this
35:16kind of thing and we can train a word to
35:19Veck model offline and we can take us
35:21its output if we process it and use that
35:26various methods to try to make solar do
35:28conceptual search without embedding some
35:31complicated machine learning model and
35:37that that requires may or may not
35:39require use of payloads and plus an plug
35:41custom plugins depending on which
35:43approach you take I would start with the
35:45the clustering approach because that's
35:47very easy and then maybe experiment with
35:49some of the other ones that make sense
35:51to you so that is the end of the talk I
35:55have about five minutes for questions
36:19so I used several approaches so the
36:23first time and this gave the best
36:25results I use a set of keywords that we
36:27defined ourselves one of those was very
36:30simply a list of top job titles that
36:32there's very easy to define but the
36:34skills that taken more work but I also
36:37and I include code for this I also
36:38experimented with taking a set of
36:41keywords and also extracting commonly
36:43occurring phrases and also using that to
36:47train the model and that still did
36:48pretty well so that second approach
36:51didn't require any kind of manual
36:53engineering or determining of keywords
36:55or so on I switched to keyword clusters
36:59here doesn't work so well for that one
37:02but didn't it's not quite as good but it
37:06still works pretty well so if I type in
37:07data scientists I get MATLAB positions
37:09I still get some Hadoop positions I get
37:12some kind of data related positions see
37:19shark gives me a bunch of dotnet
37:20positions it got little confused about a
37:23do because that's also that's a dotnet
37:26technology but it's also a server-side
37:28thing but you're getting VB jobs that
37:32works pretty well like if you don't have
37:34a set of keywords to find and you can
37:36try the second approach but I would I
37:38would definitely start with this kind of
37:39list of of your top search terms like
37:41everyone should be either capable of
37:42extracting that if you want you can have
37:46you can run my algorithm to do the
37:47phrase mining and use those also yep
38:13yeah so how I would you have to be
38:17careful how you you want to use this
38:18because sometimes if you show users
38:20documents that don't match the terms
38:21even if the relevant they may complain
38:22right so it's important you test this
38:25and also when you're building a UI you
38:26point me to tell you users that okay
38:29this isn't an exactly so you might may
38:31want to offer this as an additional
38:32option there's this many ways you can
38:34use this one would be as a different
38:36option right um finding any good
38:37documents click here a second will be
38:40just to call them out in the matches and
38:41then use the qf functionality to wait
38:44this kind of lower than the exact
38:46matches so that once they start running
38:47out of good matches then you start to
38:49see the the conceptual matches yeah just
38:55just for the purpose of evaluation right
38:57but how I would configure this in
38:59production is I would set a high value
39:00on the phrase matches and exact matches
39:03and then pulleys in at a lower kind of
39:06weight another way you could do it is
39:07you could do a reran King functionality
39:09so you could you know take you use the
39:12solarii ranking function so you could
39:13take the top of thousand matches and
39:14then you could be ranked with this so
39:17that if you only have like one or two
39:18terms that match for those terms if you
39:21configure it correctly and solar you
39:23could then make sure that well okay I
39:26only match on one term but I match on
39:27five related terms here or five clusters
39:30or whatever and rank those higher so
39:32there's many ways you could use this
39:34we're still kind of evaluating how you
39:35want to go forward what we showed this
39:37to our accrue doesn't anybody like it
39:38because I didn't even think of this
39:40there's a lot of tech jobs that really
39:41hard for people to place like DevOps is
39:43extremely high demand right now they're
39:45having a really hard time placing to
39:47have ops developers this gives them like
39:49good suggestions as alternate types of
39:51developers they could find so they don't
39:52find any good to have ops developers
39:54although too expensive this shows you
39:56really good matches they're kind of that
39:58would be a good develop their lobster
39:59developer they would have a similar
40:06currently this is still in a kind of
40:08evaluation phase but we do tend to use
40:10on the site soon yes sure sorry
40:26so the proaches I focus on I think scale
40:28just fine like I don't notice any any
40:30any detriment my initial idea with this
40:32and I kind of avoided that because I
40:34think it complicates matters is I
40:36actually tricked soar into matching on
40:38vectors themselves so you could take a
40:41term you can and say you have a for a
40:43given word you have a hundred element
40:45vector you can use synonym files and
40:48other stuff and solar to map that to a
40:50hundred tokens with payloads on them and
40:52then do back to the vector matching that
40:54way and then you kind of have something
40:55that's very much like LSA inside of
40:57solar but that doesn't scale well and
40:59it's kind of complicated so the code is
41:01out there if you look through and see
41:03how I do that it uses the same concept
41:05but rather than taking the top ten or
41:07twenty related terms there's Mac -
41:09artificial tokens zero one zero two zero
41:13three something like that there are
41:14elements in the vector and then the
41:16value of that elements is your payload
41:20yeah yep this goes well yeah so I mean
41:27that that's probably going to be the
41:28most accurate approach but it has a big
41:30cost in terms of performance because you
41:32can't use it invert index you could use
41:34it for we rank in though weight if you
41:37just be ranking the top hundred terms or
41:39all the matches or a hundred thousand
41:41matches or something lady there
41:49I excused jobs for this because I and
41:52the reason for that is so we have jobs
41:54data and then we have resume data a job
41:56contains a single job a resume contain a
41:58bunch of jobs so the jobs are Toni more
42:01focused however this would Tyvek
42:02probably wouldn't care because this
42:03looking within a very narrow window
42:05around the word but when I've used LSA
42:08on that job is definitely much better
42:10because I'll say looking at what terms
42:12occur in the same document well this is
42:14using a very kind of narrow window so
42:16you you could use you use it with I
42:19would say it's more flexible as well as
42:21getting better so I answer your question
42:27this one was just under a hundred
42:29thousand well I mean so I prune a lot
42:33though because we have a lot of ones
42:34that are we short so after the pruning
42:36and pre-processing I think it was about
42:37sixty six thousand but it runs be fast
42:42and like so I have a lot of
42:45pre-processing steps and so those
42:47actually take longer than the model to
42:48Train probably like to do the HTML
42:50parsing and all that cozen at - I'm just
42:52using some open source libraries the
42:54model training itself I think to train
42:56ten iterations took about ten minutes so
42:59it's it's pretty fast and this Python
43:01code so there are implementations of it
43:04in Java and other libraries there
43:06probably and faster but it's pre fast
43:16yeah so so I'm doing clustering here so
43:31the cost of sizes can vary right the
43:33base I'm just like it cost me album just
43:36trying to tries to find the most similar
43:37items and you said number of clusters
43:38some of them have two items in them some
43:40of them have a lot so you'll find a
43:43larger clusters have a lot of very very
43:44similar items in them right so it's it's
43:52learning that by basically going through
43:53the documents and looking what words
43:55occur close to one another and then
43:57training a mission yeah yeah it's yes
44:04right right yeah yeah I mean it's it's
44:09sometimes it's like low current so you
44:11get quite you get quite far by doing
44:12just a simple co-occurrence model with a
44:14window which is what people have done in
44:16the past but this uses a machine
44:17learning approach that probably produces
44:19better results at least I'm a TAS
44:22they've done but you can think of it as
44:24doing that it's not it's not explicitly
44:26computing that but it is in a sense so
44:32you can read the paper if you want know
44:53so not all of the clusters are perfect
44:57but you know uh you're using this to
44:59sort of mint you existing search I'm not
45:01to replace it you know so if you do that
45:04I don't think you'll get any I would get
45:06a lot of bad bad hits we get the
45:08clusters not all the clusters are quite
45:10as neat as the ones I showed you there I
45:11can't I do not want to provide everyone
45:13I'm providing with the code but I don't
45:14to provide you the help the algorithm
45:15because that would be too much and I
45:19will get in trouble but oh yeah most of
45:22them make it made sense to me like most
45:24amor these are pretty good
45:25festering out terms you can always if
45:28you have time you know he's going to
45:29spec them and delete the ones you don't
45:38well right now I'm just processing the
45:40whole document as text you could do
45:41separate training on different sexes a
45:43document that might be a good thing to
45:45do actually and just train it that way
45:48and then you can Solarize you could
45:50figure as many fields as you want and
45:52there's many weights and so I typically
45:54take advantage of it but if I could have
45:56several different fields and they may be
45:57more important and to separate them out
45:59and then I can always change it that
46:00query time so yeah I would definitely if
46:02you have different fields that are more
46:04important and then the I would do this
46:06actually for the one I showed you I use
46:09titles and skills separately so skills
46:11are extracted from the Job Description
46:12turn it on that I didn't have a set the
46:14title model and it was using a
46:16combination of those two to work
46:23mm-hmm yeah what different meanings with
46:40this well I mean there's some ways they
46:41can work out I mean one leaders way to
46:43do this and what I said it's important
46:44is just to make sure you're extracting
46:45phrases not just single words because
46:49that phrase itself will help
46:50disambiguate those kind of situations
46:51but either way this kind of works is
46:53it's not just mapping life for given dog
46:55news so I'm just mapping one word to one
46:57cluster it's mapping a thousand are no
47:00words two different clusters so that
47:01combination of data should help
47:02disambiguate the problem because you're
47:06using the whole document context not the
47:08not just one or two words this is more I
47:11guess this is a better at solving a
47:13synonym problem but people have found
47:16that you know that mapping of concepts
47:18just seem to help to solve the polisseni
47:20problem too I'm not totally sure how
47:29yeah the different experiments around so
47:36like the first one I showed you was
47:37using the set of terms that we'd already
47:39defined a network best but then the
47:41second one I showed you was using the
47:42clusters and no well as good as I showed
47:44you we're not using any kind of taxonomy
47:46or and he said of keywords I was just
47:49using the most commonly occurring terms
47:51or phrases in it in a document set and
47:53expected by the Python code that put out
47:55there that will kind of probably
47:57efficiently mine for your data and find
47:59fine phrases which you can also do with
48:01the Engram filter there's I mean there's
48:02a single filter and sort ooh yeah yeah
48:11so I give it you can give bad minimal
48:14you can give it as a set of documents in
48:16my in my scripts or if you just want to
48:18use it yourself you just give it a
48:19documents I would recommend including
48:21within that like I said those set of
48:23keywords because you want to make sure
48:27that it covers those because those are
48:28very important right but it will do
48:30reasonably well without that but yeah
48:33there's really those two things I was
48:35try was experiment to another approach
48:37where I was gonna use the keywords and
48:38then find terms that are very closely
48:40related to them and use those as
48:41keywords but that's kind of what this is
48:42doing anyway so I kind of been in that
49:12right I mean you're I mean that's the
49:14beauty of a search too is you don't get
49:16pretty good results even if it's not
49:18doing a perfect job and this kind of
49:20makes that better I think she just gave
49:22me a signal that I have to leave but um
49:25what I was gonna say is yeah what we
49:27really like about this is there's no
49:28mate there's no manual maintain
49:29maintenance or synonyms you know
49:31somewhere our sites like the medical
49:33profession like we have medical job
49:34sites that the Maine doesn't change very
49:36rapidly like you don't get a lot of new
49:38keywords appearing but in a tech jobs
49:41sector we have new skills coming all the
49:43time so we need something that can
49:46handle that and automatically learn that
49:48I don't have time to go and create you
49:50sending the files constantly what's the