Implementing Conceptual Search in Solr

Lucidworks2015-12-04

enterprise search#relevancy#functionality#relevant matches#Text Search#Software#big data#conceptual search#Apache Solr#lucidworks#apache#Open Source#lucene#relevant search#presentation#Solr Lucene#Revolution#dice.com#relevancy tuning#Lucene/Solr#search engines#Data#lucid works#Convention#search#Solr Revolution#solr#recall#F1 Measure#relevant#semantic search#polysemy#synonyms#conceptual learning#word2vec#vector#conceptual

4K views|8 years ago

💫 Short Summary

The video discusses the development and implementation of conceptual search techniques, specifically focusing on Word2Vec by Google for automatic word inference. It explores the use of machine learning models, clustering algorithms, and keyword extraction methods to enhance search algorithms and user experience. The speaker emphasizes the importance of mapping terms into concepts for better accuracy, relevancy, and recall in search results. Additionally, the video addresses the utilization of synonym maps, payloads, and clustering for improving search functionality and handling large document volumes efficiently. The overall goal is to provide users with more relevant and precise search results across various domains, including job searches and document processing.

✨ Highlights

📊 Transcript

✦

Introduction to open-source tools and plugins available on GitHub.

00:31

Custom MLT hammer and spellcheck plugin mentioned for better accuracy in spellcheck customization.

Work on improving relevancy algorithms for job searches discussed.

Skills page feature on website allows users to explore technology skills like Apache Solr and Elasticsearch.

Ongoing efforts to enhance search algorithms and user experience emphasized.

✦

Conceptual search

06:49

Conceptual search, also known as semantic search, learns important concepts in a domain instead of just matching keywords.

The goal is to improve recall by matching documents based on concepts rather than exact query terms or synonyms.

This approach provides more accurate and relevant search results, even if documents don't contain specific query terms.

Automated mechanisms like conceptual search help search engines enhance recall without sacrificing precision.

✦

Conceptual search addresses challenges of traditional keyword matching by mapping terms into concepts.

08:50

It helps with ambiguous terms, synonyms, typos, spelling errors, and idioms.

Concepts represent high-level ideas in domains like Java technologies or healthcare support positions.

Machine learning techniques can automatically learn concepts from documents.

Google's Word2Vec algorithm efficiently maps documents to low-dimensional vectors to represent concepts.

✦

Development of Word2Vec by Google in 2013.

13:19

Word2Vec trains a machine learning model to predict words before and after a word in a document within a fixed window.

Faster and more scalable than traditional techniques, allowing for mining of synonyms and related terms within a document set.

Training the model on a large dataset like Wikipedia enables Word2Vec to learn concepts automatically, such as gender and capital cities.

Considered state-of-the-art for automatic word inference.

✦

Using Soglin for conceptual search by representing documents as vectors or concepts, learning related words, boosting recall, and automatically learning related terms.

16:02

Elastic search and Solar are mentioned as examples of related terms that tend to occur together.

Demonstrating a search for data scientists, programming languages, and job titles to show the effectiveness of the search method in retrieving relevant results.

The search is not limited to tech jobs, as shown by searching for project manager positions.

✦

Methods for document processing within Solar without using a machine learning model.

19:32

Python scripts are provided for processing documents, extracting keywords, and training a word-back model.

Different approaches for scaling the process are explored, such as using synonym functionality, embedding raw vectors, and augmenting query expansion terms.

Using a machine learning clustering algorithm to compute clusters for words is discussed.

Having a set of important keywords for domain analysis is emphasized, along with query log mining and analyzing commonly occurring phrases to determine important terms.

✦

Overview of Solar Plugins and Payloads in Solar.

23:29

All code is available on GitHub and documentation will be added in the coming weeks.

Solar plugins for term extraction using synonym filters or solar text tagger are included in the repository.

Payloads in solar can tag terms with values for scoring functionality and applications like recommender engines or machine learning algorithms.

The speaker emphasizes the versatility and potential applications of payloads in solar.

✦

Highlights of Using Synonym Maps in Apache Lucene

27:36

Synonym maps in Apache Lucene involve associating each term with a payload to improve search results.

The payload is converted into a term boost at query time to enhance search relevance.

Terms can be filtered based on job titles, payloads can be extracted using custom token filters, and payload usage can be configured at query time.

The ability to experiment with different approaches at query time without reindexing is important for sites with large document volumes.

Clustering can be used as an alternative to top-end term mapping for similarity analysis.

✦

Algorithm clusters terms based on weights assigned to different fields.

34:13

Clusters related terms like natural language, programming languages, and search engine technologies.

Identifies terms used to describe ideal job candidates, such as 'self-motivated' and 'detail-oriented'.

Conceptual search improves recall by matching documents on related terms and concepts.

Word2Vec algorithm is scalable, efficient, and developed by Google.

✦

Methods for making solar conceptual search without complicated machine learning models.

36:19

Clustering was used as a starting point for solar conceptual search.

Keyword extraction approaches were tested, with the second approach of keyword clusters being effective.

Caution is needed when presenting documents to users based on search terms to avoid complaints.

Suggestions for configuring search functionality in production were provided, such as setting high values for phrase and exact matches, and utilizing ranking functions for better matching results.

✦

Evaluation of using DevOps developers and alternative types of developers for tech jobs in high demand.

40:08

Method involving vector matching and synonym files to find good matches, similar to LSA inside solar.

Approach has potential but may not scale well and is complex.

Trade-off between accuracy and performance in using certain methods for matching terms.

Job and resume data distinctions and the efficiency of model training in Python.

✦

The importance of clustering and variability in cluster sizes for better results in machine learning.

43:31

Utilizing co-occurrence models and document analysis to train the system effectively.

Not all clusters are perfect, but they can still be valuable for refining search results.

Separating training on different document sections and utilizing various fields and weights for improved accuracy.

Mapping multiple words to different clusters to disambiguate problems and enhance context understanding.

✦

The speaker discusses the importance of terms and clusters for network analysis.

47:36

Emphasizes the use of commonly occurring terms in documents for efficient data mining.

Mentions the effectiveness of using keywords and closely related terms for analysis.

Highlights the benefits of automated learning and acquiring new skills in sectors like technology.

Stresses the importance of adapting to rapidly changing keywords in certain industries for effective analysis.

00:00Shh welcome thank you guys for turning

00:08up I wasn't sure if 10 people would come

00:10so I'm glad that more than 10 people

00:12came if I didn't pick the best title for

00:15what I'm describing my colleague

00:17referred to as cerebral so I'm glad that

00:20didn't put too many of you off before I

00:23start I want to point out that a lot of

00:26a lot of talks what people have talked

00:27about making all this stuff open-source

00:31all of the stuff I talked to you about

00:33today and a bunch of other solar plugins

00:35that we've built a dice over the last

00:37couple of years they were all available

00:39on github right now for you to check out

00:42you read the sort of container is dice

00:45tech jobs although github sir shouldn't

00:48seem to be small enough to pick that up

00:49so if you want to grab that URL if

00:52you're interested in checking it out

00:53we built a number of stellar plugins

00:55over the years one is a custom MLT

00:58hammer that allows you to do a lot of

01:00extra stuff that the regular MLT doesn't

01:02do one of those being you can imply

01:05function boosts or boost functions to

01:08boost your output so you can do kind of

01:10location-based matching on boosting you

01:14can do content-based queries you can do

01:16recommendations based on multiple

01:18documents that's all in a multi handler

01:20we have a custom spellcheck plugin that

01:22allows you to inject your own spellcheck

01:25Corrections before the so loved ones

01:26it's very simple but it we find it very

01:29useful because solar can only search in

01:31a distance of two and so a lot of the

01:33most common typos we have are worse than

01:35that so by injecting your own that can

01:38help give you better spellcheck

01:40questions so go there check it out any

01:44questions or any bugs or anything just

01:46post them on the issues log mm-hmm okay

01:52so it's good to the talk so I'll be

01:55talking about conceptual search and what

01:58that is in a nutshell is how to get

02:00relevant matches to your query even if

02:02none of your query terms appear in a

02:04document Who am I

02:08well I'm the chief data scientist at

02:10dice comm I work on a data science and L

02:13Dickstein

02:13headed by yo e Birkhoff and we did a

02:17number of projects in the two and a half

02:18years I've been there a lot of them

02:20involve solar so we've built custom

02:23recommender systems that leverage solar

02:26we build a custom MOT hammer that is

02:29fully available we've used solar to do

02:32did you mean functionality which is

02:33spellcheck correction we've done type a

02:36heads based on title skills and company

02:39all of those on the site right now I've

02:42also been working a lot improving the

02:44relevancy of our algorithms and our job

02:46search site so we switched to solar last

02:49December we found big increases in

02:52accuracy and performance and so on but

02:55you know as you all know relevancy is a

02:58ongoing process and so we're continually

03:00looking at algorithms and improving it

03:02and I'm the sort of lead on that project

03:05some other things I worked on at dice

03:07comm are I skills pages this is

03:10something we released a few months ago

03:11if you go dice comm slash skills you can

03:14look up technology skills on the site

03:17such as leucine or solar and you can get

03:20some interesting information about them

03:23so this is a page full of patchy leucine

03:26so the top here it shows you the most

03:28related skills to the scene in this case

03:30Apache Solr search engines and elastic

03:33search those kind of associations have

03:36been learned using algorithms that we'll

03:38talk about in this talk although that is

03:40not the goal of the talk we also have a

03:43hierarchy of skills that have been built

03:45with one of our partners you'll see that

03:47on here you'll see 10 to the skill of a

03:49time related jobs and so on there is a

03:53solar page very similar so what dice

03:57does I should mention if you haven't

03:59heard of us we are the most popular

04:02technology focused job board in the US

04:04so we serve IT jobs and so if you're

04:09looking for a technology job you know

04:12dices post those jobs and we also take

04:14people's resumes and now employees to

04:16search them ok so enough about the

04:19company I want to get on with the actual

04:22talk ok so before I talk about what

04:26conceptual search is I want to talk

04:27about why this is important why would

04:30you care so mmm when I was trying to

04:34think about how best to explain us we do

04:36what I came down to is this like it was

04:38a one big relevancy tuning mistake or a

04:41limitation that I see a lot of companies

04:42have we have a lot of sites about just

04:46ice like calm that are owned by my

04:47company and so I've worked a lot of

04:48these different sites and they all kind

04:51of have the same challenges with doom

04:54relevancy tuning and my mind that is

04:57most people ignore the importance or

04:59recall and there's a good reasons for

05:03this because it's very hard

05:04so you actually tune your algorithms for

05:07recall so what is recall if you don't

05:09know so do it I mean there's many

05:12metrics that people use to evaluate

05:14search engines but the two most

05:17important ones that most people care

05:18about our precision and recall those are

05:22often combined is the f1 measure and

05:24what precision is is is very easy to

05:28explain it is the percentage of the

05:32search results you get back for given

05:33query that are accurate not correct so

05:36most relevant see Tunis efforts and most

05:39relevancy tuning metrics you know social

05:42valuation metrics really look at recall

05:43I mean sorry precision people run a

05:46query they look at what documents come

05:48back that are not relevant to the query

05:49I spend a lot of time trying to tune a

05:51query to correct those documents I've

05:53seen a lot of talks today where they

05:54talk about that okay what position

05:57doesn't capture and what we call does

05:59capture is we call capture the document

06:01so you don't get back from the query

06:02right so you can spend all day trying to

06:05tune your your search engine you search

06:07algorithm to prove the precision of the

06:08algorithm but doesn't help it match

06:11documents that don't match your query

06:14and the reason that's really hard is

06:16because to really accurately measure

06:18recall you need to go for your entire

06:20document set figure out for a given

06:22query which of all is documents or

06:24relevant and then build some kind of

06:27metrics around there so for most

06:29practical applications you have a

06:31hundreds of thousands or millions of

06:32documents that's really kind of

06:34unfeasible so what I'm not going to tell

06:37you how to how to automate or fix that

06:39issue particularly what I'm going to

06:41focus

06:41is trying to use automated manners

06:45automated mechanisms to improve the

06:47recall of your search engine without

06:49hurting precision so what is conceptual

06:55search what does this have to do what we

06:57call an improvement recall it's also

07:01goes by the name of semantic search the

07:04key idea behind conceptual search is

07:07that rather than doing keyword matching

07:08so other than relying on keywords in a

07:11document to match exactly the query

07:12terms conceptive sorts tries to learn

07:16concepts are important in your domain

07:17these are concepts ideas these are very

07:22intuitive to people but it's extremely

07:24hard to get an algorithm to do this and

07:26so rather than match over exact keywords

07:28Oh synonyms conceptual search tries to

07:31do matching over concepts so it allows

07:34you to do things like match documents

07:36that don't contain any any query terms

07:38but still get pretty good results there

07:43are two key challenges with keyword

07:45matching or traditional keyword matching

07:46that this attempts to solve one is

07:49paulista me which I always have a hard

07:51time saying doesn't really roll off the

07:53tongue this is where word has more than

07:55one meaning so picking some examples

07:58from our site the term engineer on Dyess

08:01itself is very ambiguous because without

08:04more context surrounding that if someone

08:07just does a search for engineer and

08:08people do that we don't know if they

08:11mean mechanical engineer programmer

08:13automation engineer so that's one

08:16example of one when one term means many

08:18different things the other problem is

08:21synonyms so all of you will probably be

08:24familiar with this you have many

08:26different variants of the same word many

08:29different words that have the same

08:30meaning for example QA and Quality

08:33Assurance tester there's all kind of

08:36have the same meaning or VB be be nights

08:38Visual Basic they all have the same

08:41meaning so it's a kind of a challenge to

08:43make sure that you match across all of

08:45these these common terms also conceptual

08:49search tries to solve this by mapping

08:50these terms into concepts and then

08:52mapping

08:53on those concepts it it will also help

08:56you solve things like typos and common

08:59spelling errors and also idioms which

09:02are phrases that means something very

09:04different to the words that are okay so

09:17why conceptual search again we will try

09:20to improve recall without diminishing

09:22precision and this can match documents

09:25containing not the query terms I repeat

09:27in this because that was really the

09:28thesis of this talk so I'm sorry if I'm

09:31kind of repeating myself okay so what

09:35are our concepts they represent some

09:40high level ideas in your domain so in

09:41our domain java technologies or big data

09:44jobs health tests healthcare support

09:46positions those all kind of represent

09:49concepts are important in a job search

09:52in IT jobs these concepts can be

09:56automatically learned from documents

09:57using some unsupervised machine learning

10:00techniques and typically this learns a

10:04set of concepts and a mapping of terms

10:06to concepts where the terms have sort of

10:09a degree of association with different

10:10concepts so a term can belong to

10:12multiple different concepts as you might

10:14expect for example Hadoop may belong to

10:17both the Java concept and also a big

10:19data concept so how people try to solve

10:24this in the past

10:25what are some a techniques are used up

10:27in the literature a lot of you have

10:31probably heard of some LSA or LSI

10:33so latent semantic indexing it's in the

10:35title of this talk that is one of the

10:38most popular techniques for doing this

10:40LD a latent Dokic location probably

10:43saying that wrong that's used for topic

10:45modeling which is a very similar kind of

10:47idea and then google recently will

10:50release an algorithm called word to back

10:52and that's really what I'll be focusing

10:54on today for various reasons I will

10:57explain in a minute all these algorithms

11:00work by mapping a document to a low

11:03dimensional vector that is a list of

11:05numbers

11:07a nice number and out list kind of

11:09represents how well a document

11:11represents the concept problem with LSA

11:17LSI Lda is that there are very

11:21computationally intensive to to compute

11:23mm-hmm they rely on factorizations of

11:25very large matrices this is a very slow

11:28and computationally intensive process

11:30and requires a lot of memory and then

11:33and this is a really important thing if

11:36you want to use these in your search

11:38engine you have to embed a very hefty

11:40model into your search engine and then

11:43as Curry's come in you have to map them

11:45to this sort of concept space so that is

11:48not easy that has a big performance

11:51impact it's not really desirable if you

11:55want to just get something up and

11:56running quickly and something like solar

11:58great performance is very poor because

12:01documents are not represented by a

12:03number of terms which are showing unique

12:05to a subset of documents they're

12:06represented by these kind of lists of

12:08numbers so you can't use the inverted

12:11index in solar to get fast document

12:14matching so you can't really get behind

12:16good query time performance using these

12:19tools which is why they don't scale well

12:21that's why I believe companies like

12:23Google don't use them directly as far as

12:26I know because it just wouldn't it just

12:27wouldn't scale well so these these

12:31traditional techniques work by learning

12:33a vector representation of a document

12:36what I really want to do in order to use

12:39something like this within solar is to

12:42map words not documents to vectors and

12:46then well than embedding those vectors

12:47themselves do some processing so I can

12:50kind of use that information and get it

12:52into solar itself why are why do I want

12:56words not documents because solar is

12:58very much based on kind of a word or

13:00token based analysis chain so I want to

13:03kind of use that to kind of bootstrap

13:06into getting this to work mm-hmm

13:11okay so next couple slides have a little

13:15abstract

13:16so fair warning were to Beck was

13:19developed by Google around 2013 it

13:22learns back to representations forwards

13:24as the link to the paper if you're

13:26mathematically inclined will like this

13:29kind of stuff but basically it works by

13:31training a machine learning model to

13:33predict what words will come before and

13:35after a word in a document within a

13:37fixed window so rather than requiring

13:40this very computationally intensive

13:42algorithm where you're condensing a

13:44matrix it's just learning word by word

13:47you know and it's much faster which is

13:49one of the reasons Google built it they

13:51need stuff to work well but I need to

13:53work very fast so you can give it a

13:55large amount of documents and it will

13:58scale very well like the traditional

14:01techniques but the important thing is

14:04once you train that model similar words

14:07get similar representations so you can

14:10then use that to kind of mine synonyms

14:11and related terms within your document

14:14set so one of the interesting things

14:19people found when they did train this

14:21model is you could do what I would call

14:22word math this is a little crazy that

14:25you can actually do this but smoothly

14:27this worked when they trained it on a

14:29large model so they trained it on

14:31something like Wikipedia large large

14:33data set and they they found out they

14:36took they looked at some of the vectors

14:39that they learned right and they tried

14:40to distill some very simple like

14:42arithmetic operations with these vectors

14:43and they found one of the things that

14:45worked is if they they took the vector

14:47form an subtracted King and added woman

14:49it kind of mapped to Queen so what it

14:53means is it'll kind of learn this notion

14:55of gender automatically itself from

14:59Wikipedia data or whatever did I said

15:02they turn it on to give you another

15:04example

15:05it also learned notions of capital

15:08cities so it learned that you know these

15:11these countries on the left and these

15:13capital cities on the right and the

15:16difference in the vector space between

15:17them is very constant so it's able to

15:19learn concepts automatically from just

15:22raw data and it's supposed to be

15:25state-of-the-art or near

15:26state-of-the-art for sort of automatic

15:29would in our

15:30tasks like SAT verbal reasoning tasks

15:32people have used it for that kind of

15:33thing and it does very well for one

15:37machine so at this point and I apologize

15:42for this you might be thinking what's

15:47all this Mathieu stuff right what's this

15:49got to do with search why do I care this

15:52is a search conference why are you

15:54telling me this stuff well you didn't

15:56this would this this weird kind of word

15:58map stuff well as I've kind of mentioned

16:02Soglin can be used to represent

16:04documents as vectors or concepts and we

16:09can use that to to configure conceptual

16:11search and we can do this by learning

16:14vectors for words and then learning

16:17which words are related to one another

16:19and we can boost the recall which as I

16:22mentioned was also the main goal of this

16:25you one way you can think of this as a

16:28way to automatically learn synonyms it's

16:30not really learning synonyms the words

16:31that the the similarities it finds are

16:33more kind of related terms so our layer

16:35term would be elastic search and solar

16:37there's are not really synonymous but

16:40they tend to occur lot together and they

16:41tend to be very related so if you get

16:43documents on lastik search they're

16:45probably likely to be pretty relevant to

16:46sort okay so I want to give you a quick

16:52demo of this kind of working

16:56so I've solar running on my machine

16:58I said config files that were in this

17:01conceptual search so the first query I

17:05would run if I run data scientist okay

17:07so this is a simple kind of search in

17:10the face that runs on our dice job site

17:12my dice jobs collection hmm and to

17:15illustrate how this is working I'm

17:17explicitly include excluding any

17:19document that matches any of those

17:21search terms because I want to know what

17:25documents I'm surfacing that are regular

17:27search engine will miss okay so I'll

17:30repeat that none of the match is here

17:32match those search keywords in any way

17:35shape or form okay so if I search for

17:39data scientists I get a lot of Hadoop

17:41jobs I

17:43Python engineer now I'm a data scientist

17:46but lot of you here might not be data

17:47scientist so Python is one of the main

17:49programming languages that data

17:50scientist like me use so that is what I

17:54would consider our relevant match and

17:57you see other kind of Hadoop jobs and

17:59Hadoop is a Big Data technology so all

18:01data scientists use that then I search

18:04for a programming language like C sharp

18:07I get kind of dotnet jobs that don't

18:11mention C sharp so I got done that

18:13developers I get vb.net leads dotnet

18:17architect this is a search conference so

18:26I want to do a search at a search

18:28conference for information retrieval

18:30because I figure I got to do that

18:33I can spell it ok so if I search for

18:36information retrieval what do I get I

18:38get data science positions I was to get

18:40natural language processing engineer

18:43these seem kind of relevant to to my

18:46search term even though these documents

18:48don't contain the words information or

18:49retrieval down here I get machine

18:51learning specialist machine learning

18:53library text mining just to show this

18:56doesn't just work with tech jobs I can

18:58do a search for project manager I will

19:03get managerial jobs notice none of these

19:05have the term manager they have 10

19:07managers or MGR I'm explicitly excluding

19:12synonyms just for the analysis here

19:14normally we would map those to manager

19:17but I'm just using this to evaluate tool

19:21also if you search for director Neil of

19:25development manager positions and so on

19:30so you can see that it's able to

19:32retrieve what I would at least consider

19:33relevant documents even though those

19:36documents don't contain any of the

19:37search terms I guess how do I do this

19:42how am I making this work with in solar

19:45without using a machine learning model

19:47within solar itself so I provided a

19:51number of Python scripts that go along

19:52with this talk and those Python scripts

19:55you can feed a large to docking

19:57- and it will do some sort processing on

20:00the documents it will try and extract

20:02important keywords and phrases and then

20:05we'll train the word - back model for

20:07you and then you'll have various options

20:09what to do with that model when you

20:11trained it and this is really where to

20:13get so how do we how we ingest this into

20:14solar so I'm kinda heavily using the

20:17synonym functionality and solar and kind

20:19of hacking in some ways so one method

20:23I'm going to use is I'm going to embed

20:24the raw vectors for the terms directly

20:27in solar using a few plugins I don't

20:31recommend doing that in production this

20:34is more proof of concept it does not

20:36perform well in terms of how it scales

20:39but if you want if you really want the

20:41most accurate results that's something

20:43you can take your with and play around

20:44with but to get it to scale well I

20:49looked at some other approaches so you

20:52can take the top n terms where n could

20:54be 10 20 30 that it found that was

20:56similar in your document set and you can

20:59extract those and you can put them into

21:01a into a synonym file but I don't just

21:06stop there

21:06I'm the weights - similarity weights

21:09that the model has been able to compute

21:11in terms of how related it finds the

21:13terms within a document set and so I use

21:16those to augment the those sort of query

21:19expansion terms a query time so that it

21:22treats certain terms with a higher

21:24waiting than others and you can actually

21:25do that with a little bit of tweaking

21:27with the with a cinnamon filter and a

21:30slightly modified a query parser or you

21:32could just handle that in your API just

21:34as easily and the last less method to

21:37kind of inject this information into

21:40solar is you can use a machine learning

21:41clustering algorithm take these vectors

21:44that you've learned for the words and

21:46compute clusters for them and then for

21:48those clusters you can map those to

21:51we're using this in the file so if you

21:54have a cluster that contains Java Java

21:56developer you can map those to the same

21:58token that is the easiest way to consume

22:02the output of this so it's important for

22:06this to work that you have a set of

22:07important keywords for your domain

22:11and the token trade Grainger earlier he

22:13suggested using as a starting point for

22:16this kind of analysis your top search

22:18terms by doing query log mining I would

22:21also advise that

22:22that's a vey good starting point to

22:24determine what words are important in

22:26your domain but going beyond that I have

22:29built a simple algorithm that will mine

22:31your documents for you and find commonly

22:33occurring phrases and terms and you can

22:36also use those to learn keyword to learn

22:39synonyms over so I've kind of just gone

22:46over that it's important when you're

22:49doing this to take comma phrases and

22:52mathas the single tokens if you if you

22:56just do it on a single word basis you

22:57don't always capture terms like data

22:59scientist and one thing we found with

23:01our search is if initially searches for

23:05data produced data scientists would

23:07produce a lot of scientist positions

23:08because it wasn't taken into account the

23:11whole phrase so he is very important

23:13that you have configure solar to extract

23:16tokens for these phrases when you're

23:19defining your keywords so as I mentioned

23:23you can do it yourself all the code is

23:25now publicly available on github I will

23:27be adding documentation to that over the

23:29next few weeks but I've already tried to

23:31document it enough to get you started

23:35there's a set of solar plugins some of

23:38them are to do with this talk but as I

23:39mentioned they do a lot of other things

23:40too I have a repository for solar config

23:46examples the easiest way to understand

23:48how to use the plugins is to see an

23:50example of it in a config there you'll

23:52see the parameters documented in there

23:55and then conceptual search repository

23:57that contains the machine learning stuff

23:59so that is all in Python set to run

24:02Python the computer and give it as the

24:04documents and its other keywords because

24:09what additional tricks am i using to

24:12make this happen so you can use the

24:17synonym filter in solar to extract

24:20keywords from documents kind of a

24:22primitive parsing or you can use the

24:24solar text

24:24that will do the same thing what that

24:28looks like is this so basically we kind

24:32of hacking solar here to sort of do

24:33simple kind of term extraction so if you

24:38enter you ever have a bunch of

24:40information like this it's kind of noisy

24:41awesome loosing position we have a

24:45cinema filter configure with us have

24:48important skills it will then go through

24:52oops sorry I just hit something very bad

24:55there sorry

25:03mouse wheel by accident if you configure

25:06your cinema filters with a set of

25:09phrases and terms and then use a type

25:12token filter factory you can use solar

25:14to do sort of primitive term extraction

25:17although it's actually quite

25:18sophisticated so always extract the

25:20longest matching phrase or as I

25:23mentioned you can use the solar text

25:24tagger to do this for you too so as you

25:27can see at the bottom here given that

25:29initial input we've extracted term solar

25:32seen solar elasticsearch and zookeeper

25:35and then kind of normalize them into

25:37these canonical forms at the bottom this

25:39is just using the standard functionality

25:41and solar and the synonyms file the

25:45second trick to make conceptual search

25:47happen in solar is to make a use of

25:52payloads so payloads are a very useful

25:55functionality and solar I think it's

25:57often very much overlooked what they

26:00allow you to do is to tag each term in

26:03your document with a value in this case

26:07that can be a real number then you just

26:09need to make some small modifications to

26:11a similarity class and a query parser to

26:14allowed it to allow it to use those

26:15payloads in the scoring functionality

26:18this is very powerful it has many

26:20applications beyond this one of them

26:22being we use it in our recommender

26:24engine to wait terms in a document by

26:29using our custom weighting scheme you

26:31could even use it to implement a simple

26:33machine learning algorithm within solar

26:34or model a linear model

26:37by allowing you to just basically set

26:39weights on terms in a document or than

26:42just use the tf-idf values mm-hmm so we

26:48use a synonym filter to expand that

26:50keyword to multiple tokens and then we

26:53take each token and it has an Associated

26:55payload and then we use that to adjust

26:58the relevancy scores the index or query

27:00time to make this kind of conceptual

27:02search happen so you can figure this as

27:05a form of sort of query expansion but

27:09you can do it with insallah in next time

27:10if you wish you can do it at query time

27:13or as I mentioned you could do it in the

27:15API so what does this look like in the

27:18analysis chain so again I'm using a

27:21cinema filter and a type filter to

27:23extract and I normalize it determines

27:27scene and then the bottom here you'll

27:30see the bit I'm now talking about so so

27:36there is a synonym map for Apache Lucene

27:38that maps to a set of top-end terms in

27:40this instance and each of those has a

27:43payload associated with it you'll see

27:47that payload with the pipe symbol so

27:50first of all it takes the patch in

27:51duccini expands it to the most related

27:53terms as found by the algorithm and then

27:56it takes that payload and then that

27:58query time converts it into a term time

28:01boost which you'll see down here you can

28:05also do this at index time and just use

28:08a version of EDA Smacks which is on my

28:11um my plugins site that can also kind of

28:14read payloads and include the payload in

28:16the score mm-hmm okay so what are some

28:22of these example soon--and files look

28:24like so I'm going to skip the back to

28:27base method but you can read more about

28:30that on the plugin page but this is a

28:35this is an example analysis chain this

28:40kind of technique so we just do some

28:41basic kind of

28:45be processing here to remove HTML

28:47characters and whitespace tokenize and

28:49stuff then we use this job titles file

28:53to sort of filter down the set of terms

28:55to us our job titles we then use a type

28:58token filter Factory to remove anything

29:02that's not a synonym so not that's not

29:04in that file and then we use a delimited

29:07payload token filter to extract the

29:10payload and then I use this kind of

29:14these kind of custom token filter

29:17factories to then extract the term time

29:19boosts so what is a file that you can

29:26help by this model look like that you

29:28would ingest and solar in terms of a

29:30synonym file so this is an example entry

29:33here I'm just using top end of five so

29:36you can understand what it looks like

29:39you might want to use larger values so

29:41it's taking the phrase Java developer

29:43again I've collapsed that into a single

29:45token as I mentioned maps that to Java

29:48j2e developer Java architects lead Java

29:51developer these are all terms that the

29:53algorithm has learned itself or relevant

29:56to Java developer and then you'll see

29:58these different payloads so these are

30:00order the ones for this sort of closest

30:04to Java developer are the most similar

30:05or related to it as I mentioned you can

30:09configure this that in the next time

30:10with payloads or what query time you do

30:13it a query time that gives you a lot of

30:14choices because then you can play with

30:16different approaches different feel ways

30:18and so on without having to be index and

30:22I know for some people at least we're

30:23next thing can be something that takes

30:25it from an hour two weeks so being able

30:28to configure this a query time is

30:30important at least for some people one

30:32of our sites we have millions of

30:34documents and so we could not run many

30:37experiments where we have to keep the

30:38indexing so that's kind of important

30:41unlike the vector method and unlike LSA

30:44we're still able to make use of the

30:46inverted index right because we're just

30:48basically using this to do time

30:49expansion so it's faster index and a

30:54query time long as you don't say n - too

30:56high

30:59so the last map that I mentioned instead

31:01of doing this sort of top end term

31:03approach you can do clustering and again

31:06I have some code that will do the

31:07clustering for you in that github repo

31:10so what you can do is you can take this

31:12Model T the vector that is learned over

31:15your top keywords and phrases mm-hmm and

31:18then you can cluster those by similarity

31:20what you might want to do is rather than

31:23just do one set of clustering you might

31:24want to do several steps with different

31:26size clusters I combined those into

31:28different fields so you can give those

31:29different weights so that the tighter

31:32the smaller clusters have more waiting

31:34because I was kind of a narrower basin

31:36of lated terms so what what is a synonym

31:45file look like that's output by the sort

31:47of clustering approach as you can see

31:50these these are all kind of terms of it

31:51in the same cluster so they just map to

31:54some arbitrary cluster token label

31:56within your sentiment file this is

31:59probably the easiest way to use this out

32:00of the box you don't need any kind of

32:02custom payload functionality or anything

32:04you can just kind of do this and then I

32:06could figure it in a synonym file I've

32:13not seen a noticeable impact on query or

32:15indexing performance effect in some ways

32:16it may actually improve it so to try to

32:23understand the power of this algorithm

32:25it helps if you kind of look at some of

32:27the clusters that it's able to learn

32:28from our jobs data so I picked some

32:31examples from the first time I ran the

32:33algorithm you know there's no like

32:35there's no tuning or anything this is

32:37just the first time I ran it to see if

32:39it kind of made sense what it was doing

32:42so there was a cluster for what I would

32:44call natural language information so I

32:46had bilingual Chinese French fluent

32:50speak speaker these are all kind of

32:52terms related to natural languages that

32:55were all in one cluster there's the

32:57entire cluster I haven't moved anything

32:59it made a small cluster for Apple

33:02programming languages of which you found

33:03to cocoa and swift by the way these

33:07these labels here I have added for

33:09interpretation but obviously kind of

33:11that is a little hard automatically

33:13it'll in the cluster for search engine

33:16technologies so it festered solar

33:19elasticsearch Lucene search engines they

33:23all came in the same cluster and they

33:25kinda had closed the food net beyond

33:28pure tech skills it also did some

33:29interesting things so another thing I

33:31found was it was able to cluster the

33:34terms that employees used when you're

33:36describing the sort ideal candidate for

33:37a role these these terms kind of relate

33:41to sort of attention and the attitude

33:43and focus also produced a larger cluster

33:46of these kind of attribute type terms

33:50where just describing someone's attitude

33:52which I thought was really interesting

33:53so terms like pay attention

33:56self-motivated diligent detail-oriented

33:59they were all kind of map to that same

34:01cluster and again this is all ultimately

34:04extracted by the algorithm

34:05there's no supervised labeling going on

34:07here you're not have to go through and

34:09do labeling of your data just learning

34:11like what at what terms are related in

34:13your you know data set for you it's

34:16doing a lot of work without without much

34:19configuration to summarize it's easy to

34:28overlook recall when you're performing

34:29relevancy tuning conceptual search

34:33provides you with a method of improving

34:35recall while still hopefully maintaining

34:37high precision by trying to match

34:39documents on related terms concepts

34:42ideas

34:43mmm-hmm in reality this involves using a

34:47machine learning algorithm that will

34:49learn what at what terms are related to

34:51one another were to Beck is a very

34:54scalable algorithm for doing this it's

34:56very fast it doesn't seem to use much

35:01more memory even if you give it multiple

35:03times the amount of documents and it can

35:06it's shown to do a while and with

35:08analogy tasks and and mining similar

35:10words and so on so it's an appropriate

35:12tool for this task and there's a reason

35:14that Google developed it for doing this

35:16kind of thing and we can train a word to

35:19Veck model offline and we can take us

35:21its output if we process it and use that

35:23within solar using

35:26various methods to try to make solar do

35:28conceptual search without embedding some

35:31complicated machine learning model and

35:37that that requires may or may not

35:39require use of payloads and plus an plug

35:41custom plugins depending on which

35:43approach you take I would start with the

35:45the clustering approach because that's

35:47very easy and then maybe experiment with

35:49some of the other ones that make sense

35:51to you so that is the end of the talk I

35:55have about five minutes for questions

36:19so I used several approaches so the

36:23first time and this gave the best

36:25results I use a set of keywords that we

36:27defined ourselves one of those was very

36:30simply a list of top job titles that

36:32there's very easy to define but the

36:34skills that taken more work but I also

36:37and I include code for this I also

36:38experimented with taking a set of

36:41keywords and also extracting commonly

36:43occurring phrases and also using that to

36:47train the model and that still did

36:48pretty well so that second approach

36:51didn't require any kind of manual

36:53engineering or determining of keywords

36:55or so on I switched to keyword clusters

36:59here doesn't work so well for that one

37:02but didn't it's not quite as good but it

37:06still works pretty well so if I type in

37:07data scientists I get MATLAB positions

37:09I still get some Hadoop positions I get

37:12some kind of data related positions see

37:19shark gives me a bunch of dotnet

37:20positions it got little confused about a

37:23do because that's also that's a dotnet

37:26technology but it's also a server-side

37:28thing but you're getting VB jobs that

37:32works pretty well like if you don't have

37:34a set of keywords to find and you can

37:36try the second approach but I would I

37:38would definitely start with this kind of

37:39list of of your top search terms like

37:41everyone should be either capable of

37:42extracting that if you want you can have

37:46you can run my algorithm to do the

37:47phrase mining and use those also yep

37:52lady there

38:11mm-hm

38:13yeah so how I would you have to be

38:17careful how you you want to use this

38:18because sometimes if you show users

38:20documents that don't match the terms

38:21even if the relevant they may complain

38:22right so it's important you test this

38:25and also when you're building a UI you

38:26point me to tell you users that okay

38:29this isn't an exactly so you might may

38:31want to offer this as an additional

38:32option there's this many ways you can

38:34use this one would be as a different

38:36option right um finding any good

38:37documents click here a second will be

38:40just to call them out in the matches and

38:41then use the qf functionality to wait

38:44this kind of lower than the exact

38:46matches so that once they start running

38:47out of good matches then you start to

38:49see the the conceptual matches yeah just

38:55just for the purpose of evaluation right

38:57but how I would configure this in

38:59production is I would set a high value

39:00on the phrase matches and exact matches

39:03and then pulleys in at a lower kind of

39:06weight another way you could do it is

39:07you could do a reran King functionality

39:09so you could you know take you use the

39:12solarii ranking function so you could

39:13take the top of thousand matches and

39:14then you could be ranked with this so

39:17that if you only have like one or two

39:18terms that match for those terms if you

39:21configure it correctly and solar you

39:23could then make sure that well okay I

39:26only match on one term but I match on

39:27five related terms here or five clusters

39:30or whatever and rank those higher so

39:32there's many ways you could use this

39:34we're still kind of evaluating how you

39:35want to go forward what we showed this

39:37to our accrue doesn't anybody like it

39:38because I didn't even think of this

39:40there's a lot of tech jobs that really

39:41hard for people to place like DevOps is

39:43extremely high demand right now they're

39:45having a really hard time placing to

39:47have ops developers this gives them like

39:49good suggestions as alternate types of

39:51developers they could find so they don't

39:52find any good to have ops developers

39:54although too expensive this shows you

39:56really good matches they're kind of that

39:58would be a good develop their lobster

39:59developer they would have a similar

40:00skill set

40:06currently this is still in a kind of

40:08evaluation phase but we do tend to use

40:10on the site soon yes sure sorry

40:26so the proaches I focus on I think scale

40:28just fine like I don't notice any any

40:30any detriment my initial idea with this

40:32and I kind of avoided that because I

40:34think it complicates matters is I

40:36actually tricked soar into matching on

40:38vectors themselves so you could take a

40:41term you can and say you have a for a

40:43given word you have a hundred element

40:45vector you can use synonym files and

40:48other stuff and solar to map that to a

40:50hundred tokens with payloads on them and

40:52then do back to the vector matching that

40:54way and then you kind of have something

40:55that's very much like LSA inside of

40:57solar but that doesn't scale well and

40:59it's kind of complicated so the code is

41:01out there if you look through and see

41:03how I do that it uses the same concept

41:05but rather than taking the top ten or

41:07twenty related terms there's Mac -

41:09artificial tokens zero one zero two zero

41:13three something like that there are

41:14elements in the vector and then the

41:16value of that elements is your payload

41:20yeah yep this goes well yeah so I mean

41:27that that's probably going to be the

41:28most accurate approach but it has a big

41:30cost in terms of performance because you

41:32can't use it invert index you could use

41:34it for we rank in though weight if you

41:37just be ranking the top hundred terms or

41:39all the matches or a hundred thousand

41:41matches or something lady there

41:49I excused jobs for this because I and

41:52the reason for that is so we have jobs

41:54data and then we have resume data a job

41:56contains a single job a resume contain a

41:58bunch of jobs so the jobs are Toni more

42:01focused however this would Tyvek

42:02probably wouldn't care because this

42:03looking within a very narrow window

42:05around the word but when I've used LSA

42:08on that job is definitely much better

42:10because I'll say looking at what terms

42:12occur in the same document well this is

42:14using a very kind of narrow window so

42:16you you could use you use it with I

42:19would say it's more flexible as well as

42:21getting better so I answer your question

42:27this one was just under a hundred

42:29thousand well I mean so I prune a lot

42:33though because we have a lot of ones

42:34that are we short so after the pruning

42:36and pre-processing I think it was about

42:37sixty six thousand but it runs be fast

42:42and like so I have a lot of

42:45pre-processing steps and so those

42:47actually take longer than the model to

42:48Train probably like to do the HTML

42:50parsing and all that cozen at - I'm just

42:52using some open source libraries the

42:54model training itself I think to train

42:56ten iterations took about ten minutes so

42:59it's it's pretty fast and this Python

43:01code so there are implementations of it

43:04in Java and other libraries there

43:06probably and faster but it's pre fast

43:16yeah so so I'm doing clustering here so

43:31the cost of sizes can vary right the

43:33base I'm just like it cost me album just

43:36trying to tries to find the most similar

43:37items and you said number of clusters

43:38some of them have two items in them some

43:40of them have a lot so you'll find a

43:43larger clusters have a lot of very very

43:44similar items in them right so it's it's

43:52learning that by basically going through

43:53the documents and looking what words

43:55occur close to one another and then

43:57training a mission yeah yeah it's yes

44:04right right yeah yeah I mean it's it's

44:09sometimes it's like low current so you

44:11get quite you get quite far by doing

44:12just a simple co-occurrence model with a

44:14window which is what people have done in

44:16the past but this uses a machine

44:17learning approach that probably produces

44:19better results at least I'm a TAS

44:22they've done but you can think of it as

44:24doing that it's not it's not explicitly

44:26computing that but it is in a sense so

44:32you can read the paper if you want know

44:33how it works

44:53so not all of the clusters are perfect

44:57but you know uh you're using this to

44:59sort of mint you existing search I'm not

45:01to replace it you know so if you do that

45:04I don't think you'll get any I would get

45:06a lot of bad bad hits we get the

45:08clusters not all the clusters are quite

45:10as neat as the ones I showed you there I

45:11can't I do not want to provide everyone

45:13I'm providing with the code but I don't

45:14to provide you the help the algorithm

45:15because that would be too much and I

45:19will get in trouble but oh yeah most of

45:22them make it made sense to me like most

45:24amor these are pretty good

45:25festering out terms you can always if

45:28you have time you know he's going to

45:29spec them and delete the ones you don't

45:30like

45:38well right now I'm just processing the

45:40whole document as text you could do

45:41separate training on different sexes a

45:43document that might be a good thing to

45:45do actually and just train it that way

45:48and then you can Solarize you could

45:50figure as many fields as you want and

45:52there's many weights and so I typically

45:54take advantage of it but if I could have

45:56several different fields and they may be

45:57more important and to separate them out

45:59and then I can always change it that

46:00query time so yeah I would definitely if

46:02you have different fields that are more

46:04important and then the I would do this

46:06actually for the one I showed you I use

46:09titles and skills separately so skills

46:11are extracted from the Job Description

46:12turn it on that I didn't have a set the

46:14title model and it was using a

46:16combination of those two to work

46:23mm-hmm yeah what different meanings with

46:40this well I mean there's some ways they

46:41can work out I mean one leaders way to

46:43do this and what I said it's important

46:44is just to make sure you're extracting

46:45phrases not just single words because

46:49that phrase itself will help

46:50disambiguate those kind of situations

46:51but either way this kind of works is

46:53it's not just mapping life for given dog

46:55news so I'm just mapping one word to one

46:57cluster it's mapping a thousand are no

47:00words two different clusters so that

47:01combination of data should help

47:02disambiguate the problem because you're

47:06using the whole document context not the

47:08not just one or two words this is more I

47:11guess this is a better at solving a

47:13synonym problem but people have found

47:16that you know that mapping of concepts

47:18just seem to help to solve the polisseni

47:20problem too I'm not totally sure how

47:29yeah the different experiments around so

47:36like the first one I showed you was

47:37using the set of terms that we'd already

47:39defined a network best but then the

47:41second one I showed you was using the

47:42clusters and no well as good as I showed

47:44you we're not using any kind of taxonomy

47:46or and he said of keywords I was just

47:49using the most commonly occurring terms

47:51or phrases in it in a document set and

47:53expected by the Python code that put out

47:55there that will kind of probably

47:57efficiently mine for your data and find

47:59fine phrases which you can also do with

48:01the Engram filter there's I mean there's

48:02a single filter and sort ooh yeah yeah

48:11so I give it you can give bad minimal

48:14you can give it as a set of documents in

48:16my in my scripts or if you just want to

48:18use it yourself you just give it a

48:19documents I would recommend including

48:21within that like I said those set of

48:23keywords because you want to make sure

48:27that it covers those because those are

48:28very important right but it will do

48:30reasonably well without that but yeah

48:33there's really those two things I was

48:35try was experiment to another approach

48:37where I was gonna use the keywords and

48:38then find terms that are very closely

48:40related to them and use those as

48:41keywords but that's kind of what this is

48:42doing anyway so I kind of been in that

48:45that guy over there

49:03right

49:12right I mean you're I mean that's the

49:14beauty of a search too is you don't get

49:16pretty good results even if it's not

49:18doing a perfect job and this kind of

49:20makes that better I think she just gave

49:22me a signal that I have to leave but um

49:25what I was gonna say is yeah what we

49:27really like about this is there's no

49:28mate there's no manual maintain

49:29maintenance or synonyms you know

49:31somewhere our sites like the medical

49:33profession like we have medical job

49:34sites that the Maine doesn't change very

49:36rapidly like you don't get a lot of new

49:38keywords appearing but in a tech jobs

49:41sector we have new skills coming all the

49:43time so we need something that can

49:46handle that and automatically learn that

49:48I don't have time to go and create you

49:50sending the files constantly what's the

49:52new skill comes out

🎥 Related Videos

What vaccinating vampire bats can teach us about pandemics | Daniel Streicker

a16z Podcast | Things Come Together -- Truths about Tech in Africa

2024 TSCRS Applications of anterior segments diagnostic instruments in cataract surgery

a16z Podcast | The Infrastructure of Total Health

The Robot Lawyer Resistance with Joshua Browder of DoNotPay

NES Controllers Explained

🔥 Recently Summarized Examples

4 Steps to Master Any Complex Skill (quickly)

40 Years of Fitness Experience in Less Than 11 Minutes.

Gun Controlling Media Makes FATAL Mistake... They Have Tied Their Fate To Biden's & Gun Rights Win

GET READY! Palantir Is Officially The Next Nvidia.

Abundant Thinking: The Hidden Key to Get Everything You Want (Audiobook)

The Coming Demonic Invasion (Revelation 9:12–21)

View original video