LangChain "Hallucinations in Document Question-Answering" Webinar
LangChain2023-05-17
5K views|1 years ago
💫 Short Summary
The video features presentations on NLP applications, hallucination in AI models, the importance of document validity, evaluation in retrieval models, and grounding generation techniques. The focus is on improving ML models, transitioning to production environments, automating QA systems, detecting hallucinations, and overcoming limitations in large language models. Strategies for information retrieval, data set quality, and human input in model calibration are discussed, along with prioritization in search queries. The ultimate goal is to enhance technology for better customer support and knowledge discovery.
✨ Highlights
📊 Transcript
✦
Highlights from NLP Applications Webinar
01:58Nick from Mendible showcases app search for developer docs.
Mathis from Deep set presents their open-source NLP framework and commercial product.
Daniel discusses Auto evaluator, an evaluation framework for testing Qi systems.
Speakers share insights and experiences related to their projects and industry involvement after resolving technical connectivity issues.
✦
Defining hallucination in AI models.
05:57Hallucination in AI models occurs when incorrect results are produced due to lack of training on specific data.
Mandible's evaluation data set defines hallucination as AI generating results not present in the input documents.
Different perspectives on hallucination in AI models are shared, focusing on unexpected outputs.
Challenges of training AI models on diverse data and the impact on result accuracy are discussed.
✦
Strategies to reduce hallucinations in large language models.
08:34Requesting evidence from provided documents can help prevent the generation of incorrect information.
Setting boundaries for responses is another effective way to reduce hallucinations in language models.
Employing step-by-step reasoning can also help in minimizing the occurrence of hallucinations.
Manual evaluation is crucial for challenging cases where documents are not fully retrieved.
✦
Importance of verifying document validity in responses to requests for evidence.
11:05Emphasis on evaluating each step of the pipeline separately to ensure accuracy and match the right documents.
Robust evaluation system for retrieval and completion mechanisms is crucial.
Improving retrieval system key to reducing hallucinations and enhancing response quality.
Hybrid search combining semantic and keyword approaches mentioned as successful method to improve results.
✦
Accuracy of language model depends on quality of documents provided.
16:04Confidence levels are adjusted when necessary information is missing to prevent inaccuracies.
Enterprises may opt for a less creative model for reliability, while others may prefer a more creative approach.
Customers can choose the desired level of creativity for the model.
Temperature adjustment based on document quality is being tested, with most cases currently defaulting to zero temperature.
✦
Importance of setting higher priorities for data sources in improving query results.
18:12Emphasizes the need for a system that evaluates each step of the process.
Investing time in creating quality evaluation datasets is crucial.
Encourages the use of open AI evals and tools for model improvement.
Conducting adversary testing to prompt creative responses is recommended.
✦
Importance of evaluation in retrieval models for QA systems using the EQA framework.
21:06Process involves data extraction, encoding, retrieval from a vector database, and re-ranking with a re-ranker model.
Victoria system simplifies the process for developers by balancing performance, cost, and retrieval times.
Key topics include reducing hallucination effect, real-time updates, and cross-lingual capabilities.
Future goals involve adding images, audio, and video recognition to enhance information retrieval.
✦
Vision for modern search engines:
24:10The goal is to transition towards search engines that provide direct answers instead of just search results.
Evolving applications with advanced technology:
The aim is to improve customer support and knowledge discovery through the use of more advanced technology.
Challenges with language models:
Fine-tuning language models with customer data poses challenges such as the risk of hallucinations and long training times.
✦
Grounded generation is a technique in which information retrieval models are trained to find relevant facts to generate responses.
27:14Customer data is encoded and stored in a vector database, with the closest proximity vectors retrieved to provide results.
Generative models are used to summarize facts into answers for end users while ensuring responses stay factual.
Grounding responses in external knowledge and data sets is emphasized to reduce hallucinations.
✦
Improving information retrieval ML models through a hybrid approach.
30:55Combining generative and retrieval models is crucial for enhanced performance.
Major companies are investing in generative models, reflecting an industry trend towards open source.
Utilizing available resources to make ML models the best they can be.
Project Auto Violator aims to facilitate the transition from demo to production environments with accurate data.
✦
The tool demonstrated is designed to automatically generate questions and answers for a QA system based on provided parameters.
33:50It uses AI to grade the QA chain against test data generated in the first stage.
The tool aims to enable developers to quickly generate tested settings without manual labor.
It offers a summary of experiments with different retriever methods and parameters, showing how they scored against each other.
Overall, the tool streamlines the QA process and provides efficient and automated solutions for developers.
✦
Strategies for detecting hallucinations using smaller fine-tuned models.
39:28Focus on hallucination and retrieval augmentation, document question answering, and open-source frameworks.
Insights on verifiability and attribution in recent research.
Results from training smaller Transformer models and evaluating statistical methods.
Overview of patterns in hallucination detection and the importance of effective evaluation methods.
✦
Limitations of large language models like GPT-4 in accurately generating information.
42:35Issues with missing small contextual details and generating incorrect information, such as inaccurate river lengths.
Stanford researchers found a high error rate in supported statements by generative search engines.
Ohio State researchers developed an attribution score to judge the reliability of cited references in generated content.
T5 model outperformed other models in the evaluation of reliability of generated content.
✦
Evaluation of information retrieval and generation models, focusing on performance metrics and correlations with actual scores.
47:21Models like UniEvil Fact T5 showed subpar results due to data set quality issues leading to model hallucinations.
Challenges include the need for score adjustments and refining evaluation methods to better reflect human perception.
Generalization across datasets and handling large context sizes are key considerations for future research and development.
✦
Importance of Information Retrieval (IR) in research and evaluation processes, focusing on metrics like mean reciprocal rank and mean average precision.
50:21Emphasis on optimizing search engines through standard metrics and evaluating performance using F1 score to show improvement over legacy systems.
Significance of blending precision and recall in the F1 score for information retrieval models.
Need to store intermediate results in vector databases or SQL for future analysis and testing purposes.
✦
Importance of labeled datasets for calibration in machine learning models and the significance of human input.
54:10Complexity of labeling datasets and the necessity of experts creating high-quality datasets for accurate evaluation.
Various high-quality datasets available for training models from Reddit, Amazon, and Stanford.
Challenges in determining correct answers in vast datasets, highlighting the need for human expertise in data evaluation.
Discussion on dynamically prioritizing different data sources based on input queries in machine learning models.
✦
Importance of metadata attributes and custom elements in search query prioritization.
57:25Boosting sources based on recency and social graph connections is crucial for ranking functions.
The ultimate goal is for the system to independently determine relevance.
Brief mention of a Pomsky named Luna and appreciation for collaborative efforts in advancing technology for a better world.
00:00thank you
00:00hello everyone and hello emmer who
00:03joined right as we're going live we're
00:06now live so thank you everyone for for
00:08joining and and tuning in to the uh
00:12hallucinations and document question
00:13answering webinar so it should be a
00:15really fun one we've heard this as a big
00:16pain point from a lot of folks
00:19um they try to put partially answering
00:21systems in production
00:24um and so yeah really excited to be here
00:27we've got four awesome presentations
00:30um Daniel should hopefully be joining he
00:31was on in the Green Room earlier um and
00:33then he went offline but we hope to see
00:35him back here soon
00:36um so what we're going to do uh is maybe
00:39we can do quick introductions from
00:40everyone then we'll jump into
00:42presentations from each each group for
00:45like eightish minutes
00:46um and then we'll do question answering
00:48at the end minor logistical stuff before
00:50we get started this is being recorded
00:52um so so it will be available at the
00:54link after the fact
00:56um the for for questions
01:00if you have questions for the presenters
01:02please put them in the question
01:04answering box on the right so you can
01:06see there's a little chat icon and then
01:08below that that there's a little box
01:09with a question mark please put
01:11questions in there and then upload the
01:13ones that you want to hear answered as
01:14well and those will be the ones that
01:15will prioritize if there's immediate
01:17things that pop up
01:19um like zooming in or audio gets bad
01:21during the presentation please put those
01:22in the chat
01:24um although for video quality I'd
01:25recommend kind of like refreshing the
01:27page we've seen that a few times
01:29um
01:30yeah I think that's all the logistical
01:32stuff so maybe we can jump in quick to
01:34Quick introductions from everyone and
01:36then get into the first presentation
01:37Nick do you want to start
01:39yeah for sure so hi everyone my name is
01:41Nick I'm the co-founder of mendible
01:44um and yeah amendable I mean if you if
01:47you haven't seen us in the Lang chain
01:48docs yet we basically provide app art
01:50search for developer documentation
01:54awesome Mathis do you want to go next
01:58yeah um hi uh nice uh talking to you
02:01today and thanks for the invitation
02:03definitely
02:05um I'm matters
02:06um I'm head of product at Deep set and
02:10um we have an open source framework
02:11called a stack basically to build NLP
02:15applications and then commercial product
02:18called Deep set Cloud which is more
02:20focused on Enterprises building NLP
02:23applications
02:26great and Daniel do you want to go
02:37oh
02:38we seem to be having some internet
02:40connections with Dan also maybe offer
02:43and amarf and Victoria do you guys want
02:45to do a quick intro
02:47sure I see Daniel moving again now he
02:50was disconnected for a second hey Daniel
02:51we were asking you to introduce yourself
02:55try again
02:56yeah guys uh check check sound check can
02:58you hear me
03:00all good yeah
03:02yeah and now we work uh they're having
03:05some troubles
03:07um hey guys I'm Daniel I'm a foreign
03:09engineer at Shopper uh learning initial
03:11attack in a commercial construction
03:12space and on the nights and weekends
03:14I've been hacking on this branch called
03:16Auto evaluator which is an evaluation
03:18framework for testing Qi systems we're
03:22excited to give you all a demo and here
03:24are some learnings building that
03:29okay I'll go next uh so hi hi folks I'm
03:32Amar Abdullah I previously was the
03:34founder and CTO of Cloudera obviously
03:36some of the Hadoop or mapreduce or spark
03:39or some of the great things we have in
03:41our distribution uh currently founder
03:44and co-founder and CEO for Victora and
03:48our mission at Victoria is to allow you
03:50to easily have a conversation with your
03:52data so I have an API that make it very
03:54straightforward to upload whatever data
03:56you have and then make another call to
03:58have a conversation with that data and
04:00then today I also have offer with me
04:01offer you want to briefly introduce
04:02yourself
04:03yes hi everyone can you can you hear me
04:07yeah uh so nice nice to meet everyone
04:09thanks all for coming uh I'm also at
04:12Victoria working with Amber on developer
04:14advocacy
04:15uh and have been working on uh llms uh
04:19for a while for about three or four
04:21years so I like to think of we're
04:23talking about hallucinations today I
04:25think when it was gpt2 which is called
04:26it nonsense but it was a different thing
04:29because it was just in Korean stuff but
04:31anyway glad to be part of this
04:33conversation and looking forward to
04:34what's next and next
04:37all right so so we've got a packed
04:40packed webinar awesome guest really
04:42excited to hear from everyone let's
04:43start with you Nick
04:46awesome yeah so let me share my screen
04:49real quick
04:59all right
05:01let me know when y'all can see
05:04I can see it awesome
05:07great so let me skip through this uh so
05:10before we start I just want to give a
05:11quick overview of mandible again you
05:13know so we're basically we provide
05:14chatbart search for documentation uh
05:17we're in the link chain docs element
05:18index uh right now we're currently
05:20focused on developer uh Focus companies
05:23and Enterprises
05:25um and we also we provide like
05:26embeddable components uh that's on react
05:29JavaScript in our API so you can
05:31Implement chat uh in your own products
05:35so um going to next year I think it's
05:38pretty important for us to Define what
05:40we're talking about and hallucination
05:41and I think like
05:43um everyone has kind of like a different
05:45persp uh different definition of what
05:48hallucination means uh so I'm gonna go
05:51I'm gonna say like I'm gonna go with
05:53what mendable things uh and what
05:55mandible defines as a hallucination in
05:57our evaluation data set but I would love
06:00to hear from everyone what
06:02um what and hallucination really means
06:04so for us
06:06um is when the AI model produces a
06:08result that was not present or formed on
06:10the documents that we fed through the
06:12prompt so that's when we call it a
06:14hallucination and I mean in our use case
06:17this often leads to incorrect result but
06:20not always right because it definitely
06:22leads uh to creative answers sometimes
06:25and I I mean the incorrect results part
06:28as um because the model probably wasn't
06:31trained on our customers data for
06:34example right so the model has no idea
06:36what it's saying so if we feed something
06:38wrong
06:40um the model will most likely produce
06:42something incorrect
06:44so that's how we Define what I would
06:46love to hear from our speakers here
06:48um what they think uh hallucination
06:51really means
06:54well so one common knife here and I'm
06:56curious for other tastes of all Jedi but
06:58so just to clarify like there's kind of
07:01like getting a question wrong or
07:02hallucinating because so you're doing
07:05like retrieval augmented generation
07:06where you have a question then you
07:07retrieve some documents and then you
07:09basically ask the model to come up with
07:11some answer based on those documents and
07:12you're saying Hallucination is when like
07:14uh could you go back to the definition
07:15it's like uh it's it produces a result
07:19that was not present or formed based on
07:21the document shred through the pump so I
07:22guess like how do you think about or in
07:24like do you distinguish between cases
07:26when like the because the big step in
07:29there is the retrieval step right and
07:30putting the documents into the prompt
07:32and so yeah how how do you think about
07:34that and distinguish between that in
07:35hallucination like is there is there a
07:37hallucination where it like retrieves
07:39the right results but the model just
07:41doesn't pick them out and is that like
07:44and how do you think about that
07:45differently from when the model doesn't
07:47retrieve the right results or maybe
07:49retrieves like half of the results and
07:51start and hallucinates half of it or
07:53something like how do you yeah totally
07:56totally can I give an example actually
07:58uh that that it's a very funny example
08:00of somebody a very famous reporter from
08:03The New York Times they're not going to
08:04say his name but his name starts with c
08:06and he was at some other uh famous
08:09machine Learning Company their name also
08:10starts with C which is very interesting
08:12and they had indexed their
08:14documentations and then they asked the
08:16question what is temperature and
08:18temperature above large language models
08:19here right which is the the randomness
08:21in the output words being produced and
08:24that the response came back explaining
08:25temperature well like in in terms of the
08:27context of a large language model but
08:29then the last sentence was and
08:31temperature is measured in Kelvin
08:34now that was hallucination because this
08:37is not the same temperature that we use
08:38for physics this is the temperature that
08:40we use for Life language models so
08:41that's what we mean when you mean
08:42hallucination like there's additional
08:44outputs that the model contributes that
08:46has nothing to do with the facts that we
08:48give to it
08:51exactly
08:53yeah I think this is really it's really
08:56interesting and I think the for yeah
08:58exactly the information retrieval piece
08:59for us is very critical
09:02um but there are cases where of course
09:05there's like not the majority of cases
09:06at all but there are cases where the
09:08document did we did pass the right
09:10documents and it actually didn't like
09:13complete and those are honestly like the
09:15hard ones to really like uh like Auto
09:18evaluate
09:19um normally we use some we need to like
09:23uh like one of the biggest things that
09:25we do is we go manually through it uh
09:28and see it and check it
09:30um
09:32but moving forward here I think like
09:34some ways that we use uh to to reduce
09:38General hallucinations I think this is
09:39really important uh and the first one is
09:41like pretty classic and pretty um
09:43obvious and it would be like a true
09:45prompt engineering so I think here like
09:48some four techniques they listed here
09:49that we use and um that might help you
09:52and your use case
09:54um so like the first one is like a
09:55request for evidence right so like
09:56always making sure like hey according to
09:58the provided documents refer to the
10:00documents when answering quote the
10:02original document uh the second one is
10:04like setting boundaries so this is what
10:07um we basically set a boundary in the
10:09prompt like hey here's where the the
10:11verified documents are going to be
10:13here's where it's going to end answer
10:15solely based on that chunk
10:18um and then another one
10:20is the step-by-step reasoning you've
10:22probably seen in a million websites and
10:25but you know let's think step by step
10:27you know
10:29um how can we answer this based on the
10:31provided documents
10:33um and I think my favorite is the
10:36described first in detail and it's
10:37probably very
10:39it's very similar to what chapter PT
10:41does but basically like describe the
10:43question in detail before attempting to
10:45answer you know really provide that
10:48description really like
10:50um teaching them having the model say uh
10:54exactly what it's thinking so it can
10:56basically be more creative when
10:58completing uh I think it's really
11:00important
11:01Nick can I ask one quick question on the
11:03request for evidence but
11:05um do you do you verify that the
11:08documents that they quote or cite are
11:10actually valid documents that were kind
11:12of like passed in based on the retrieval
11:14step
11:17yes yes we do
11:20um we have
11:21um we do verify if the documents is
11:24actually correct we do have a we also
11:26like and then I'm gonna get into later
11:28but we do
11:29um something that helps us a lot is like
11:31evaluating each step of our pipeline
11:33separately so we have like an evaluation
11:36system just for the retrieval mechanism
11:38we have evaluation system just for the
11:40completion mechanism but we also have an
11:42evaluation system that takes into
11:44account like both the uh retrieval and
11:47the completion and we need to make we we
11:50try to make sure that those documents
11:52match the right ones
11:54um
11:55yeah so
11:57I mean going going on the in retrieval
12:00system I think another General way to
12:01like reduce hallucinations is just
12:03improving your retrieval system at least
12:05for us right because we are doing all
12:08this
12:08um document injection in the prompt so
12:11for us having a good retrieval system
12:12means that our answers are going to be
12:15really good or are going to be better
12:16than
12:18um so basically like for those of you
12:21that are not aware like if you're like
12:23retriever for example for us if it grabs
12:25irrelevant documents documents that are
12:28not split accordingly you know the
12:29completion will probably hallucinate
12:31most of the time
12:32um at least in our use case so investing
12:34in a good IR evaluation system is really
12:36important
12:38um also like how the way you split your
12:40tax like do you want to have a larger
12:43tax splitting to maintain contacts or do
12:45you want to have smaller ones for like
12:46efficiency and more accurate cases
12:50um and another thing we use and you guys
12:52can hear about on the we on the previous
12:54webinars we we use hybrid search which
12:57is a mixture between like the semantics
12:59um search and keyword search to retrieve
13:02the documents which
13:03um for us it does improve quite a bit
13:05the results
13:07um and you guys can feel free to listen
13:09on YouTube I think on the other webinar
13:10we talk a lot more about that
13:13um
13:14but
13:16um just moving forward here I think
13:18um yeah go for it
13:24yeah I think I'm curious um would you
13:26still consider this a health Nation if
13:28the an improper context improper
13:31documents were provided prompt
13:36anyone feel free to jump in
13:39that's a good question maybe so how we
13:41think about it is that sometimes we also
13:44pass the metadata of the document and
13:46then we look at okay from that document
13:50enter metadata could the language model
13:52have known that this is not the correct
13:54document so if I don't know I'm asking
13:57about the revenue of Google but we pass
13:59in a document about Microsoft and it
14:01actually has
14:03company Microsoft in in the header kind
14:06of
14:07um then the large language model can
14:09actually know that's not the right
14:11document and what we then do is
14:13providing escape hatch so that it says
14:15if if the answer is not there also I
14:19don't know
14:20um yeah
14:23yeah
14:23but clearly the the relevance is super
14:27important for this right if you get the
14:28wrong documents you
14:30give something bad for the summarizer to
14:33work with and so it'll it'll fail and do
14:36a lot more hallucinations right
14:38right yeah I think yeah we would
14:41consider as like I think it would be
14:42like hallucination as like the complete
14:45like workflow but we would identify that
14:47was because the IR Retreat the wrong
14:49documents and not necessarily the motto
14:51being wrong and not responding
14:55but yeah that's really interesting
14:57um
14:58and so
15:00some ways that we use to prevent
15:01incorrect results now so going from the
15:04hallucination to like actually being
15:05factually incorrect
15:07so I mean on the IR retrieval system we
15:10detect when the information returns zero
15:12documents for example right it can
15:14retrieve any documents or the documents
15:16are like too low score so that raises
15:18like a big flag for us in the backhand
15:20and we're like okay like we're not
15:21feeding anything to the model it's just
15:23going to be creative it's going to do
15:25what it does
15:26um so what we do is we decrease the
15:29confidence level to the user and we let
15:30the user know like hey like sorry
15:33um we can find the information and
15:35documentation provided so expect your
15:37answer to be less accurate you know so
15:39like hey like I like you may not know
15:42what it's talking about
15:44um and that's something that we leave a
15:45lot to our customers to decide what they
15:47want to do so some of the customers that
15:49we work on like Enterprises they really
15:52don't want the model to be like anything
15:54creative it's not confidence if our R
15:57fails so then we just say like oh I
15:59don't know like you know documents are
16:01not provided and then some of them want
16:03the model to be a little bit more
16:04creative because maybe the training data
16:06does include the documentation uh
16:08because they're a big company or
16:09something so like we kind of let our
16:12customers decide what they want to do
16:13that
16:15um and then once when they are creative
16:17you know the model can just it can tries
16:20to answer so it will say like oh I'm
16:22sorry I couldn't find anything exactly
16:24on our documents but here's an example
16:27of how
16:29um
16:30and then another example I'm very
16:33curious in this setting do you
16:35dynamically set the temperature
16:37like if you have no documents
16:40um returned from their triple tab with
16:43your updated temperature to be more
16:45creative or would you still have it I
16:47assume like me or zero if you have at
16:50least sector documents are tripped
16:52right
16:53um so we
16:57so right now I mean we actually just
16:59we're testing with actually dynamically
17:01changing but right now we kind of leave
17:03up to the customer I mean we don't
17:05uh we still let it zero most cases
17:09um but I think that's an interesting
17:10take because then
17:12um
17:13you could potentially of course with a
17:16higher temperature get better results
17:17but because it's a setting it's a
17:20customer setting we haven't done this
17:22yet but
17:24um we're definitely starting to try it
17:26out
17:30um
17:30it's a great question
17:32so another thing that we do too
17:34um it's the teach to model functionality
17:37um you know so we let users revise and
17:39correct results in our and so we
17:42basically like we can we provide some
17:43insights to the user based on like how
17:46um users write the questions and also we
17:49show all the conversations and if the if
17:52our customer like says sees an incorrect
17:54answer they can go and revise it uh and
17:57then the revision is pretty it's pretty
17:59basic right now basically like the input
18:01the correct response the input The
18:02Source link we just re-index into the
18:05vector database with some metadata and
18:07some higher priority uh so that someone
18:10asks the similar questions it will query
18:12that document as well
18:14um and you say higher priority because
18:17basically your retrieval step not only
18:19takes into account relevance but also
18:21some notion of priority and so yes
18:23exactly so we have
18:26um yeah we set up in our system to
18:28account for like higher priority like
18:31sources versus like low priority so
18:34there's a lot of people that want to
18:35ingest like their documentation but they
18:37also want to ingest YouTube videos they
18:39also want to ingest zendesk's help
18:41support center and some of those data
18:44sources might not be as relevant as the
18:45docs for example are as not as updated
18:47as the docs so we want to be able to set
18:50that priority
18:51um so the model knows like oh this is
18:53the most up-to-date data source I should
18:56probably rely on that one
18:58makes sense also we're running a little
19:00bit short on time so maybe let's speed
19:02through the the rest of the slides
19:03awesome I mean yeah I think it's pretty
19:05close to the end now here is just some
19:07some general evaluation uh things and I
19:09would love to hear from all you how you
19:12all evaluate but just this is mostly for
19:14the audience you know I think something
19:16that helped us a lot is just having an
19:18evaluation system for every step of the
19:19process you know so having a IR
19:22evaluation having a completion
19:23evaluation having a pre-processing one
19:26you know and then having one all
19:28together
19:29uh second one of course is just invest
19:31time in creating great evaluation data
19:33set you know you're your results are
19:35going to be as good as your evaluations
19:36so make sure that you have you went and
19:39do the work even if that takes like
19:41manual like manual work for like days
19:43you know having good evaluation data set
19:46makes a big difference uh the third one
19:48is now is just using open AI evals it's
19:51great and also the link chain out
19:52evaluator uh use the tools around you to
19:56help improve the model and improve your
19:58responses and reduce hallucination
20:01and I just put something here for
20:03adversary testing you know like having
20:05your data sets
20:08um questions that will
20:10um that will basically prompt the model
20:12to hallucinate or basically that you you
20:14know that there are no documents or
20:16they're like partial documents about
20:17that answer but you wanted to try to
20:20um either say I don't know or you want
20:22to try to let it be more creative
20:26um so I think these are just some
20:27general things General takeaways that we
20:29use and I think it's useful for everyone
20:37all right all right we'll we'll surely
20:39get into a lot more around evaluations
20:40later with Daniel but for now let's
20:42transition over to Victoria so
20:45you guys are out okay let me share my
20:48screen so first I I wanna also agree
20:51with the importance of evaluation and
20:53how that's very critical uh to do this
20:55uh correctly I want to highlight a key
20:58paper that my co-founder and our CTO at
21:03Victoria had published and worked on and
21:06led the research for a while at Google
21:08actually it's not this one is this one
21:10the eqa framework which is one of the
21:12key methods for evaluation for
21:15evaluating retrieval models for Q a
21:17systems and so I don't know uh
21:21if you folks are using it or not already
21:23but you should definitely take take a
21:25look at that
21:26so with that said let me jump into my
21:28presentation and I'll try to make it
21:30very quick here
21:32so this is what the Victoria system
21:34looks like our goal is to simplify all
21:37of these steps these are the same steps
21:38I think in all of our systems we all
21:40have a very similar approach you extract
21:42the input data from different sources
21:44you encode that data into an embedding
21:46space that unifies the information based
21:49on meaning and though sometimes you can
21:51augment that with keywords to support
21:53the hybrid approach and then you have a
21:56vector database that does the retrieval
21:57to find the best meanings that match the
21:59query that the user is inputting you use
22:02a cross-attentionally re-ranker model to
22:04re-rank the outputs that's not necessary
22:06but sometimes that can improve the
22:07results significantly we do actually
22:09have a cross-attentional re-ranker in
22:11our Pipeline and then finally calibrate
22:13the results and feed them into a
22:15summarizer model to provide the final
22:17answer that the user sees so that's what
22:20the pipeline looks like at Victoria our
22:22goal is to simplify this entire process
22:24so for our users they only see this kind
22:27of view where they have one API to
22:29upload the data and then another API to
22:31issue the problem and get the answers
22:33back from from that data
22:35what we focus on is the aspects of
22:38making this easier for developers
22:40meaning how do you balance the
22:42performance for on disk versus in memory
22:44and to balance the cost and to balance
22:46the retrieval times and the latency of
22:48the responses and keep that at the
22:50minimum how to reduce the hallucination
22:52effect that is the key topic of this
22:54discussion today and then how to do it
22:56in a way that is very real time meaning
22:58the updates can be coming in real time
23:00and the documents could be inserted
23:02deleted changed etc etc and then the
23:05effects of that represent themselves on
23:08the output right away one of the key
23:10benefits of our system which my my
23:12colleagues here might also have in their
23:14systems is that we are cross-lingual
23:16meaning that the input documents can be
23:18in many many languages French English
23:20Arabic German Chinese Korean and the
23:24questions the prompts can be in any
23:25language and that the best answer is
23:27extracted and provided back to the
23:28prompt in the language of the prompt
23:31that the user is asking it and the
23:33question ends so this kind of removes
23:34language as a barrier for finding
23:37information this is some longer term
23:39stuff that we want to be working on
23:41which is to add images audio and video
23:43not just extract the text out of them
23:45but to add an embedding recognition to
23:48what's in them for example if this is an
23:49image of a kubernetes cluster but the
23:52word of kubernetes is not there anywhere
23:53in the image with the proper embedding
23:56you can still retain that meaning and
23:58that's not a feature that we have yet
23:59but we plan to add that I'm curious if
24:01some of the other folks the Systems
24:03Support or not we can hear about later
24:06on in the in the conversation there's a
24:08longer term Vision here for all of us I
24:10think is to move from a Legacy Legacy is
24:13about search results right or search
24:15engines where you have a query string
24:17and then you get a back a list of
24:19results that I call that Legacy
24:21modern meaning current present time is
24:24about answer engines instead where you
24:26issue a question or a prompt and get
24:29back the answer for your question right
24:30away you don't have to click on the
24:31results to try and read and find out
24:33what the answer is the answer is is
24:35provided to you and if we do that
24:36correctly in some cases we'll be able to
24:38move to what's called action engines
24:40where that the answer to that problem
24:41I'm trying to solve for example my
24:43snowflake system is too slow the query
24:46is not running at the speed it needs to
24:47be and then it tells you the answer and
24:49then it tells you at the end would you
24:50like me to do that for you and that's
24:53what I refer to as action engine and
24:55that's I think the longer term mission
24:56that many of us are on uh we truly
25:00believe that in five years every single
25:02application whether that be SAS mobile
25:04e-commerce you name it would be powered
25:07by something like this I think all of us
25:08in this room not just Victor we all
25:10believe that that that's that's the
25:12mission that we're after is to enable
25:13applications to evolve from the Legacy
25:16Way for how we interact with them to a
25:18more futuristic way more verbally we can
25:21just Express what we want and what we
25:22need and it just happens and that's what
25:24we're working towards these are some of
25:26the use cases that we can predict and
25:28see today I think that's what many of us
25:29are working on here how to make customer
25:31support better knowledge Discovery and
25:33you can see in some of the other ones
25:34listed here here but I think this these
25:36are only the use cases that our feeble
25:38Minds can predict for what is happening
25:40right now I think there will be new use
25:42cases unlocked in the future that we
25:44can't even predict yet in the same way
25:46that when the iPhone came out nobody
25:47could have predicted Snapchat but
25:49Snapchat still happens right and I think
25:51there will be some things like that that
25:53will happen over time now uh this was
25:55already covered earlier but briefly
25:57there's two ways to have large language
25:59models speak about your data and provide
26:01answers to your data the first one is to
26:05fine-tune the model to take your
26:07customer data feed feed that customer
26:09data into a fine tuning engine to
26:12fine-tune the large language model on
26:14your own data and then get a new model
26:16that's now trained with your own data
26:18and then you issue the question to that
26:20model and you get back an answer this is
26:22actually not ideal to be doing it has
26:24many disadvantages one of the key ones
26:26is it will lead to a significant
26:28hallucinations where sometimes you can
26:30get outputs that have nothing to do with
26:32your data that were in the existing
26:34model itself depending on the
26:35temperature and how you set the
26:36temperature obviously another
26:38disadvantage is training training these
26:40models can take a very long time so that
26:42means new data coming in might not
26:43materialize on the output until hours if
26:46not Days Later until the fine tuning
26:48process is is finished and then last but
26:51not least that's very costly actually to
26:52do this kind of retraining over and over
26:54again compared to some of the other
26:56techniques that we are discussing today
26:58so the technique that many of us employ
27:00in this conversation today is what we
27:04refer to as grounded generation or
27:06sometimes you read the term retrieval
27:09augment generation I prefer the term
27:11grounded generation because I'm a gamer
27:13as you can see from my t-shirt and the
27:14ground regeneration is GG so what's a
27:17better acronym to go with than that and
27:19ground degeneration
27:23in a nutshell is about we're gonna still
27:25provide an alternative outputs but we're
27:26going to ground it in the facts so we're
27:28going to retrieve the facts first we're
27:29going to use a high-end information
27:31retrieval semantic model that's still
27:32using neural networks to find the best
27:34facts for that prompt and then we're
27:36going to summarize these facts with the
27:38large language models that's kind of a
27:39in a nutshell what another generation is
27:41about so I have a couple of pictures I
27:43want to show you here this one is just
27:45doing the aspect of training a model to
27:49be very good at finding facts meaning
27:52the information retrieval models so you
27:53have training data that's not the
27:55customer data that's training data that
27:57trains this retrieval model on how can
27:59it be very good at finding a fact the
28:01facts relevant to the prompt or the
28:03question that the user has and then once
28:06that model finishes training in training
28:08time we now use the customer data to
28:11index it using or encode it you can hear
28:13the team encoding or indexing or
28:15embeddings or vectors this means the
28:18process of running that customer data
28:19through the model to produce these
28:20embeddings that we then store in the
28:22vector database the customer question
28:24comes in to the convector database and
28:25then we produce a list of results that
28:27is Step number one step number two looks
28:30like this
28:31this next slide here and I'm sorry for
28:33the animations uh let me just forward to
28:35them so you get the final picture so
28:37essentially this is what what happens
28:38here is the question or the prompt from
28:40the user goes to the information
28:41retrieval model the customer data have
28:44already been encoded with that model
28:45into the vector database the vector
28:47database finds the closest proximity
28:49vectors using dot product or cosine
28:51product or other techniques you get back
28:53a list of results these are the facts
28:55the facts that addressing your your
28:56question the same question goes to the
28:58generative model and we tell it these
29:00are the facts for this question give us
29:02the best response that summarizes these
29:05facts into an answer for the end user
29:08and try to not go beyond the facts and
29:10that's the hard part is restricting the
29:12model to not go beyond the facts and
29:13that's kind of a key theme that we're
29:15all discussing today so that's very
29:17roughly it my last slide I want to show
29:19you is that I asked chargpt about some
29:21ways to reduce hallucinations and I give
29:24this very very long list of ways to
29:26reduce hallucinations the key technique
29:28that we're all employing here I mean we
29:29have some of these other aspects in here
29:31but the key one we have is this number
29:33five technique right is that we are
29:35grounding it in the external knowledge
29:37about the question and underlying data
29:39set and trying to limit the response
29:41back to that but some of these other
29:43techniques are also very good and we're
29:45investigating some of them from a
29:46research perspective to try and deploy
29:48in the product going forward so I'm
29:49sorry if I went over my time but that's
29:51uh the quick uh Victoria story
29:54no that's that's great I I have one
29:56quick question for you before we go to
29:58the other folks which is so when you're
30:00breaking it down in this like I mean
30:02we're talking we've talked a bunch about
30:04kind of like retrieval versus generation
30:05as well and so when you guys are
30:07thinking about it at Vector like where
30:09do you focus like how do you think about
30:11those different steps where do you think
30:12where do you focus most of your time
30:14what do you think is the harder problem
30:15where do you see most of the errors
30:17coming from
30:19excellent question excellent question so
30:21everybody in the industry right now is
30:24uh open source and lots of the activity
30:27happening of course from major companies
30:29like stable stability Labs or stable LM
30:31uh of course the the more
30:34proprietary Technologies from open AI
30:36etc etc they're all focused on this on
30:38this problem right how do we make a very
30:40very good generative model so that's not
30:42what we're focusing our efforts and I
30:44think everybody in this room might be
30:46following that approach we're focused on
30:48this if we can get this to be really
30:49really good if you can get this
30:51information retrieval ml model that
30:53retrieves the facts that then we're
30:55grounding the generative Model N if you
30:57get this to be superior in its
30:59performance and do it in a hybrid way
31:01properly because sometimes when it's
31:02semantic long question the performance
31:04might not be as good when it's a keyword
31:06lookup or a Boolean kind of expression
31:07question you really want to mix The Best
31:09of Both Worlds if you do this really
31:11really well then you win so that's what
31:13we're very very focused on is how to
31:15make this the best it can be and then on
31:18this part we'll Leverage whatever we can
31:20get our hands on out there that does
31:21this a good job at that and there's many
31:22many it's become a commodity now where
31:24there's many good options for that I
31:26hope hopefully that addresses your your
31:27passion
31:28yeah it does Super insightful here how
31:31do you think about it
31:33all right um I'm sure I see a lot more
31:36questions but we'll answer those in the
31:37QA part so uh Daniel do you want to take
31:40over
31:43thanks Harrison uh guys I will give you
31:46a a nice brief intro uh around the
31:49project I've been working on alongside
31:50uh last tomorrow in Bongo bungalberg the
31:53past um I'm on the half two months it's
31:55called Auto Violator and we've built a
31:58hosted
31:59um framework it's an open source project
32:00but it's also hosted online chain thanks
32:02Patterson for uh sponsoring it and we
32:06enable developers to go from a demo
32:10Opera type into a more production
32:12environment with confidence because one
32:15thing I have an audience is that we all
32:16have awesome cute demos on Twitter
32:18around our QA you know 3gpt for
32:20documents but I've seen a lack of tools
32:24uh how can we get into a production
32:28environment
32:29um with uh data behind it how can we
32:32come up to our leaders within our
32:34organizations and give them data that
32:36our QA childbot is going to be accurate
32:40um around maybe 85 percent of the time
32:42and if that's good enough we are good
32:44through the rigorous production because
32:46in a production use case you know think
32:48legal Insurance health care we don't
32:51want Health stations we don't want false
32:52positives we don't want to damage the
32:55brand and that is something that we care
32:57a lot about at Shepherds which is an
32:59insured attack and a commercial
33:00construction space and you can obviously
33:02assume like as an insurance company we
33:05cannot afford to tell our research that
33:07oh we do like nuclear Insurance in our
33:10childbot so we have to build guard rails
33:12and use data to enable us to go to
33:14production I will give a quick
33:16um demo of the tool it is open source
33:19contributions are welcomed and only guys
33:22jump in and walk you through
33:27okay are you able guys wait to see my
33:31tab The Honorable later
33:33yes you see it now
33:36awesome awesome awesome so I'll walk you
33:39through a
33:40um demo environment over here so a high
33:44level what this tool does is it has
33:46three stages uh stage one is uh we will
33:50you will provide a document or a set of
33:53documents that you want to build your
33:55um q a system with
33:58let's say here we have added a car party
34:00podcast from electron
34:02and stage one is we're going to
34:05dynamically automatically uh generate a
34:08synthetic task that's left for you so
34:11here
34:12um we can see that there's generated by
34:14the tool in stage one and we use an L
34:17lamp to do that so let's say we will put
34:21this document in a in an index right it
34:24will randomly pick the index we'll
34:26randomly pick um a vector and then we'll
34:30try to come up with a question and ask
34:32repair out of that letter so here we
34:34have dynamically generated that as data
34:37set from this document and the goal here
34:40is
34:41to enable a developer to automatically
34:44generate their tested settings out of
34:45hiring a team or doing it themselves
34:49and we have seen this to be a quite
34:52accurate and quite effective it's not
34:54perfect it's not as amazing as you know
34:56a manual testing a manual uh human labor
35:00around this task but it's free uh it's
35:04automatic it takes 10 seconds to do it
35:06so uh that's stage one so stage two is
35:10given the parameters that you you will
35:11provide here in the left hand side let's
35:13say a model you can use my Tropic or
35:15LGBT 3.5
35:17uh and uh chunk size chunk overlap all
35:20these different parameters that you uh
35:22we were all used to when building a QA
35:24chain uh using a LinkedIn Library
35:26uh we will build in stage to a QA change
35:30a question answer system for you on the
35:32Fly
35:33and then step three we are
35:36grit grades that uh stage two will grade
35:40the QA change against the test data that
35:43we have generated for you in stage one
35:45and then we'll run next from it so let's
35:47take a look what happens here so here we
35:50have
35:51um greatest this is the output of stage
35:53three so we have graded each entry
35:57here and again we use an hour to do that
36:00so kind of cool right we use AI to
36:02create AI uh so we use some clever-ish
36:07problem engineering uh to do that so we
36:09provide uh in the prompt we provide the
36:12question and answer here and then we
36:14provide the observed answer now we ask
36:16an LM to provide us um with a relevancy
36:19and how accurate is that in its own
36:22terms and then all the way at the end
36:24here you can see that we will um give
36:27you a summary of experiments so for this
36:29demo run we have uh three experiments
36:32that has a different retriever methods
36:36and different chunk sizes different
36:37overlap parameters as well and here we
36:41can see how those experiments have
36:44scored against each other so on the
36:47x-axis we have the uh kind of
36:50correctness score like how accurate
36:53um
36:53uh those Brands had been and then on the
36:57y-axis we have the latency I mean that
36:59was already two big things we care about
37:00right now latency and correctness so
37:03here we can see that experiment
37:06uh one
37:08this one had the highest latency but
37:12also had pretty good accuracy right but
37:15we want to see that Experiment three had
37:18the lowest latency and also had a high
37:20accuracy so here we can deduce that uh
37:24based off data and accumulation uh we
37:27can say that for that set of documents
37:29that you have provided Us in this Tool
37:31uh these are the parameters the chunk
37:35size 500 overlap of zero this flips
37:39method and this is terrible
37:41are the best ones to use for that chain
37:43and of course you can keep going and
37:46maybe take a look at anthropic models so
37:49um Martin he released the anthropic
37:52model of adobe 100K in Auto elevator a
37:55day ago and he did awesome testing
37:58around that we can see that the accuracy
38:00is around the same uh however
38:03um politics is quite higher there
38:06I will end around here and just give you
38:10one little thing here so a playground
38:12environment here you can upload your own
38:14text
38:15and if you follow the GitHub link you
38:17will see that
38:19um I don't know if you guys can see the
38:20GitHub link but yeah if you can just
38:22click on the GitHub
38:23um icon here use the repository you can
38:25also find the repository by going to
38:27linking um yeah organization
38:31thank you guys know if you guys uh have
38:33any questions about that awesome thank
38:36you thank you you know I I put two links
38:37for those in the chat as well and um
38:40reminder that passenger the the final
38:43presentation by Mathis will be doing
38:44question and answers for the whole group
38:46I see a few that are that are really
38:48high outputted but reminder to everyone
38:50to go over and upload their favorites or
38:52add new ones if you have some new ones
38:54um so with that being said over to the
38:56last and and uh saving the best for last
38:59I'm sure so math is syrup
39:02um thanks Harrison can you can you see
39:03my screen
39:04yes
39:05yes all right great uh and really neat
39:09Tool uh Daniel that look great from a
39:12tech background and also Sleek UI I I
39:14love that
39:16um all right uh so I want to talk about
39:19um some strategies that we've been
39:21working on on detecting hallucinations
39:24using smaller fine-tuned models
39:28um so the setup is the same it's about
39:31hallucination and retrieval augmentation
39:33or document question answering and
39:37um we've tried we've been trying out a
39:40few things
39:41um generally here David said I said it
39:44before we work on an open source
39:46framework and a commercial product
39:48um you can check that out
39:50and what I'm going to do now is I'm
39:53going to skip the basics because we
39:55already owed them
39:57um and then I'm gonna talk about two
39:59hallucination patterns that we see
40:03um I would talk about verifiability and
40:07attribution in recent research there's a
40:10few interesting papers that we've been
40:12looking at and then I'm going to share
40:15some of the results that we had training
40:19smaller Transformer models but also
40:21evaluate evaluating statistical methods
40:24of evaluation and also some other kind
40:28of fact chat checking models that we
40:30tried
40:32um so yeah let's I'm gonna Skip and I'm
40:36gonna go right to the patterns and
40:40hallucination patterns in retrieval
40:42augmentation with large language models
40:45the first one we call it not paying
40:48attention pun intended so this one here
40:52you have a question uh on the left side
40:55uh what types of aircrafts operate
40:58between Guernsey and Southampton to
41:00cities in England and then you have a
41:04gpt4 generated answer below that and
41:08then you have some retrieved documents
41:09and if you look at that at first sight
41:12it kind of looks alright so there's two
41:14types of aircraft
41:15and these are flybase Emperor jet
41:19aircraft and an ATR 72. but if you take
41:23a closer look then you can see that
41:25actually in the retrieved documents
41:27um the flyby jet operates between
41:30Guernsey and London Gatwick which is not
41:33Southampton so the model just didn't
41:36pick up on that and we see that pretty
41:38often even with a strong model like gpt4
41:42that it doesn't pick up on small
41:44contextual things in the data it doesn't
41:47make up something completely but it's
41:49pretty frequent and then
41:52the other thing I think many of you of
41:55you know that numbers uh so here I'm
41:58asking or the question from from a data
42:00set actually is how long is Krishna
42:02River and here we have a generated
42:05answer from you.com so the generative
42:08search engine and it says that the river
42:12is
42:141288 kilometers long and actually if you
42:18look at the supporting document that was
42:20cited as a source even
42:22um you can see okay that's incorrect and
42:25we've seen that pretty often like that
42:27it just
42:29besides the wrong numbers for whatever
42:31reason I think
42:33um in general they've seen way less
42:35numbers than text so pretty
42:38understandable but can be tricky
42:43and then we thought about okay how can
42:47we now detect these hallucinations
42:50reliably because our customers kind of
42:53want that the models rely on their data
42:55and that they don't make things up
42:59um so we looked at a little bit of
43:01research one paper I want to share here
43:04comes from researchers at Stanford
43:07evaluating verifiability and generative
43:10search engines pretty interesting they
43:12looked at four search engines Bing chat
43:16perplexity you and neither and they
43:19checked if a statement is supported and
43:22then if that cited reference actually
43:24with a citation and then if that cited
43:27reference actually supports the
43:29statement and they found that only like
43:31half of the statements had a citation
43:33and then out of those only about 75
43:37percent
43:39actually supported the reference
43:41actually supported the statement so you
43:43get an error rate of about 25 for the
43:46state statements that are supported
43:49which isn't great in some contexts might
43:52be enough for others but some context is
43:56it isn't great and then this
44:00yeah the question
44:04all right
44:06um so and then the other one automatic
44:09evaluation of attribution by llms
44:12um comes from researchers at Ohio State
44:15and they've basically developed an
44:18attribution score where they judged uh
44:22cited references as uh attributable uh
44:27kind of
44:29expertory so that it extrapolates it
44:32says something that is not in the source
44:34document and then contradictory so that
44:37it even
44:38contradicts the document and then they
44:41fine-tune the range of models on the
44:43task they had some data sets that they
44:45repurposed and had a small evaluation
44:47set and they found land T5 to outperform
44:51both the find hue and alpaca and also a
44:55few short chat GPT so even that was
44:58surprising a little bit because even the
44:59much bigger Parker model which was fine
45:02human on the data
45:04underperformed here and of course chat
45:06gbt not being fine-tuned on a task that
45:09kind of seems reasonable
45:13and then
45:14um we've been going down the same route
45:17so we looked okay
45:20what existing methods are there to
45:22evaluate this maybe and can we train a
45:25model and we took that data set from the
45:27first research paper and
45:30evalued created sort of like an
45:32evaluation set in the training set from
45:34that and the mode that we
45:37um yeah the task for the model was
45:39basically they get a statement the
45:42Pittsburgh Steelers when did they win
45:44their first championship and then they
45:47get the the evidence uh which comes from
45:49the source document and then you feed it
45:52to the model as statement separator
45:55token evidence and then it produces a
45:58score and as you can see here uh score
46:01is
46:020.74 that's quite good we want to work a
46:05little bit on that but it indicates that
46:08this is not hallucination and
46:12we also tested this of course on our
46:16test set and what we found is that
46:20statistical methods that come from a
46:23more translation evaluation background
46:26like blue or Rouge they underperform
46:30here a little bit so
46:32um yeah they have a low correlation with
46:35with actual scores basically and we also
46:38looked at Birth score which is more like
46:40semantic similarity also not that great
46:44um we compared uni evil fact a T5 large
46:48model trained I think last last year by
46:51a research group
46:53also not great and then the deborder
46:57models are very small the builder-based
46:59model that we find tuned
47:01has quite good correlation but we also
47:04looked at the results manually and
47:05there's there's a few things to to find
47:08here indefinitely still
47:09and then
47:11remaining challenges is that the data
47:15set itself the search verifiability data
47:18set has quality issues so we found using
47:21our model that oftentimes when the model
47:24was pretty sure that this is a
47:26hallucination it actually was a
47:29hallucination the annotator made a
47:32mistake
47:34um so we kind of want to relabel a
47:36little bit there and then we still want
47:38to get better performance we want to
47:40kind of tweak the scores to reflect
47:44um a bit better how a human would maybe
47:46perceive that hallucination and then the
47:50question is as always how well with this
47:53approach generalize across data sets and
47:55domains and maybe also how would we deal
47:59with very large context sizes
48:02um yeah and that's it
48:07awesome
48:08thank you for the the presentation and
48:11we've got we've got 10 minutes for some
48:13awesome questions and uh continuing off
48:16of the last points you were making
48:17Mathis about
48:18um uh uh evaluation the number one kind
48:21of like question is basically can you
48:23provide examples of good IR evaluation
48:26models and practices
48:29um and so I guess that's kind of like
48:32talking about there's two components
48:33here there's the information retrieval
48:34step and then there's also like the
48:36generation step and so I think the math
48:38is if I'm understanding correctly you're
48:39kind of like after the in from after the
48:42the stuff's been generated like you know
48:44is it is it hallucinating or not based
48:46on that this and I think that's also
48:48what we've talked a little bit most
48:51about before in the general context so
48:52maybe like switching to the other side
48:54for the IR step in particular like you
48:57know we've heard retrieval is really
48:58important you guys are spending a lot of
49:00time there Victoria you're doing some
49:01high research stuff Nick like how do you
49:04evaluate the IR step
49:06by itself do one of you guys maybe want
49:09to take that
49:10I think yeah so go ahead
49:15um I think for us basically we also
49:17evaluated separately
49:19um and for that generous part we put
49:23more at metrics like mean rest broker
49:26rank or mean average Precision because
49:28you don't want to have recall because
49:30you've probably only gonna feed like
49:32three four five documents maybe into GPT
49:36so you really want to have that
49:38Precision at the top and you don't care
49:40that much about the rest of the results
49:42so this is what we do usually
49:47yeah I was just going to say that uh we
49:50we focus a lot as Omar mentioned earlier
49:52about the IR component that's really
49:55kind of core to what we do and I mean
49:57one of the co-founders you know has
49:59spent a lot of his career on that so
50:01there are standard metrics to just
50:03optimize the hell out of this engine to
50:05make it the best it can be so we use a
50:08lot of the standard metrics you see in
50:10in sort of academics to optimize this
50:15engine in general not just particularly
50:17for a particular if we use three facts
50:19or five facts or whatever just to make
50:21it the best it could be
50:26Nick Dental anything anything to add
50:28there
50:29um yeah I think for some people uh I
50:31know some people are asking like how do
50:33you structure like the data set Etc like
50:35something that we do for the IR eval
50:38Parts basically we have like the
50:40question uh we also have like the
50:42correct answer and then what we do is we
50:44have basically like
50:47um a list of IDs that correspond to the
50:50documents in our database like that list
50:52of IDs is basically like oh the answer
50:55is contained on this IDs and we also
50:58have like a list of like references so
51:01like if you have like
51:03um like basically inside that document
51:06ID if you have like a quote that would
51:08answer the question you'd list them as a
51:11reference and you can have that
51:13um in your data set
51:15um and then that way you can know
51:16exactly like oh the answer is contained
51:19on these documents
51:21um it can take as a reference this parts
51:23of the text and then you can like
51:25correlate and you know like
51:27um evaluate like using F1 and
51:30um Etc
51:32yeah I want to stress F1 I mean F1 is
51:35definitely one of the key metrics that
51:37even in a lot of our customer
51:38engagements when we're doing evaluation
51:40compared to Legacy systems they might
51:42have the F1 score is one of the key
51:44metrics that we measure to try and show
51:46how much better these new techniques are
51:49and as many of you know the F1 score
51:51Blends both the Precision and recall of
51:53the information refusal model and that's
51:55really what our our customers want to
51:57see they want a significant they want to
51:59see a significant lift in the F1 score
52:01compared to whatever they have before
52:05I think Harrison now we just add one
52:06point over there is simply like when you
52:09when you want to test either in the
52:11output of level Lam on health Nations or
52:13uh the original quality uh think think
52:16deeply about the work on the process
52:18like make sure you are storing
52:22um either in a vector DB or maybe an
52:23imposter sequel you're storing those
52:25intermediate results so that you're now
52:27you're later enabled to perform the
52:29testing either manual by yourself or
52:31maybe exporting it into like you know
52:33scale AI or you know mechanical torque
52:35and hiring someone else to do it for you
52:37yeah but like again without adding data
52:40you can test it
52:42that makes sense and actually that so
52:44there's another question a bit further
52:45down which is like what what do these
52:47kind of like IR data sets look like like
52:50is it as simple as there's a question
52:52and one document that has the right
52:53answer what if what if there's multiple
52:56documents that have the same answer what
52:57is the labeling for that look like what
52:59if there's questions that maybe need
53:01things from multiple different documents
53:02how do you kind of like judge the IR
53:04retrieval step of those I guess like you
53:07know yeah what do these actual data sets
53:09look like in practice
53:13I think again double down my earlier
53:15points around recording the data that
53:18you're producing and our devil down here
53:20is like record everything
53:22um you know with pycon for example or
53:23any other you know letter to be uh you
53:25have awesome metadata and you have other
53:28like you know for example in the car you
53:31know or manual you guys are using a
53:33priority like the ranking uh the
53:36original step
53:38um I can record that too I like to
53:40really enable yourself to record all
53:43um data points better Yeah by high level
53:46yeah it's the top care results that you
53:48were
53:49that you're querying against again also
53:51I record a k maybe uh you're sending the
53:54dynamic layer yeah
53:56and and Harrelson yes I mean that's why
53:58the data sets uh there is a human that
54:01helps calibrate labeling the data sets
54:02and what's the good answer what's
54:03another good answer that's how we
54:04calibrate against that because it's very
54:07complex as we correctly identified like
54:08sometimes the same answer can exist in
54:10multiple documents what's the best one
54:11uh and that's where we for for the
54:14purpose of doing the measurements and
54:16calibration that is what we rely on so
54:18that's why labeling labeling is still
54:20very very important task for that but
54:22the good news is there's a number of
54:23very good data sets out there from
54:25Reddit from Amazon from the squad data
54:28set from Stanford there's a Wiki uh
54:31based data set there's a bunch of really
54:33really good ones out there that people
54:34can use but if you want to do it for
54:36your own data internally at your company
54:38then one of the first exercises we would
54:39do with you is to create a labeled data
54:42set that a human editor helped Define so
54:45we can do the calibration against
54:48yeah exactly I think like like if you
54:52and also like I I would reinforce that
54:54you probably want to create a data set
54:56that you know like you are an expert on
54:58so you can really like evaluate and like
55:01manually and you know if the right
55:02answer is there so you can save some
55:04time because yeah like there's so many
55:06data sets out there but like it's it's a
55:08bit hard to know like oh is this the
55:09actual correct answer you know
55:10especially for like even the completion
55:12of vows um so yeah just having something
55:15that you know it's important
55:21awesome
55:22um there's uh there's one question down
55:26at the end which I think is a little bit
55:27in the weeds but I think it's kind of
55:29interesting do you and this gets around
55:31kind of like uh uh prioritization and
55:34importance for documents do you
55:36dynamically prioritize the like get do
55:39you dynamically prioritize different
55:41sources based on the input query so like
55:43is the and I'm guessing I'm guessing the
55:45probably the prioritization is probably
55:47like fixed but I guess like and
55:50expanding this to why I think it's
55:51interesting like I think a lot of times
55:53when people think of a query they think
55:54of like semantic search or like hybrid
55:56search where you're looking up things
55:57but there's also like other metadata
55:59attributes that it could be asking about
56:01right it could be asking about like from
56:02an author and so I guess like this kind
56:05of gets like parsing the query out into
56:07different things whether it be a
56:08dynamically prioritized source as this
56:10is kind of pointing out or just like
56:12some type of metadata filter more so
56:14even than some sort of things how how do
56:16you guys think about very open-ended and
56:19I know we have one minute left so this
56:21is a bad question to ask any quick any
56:23quick thoughts on that my my thought is
56:26like how does how do humans do it right
56:28so again we're trying to rely on the
56:30large language model to be smart enough
56:31that it's gonna both around the large
56:34language model and the retrieval model
56:35together and the re-ranking model in the
56:37middle to be smart enough to figure out
56:39that even without the com without the
56:41additional metadata like that we want it
56:42to be that good like it it looks at the
56:44question it looks at the response like
56:46this is really a good response just but
56:48because I understand the domain and I
56:49understand the language and I know it's
56:50a good response that said you were right
56:52like if we can help it if you can nudge
56:54it here or there that that would help a
56:55lot so we do have custom attributes that
56:57you can add in the metadata aspects that
57:00can be further used in the ranking
57:02functions and for doing the cross
57:03attention on the re-ranking to further
57:05boost certain sources to boost recency
57:08if this is a more recent document than
57:09not to boost maybe the social graph
57:12connections meaning this this is the CEO
57:15if they posted this then maybe it's more
57:16accurate than if it's uh somebody else
57:19so you can you can add these additional
57:21elements to try and improve the the
57:23scoring but at the end of the day
57:24Nirvana is to have the system figure it
57:26out based on the information contents
57:28but this is the best response
57:32awesome and since the last question the
57:35very last question what's what's the
57:36Husky's name from here
57:39thank you for the excellent question
57:41she's a Pomsky actually which is a Miss
57:42between Pomeranian so that keeps her
57:45small in size and her name is Luna which
57:47means the moon okay and embedding an
57:50embedding Vector mapping would tell you
57:52that right away
57:53Luna l-u-n-a yeah right we're gonna
57:57we're gonna end it there
57:59um thank you guys for for for joining
58:01and sharing all your wisdom I definitely
58:03learned a lot and I think a lot of
58:05really interesting papers to read and
58:07things to look into and and thank you
58:08everyone for for tuning in and all the
58:10excellent questions hope you guys
58:11enjoyed it
58:13um yeah thank you for all you do you're
58:15like amazing in terms of how you're
58:17really building out this community and
58:19doing it in a very collaborative way
58:21like you just bring people together
58:22we're all trying to make the world
58:24better with this technology and we
58:25really love what you're doing so please
58:27keep doing that
58:28we're all trying to figure it out
🎥 Related Videos