Building Production-Ready RAG Applications: Jerry Liu

AI Engineer2023-11-15

112K views|9 months ago

💫 Short Summary

The video discusses how to build production-ready Retrieval-Augmented Generation (RAG) applications using language models like LLMs. It covers the current RAG stack, challenges with naive RAG, and advanced techniques for improving RAG systems including fine-tuning, advanced retrieval methods, and using LLMs for reasoning.

✨ Highlights

📊 Transcript

✦

Jerry, co-founder and CEO of Lindex, discusses the paradigm for getting language models to understand data and the challenges with naive rag.

00:00

Two main paradigms for getting language models to understand data: retrieval augmentation and fine-tuning.

Challenges with naive rag include issues with response quality, such as bad retrieval, low precision, hallucination, and outdated information.

✦

The speaker talks about optimizing rag systems, including the use of llms for reasoning, the importance of performance measurement and evaluation, and techniques for improving retrieval and synthesis.

02:27

Optimizing rag systems involves storing additional information, optimizing data pipeline, and tuning embedding representation.

llms can be used for reasoning and breaking down questions into simpler queries.

Performance measurement and evaluation are important for iterating and improving rag systems.

Techniques for improving retrieval include tuning chunk sizes, using advanced retrieval methods, and incorporating metadata filters.

For synthesis, methods like reranking and using smaller chunks for retrieval can improve performance.

✦

The speaker discusses table stakes rag techniques, including chunk size tuning, metadata filtering, and advanced retrieval methods.

09:50

Tuning chunk size can have a big impact on performance, with more tokens not always equating to higher performance.

Metadata filtering adds structured context to text chunks, improving retrieval and synthesis.

Advanced retrieval methods like small to big retrieval and embedding references to parent chunks can also improve performance.

✦

The speaker introduces advanced concepts such as using agents for better rag pipelines, fine-tuning models for improved performance, and the potential of distilling knowledge from a larger model to a smaller one.

14:12

Agents model each document as a set of tools for summarization and QA, allowing for better reasoning and analysis.

Fine-tuning models, including embeddings and llms, can optimize specific parts of the rag pipeline for improved performance.

Distilling knowledge from a larger model to a smaller one, such as training a weaker LM with data generated from a bigger model, can improve train of thought and response quality.

00:14 hey everyone uh my name is Jerry

00:16 co-founder and CEO of L index and today

00:18 we'll be talking about how to build

00:19 production ready rag applications um I

00:22 think there's still time for a raffle

00:23 for the bucket hat so if you guys stop

00:24 by our booth uh please fill out the

00:26 Google form okay let's get started so

00:29 everybody knows is that there's been a

00:31 ton of amazing use cases in gen recently

00:33 you know um knowledge search and QA

00:36 conversational agents uh workflow

00:37 automation document processing these are

00:40 all things that you can build uh

00:41 especially using the reasoning

00:43 capabilities of llms uh over your

00:46 data so if we just do a quick refresher

00:49 in terms of like paradigms for how do

00:51 you actually get language models to

00:53 understand data that hasn't been trained

00:54 over there's really like two main

00:56 paradigms one is retrieval augmentation

00:59 where you you like fix the model and you

01:01 basically create a data pipeline to put

01:03 context into the prompt from some data

01:05 source into the input prompt of the

01:08 language model um so like a vector

01:10 database uh you know like unstructured

01:11 taex SQL database

01:13 Etc the next Paradigm here is

01:16 fine-tuning how can we bake knowledge

01:18 into the weights of the network by

01:20 actually updating the weights of the

01:22 model itself some adapter on top of the

01:24 model but basically some sort of

01:25 training process over some new data to

01:28 actually incorporate knowledge we'll

01:30 probably talk a little bit more about

01:31 retrieval augmentation but this is just

01:33 like to help you get uh started and

01:35 really understanding the mission

01:36 statement of of the

01:38 company okay let's talk about rag

01:41 retrieval augmented Generation Um it's

01:44 become kind of a buzzword recently but

01:47 we'll first walk through the current rag

01:49 stack for building a QA system this

01:51 really consists of two main components

01:53 uh data ingestion as well as data

01:55 quering which contains retrieval and

01:56 synthesis uh if you're just getting

01:58 started in llama index you can basically

02:00 do this in around like fiveish lines of

02:02 code uh so you don't really need to

02:03 think about it but if you do want to

02:05 learn some of the lower level components

02:07 and I do encourage like every engineer

02:09 uh AI engineer to basically just like

02:10 learn how these components work under

02:12 the hood um I would encourage you to

02:14 check out some of our docs to really

02:15 understand how do you actually do data

02:17 inje uh and data quering like how do you

02:19 actually retrieve from a vector database

02:21 and how do you synthesize that with an

02:23 L1 so that's basically the key stack

02:27 that's kind of emerging these days like

02:29 for every sort chat bot like you know

02:31 chat over your PDF like over your

02:33 unstructured data um a lot of these

02:35 things are basically using the same

02:37 principles of like how do you actually

02:39 load data from some data source and

02:41 actually you know um retrieve in query

02:43 over it but I think as developers are

02:47 actually developing these applications

02:48 they're realizing that this isn't quite

02:50 enough uh like there's there's certain

02:53 issues that you're running into that are

02:54 blockers for actually being able to

02:56 productionize these applications and so

02:58 what are these challenges with naive rag

03:01 one aspect here is just like uh the

03:04 response and and this is the key thing

03:05 that we're focused on like the the

03:06 response quality is not very good you

03:08 run into for instance like bad retrieval

03:10 issues like uh during the retrieval

03:12 stage from your vector database if

03:14 you're not actually returning the

03:15 relevant chunks from your vector

03:17 database you're not going to be able to

03:18 have the correct context actually put

03:20 into the llm so this includes certain

03:22 issues like low Precision not all chunks

03:24 in the retrieve set are relevant uh this

03:27 leads to like hallucination like loss in

03:28 the middle problems you have a lot of

03:30 fluff in the return response this could

03:32 mean low recall like your top K isn't

03:34 high enough or basically like the the

03:35 the set of like information that you

03:37 need to actually answer the question is

03:38 just not there um and of course there's

03:41 other issues too like outdated

03:43 information and many of you who are

03:44 building apps these days might be

03:46 familiar with some like key concepts of

03:48 like just why the llm isn't always you

03:50 know uh guaranteed to give you a correct

03:52 answer there's hallucination irrelevance

03:54 like toxicity bias there's a lot of

03:56 issues on the LM side as

03:58 well so what can we do um what can we

04:02 actually do to try to improve the

04:04 performance of a retrieval augmented

04:06 generation application um and and for

04:08 many of you like you might be running

04:09 into certain issues and it really runs

04:12 the gamut across like the entire

04:14 pipeline there's stuff you can do on the

04:16 data like can we store additional

04:17 information Beyond just like the raw

04:19 text chunks right that that you're

04:21 putting in the vector database can you

04:22 optimize that data pipeline somehow play

04:24 around with chunk sizes that type of

04:25 thing can you optimize the embedding

04:27 representation itself a lot of times

04:29 when you're using a pre-trained embeding

04:31 model it's not really optimal for giving

04:33 you the best performance um there's the

04:35 retrieval algorithm you know the default

04:37 thing you do is just look up the topk

04:39 most similar elements from your vector

04:41 database to return to the llm um many

04:44 times that's not enough and and what are

04:45 kind of like both simple things you can

04:46 do as well as hard things uh and there's

04:48 also synthesis like uh why is there yeah

04:51 there's like a v in the anyway so so can

04:53 we use LMS for more than generation um

04:55 and so basically like you can um use the

04:57 llm to actually help you with like

04:59 reasoning um as opposed to just like

05:01 pure um uh pure uh just like uh just

05:05 pure generation right you can actually

05:06 use it to try to reason over given a

05:08 question can you break it down into

05:10 simpler questions route to different

05:12 data sources and kind of like have a a

05:15 more sophisticated way of in wearing

05:16 your

05:18 data um of course like if you kind of

05:21 been around some of my recent talks like

05:22 I always say before you actually try any

05:24 of these techniques you need to be

05:26 pretty task specific and make sure that

05:27 you need a way to that you actually have

05:29 a way to to measure

05:30 performance so I'll probably spend like

05:33 2 minutes talking about evaluation um

05:35 Simon my co-founder just ran a workshop

05:36 yesterday on really just like how to you

05:38 evaluate uh build a data set evaluate

05:40 rag systems and help iterate on that uh

05:42 if you miss the workshop don't worry

05:44 I'll we'll have the slides and and

05:45 materials uh available online so that

05:47 you can take a look um at a very high

05:50 level in terms of evaluation it's

05:52 important because you basically need to

05:53 define a benchmark for your system to

05:55 understand how are you going to iterate

05:57 on and improve it uh and there's like a

05:59 few few different ways you can try to do

06:00 evaluation right I think Anton from from

06:02 chroma was was just saying some of this

06:04 but like you basically need a way to um

06:07 evaluate both the end to endend solution

06:09 like you have your input query as well

06:11 as the output response you also want to

06:13 probably be able to evaluate like

06:14 specific components like if you've

06:16 diagnosed that the retrieval is is like

06:18 the portion that needs improving you

06:19 need like retrieval metrics to really

06:21 understand how can you improve your

06:23 retrieval system um so there's retrieval

06:26 and there's

06:27 synthesis let's talk a little bit just

06:29 like 30 seconds on each one um

06:31 evaluation on retrieval what does this

06:33 look like you basically want to make

06:35 sure that the stuff that's returned

06:37 actually answers the query and that

06:39 you're kind of you know not returning a

06:41 bunch of fluff uh and that the stuff

06:42 that you return is relevant to the

06:44 question um so first you need an

06:46 evaluation data set a lot of people are

06:48 uh have like human labeled data sets if

06:50 you're in uh building stuff in prod you

06:52 might have like user feedback as well if

06:54 not you can synthetically generate a

06:56 data set this data set is input like

06:58 query and output the IDS of like the

07:01 return documents are relevant to the

07:02 query so you need that somehow once you

07:05 have that you can measure stuff with

07:06 ranking metrics right you can measure

07:08 stuff like success rate hit rate Mr ndcg

07:12 a variety of these things uh and and so

07:14 like once you are able to evaluate this

07:16 like this really isn't uh kind of like

07:18 an llm problem this is like an IR

07:20 problem and this has been around for at

07:21 least like a decade or two um but a lot

07:24 of this is becoming like you know it's

07:26 it's still very relevant in the face of

07:27 actually building these L Maps

07:31 the next piece here is um there's a

07:33 retrieve portion right but then you

07:34 generate a response from it and then how

07:36 do you actually evaluate the whole thing

07:37 end to end so evaluation of the final

07:40 response uh given the input you still

07:42 want to generate some sort of data set

07:44 so you could do that through like human

07:45 annotations user feedback you could have

07:47 like ground truth reference answers

07:49 given the query that really indicates

07:50 like hey this is the proper answer to

07:52 this question um and you can also just

07:54 like you know synthetically generate it

07:56 with like gb4 uh you run this through

07:58 the full rag pipeline that you built the

08:00 retrieval and synthesis uh and you can

08:02 run like LM based evals um so label-free

08:04 evals with label evals there's a lot of

08:07 uh projects these days uh going on about

08:09 how do you actually properly evaluate

08:11 the outputs uh predicted outputs of a

08:12 language

08:15 model once you've defined your evalve

08:17 benchmark now you want to think about

08:19 how do you actually optimize your rag

08:20 systems so I sent a teaser on this slide

08:23 uh a few like yesterday but the way I

08:26 think about this is that when do you

08:28 want to actually improve your system

08:30 there's like a million things that you

08:31 can do to try to actually improve your

08:33 rag system uh and like you probably

08:35 don't want to start with the hard stuff

08:37 first uh just because like you know part

08:38 of the value of language models is how

08:40 it's kind of democratized access to

08:42 every developer it's really just made it

08:43 easy for people to get up and running

08:45 and so if for instance you're running

08:47 into some performance issues with rag

08:49 I'd probably start with the basics like

08:50 I call it like table Stakes rag

08:52 techniques uh better puring um so that

08:54 you don't just split by even chunks like

08:56 adjusting your chunk sizes trying out

08:58 stuff that's already integrated with a

08:59 vector database like hybrid search as

09:01 well as like metadata

09:03 filters there's also like Advanced

09:05 retrieval methods that you could try it

09:07 this is like a little bit more advanced

09:08 some of it pulls from like traditional

09:10 IR some of it it's more like kind of uh

09:12 really like uh new in in this age of

09:15 like LM based apps there's like uh

09:17 reranking um that's a traditional

09:19 concept there's also Concepts in llama

09:20 index like recursive retrieval like

09:22 dealing with embedded tables like uh

09:24 small to big retrieval and a lot of

09:26 other stuff that we have that help you

09:27 potentially improve the performance of

09:29 your application uh and then the last

09:31 bit like this kind of gets into more

09:33 expressive stuff that might be harder to

09:35 implement might incur a higher lency and

09:36 cost but is potentially more powerful

09:38 and forward looking is like agents like

09:40 how do you incorporate agents towards

09:42 better like rag pipelines to better

09:45 answer different types of questions and

09:46 synthesize information and how do you

09:48 actually fine-tune

09:50 stuff let's talk a little bit about the

09:52 table Stakes first so chunk sizes tuning

09:55 your chunk size can have outsize impacts

09:57 on performance right uh if you've kind

09:59 of like played around with frag systems

10:01 uh this may or may not be obvious to you

10:03 what's interesting though is that like

10:05 more retriev tokens does not always

10:06 equate to higher performance and that if

10:09 you do like reranking of your retrieve

10:10 tokens it doesn't necessarily mean that

10:12 your final generation response is going

10:14 to be better and this is again due to

10:16 stuff like lost in the middle problems

10:17 where stuff in the middle of the LM

10:19 context window tends to get lost where

10:20 stuff at the end uh tends to be a little

10:22 bit uh more well remembered by the Ln um

10:26 and so I think we did a workshop with

10:27 like arise a few uh week ago where

10:30 basically we showed you know uh there is

10:32 kind of like an optimal chunk size given

10:33 your data set and a lot of times when

10:35 you try out stuff like reranking it

10:36 actually increases your error

10:39 metrics metadata filtering uh this is

10:42 another like very table Stak thing that

10:44 I think everybody should look into and I

10:46 think Vector databases like you know

10:48 chroma pine cone we like these Vector

10:50 databases are all implementing these uh

10:52 capabilities on your hood metadata

10:54 filtering is basically just like how can

10:56 you add structured context uh to your

10:59 your chunks like your text chunks and

11:01 you can use this for both like

11:02 embeddings as well as synthesis but it

11:04 also integrates with like The Meta

11:06 metadata filter capabilities of a vector

11:08 database um so metadata is just like

11:10 again structured Json dictionary it

11:12 could be like page number it could be

11:13 the document title it could be the

11:15 summary of adjacent chunks you can get

11:16 creative with it too you could

11:17 hallucinate like questions uh that the

11:19 chunk answers um and it can help

11:21 retrieval it can help augment your

11:22 response quality it also integrates with

11:24 the vector database

11:26 filters so as an example let's say the

11:29 question is over like the SEC like 10q

11:32 document and like can you tell me the

11:34 risk factors in 2021 if you just do raw

11:37 semantic search typically it's very low

11:38 Precision you're going to return a bunch

11:40 of stuff that may or may not match this

11:42 you might even return stuff from like

11:43 other years if you have a bunch of

11:45 documents from different years in the

11:46 same Vector collection um and so like

11:48 you're kind of like rolling the dice a

11:50 little

11:52 bit but one idea here is basically you

11:55 know if you have access to the metadata

11:58 of the documents um and you ask a

12:00 question like this you basically combine

12:01 structured query capabilities by

12:03 inferring the metadata filters like a

12:05 wear Claus and a SQL statement like a

12:06 year equals 2021 and you combine that

12:08 with semantic search to return the most

12:10 relevant candidates given your query and

12:12 this improves the Precision of your uh

12:14 of your

12:17 results moving on to stuff that's maybe

12:20 a bit more advanced like Advanced

12:21 retrieval is one thing that we found

12:23 generally helps is this idea of like

12:25 small to big retrieval um so what does

12:27 that mean basically right now when you

12:29 embed a big text trunk you also

12:31 synthesize over that text trunk and so

12:33 it's a little suboptimal because what if

12:35 like the embedding representations like

12:37 biased because you know there's a bunch

12:39 of fluff in that text trunk that

12:40 contains a bunch of relevant information

12:42 you're not actually optimizing your

12:43 retrieval quality so embedding a big

12:46 text trunk sometimes feels a little

12:47 suboptimal one thing that you could do

12:49 is basically embed text at the sentence

12:51 level or on a smaller level and then

12:53 expand that window during synthesis time

12:55 um and so this is contained in a variety

12:57 of like L index ractions but the idea is

13:00 that you return you retrieve on more

13:02 granular pieces of information so

13:04 smaller chunks this makes it so that

13:06 these chunks are more likely to be

13:07 retrieved when you actually ask a query

13:09 over these specific piece of context but

13:11 then you want to make sure that the LM

13:12 actually has access to more information

13:14 to actually synthesize a proper

13:16 result so this leads to like more

13:19 precise retrieval right so um we we

13:22 tried this out it it helps avoid like

13:24 some loss in the middle problems you can

13:25 set a smaller top K value like k equal 2

13:28 uh whereas like over this data set if

13:30 you set like k equals 5 for naive

13:32 retrieval over big text chunks you

13:34 basically start returning a lot of

13:36 context and that kind of leads into

13:38 issues where uh you know maybe the

13:40 relevant context is in the middle but

13:41 you're not able to find out uh or or

13:43 you're like that the LM is is is not

13:45 able to kind of synthesize over that

13:50 information a very related idea here is

13:53 just like embedding a reference to the

13:55 parent trunk um as opposed to the actual

13:57 text Chunk itself so for instance if you

14:00 want to embed like not just the raw text

14:01 trunk or not the text trunk but actually

14:03 like a smaller trunk um or a summary or

14:06 questions that answer of the trunk we

14:08 have found that that actually helps to

14:09 improve retrieval performance a decent

14:11 amount um and it's it kind of again goes

14:14 along with this idea like a lot of times

14:15 you want to embed something that's more

14:17 edable for embedding based retrieval uh

14:19 but then you want to return enough

14:20 context so that the LM can actually

14:22 synthesize over that

14:27 information

14:29 the next bit here is actually kind of

14:32 even more advanced stuff right this goes

14:34 on into agents and this goes on into

14:35 that last pillar that I I mentioned

14:37 which is how can you use llms for for

14:39 reasoning as opposed to just synthesis

14:42 the intuition here is that like for a

14:44 lot of rag if you're just using the llm

14:46 at the end you're one constrained by the

14:47 quality of your Retriever and you're

14:49 really only able to do stuff like

14:51 question answering and there's certain

14:53 types of questions and more advanced uh

14:55 analysis that you might want to launch

14:56 that like top K rag can't really answer

14:58 it's not necessarily just a one-off

15:00 question you might need to have like an

15:02 entire sequence of reasoning steps to

15:03 actually pull together a piece of

15:04 information or you might want to like

15:06 summarize a document and compare with

15:08 like other documents so one kind of

15:11 architecture we're exploring right now

15:13 is this idea of like multi-document

15:14 agents what if like instead of just like

15:16 rag we moved a little bit more into

15:18 agent territory we modeled each document

15:21 not just as a sequence of text trunks

15:23 but actually as a set of tools that

15:24 contains the ability to both like

15:26 summarize that document as well as to do

15:28 QA over that document over specific

15:30 facts um and of course if you want to

15:32 scale to like you know hundreds or

15:34 thousands or millions of documents um

15:36 typically an agent can only have access

15:38 to a limited window of tools so you

15:41 probably want to do some sort of

15:42 retrieval on these tools similar to how

15:44 you want to retrieve like text Trunks

15:46 from a document the main difference is

15:48 that because these are tools you

15:49 actually want to act upon them you want

15:51 to use them as opposed to just like

15:52 taking the raw text and plugging it into

15:54 the context window so blending this

15:56 combination of like uh kind of um

15:59 embedding based retrieval or any sort of

16:01 retrieval as well as like agent tool use

16:03 is a very interesting Paradigm that I

16:05 think is really only possible with this

16:06 age of almes and hasn't really existed

16:08 before

16:12 this another kind of advanced concept is

16:15 this idea of fine tuning um and so fine

16:18 tuning uh you know this some other

16:19 presenters have talked about this as

16:21 well but the idea of like fine-tuning in

16:23 a rag system is that it really optimizes

16:26 specific pieces of this rag pipeline for

16:28 you to kind of better um like improve

16:31 the performance of either retriever or

16:33 synthesis capabilities so one thing you

16:35 can do is fine-tune your embeddings um I

16:37 think Anton was talking about this as

16:38 well like if you just use a pre-trained

16:40 model the embedding representations are

16:42 not going to be optimized over your

16:43 specific data so sometimes you're just

16:45 going to retrieve the wrong wrong

16:47 information um if you can somehow tune

16:50 these embeddings so that given any sort

16:52 of like relevant question that the user

16:53 might ask that you're actually returning

16:55 the relevant response then you're going

16:57 to have like better performance

16:58 performance so um an idea here right is

17:01 to generate synthetic query data set

17:03 from raw text chunks using llms and use

17:05 this to fine-tune and embeding model um

17:07 and you can do this like uh if we go

17:10 back really quick actually uh you can do

17:12 this by basically um kind of fine-tuning

17:14 the base model itself you can also

17:16 fine-tune an adapter on top of the model

17:19 um and fine-tuning an adapter on top of

17:20 the model has a few advantages in that

17:22 you don't require the base model's

17:23 weights to actually fine-tune stuff and

17:25 if you just finetune the query you don't

17:27 have to reindex your entire document

17:32 Corpus there's also fine-tuning LMS

17:34 which of course like a lot of people are

17:36 very interested in doing these days um

17:38 and intuition here specifically for rag

17:40 is that if you have a weaker LM like 3.5

17:42 turbo like LL 2 7B like these weaker

17:45 llms are bad are are not bad at like um

17:49 uh wait yeah weaker LMS are are maybe a

17:52 little bit worse at like response

17:53 synthesis reasoning structured outputs

17:56 Etc um compared to like bigger models

17:58 so the solution here is what if you can

18:00 generate a synthetic data set using a

18:02 bigger model like gbd4 that's something

18:04 we're exploring and you actually distill

18:06 that into 3.5 turbo so it gets better at

18:08 train of thought longer response quality

18:11 um better structured outputs and a lot

18:13 of other possibilities as well so all

18:15 these things are in our docs there's

18:17 production rag uh there's fine toting

18:19 and I have two seconds left so thank you

18:21 very

18:27 much

💫 FAQs about This YouTube Video

1. What are the main components of building a QA system using RAG stack?

The main components of building a QA system using the RAG (Retrieval-Augmented Generation) stack consist of data ingestion and data querying, which includes retrieval and synthesis. This can be done in around five lines of code with LLM index, but lower-level components can also be explored for a deeper understanding.

2. What are the challenges with naive RAG and how can the performance of a Retrieval-Augmented Generation application be improved?

Challenges with naive RAG include issues with response quality, such as bad retrieval leading to irrelevant or outdated information, and the need to improve the performance of the application. This can be addressed through fine-tuning the RAG system, optimizing the retrieval algorithm, and exploring advanced retrieval methods.

3. How can LLMs be used for reasoning in Retrieval-Augmented Generation applications?

LLMs (Large Language Models) can be used for reasoning in Retrieval-Augmented Generation applications by leveraging their capabilities to understand and process complex queries, synthesize information, and provide structured outputs. This allows for more sophisticated QA systems and enhanced performance in handling diverse types of questions.

4. What are the key techniques for optimizing a RAG system?

The key techniques for optimizing a RAG (Retrieval-Augmented Generation) system include fine-tuning the system to improve response quality, exploring advanced retrieval methods such as small-to-big retrieval, and leveraging LLMs for reasoning to enhance the overall performance of the application. Additionally, optimizing the retrieval algorithm and integrating metadata filtering can also contribute to the enhancement of the RAG system.

5. How can fine-tuning the RAG system contribute to better performance in QA applications?

Fine-tuning the RAG system can contribute to better performance in QA (Question Answering) applications by improving the relevance and accuracy of the retrieved information, enhancing the quality of the generated responses, and enabling the system to handle a wide range of queries more effectively. This leads to overall enhancement in the capabilities and reliability of the RAG-based QA system.

🎥 Related Videos