00:14 hey everyone uh my name is Jerry
00:16 co-founder and CEO of L index and today
00:18 we'll be talking about how to build
00:19 production ready rag applications um I
00:22 think there's still time for a raffle
00:23 for the bucket hat so if you guys stop
00:24 by our booth uh please fill out the
00:26 Google form okay let's get started so
00:29 everybody knows is that there's been a
00:31 ton of amazing use cases in gen recently
00:33 you know um knowledge search and QA
00:36 conversational agents uh workflow
00:37 automation document processing these are
00:40 all things that you can build uh
00:41 especially using the reasoning
00:43 capabilities of llms uh over your
00:46 data so if we just do a quick refresher
00:49 in terms of like paradigms for how do
00:51 you actually get language models to
00:53 understand data that hasn't been trained
00:54 over there's really like two main
00:56 paradigms one is retrieval augmentation
00:59 where you you like fix the model and you
01:01 basically create a data pipeline to put
01:03 context into the prompt from some data
01:05 source into the input prompt of the
01:08 language model um so like a vector
01:10 database uh you know like unstructured
01:13 Etc the next Paradigm here is
01:16 fine-tuning how can we bake knowledge
01:18 into the weights of the network by
01:20 actually updating the weights of the
01:22 model itself some adapter on top of the
01:24 model but basically some sort of
01:25 training process over some new data to
01:28 actually incorporate knowledge we'll
01:30 probably talk a little bit more about
01:31 retrieval augmentation but this is just
01:33 like to help you get uh started and
01:35 really understanding the mission
01:36 statement of of the
01:38 company okay let's talk about rag
01:41 retrieval augmented Generation Um it's
01:44 become kind of a buzzword recently but
01:47 we'll first walk through the current rag
01:49 stack for building a QA system this
01:51 really consists of two main components
01:53 uh data ingestion as well as data
01:55 quering which contains retrieval and
01:56 synthesis uh if you're just getting
01:58 started in llama index you can basically
02:00 do this in around like fiveish lines of
02:02 code uh so you don't really need to
02:03 think about it but if you do want to
02:05 learn some of the lower level components
02:07 and I do encourage like every engineer
02:09 uh AI engineer to basically just like
02:10 learn how these components work under
02:12 the hood um I would encourage you to
02:14 check out some of our docs to really
02:15 understand how do you actually do data
02:17 inje uh and data quering like how do you
02:19 actually retrieve from a vector database
02:21 and how do you synthesize that with an
02:23 L1 so that's basically the key stack
02:27 that's kind of emerging these days like
02:29 for every sort chat bot like you know
02:31 chat over your PDF like over your
02:33 unstructured data um a lot of these
02:35 things are basically using the same
02:37 principles of like how do you actually
02:39 load data from some data source and
02:41 actually you know um retrieve in query
02:43 over it but I think as developers are
02:47 actually developing these applications
02:48 they're realizing that this isn't quite
02:50 enough uh like there's there's certain
02:53 issues that you're running into that are
02:54 blockers for actually being able to
02:56 productionize these applications and so
02:58 what are these challenges with naive rag
03:01 one aspect here is just like uh the
03:04 response and and this is the key thing
03:05 that we're focused on like the the
03:06 response quality is not very good you
03:08 run into for instance like bad retrieval
03:10 issues like uh during the retrieval
03:12 stage from your vector database if
03:14 you're not actually returning the
03:15 relevant chunks from your vector
03:17 database you're not going to be able to
03:18 have the correct context actually put
03:20 into the llm so this includes certain
03:22 issues like low Precision not all chunks
03:24 in the retrieve set are relevant uh this
03:27 leads to like hallucination like loss in
03:28 the middle problems you have a lot of
03:30 fluff in the return response this could
03:32 mean low recall like your top K isn't
03:34 high enough or basically like the the
03:35 the set of like information that you
03:37 need to actually answer the question is
03:38 just not there um and of course there's
03:41 other issues too like outdated
03:43 information and many of you who are
03:44 building apps these days might be
03:46 familiar with some like key concepts of
03:48 like just why the llm isn't always you
03:50 know uh guaranteed to give you a correct
03:52 answer there's hallucination irrelevance
03:54 like toxicity bias there's a lot of
03:56 issues on the LM side as
03:58 well so what can we do um what can we
04:02 actually do to try to improve the
04:04 performance of a retrieval augmented
04:06 generation application um and and for
04:08 many of you like you might be running
04:09 into certain issues and it really runs
04:12 the gamut across like the entire
04:14 pipeline there's stuff you can do on the
04:16 data like can we store additional
04:17 information Beyond just like the raw
04:19 text chunks right that that you're
04:21 putting in the vector database can you
04:22 optimize that data pipeline somehow play
04:24 around with chunk sizes that type of
04:25 thing can you optimize the embedding
04:27 representation itself a lot of times
04:29 when you're using a pre-trained embeding
04:31 model it's not really optimal for giving
04:33 you the best performance um there's the
04:35 retrieval algorithm you know the default
04:37 thing you do is just look up the topk
04:39 most similar elements from your vector
04:41 database to return to the llm um many
04:44 times that's not enough and and what are
04:45 kind of like both simple things you can
04:46 do as well as hard things uh and there's
04:48 also synthesis like uh why is there yeah
04:51 there's like a v in the anyway so so can
04:53 we use LMS for more than generation um
04:55 and so basically like you can um use the
04:57 llm to actually help you with like
04:59 reasoning um as opposed to just like
05:01 pure um uh pure uh just like uh just
05:05 pure generation right you can actually
05:06 use it to try to reason over given a
05:08 question can you break it down into
05:10 simpler questions route to different
05:12 data sources and kind of like have a a
05:15 more sophisticated way of in wearing
05:18 data um of course like if you kind of
05:21 been around some of my recent talks like
05:22 I always say before you actually try any
05:24 of these techniques you need to be
05:26 pretty task specific and make sure that
05:27 you need a way to that you actually have
05:29 a way to to measure
05:30 performance so I'll probably spend like
05:33 2 minutes talking about evaluation um
05:35 Simon my co-founder just ran a workshop
05:36 yesterday on really just like how to you
05:38 evaluate uh build a data set evaluate
05:40 rag systems and help iterate on that uh
05:42 if you miss the workshop don't worry
05:44 I'll we'll have the slides and and
05:45 materials uh available online so that
05:47 you can take a look um at a very high
05:50 level in terms of evaluation it's
05:52 important because you basically need to
05:53 define a benchmark for your system to
05:55 understand how are you going to iterate
05:57 on and improve it uh and there's like a
05:59 few few different ways you can try to do
06:00 evaluation right I think Anton from from
06:02 chroma was was just saying some of this
06:04 but like you basically need a way to um
06:07 evaluate both the end to endend solution
06:09 like you have your input query as well
06:11 as the output response you also want to
06:13 probably be able to evaluate like
06:14 specific components like if you've
06:16 diagnosed that the retrieval is is like
06:18 the portion that needs improving you
06:19 need like retrieval metrics to really
06:21 understand how can you improve your
06:23 retrieval system um so there's retrieval
06:27 synthesis let's talk a little bit just
06:29 like 30 seconds on each one um
06:31 evaluation on retrieval what does this
06:33 look like you basically want to make
06:35 sure that the stuff that's returned
06:37 actually answers the query and that
06:39 you're kind of you know not returning a
06:41 bunch of fluff uh and that the stuff
06:42 that you return is relevant to the
06:44 question um so first you need an
06:46 evaluation data set a lot of people are
06:48 uh have like human labeled data sets if
06:50 you're in uh building stuff in prod you
06:52 might have like user feedback as well if
06:54 not you can synthetically generate a
06:56 data set this data set is input like
06:58 query and output the IDS of like the
07:01 return documents are relevant to the
07:02 query so you need that somehow once you
07:05 have that you can measure stuff with
07:06 ranking metrics right you can measure
07:08 stuff like success rate hit rate Mr ndcg
07:12 a variety of these things uh and and so
07:14 like once you are able to evaluate this
07:16 like this really isn't uh kind of like
07:18 an llm problem this is like an IR
07:20 problem and this has been around for at
07:21 least like a decade or two um but a lot
07:24 of this is becoming like you know it's
07:26 it's still very relevant in the face of
07:27 actually building these L Maps
07:31 the next piece here is um there's a
07:33 retrieve portion right but then you
07:34 generate a response from it and then how
07:36 do you actually evaluate the whole thing
07:37 end to end so evaluation of the final
07:40 response uh given the input you still
07:42 want to generate some sort of data set
07:44 so you could do that through like human
07:45 annotations user feedback you could have
07:47 like ground truth reference answers
07:49 given the query that really indicates
07:50 like hey this is the proper answer to
07:52 this question um and you can also just
07:54 like you know synthetically generate it
07:56 with like gb4 uh you run this through
07:58 the full rag pipeline that you built the
08:00 retrieval and synthesis uh and you can
08:02 run like LM based evals um so label-free
08:04 evals with label evals there's a lot of
08:07 uh projects these days uh going on about
08:09 how do you actually properly evaluate
08:11 the outputs uh predicted outputs of a
08:15 model once you've defined your evalve
08:17 benchmark now you want to think about
08:19 how do you actually optimize your rag
08:20 systems so I sent a teaser on this slide
08:23 uh a few like yesterday but the way I
08:26 think about this is that when do you
08:28 want to actually improve your system
08:30 there's like a million things that you
08:31 can do to try to actually improve your
08:33 rag system uh and like you probably
08:35 don't want to start with the hard stuff
08:37 first uh just because like you know part
08:38 of the value of language models is how
08:40 it's kind of democratized access to
08:42 every developer it's really just made it
08:43 easy for people to get up and running
08:45 and so if for instance you're running
08:47 into some performance issues with rag
08:49 I'd probably start with the basics like
08:50 I call it like table Stakes rag
08:52 techniques uh better puring um so that
08:54 you don't just split by even chunks like
08:56 adjusting your chunk sizes trying out
08:58 stuff that's already integrated with a
08:59 vector database like hybrid search as
09:01 well as like metadata
09:03 filters there's also like Advanced
09:05 retrieval methods that you could try it
09:07 this is like a little bit more advanced
09:08 some of it pulls from like traditional
09:10 IR some of it it's more like kind of uh
09:12 really like uh new in in this age of
09:15 like LM based apps there's like uh
09:17 reranking um that's a traditional
09:19 concept there's also Concepts in llama
09:20 index like recursive retrieval like
09:22 dealing with embedded tables like uh
09:24 small to big retrieval and a lot of
09:26 other stuff that we have that help you
09:27 potentially improve the performance of
09:29 your application uh and then the last
09:31 bit like this kind of gets into more
09:33 expressive stuff that might be harder to
09:35 implement might incur a higher lency and
09:36 cost but is potentially more powerful
09:38 and forward looking is like agents like
09:40 how do you incorporate agents towards
09:42 better like rag pipelines to better
09:45 answer different types of questions and
09:46 synthesize information and how do you
09:50 stuff let's talk a little bit about the
09:52 table Stakes first so chunk sizes tuning
09:55 your chunk size can have outsize impacts
09:57 on performance right uh if you've kind
09:59 of like played around with frag systems
10:01 uh this may or may not be obvious to you
10:03 what's interesting though is that like
10:05 more retriev tokens does not always
10:06 equate to higher performance and that if
10:09 you do like reranking of your retrieve
10:10 tokens it doesn't necessarily mean that
10:12 your final generation response is going
10:14 to be better and this is again due to
10:16 stuff like lost in the middle problems
10:17 where stuff in the middle of the LM
10:19 context window tends to get lost where
10:20 stuff at the end uh tends to be a little
10:22 bit uh more well remembered by the Ln um
10:26 and so I think we did a workshop with
10:27 like arise a few uh week ago where
10:30 basically we showed you know uh there is
10:32 kind of like an optimal chunk size given
10:33 your data set and a lot of times when
10:35 you try out stuff like reranking it
10:36 actually increases your error
10:39 metrics metadata filtering uh this is
10:42 another like very table Stak thing that
10:44 I think everybody should look into and I
10:46 think Vector databases like you know
10:48 chroma pine cone we like these Vector
10:50 databases are all implementing these uh
10:52 capabilities on your hood metadata
10:54 filtering is basically just like how can
10:56 you add structured context uh to your
10:59 your chunks like your text chunks and
11:01 you can use this for both like
11:02 embeddings as well as synthesis but it
11:04 also integrates with like The Meta
11:06 metadata filter capabilities of a vector
11:08 database um so metadata is just like
11:10 again structured Json dictionary it
11:12 could be like page number it could be
11:13 the document title it could be the
11:15 summary of adjacent chunks you can get
11:16 creative with it too you could
11:17 hallucinate like questions uh that the
11:19 chunk answers um and it can help
11:21 retrieval it can help augment your
11:22 response quality it also integrates with
11:24 the vector database
11:26 filters so as an example let's say the
11:29 question is over like the SEC like 10q
11:32 document and like can you tell me the
11:34 risk factors in 2021 if you just do raw
11:37 semantic search typically it's very low
11:38 Precision you're going to return a bunch
11:40 of stuff that may or may not match this
11:42 you might even return stuff from like
11:43 other years if you have a bunch of
11:45 documents from different years in the
11:46 same Vector collection um and so like
11:48 you're kind of like rolling the dice a
11:52 bit but one idea here is basically you
11:55 know if you have access to the metadata
11:58 of the documents um and you ask a
12:00 question like this you basically combine
12:01 structured query capabilities by
12:03 inferring the metadata filters like a
12:05 wear Claus and a SQL statement like a
12:06 year equals 2021 and you combine that
12:08 with semantic search to return the most
12:10 relevant candidates given your query and
12:12 this improves the Precision of your uh
12:17 results moving on to stuff that's maybe
12:20 a bit more advanced like Advanced
12:21 retrieval is one thing that we found
12:23 generally helps is this idea of like
12:25 small to big retrieval um so what does
12:27 that mean basically right now when you
12:29 embed a big text trunk you also
12:31 synthesize over that text trunk and so
12:33 it's a little suboptimal because what if
12:35 like the embedding representations like
12:37 biased because you know there's a bunch
12:39 of fluff in that text trunk that
12:40 contains a bunch of relevant information
12:42 you're not actually optimizing your
12:43 retrieval quality so embedding a big
12:46 text trunk sometimes feels a little
12:47 suboptimal one thing that you could do
12:49 is basically embed text at the sentence
12:51 level or on a smaller level and then
12:53 expand that window during synthesis time
12:55 um and so this is contained in a variety
12:57 of like L index ractions but the idea is
13:00 that you return you retrieve on more
13:02 granular pieces of information so
13:04 smaller chunks this makes it so that
13:06 these chunks are more likely to be
13:07 retrieved when you actually ask a query
13:09 over these specific piece of context but
13:11 then you want to make sure that the LM
13:12 actually has access to more information
13:14 to actually synthesize a proper
13:16 result so this leads to like more
13:19 precise retrieval right so um we we
13:22 tried this out it it helps avoid like
13:24 some loss in the middle problems you can
13:25 set a smaller top K value like k equal 2
13:28 uh whereas like over this data set if
13:30 you set like k equals 5 for naive
13:32 retrieval over big text chunks you
13:34 basically start returning a lot of
13:36 context and that kind of leads into
13:38 issues where uh you know maybe the
13:40 relevant context is in the middle but
13:41 you're not able to find out uh or or
13:43 you're like that the LM is is is not
13:45 able to kind of synthesize over that
13:50 information a very related idea here is
13:53 just like embedding a reference to the
13:55 parent trunk um as opposed to the actual
13:57 text Chunk itself so for instance if you
14:00 want to embed like not just the raw text
14:01 trunk or not the text trunk but actually
14:03 like a smaller trunk um or a summary or
14:06 questions that answer of the trunk we
14:08 have found that that actually helps to
14:09 improve retrieval performance a decent
14:11 amount um and it's it kind of again goes
14:14 along with this idea like a lot of times
14:15 you want to embed something that's more
14:17 edable for embedding based retrieval uh
14:19 but then you want to return enough
14:20 context so that the LM can actually
14:22 synthesize over that
14:29 the next bit here is actually kind of
14:32 even more advanced stuff right this goes
14:34 on into agents and this goes on into
14:35 that last pillar that I I mentioned
14:37 which is how can you use llms for for
14:39 reasoning as opposed to just synthesis
14:42 the intuition here is that like for a
14:44 lot of rag if you're just using the llm
14:46 at the end you're one constrained by the
14:47 quality of your Retriever and you're
14:49 really only able to do stuff like
14:51 question answering and there's certain
14:53 types of questions and more advanced uh
14:55 analysis that you might want to launch
14:56 that like top K rag can't really answer
14:58 it's not necessarily just a one-off
15:00 question you might need to have like an
15:02 entire sequence of reasoning steps to
15:03 actually pull together a piece of
15:04 information or you might want to like
15:06 summarize a document and compare with
15:08 like other documents so one kind of
15:11 architecture we're exploring right now
15:13 is this idea of like multi-document
15:14 agents what if like instead of just like
15:16 rag we moved a little bit more into
15:18 agent territory we modeled each document
15:21 not just as a sequence of text trunks
15:23 but actually as a set of tools that
15:24 contains the ability to both like
15:26 summarize that document as well as to do
15:28 QA over that document over specific
15:30 facts um and of course if you want to
15:32 scale to like you know hundreds or
15:34 thousands or millions of documents um
15:36 typically an agent can only have access
15:38 to a limited window of tools so you
15:41 probably want to do some sort of
15:42 retrieval on these tools similar to how
15:44 you want to retrieve like text Trunks
15:46 from a document the main difference is
15:48 that because these are tools you
15:49 actually want to act upon them you want
15:51 to use them as opposed to just like
15:52 taking the raw text and plugging it into
15:54 the context window so blending this
15:56 combination of like uh kind of um
15:59 embedding based retrieval or any sort of
16:01 retrieval as well as like agent tool use
16:03 is a very interesting Paradigm that I
16:05 think is really only possible with this
16:06 age of almes and hasn't really existed
16:12 this another kind of advanced concept is
16:15 this idea of fine tuning um and so fine
16:18 tuning uh you know this some other
16:19 presenters have talked about this as
16:21 well but the idea of like fine-tuning in
16:23 a rag system is that it really optimizes
16:26 specific pieces of this rag pipeline for
16:28 you to kind of better um like improve
16:31 the performance of either retriever or
16:33 synthesis capabilities so one thing you
16:35 can do is fine-tune your embeddings um I
16:37 think Anton was talking about this as
16:38 well like if you just use a pre-trained
16:40 model the embedding representations are
16:42 not going to be optimized over your
16:43 specific data so sometimes you're just
16:45 going to retrieve the wrong wrong
16:47 information um if you can somehow tune
16:50 these embeddings so that given any sort
16:52 of like relevant question that the user
16:53 might ask that you're actually returning
16:55 the relevant response then you're going
16:57 to have like better performance
16:58 performance so um an idea here right is
17:01 to generate synthetic query data set
17:03 from raw text chunks using llms and use
17:05 this to fine-tune and embeding model um
17:07 and you can do this like uh if we go
17:10 back really quick actually uh you can do
17:12 this by basically um kind of fine-tuning
17:14 the base model itself you can also
17:16 fine-tune an adapter on top of the model
17:19 um and fine-tuning an adapter on top of
17:20 the model has a few advantages in that
17:22 you don't require the base model's
17:23 weights to actually fine-tune stuff and
17:25 if you just finetune the query you don't
17:27 have to reindex your entire document
17:32 Corpus there's also fine-tuning LMS
17:34 which of course like a lot of people are
17:36 very interested in doing these days um
17:38 and intuition here specifically for rag
17:40 is that if you have a weaker LM like 3.5
17:42 turbo like LL 2 7B like these weaker
17:45 llms are bad are are not bad at like um
17:49 uh wait yeah weaker LMS are are maybe a
17:52 little bit worse at like response
17:53 synthesis reasoning structured outputs
17:56 Etc um compared to like bigger models
17:58 so the solution here is what if you can
18:00 generate a synthetic data set using a
18:02 bigger model like gbd4 that's something
18:04 we're exploring and you actually distill
18:06 that into 3.5 turbo so it gets better at
18:08 train of thought longer response quality
18:11 um better structured outputs and a lot
18:13 of other possibilities as well so all
18:15 these things are in our docs there's
18:17 production rag uh there's fine toting
18:19 and I have two seconds left so thank you