00:00 hello everyone uh welcome to my code
00:02 deare uh video series um what I'm doing
00:06 is I'm rotating through three different
00:08 types of topics educational topics uh
00:11 use case topics and then kind of bias
00:13 ethics safety uh topic so now on the
00:17 education rotation and today what I
00:19 wanted to talk about is uh what is
00:21 retrieval augmented generation or rag uh
00:26 and you may think that I'm going into
00:28 some kind of nook and cranny of the AI
00:31 uh field but this is a very important
00:33 and popular kind of solution pattern
00:36 that I see um being used over and over
00:39 and over again for uh how to leverage
00:42 large language models so I thought I
00:44 would explain it uh to you uh and the
00:47 the the thing that this is used for is
00:49 basically systems that leverage large
00:51 language models but on your own content
00:55 so let me describe that if you think of
00:57 like the chat GPT experience and if you
01:00 think about that um relative to like the
01:02 search engine experience that we had
01:05 before if you ask a question like um I
01:08 don't know what color is the sky or how
01:10 do I fix this plumbing issue or
01:12 something like that a search engine
01:14 would go out uh or appear to go out
01:17 search the internet find relevant
01:19 content and then just list that content
01:21 for you list those links and then you as
01:24 a user would need to click on the links
01:26 that seem seem right read it digest it
01:29 and figure out the answer to your
01:31 question what a large language model
01:33 does is it seems to do that first part
01:35 meaning leverage the content on the
01:37 whole internet but instead of just
01:39 listing that content it sort of digests
01:41 it digests it combines it assembles it
01:44 together and answers your question sort
01:46 of generates an answer um so it's a
01:49 whole lot better I mean search engines
01:51 have been great but this is taking the
01:52 whole experience to another level and in
01:55 addition the question and answering uh
01:57 you can also give it instructions like
01:59 write me this document or write me a
02:01 lesson plan to teach geometry to seventh
02:03 graders uh and it will do something
02:05 similar it will kind of assemble content
02:08 that it SE that it has seen uh that
02:10 talks about geometry or seventh graders
02:12 or how to do lesson plans or whatever uh
02:15 pulls that together assembles it and
02:17 then writes out a lesson plan okay so
02:21 it's a much better experience than just
02:23 taking the raw content from the internet
02:25 but it really uh creates something new
02:28 from that now let's say you want that
02:30 same experience but on your own content
02:33 so it might be a chatbot on your website
02:36 or you might have a library of PDF
02:37 documents that this documentation for
02:40 one of your products uh and instead of
02:42 just linking the user to parag sections
02:46 of the documentation you want to
02:47 actually answer their question uh it
02:50 might be your service ticketing uh
02:52 system so when a new issue comes in you
02:54 could say how would I resolve this issue
02:56 and it can assemble past similar issues
02:59 uh and then come up with a new uh new
03:01 solution based on that so this is an
03:04 incredible experience that these large
03:07 language models offer but how can you
03:09 create that experience on your own
03:11 content uh that might not be available
03:14 to the internet or available to these
03:16 large language models well the solution
03:18 to this is this rag um architecture this
03:22 retrieval augmented uh generation
03:24 architecture so now I'm going to do my
03:25 best to explain that uh to
03:28 you so let's say you have a um
03:32 user and I'm going to use the example of
03:35 a uh patient chatbot and the content
03:39 source is going to be that content from
03:41 your website let's say or could be
03:43 content from PDF documents or or
03:46 whatever but you want this to be the
03:47 content to answer the patient's
03:49 questions so if the patient has a
03:50 question like how do I prepare for my
03:52 knee surgery instead of just going to
03:55 chat sheet PT and getting a generic
03:56 answer you'd like to provide an answer
03:59 that's from your health system or a
04:02 question like do you have parking you'd
04:04 like to provide an answer for your
04:06 health system for your the office where
04:07 the patient is seen okay so that's a
04:10 scenario that I'd like to do so the
04:13 question uh and I'm going to do do you
04:23 parking um you can uh imagine that
04:26 question being bundled up into a prompt
04:30 what's called a prompt and I'll describe
04:33 later so there is the question that
04:37 prompt is sent to a large language
04:40 model and that large language model will
04:44 come up with a response to that question
04:48 okay now um if you just wanted to use uh
04:51 chat GPT let's say or some other llm uh
04:54 without any extra content you could just
04:57 use this flow how do I prepare for my
04:59 knee surgery or do you have parking put
05:02 that into a prompt send that to the uh
05:04 large language model and get a response
05:06 back okay but uh but what we want to do
05:09 is enhance this experience with our own
05:11 content so let's say here is your
05:14 source and again this might be all the
05:17 content of your website or PDF documents
05:21 or internal ticketing system or
05:23 databases or that uh that sort of thing
05:27 and what you'd like to do is something
05:29 called called the prop before the propt
05:32 so in these systems you don't just send
05:34 the user question to the large language
05:36 model you usually have some level of
05:38 instructions So the instructions might
05:41 be you are a contact center specialist
05:44 working for a hospital answering patient
05:47 questions that come in over the Internet
05:50 uh please be uh nice to the patients and
05:53 responsive and folksy because that fits
05:55 with our brand or some instructions like
05:58 that are sometimes sent with the prompt
06:03 Additionally you want to provide the
06:06 information that the L llm needs to
06:08 answer the question so what you'd
06:11 ideally like is information from your
06:14 website to be included here um and uh
06:18 and that to be sent to the llm as well
06:20 so the full prompt might be your
06:23 instructions it might be something like
06:25 please use this content um in order to
06:28 answer the patient question at the end
06:30 and then you put in a bunch of
06:32 information about parking or about knee
06:34 surgery or whatever the patient asked
06:36 you put that in the prompt before the
06:38 prompt then you have the question then
06:40 you send that whole package to the llm
06:43 and the llm will give a great response
06:47 content okay with me so far so um so
06:51 this notion is the prop before the
06:53 prompt um and and that's why prompt
06:56 engineering and these types of things
06:58 are a big field right now now because
07:00 you can really hone the um these systems
07:03 by doing a better and better job with
07:05 the actual prompt before the prompt um
07:10 style now the last trick here is your
07:14 website or your content is huge and it
07:16 talks about all kinds of topics Beyond
07:19 parking and Beyond knee surgery so you
07:21 really want to somehow pull out only the
07:24 parts of your content that are relevant
07:26 to the patient's question so this is
07:29 another um a tricky part of this whole
07:32 rag architecture uh and the way that
07:34 works is that um you take all your
07:37 content and you break it into chunks or
07:40 these systems will break it into chunks
07:42 so chunk might be a paragraph of content
07:44 or a p or a couple paragraphs a page
07:46 something like that and then those um
07:50 chunks are sent to a large language
07:53 model could be the same one or a
07:55 different one and they are turned into a
08:01 and uh so each each paragraph or each
08:06 vector which is just is just a series of
08:09 numbers and that series of
08:12 numbers you can think of it as the
08:14 numeric representation of the essence of
08:18 paragraph and what's uh different about
08:21 these numbers just they're not random
08:23 numbers but paragraphs that talk about a
08:25 similar topic have close by numbers they
08:28 almost have the same vectors okay so in
08:31 addition to the uh it's a numera Zed
08:33 version of the paragraph but it's such
08:37 that similar paragraphs on similar
08:39 topics will have similar vectors will
08:42 have similar numbers so that means that
08:46 what happens is when um uh a user will
08:49 ask a question like do you have parking
08:52 say then that is also sent to the llm in
08:55 real time right after the user asked the
08:59 that comes up with the vector as well
09:02 you could think of that as the question
09:04 vector and then what happens we do we do
09:06 a mathematical comparison real quick
09:09 between the vector of the question and
09:11 then the vectors of your content and
09:13 pick like the top five documents that
09:15 are closest to this question so do you
09:17 have parking will be a vector then you
09:21 have all your content and it's going to
09:23 try and find the five documents that
09:25 taught the most about parking basically
09:28 um and so it'll find those I don't know
09:30 what that is it'll find those documents
09:32 let's say uh from these it'll grab the
09:34 paragraphs associated with those
09:37 documents um and it'll use that
09:41 here so those will be the subset of your
09:45 content basically that is used as part
09:48 of the prompt before the prompt okay so
09:51 this whole uh concept is uh kind of
09:54 vectorizing your content uh typically
09:58 that then our storage in something
09:59 called a vector database which is
10:01 basically a representation of your
10:03 content in this numeric form and then
10:06 this system that you build this rag
10:08 system will uh take the question find
10:12 retrieve the most relevant content make
10:15 that as part of the prompt before the
10:17 prompt send that to the llm and then
10:20 you'll get a good response back actually
10:23 so it's a little bit confusing but um
10:25 but it's actually not that confusing um
10:28 uh I just made it more confusing by this
10:30 horrible uh horrible drawing but this
10:32 whole thing is um what is uh called rag
10:37 retrieval so you're retrieving the
10:39 relevant documents from your content
10:42 you're augmenting the generation process
10:45 so you're augmenting the lm's ability to
10:48 do generative AI based on the documents
10:51 that you retrieve so that's why it's
10:52 retrieval augmenting
10:55 generation okay so I hope that made
10:58 sense uh like I said this is a very
11:00 popular um solution pattern that I'm
11:03 seeing over and over again in fact the
11:05 majority of llm projects that I see are
11:08 this kind of thing using my content
11:11 packaging that up with an llm system to
11:14 create a kind of chat chpt like
11:16 experience for my employees or for my
11:20 customers for my users that kind of
11:22 thing and it works extremely well that's
11:24 why uh that's why it's so popular so I
11:27 hope that was interesting and
11:29 educational and made sense if you have
11:31 any questions please leave them for me
11:33 uh as part of the comments uh thank you