64. Hands-on ~ Basic RAG Pipeline
The passage explains how to build a simple RAG pipeline with LangChain.
Main points
-
Imports and setup
-
Uses common LangChain components like
OpenAIEmbeddings,PromptTemplate,RunnablePassthrough,RunnableParallel, and related utilities. -
An embeddings model is already initialized for retrieval.
-
1) Create the knowledge base
-
Defines a
create_kbfunction. -
Splits a manually created
DocumentusingRecursiveCharacterTextSplitterwith:-
chunk_size = 5500 -
chunk_overlap = 50
-
-
The document includes metadata such as
source="blank_chain.md". -
Uses
split_documents()because the input is aDocumentobject. -
Creates a vector store from the chunks using
vectorstore_from_documents(…), the embeddings model, and a persistence directory like./temp. -
Returns the vector store.
2) Create a basic RAG system
-
Calls
create_kb()to get the vector store. -
Builds a retriever with:
-
search_type="similarity" -
k=2
-
-
Initializes a chat model with
temperature=0.2. -
Creates a prompt that instructs the model to answer only from the given context and say “I don’t know” if unsure.
3) Format retrieved documents
-
Defines a helper like
format_docs(docs)to join retrieved chunks into a single string for the prompt.
4) Build the RAG chain
-
Creates a chain with two inputs:
-
context: retrieved docs passed throughformat_docs -
question: passed through unchanged withRunnablePassthrough()
-
-
Pipes the inputs through:
-
prompt template
-
LLM
-
StringOutputParser
-
-
Produces a final string answer.
5) Test the chain
-
Invokes the chain with sample questions like:
-
“What is LangChain?”
-
“Who created LangChain?”
-
“What is LangGraph used for?”
-
-
Prints the responses.
Why it works
It follows the standard RAG flow:
-
Retrieve relevant documents
-
Format them as context
-
Add them to the prompt
-
Generate an answer with the LLM
-
Parse the output cleanly
The key benefit is that LangChain’s runnable and pipe syntax makes the entire pipeline modular, readable, and easy to compose.
65. Hands-on ~ RAG with Resources
The passage explains how to extend basic RAG into RAG with sources.
Key ideas
-
The core setup stays the same: knowledge base/vector store, retriever, and LLM are reused.
-
The main change is that the system now returns sources or citations along with the answer.
Main steps
-
Create a prompt that tells the model to:
-
answer using the provided context
-
include the sources used
-
-
Format retrieved documents with source info using a helper like
format_docs_with_sources. -
Build the chain in the same RAG flow:
-
retriever → prompt → LLM → output parser
-
-
Ask a question, and the response includes:
-
the answer
-
the documents or sources it came from
-
Why it matters
This is useful for:
-
Q&A bots
-
enterprise search
-
trustworthy AI systems
Because users can:
-
verify answers
-
trace them back to original documents
-
trust the results more easily
Bottom line
RAG with sources works like basic RAG, but adds source formatting, citations, and a prompt that asks for references, making the output more transparent and practical.
66. Hands-on ~ RAG with Fallback
The passage explains how to add a fallback mechanism to a RAG pipeline so it can handle out-of-scope questions safely.
Key points
-
The pipeline uses:
-
a vector store
-
a retriever
-
a prompt
-
-
The prompt instructs the model to:
-
answer only using the provided context
-
reply with: “I don’t have information about that in my knowledge base.” if the answer is not present
-
-
The chain works by:
-
retrieving relevant documents
-
formatting and inserting them into the prompt
-
sending the prompt to the LLM
-
parsing the output as text
-
Testing
-
Questions covered by the knowledge base produce normal answers.
-
Questions outside the knowledge base trigger the fallback response.
Why it matters
-
It reduces hallucinations by preventing the model from guessing.
-
It makes the system more honest, reliable, and useful in real-world situations where users may ask unsupported questions.
Overall result
This approach makes the RAG pipeline more robust by keeping answers grounded in the available context and gracefully handling unknown queries.
67. Hands-on ~ RAG with Structured Outputs
demo_structured_rag() demonstrates a small structured RAG
pipeline.
-
It creates a knowledge base and turns it into a retriever that fetches the top 3 relevant documents.
-
It defines a
RAGResponsePydantic schema with fields for:-
answer -
confidence -
sources_used -
follow_up
-
-
It wraps the LLM with
with_structured_output(RAGResponse)so the model returns a validated structured object instead of free-form text. -
It builds a prompt that includes retrieved
contextand the userquestion. -
A helper formats retrieved docs by combining each document’s source metadata and content into one context string.
-
The pipeline uses runnable composition:
-
question → retriever → formatted context
-
question → passthrough
-
both go into the prompt
-
prompt goes to the structured LLM
-
-
It invokes the chain with
"What is LangGraph?"and prints the structured fields from the result.
Key ideas:
-
RAG grounds answers in retrieved documents.
-
Structured output makes responses predictable and easier to use programmatically.
-
The
|operator composes retrieval, formatting, prompting, and generation into one chain.
It also notes that confidence is only described as
"high, medium, or low" but not strictly enforced; using an Enum
would add validation.
70. Hands-on ~ Advanced RAG - Multi-Query Retriever
The section introduces several advanced RAG retrieval patterns:
-
multi-query retrieval
-
self-query retrieval
-
contextual compression
-
hybrid search
It explains that the examples use some langchain_community imports
because LangChain has been reorganizing its packages, and this
compatibility layer is still useful for learning, even though some parts
may be deprecated later. Some of these retrievers are also moving into
LangGraph.
New components introduced include:
-
MultiQueryRetriever -
ContextualCompressionRetriever -
LLMChainExtractor -
EnsembleRetriever -
BM25Retriever -
ParentDocumentRetriever
Logging is enabled so the generated sub-queries from multi-query retrieval can be inspected during execution.
The demo builds a small technical knowledge base, creates a Chroma
vector store with embeddings like text-embedding-3-small, and uses
that as the foundation for retrieval experiments.
The main example covered is Multi-Query Retriever:
-
It uses an LLM to rewrite a single user query into multiple alternative phrasings.
-
These different versions help surface documents that might not match the original wording exactly.
-
For example, “What tools can I use to build AI applications?” might be expanded into several related queries about AI app development tools, platforms, or software.
When run, the retriever generates these alternate queries, searches the vector store for each one, and returns a broader set of relevant documents. This improves recall but requires more computation and may increase cost.
In the example, the retrieved results included documents related to AI tools, AI platforms, and databases/infrastructure, showing how multi-query retrieval can expand coverage beyond a single query.
71. Hands-on ~ Advanced RAG - Contextual Compression
Contextual compression is a retrieval technique that uses an LLM to extract only the most relevant parts of retrieved documents before passing them to the final model.
How it works
-
Set up the vector store, retriever, and LLM.
-
Create an LLM chain extractor to act as the compressor.
-
Wrap the base retriever with a Contextual Compression Retriever using:
-
the compressor
-
the base retriever
-
-
Run the query and compare:
-
Without compression: full document chunks are returned.
-
With compression: only relevant excerpts are returned.
-
What it shows
-
In simple cases, compression may not seem dramatic because the documents are already focused.
-
In more complex documents, the reduction is much clearer:
-
full chunks may be around 1500–1700 characters
-
compressed results may shrink to around 214 characters
-
-
The output keeps only the information needed to answer the question, such as framework names like LangChain and LangGraph.
Benefits
-
Lower token usage
-
Better answer quality due to less noise
-
Faster processing for large contexts
Trade-off
-
It adds extra LLM calls during retrieval, which increases latency and cost.
Overall
Contextual compression is useful when documents are long, noisy, or expensive to send to the model. It improves precision and efficiency, but at the cost of extra retrieval-time computation.
72. Hands-on ~ Advanced RAG - Hybrid Search
This walkthrough explains how to build a hybrid search system that combines BM25 keyword search and semantic search using a tech docs dataset.
Main steps
-
BM25 retriever
-
Built from the documents with
from_documents -
Configured with
k=3to return the top 3 keyword matches
-
-
Semantic retriever
-
Uses the existing semantic retriever setup
-
Also set to
k=3
-
-
Ensemble retriever
-
Combines BM25 and semantic retrievers using rank fusion
-
Example weighting: 40% BM25, 60% semantic
-
Weights should be tuned based on the kinds of queries users ask
-
-
Testing queries
-
Keyword-heavy queries like Postgres, SQL, and pgvector work well with BM25
-
More meaning-based queries benefit from semantic search
-
-
BM25 installation issue
-
The
rank-bm25package was missing -
After installing it, the hybrid retriever worked correctly
-
Results and takeaways
-
For “What is Postgres?”, the ensemble gives the best combined result.
-
For “What database stores vectors?”, both retrievers identify relevant vector database content.
-
For “asset transactions”, BM25 succeeds where semantic search drifts off-topic.
-
For “How do I store AI model outputs for later retrieval?” and “fast similarity lookup embeddings”, both retrievers contribute useful signals.
Why it matters
-
BM25 is strong for exact keyword matching
-
Semantic search is strong for intent and meaning
-
Ensemble retrieval combines both to improve accuracy and robustness
The key lesson is that hybrid search handles both exact terms and conceptual similarity, making it more reliable than using either approach alone.
73. Hands-on ~ Advanced RAG - Parent Document Retriever
The document explains how to build a parent document retriever, which combines small chunks for retrieval accuracy with large chunks for better context.
Main idea
-
Split documents into:
-
Parent chunks: larger pieces of about 800 characters
-
Child chunks: smaller pieces of about 200 characters with overlap
-
-
Search is done over the small child chunks
-
The system returns the full parent chunk to the LLM
Setup
-
Use an in-memory vector store for embeddings
-
Use an in-memory document store for parent documents
-
Name the collection something like
parent-child-demo -
Build the retriever with:
-
vector store
-
document store
-
child splitter
-
parent splitter
-
How it works
-
Add documents
-
Query something like “What is LangGraph used for?”
-
Compare:
-
regular retrieval: returns a small, focused chunk
-
parent document retrieval: returns a larger chunk with more context
-
Why it helps
-
Small chunks
-
better retrieval precision
-
more focused embeddings
-
-
Large chunks
-
better context for generation
-
less fragmentation
-
Key benefit
This approach gives the best of both worlds:
-
accurate search from small chunks
-
complete context from large chunks
Summary
A parent document retriever is a two-stage retrieval system:
-
first, find the most relevant small child chunk
-
then, return its corresponding larger parent document
It is especially useful for larger documents where both precision and context matter.
74. Hands-on ~ Advanced RAG - Combining Multi-Query and Compression Strategies
The passage describes how to combine advanced RAG techniques into one retrieval chain:
-
Start with a vector store and an LLM.
-
Add multi-query retrieval to improve recall by generating query variations.
-
Add contextual compression to improve precision by filtering retrieved results for relevance.
-
Define a RAG prompt, format retrieved documents, and build the final chain.
-
Test the chain with example questions.
Key takeaways:
-
Multi-query retrieval helps find more relevant documents.
-
Contextual compression helps keep only the most useful context.
-
You do not need to use every RAG strategy at once; choose what fits your use case.
The example setup uses:
-
ChromaDB for vector storage
-
OpenAI for embeddings and completions
-
Multi-query retrieval
-
Contextual compression
-
An LLM to generate the final answer
Overall, it presents a clean blueprint for a more advanced, effective RAG pipeline.
76. Hands-on ~ Conversation Memory - Basics
|
This document explains how to build conversation_memory.py to
demonstrate conversational memory in LangChain.
Main idea
The chat model remembers earlier parts of a conversation by storing messages in session-based history and reusing them in later turns.
Key components
-
Chat model setup using
init_chat_modelorChatOpenAI -
Prompt template with:
-
a system message
-
a human input
-
a
MessagesPlaceholderfor chat history
-
-
Message history storage with:
-
InMemoryChatMessageHistory -
a dictionary keyed by
session_id
-
-
RunnableWithMessageHistory to automatically load and save messages for each session
-
StrOutputParser to format model output
basic_memory() workflow
-
Initialize the chat model.
-
Build a prompt that includes history.
-
Chain the prompt, model, and parser.
-
Create an in-memory store for session histories.
-
Define
get_session_history()to retrieve or create history for a session. -
Wrap the chain with
RunnableWithMessageHistory. -
Use a fixed
session_idto simulate one conversation. -
Send several user messages through the chain.
-
Print the stored history to inspect saved human and AI messages.
Result
The model can answer follow-up questions using earlier context, such as:
-
remembering the user’s name
-
remembering what the user is learning
Conclusion
This is a simple example of session-based conversational memory in LangChain using modern runnable utilities and message history placeholders.
77. Hands-on ~ Multiple Sessions Memory
This describes how to support multiple independent chat sessions with one shared LLM by giving each user their own memory.
Main idea
-
Use one shared LLM
-
Build a prompt that accepts:
-
historyfor prior messages -
current
input
-
-
Create a chain from the prompt and LLM
-
Store conversation histories in a dictionary
-
Add a helper function that gets or creates a session’s history
-
Wrap the chain so:
-
messagemaps to the current user input -
historymaps to that user’s stored conversation history
-
How it works
If a session ID is new, a history object is created and saved.
That means each user gets separate memory instead of sharing one global
chat history.
Example
-
User A says: “My favorite language is Python.”
-
User B says: “I love JavaScript.”
When they later ask:
-
“What is my favorite language?”
the system uses the correct session ID to load the right history:
-
User A → Python
-
User B → JavaScript
Why it matters
This setup lets the model:
-
remember past messages
-
keep conversations separate by user
-
answer based on each user’s own history
Summary: each session has its own memory, so multiple users can talk to the same model without mixing their conversations.
78. Hands-on ~ Message Trimming
The passage explains message trimming, which is the process of shortening a conversation history so it fits within a model’s context window and token limits.
Key points:
-
It uses a simulated long chat made of
SystemMessage,HumanMessage, andAIMessage. -
Trimming is done with a token limit and a strategy such as
"last"or"first". -
In the example, the
laststrategy is used, so the most recent messages are kept. -
include_system=Truemeans system messages are preserved. -
allow_partial=Falsemeans messages are only kept if they fit completely.
Why it matters:
-
Reduces token usage
-
Keeps conversations within context limits
-
Avoids sending unnecessary history
-
Helps manage long-term memory efficiently
Example outcome:
-
The original chat had 8 messages
-
After trimming with a small token limit, it may shrink to only 2 messages
Overall, message trimming is a practical way to control how much conversation history is retained in AI applications.
79. Hands-on ~ Windowed Memory
|
The passage explains sliding window memory for LLM conversations:
-
LLM costs grow because each new request may include the full chat history.
-
To control this, sliding window memory keeps only the last K exchanges and discards older messages.
-
In the demo, a custom
WindowChatHistoryclass extends LangChain’sInMemoryChatMessageHistory. -
It overrides
add_messagesto check whether the number of messages exceedsK * 2:-
1 exchange = 1 human + 1 AI message
-
So
Kexchanges =K * 2messages
-
-
If the limit is exceeded, it slices the list to keep only the newest messages:
-
self.messages = self.messages[-(K * 2):]
-
The demo conversation shows the memory shrinking as new messages arrive.
With K = 2, only the last two exchanges remain, so the model remembers
recent facts like:
-
“I work as an engineer.”
-
“I have two cats.”
It forgets earlier ones like:
-
“My name is Paulo.”
-
“I live in Seattle.”
Main takeaway
Sliding window memory provides fixed-size, predictable conversation memory, lowering cost and avoiding context-window overflow, but it loses older context.
80. Hands-on ~ Summary Memory
Summary memory keeps a conversation manageable by compressing older messages into a running summary instead of deleting them. It uses:
-
a summary LLM to maintain a stable, deterministic summary,
-
a chat LLM for the live conversation,
-
a prompt built from running summary + recent message buffer + current user input.
How it works
-
The model responds using the summary, recent messages, and new input.
-
The new exchange is added to a recent-message buffer.
-
When the buffer gets too large, the oldest messages are summarized.
-
That summary is merged into the running summary.
Why it’s useful
-
Recent context stays exact
-
Older context is preserved in compressed form
-
Token usage remains bounded, preventing context overflow
Key idea
It’s a hybrid memory strategy:
-
old info → summarized
-
new info → kept verbatim
This is especially useful for chatbots, RAG systems, and other long-running AI interactions.
81. Exercise and Solution ~ Persistent Memory
The passage explains how to build a chatbot with persistent memory using LangChain and SQLite.
Main points
-
Use
RunnableWithMessageHistoryandSQLChatMessageHistoryto store conversation history in a local SQLite database. -
Each chat session is identified by a session ID, so messages are saved and retrieved per user/session.
-
The chatbot can remember preferences across restarts, such as:
-
“I prefer dark mode themes.”
-
“What theme do I prefer?”
-
-
To make this work, you:
-
Import the needed LangChain chat history tools.
-
Set a SQLite
.dbfile path. -
Create a function that returns a
SQLChatMessageHistoryfor a given session. -
Build a prompt with a system message,
history, and user input. -
Wrap the chain with
RunnableWithMessageHistory. -
Pass a config containing the
session_id. -
Test that the bot remembers past messages.
-
Persistence verification
-
To confirm memory is truly saved, you can:
-
run a conversation,
-
restart the chain,
-
reuse the same SQLite database,
-
and ask about earlier information.
-
-
You can also inspect the SQLite database directly to see stored human and AI messages.
Summarization idea
-
After about 10 messages, the conversation can be summarized automatically.
-
The summary can be stored as memory, while older raw messages may be pruned if desired.
Overall goal
The result is a chatbot that:
-
remembers preferences,
-
persists across restarts,
-
stores memory locally,
-
and can later be extended with automatic summarization.
83. Project ~ AI Research Assistant - Indexing Documents (Part 1)
The document outlines the setup of an AIResearchAssistant for a RAG pipeline using Chroma, OpenAIEmbeddings, and RecursiveCharacterTextSplitter.
Main points
-
Introduces structured output models:
-
ResearchResponse: includes
answer,confidence,sources, andkey_quotes -
follow_up_questions: for generating follow-up prompts
-
-
Builds an AIResearchAssistant class that bundles the three core RAG components:
-
Embedding model (
OpenAIEmbeddings,text-embedding-3-small) -
Text splitter (
RecursiveCharacterTextSplitter) -
Vector store (
Chromawith persistent storage)
-
-
The constructor sets defaults like:
-
persistent_directory="research_db" -
chunk_size=1000 -
chunk_overlap=200
-
-
Adds document ingestion methods:
-
add_documentsto split, tag, timestamp, and store documents -
add_textandadd_textsas convenience wrappers for raw text
-
-
Includes inspection utilities:
-
get_document_count -
list_sources
-
-
Confirms persistence and indexing through tests and cleanup steps
Overall takeaway
The assistant is now able to ingest, chunk, index, and persist documents, but it does not yet support retrieval or question answering. The next step is to add those capabilities so it can respond to user queries.
84. Project ~ AI Research Assistant - LLM Prompt and Output Parser (Part 2)
The passage explains how to turn a basic document retriever into a simple RAG-style Q&A chain using three main parts: an LLM, a prompt, and an output parser.
Main steps covered
-
Add a
ChatOpenAImodel to the assistant. -
Build a retriever that uses similarity search and returns the top 4 relevant chunks.
-
Test the retriever to confirm it returns relevant document fragments.
-
Add a function to format retrieved documents into plain-text context for the LLM.
-
Create an
askmethod that:-
retrieves documents,
-
formats them,
-
builds a prompt with system and human instructions,
-
runs a chain like
prompt | llm | StrOutputParser(), -
returns the generated answer.
-
Testing and behavior
The assistant is tested with three questions:
-
A factual question about RAG
-
A question needing information from multiple sources
-
A follow-up question
Key result
The system works, but it has a major weakness: no memory.
Because of that, follow-up questions can be misinterpreted or
hallucinated instead of being answered correctly. This shows why
grounding helps, but also why conversational memory will be needed next.
85. Project ~ AI Research Assistant - Adding Memory (Part 3)
The passage explains how to add session-based memory to an AI Research Assistant so each user session keeps its own conversation history.
Key points:
-
Add
self.session_storeas an in-memory dictionary to hold per-session chat history. -
Create
_get_session_history(self, session_id)to return or initialize a session’s message list. -
Update the prompt in
askby inserting aMessagesPlaceholdernamedhistorybetween the system and human messages. -
Inspecting session history shows it is just a list of stored messages, which only becomes useful when injected into the prompt.
-
In
ask, retrieve the session history first and passhistory.messagesinto the chain, optionally limiting the number of recent messages. -
Before returning a response, save both sides of the exchange:
-
HumanMessagefor the user question -
AIMessagefor the assistant reply
-
-
Add utility methods:
-
clear_session(…)to erase a session’s history -
get_session_history_display(…)to view history in a readable format
-
Testing shows:
-
Follow-up questions now work because prior context is available.
-
Each Q&A pair adds two messages to memory.
-
Different session IDs have isolated histories, so one user’s conversation does not affect another’s.
Overall, the update gives the chatbot real conversational memory while keeping session histories separate.
86. Project ~ AI Research Assistant - Multi-Query Implementation (Part 4)
The passage explains how to improve a RAG retriever by adding multi-query retrieval.
Main idea
-
The current retriever only matches the query using similar words.
-
This works, but it can miss relevant chunks if the answer uses different terminology.
-
Multi-query retrieval fixes this by using an LLM to generate several semantically similar queries from the original question.
Basic vs. advanced retriever
-
Basic retriever: simple similarity search, returns about four documents.
-
Advanced retriever: multi-query retrieval, which expands the original question into multiple related searches.
Why it helps
Different people may ask the same thing in different ways. Multi-query retrieval improves recall by searching from multiple angles, making it more likely to find useful chunks even when wording differs.
Implementation described
-
Update
_build_retrieverinAIResearchAssistant. -
Add a
use_advancedflag. -
If
false, use the basic retriever. -
If
true, use the multi-query retriever. -
Update
askto pass this flag through and handle retrieval the same way afterward.
Testing and results
-
With debugging enabled, the advanced retriever shows generated alternate queries.
-
It returns more relevant and unique chunks than the basic retriever.
-
This gives the model richer context and usually improves answer quality.
Conclusion
Multi-query retrieval makes the RAG system smarter and more flexible by retrieving information from several semantically related searches instead of relying on one keyword-based query.
The passage ends by noting the next step: moving from raw string answers to structured output, such as returning confidence, sources, and answer text in a predictable format.
87. Project ~ AI Research Assistant - Structured Output - Final Part
The passage explains how to improve an ask function in a RAG system by
changing its output from plain text to a structured object.
Key points
-
The current
askfunction returns a plain string, which is hard to use programmatically. -
A
ResearchResponsedata model already exists to solve this by structuring outputs with fields like:-
answer -
confidence -
sources -
key_quotes -
follow_up_questions
-
-
A new function,
ask_structured, is introduced right beforeask.-
It takes the same inputs as
ask:-
question -
session_id -
use_default -
use_advanced
-
-
But it returns a
ResearchResponseinstead of a string.
-
-
Inside
ask_structured, the LLM is wrapped withwith_structured_output(ResearchResponse)so the model returns data in the schema format. -
This makes it easy to access individual parts of the response directly in code, such as
response.answerorresponse.sources.
Why this matters
-
Plain text is fine for display, but structured output is better for downstream processing.
-
It makes it easier to extract answers, confidence scores, sources, and follow-up questions.
Broader context
-
The project now includes a full RAG pipeline with:
-
a research assistant
-
advanced retrieval
-
conversation memory
-
structured responses
-
-
Document ingestion was intentionally left out for simplicity, but should be implemented separately in a real project:
-
upload file
-
extract content
-
store it in the database
-
Conclusion
The lesson shows how to use LangChain to build a more powerful RAG system, and it sets up the next topic: LangGraph, stateful agents, and how LangGraph and LangChain work together.