64. Hands-on ~ Basic RAG Pipeline

The passage explains how to build a simple RAG pipeline with LangChain.

Main points

  • Imports and setup

    • Uses common LangChain components like OpenAIEmbeddings, PromptTemplate, RunnablePassthrough, RunnableParallel, and related utilities.

    • An embeddings model is already initialized for retrieval.

1) Create the knowledge base

  • Defines a create_kb function.

  • Splits a manually created Document using RecursiveCharacterTextSplitter with:

    • chunk_size = 5500

    • chunk_overlap = 50

  • The document includes metadata such as source="blank_chain.md".

  • Uses split_documents() because the input is a Document object.

  • Creates a vector store from the chunks using vectorstore_from_documents(…​), the embeddings model, and a persistence directory like ./temp.

  • Returns the vector store.

2) Create a basic RAG system

  • Calls create_kb() to get the vector store.

  • Builds a retriever with:

    • search_type="similarity"

    • k=2

  • Initializes a chat model with temperature=0.2.

  • Creates a prompt that instructs the model to answer only from the given context and say “I don’t know” if unsure.

3) Format retrieved documents

  • Defines a helper like format_docs(docs) to join retrieved chunks into a single string for the prompt.

4) Build the RAG chain

  • Creates a chain with two inputs:

    • context: retrieved docs passed through format_docs

    • question: passed through unchanged with RunnablePassthrough()

  • Pipes the inputs through:

    1. prompt template

    2. LLM

    3. StringOutputParser

  • Produces a final string answer.

5) Test the chain

  • Invokes the chain with sample questions like:

    • “What is LangChain?”

    • “Who created LangChain?”

    • “What is LangGraph used for?”

  • Prints the responses.

Why it works

It follows the standard RAG flow:

  1. Retrieve relevant documents

  2. Format them as context

  3. Add them to the prompt

  4. Generate an answer with the LLM

  5. Parse the output cleanly

The key benefit is that LangChain’s runnable and pipe syntax makes the entire pipeline modular, readable, and easy to compose.

65. Hands-on ~ RAG with Resources

The passage explains how to extend basic RAG into RAG with sources.

Key ideas

  • The core setup stays the same: knowledge base/vector store, retriever, and LLM are reused.

  • The main change is that the system now returns sources or citations along with the answer.

Main steps

  1. Create a prompt that tells the model to:

    • answer using the provided context

    • include the sources used

  2. Format retrieved documents with source info using a helper like format_docs_with_sources.

  3. Build the chain in the same RAG flow:

    • retriever → prompt → LLM → output parser

  4. Ask a question, and the response includes:

    • the answer

    • the documents or sources it came from

Why it matters

This is useful for:

  • Q&A bots

  • enterprise search

  • trustworthy AI systems

Because users can:

  • verify answers

  • trace them back to original documents

  • trust the results more easily

Bottom line

RAG with sources works like basic RAG, but adds source formatting, citations, and a prompt that asks for references, making the output more transparent and practical.

66. Hands-on ~ RAG with Fallback

The passage explains how to add a fallback mechanism to a RAG pipeline so it can handle out-of-scope questions safely.

Key points

  • The pipeline uses:

    • a vector store

    • a retriever

    • a prompt

  • The prompt instructs the model to:

    • answer only using the provided context

    • reply with: “I don’t have information about that in my knowledge base.” if the answer is not present

  • The chain works by:

    1. retrieving relevant documents

    2. formatting and inserting them into the prompt

    3. sending the prompt to the LLM

    4. parsing the output as text

Testing

  • Questions covered by the knowledge base produce normal answers.

  • Questions outside the knowledge base trigger the fallback response.

Why it matters

  • It reduces hallucinations by preventing the model from guessing.

  • It makes the system more honest, reliable, and useful in real-world situations where users may ask unsupported questions.

Overall result

This approach makes the RAG pipeline more robust by keeping answers grounded in the available context and gracefully handling unknown queries.

67. Hands-on ~ RAG with Structured Outputs

demo_structured_rag() demonstrates a small structured RAG pipeline.

  • It creates a knowledge base and turns it into a retriever that fetches the top 3 relevant documents.

  • It defines a RAGResponse Pydantic schema with fields for:

    • answer

    • confidence

    • sources_used

    • follow_up

  • It wraps the LLM with with_structured_output(RAGResponse) so the model returns a validated structured object instead of free-form text.

  • It builds a prompt that includes retrieved context and the user question.

  • A helper formats retrieved docs by combining each document’s source metadata and content into one context string.

  • The pipeline uses runnable composition:

    • question → retriever → formatted context

    • question → passthrough

    • both go into the prompt

    • prompt goes to the structured LLM

  • It invokes the chain with "What is LangGraph?" and prints the structured fields from the result.

Key ideas:

  • RAG grounds answers in retrieved documents.

  • Structured output makes responses predictable and easier to use programmatically.

  • The | operator composes retrieval, formatting, prompting, and generation into one chain.

It also notes that confidence is only described as "high, medium, or low" but not strictly enforced; using an Enum would add validation.

70. Hands-on ~ Advanced RAG - Multi-Query Retriever

The section introduces several advanced RAG retrieval patterns:

  • multi-query retrieval

  • self-query retrieval

  • contextual compression

  • hybrid search

It explains that the examples use some langchain_community imports because LangChain has been reorganizing its packages, and this compatibility layer is still useful for learning, even though some parts may be deprecated later. Some of these retrievers are also moving into LangGraph.

New components introduced include:

  • MultiQueryRetriever

  • ContextualCompressionRetriever

  • LLMChainExtractor

  • EnsembleRetriever

  • BM25Retriever

  • ParentDocumentRetriever

Logging is enabled so the generated sub-queries from multi-query retrieval can be inspected during execution.

The demo builds a small technical knowledge base, creates a Chroma vector store with embeddings like text-embedding-3-small, and uses that as the foundation for retrieval experiments.

The main example covered is Multi-Query Retriever:

  • It uses an LLM to rewrite a single user query into multiple alternative phrasings.

  • These different versions help surface documents that might not match the original wording exactly.

  • For example, “What tools can I use to build AI applications?” might be expanded into several related queries about AI app development tools, platforms, or software.

When run, the retriever generates these alternate queries, searches the vector store for each one, and returns a broader set of relevant documents. This improves recall but requires more computation and may increase cost.

In the example, the retrieved results included documents related to AI tools, AI platforms, and databases/infrastructure, showing how multi-query retrieval can expand coverage beyond a single query.

71. Hands-on ~ Advanced RAG - Contextual Compression

Contextual compression is a retrieval technique that uses an LLM to extract only the most relevant parts of retrieved documents before passing them to the final model.

How it works
  • Set up the vector store, retriever, and LLM.

  • Create an LLM chain extractor to act as the compressor.

  • Wrap the base retriever with a Contextual Compression Retriever using:

    • the compressor

    • the base retriever

  • Run the query and compare:

    • Without compression: full document chunks are returned.

    • With compression: only relevant excerpts are returned.

What it shows
  • In simple cases, compression may not seem dramatic because the documents are already focused.

  • In more complex documents, the reduction is much clearer:

    • full chunks may be around 1500–1700 characters

    • compressed results may shrink to around 214 characters

  • The output keeps only the information needed to answer the question, such as framework names like LangChain and LangGraph.

Benefits
  1. Lower token usage

  2. Better answer quality due to less noise

  3. Faster processing for large contexts

Trade-off
  • It adds extra LLM calls during retrieval, which increases latency and cost.

Overall

Contextual compression is useful when documents are long, noisy, or expensive to send to the model. It improves precision and efficiency, but at the cost of extra retrieval-time computation.

This walkthrough explains how to build a hybrid search system that combines BM25 keyword search and semantic search using a tech docs dataset.

Main steps

  1. BM25 retriever

    • Built from the documents with from_documents

    • Configured with k=3 to return the top 3 keyword matches

  2. Semantic retriever

    • Uses the existing semantic retriever setup

    • Also set to k=3

  3. Ensemble retriever

    • Combines BM25 and semantic retrievers using rank fusion

    • Example weighting: 40% BM25, 60% semantic

    • Weights should be tuned based on the kinds of queries users ask

  4. Testing queries

    • Keyword-heavy queries like Postgres, SQL, and pgvector work well with BM25

    • More meaning-based queries benefit from semantic search

  5. BM25 installation issue

    • The rank-bm25 package was missing

    • After installing it, the hybrid retriever worked correctly

Results and takeaways

  • For “What is Postgres?”, the ensemble gives the best combined result.

  • For “What database stores vectors?”, both retrievers identify relevant vector database content.

  • For “asset transactions”, BM25 succeeds where semantic search drifts off-topic.

  • For “How do I store AI model outputs for later retrieval?” and “fast similarity lookup embeddings”, both retrievers contribute useful signals.

Why it matters

  • BM25 is strong for exact keyword matching

  • Semantic search is strong for intent and meaning

  • Ensemble retrieval combines both to improve accuracy and robustness

The key lesson is that hybrid search handles both exact terms and conceptual similarity, making it more reliable than using either approach alone.

73. Hands-on ~ Advanced RAG - Parent Document Retriever

The document explains how to build a parent document retriever, which combines small chunks for retrieval accuracy with large chunks for better context.

Main idea

  • Split documents into:

    • Parent chunks: larger pieces of about 800 characters

    • Child chunks: smaller pieces of about 200 characters with overlap

  • Search is done over the small child chunks

  • The system returns the full parent chunk to the LLM

Setup

  • Use an in-memory vector store for embeddings

  • Use an in-memory document store for parent documents

  • Name the collection something like parent-child-demo

  • Build the retriever with:

    • vector store

    • document store

    • child splitter

    • parent splitter

How it works

  1. Add documents

  2. Query something like “What is LangGraph used for?”

  3. Compare:

    • regular retrieval: returns a small, focused chunk

    • parent document retrieval: returns a larger chunk with more context

Why it helps

  • Small chunks

    • better retrieval precision

    • more focused embeddings

  • Large chunks

    • better context for generation

    • less fragmentation

Key benefit

This approach gives the best of both worlds:

  • accurate search from small chunks

  • complete context from large chunks

Summary

A parent document retriever is a two-stage retrieval system:

  • first, find the most relevant small child chunk

  • then, return its corresponding larger parent document

It is especially useful for larger documents where both precision and context matter.

74. Hands-on ~ Advanced RAG - Combining Multi-Query and Compression Strategies

The passage describes how to combine advanced RAG techniques into one retrieval chain:

  • Start with a vector store and an LLM.

  • Add multi-query retrieval to improve recall by generating query variations.

  • Add contextual compression to improve precision by filtering retrieved results for relevance.

  • Define a RAG prompt, format retrieved documents, and build the final chain.

  • Test the chain with example questions.

Key takeaways:

  • Multi-query retrieval helps find more relevant documents.

  • Contextual compression helps keep only the most useful context.

  • You do not need to use every RAG strategy at once; choose what fits your use case.

The example setup uses:

  • ChromaDB for vector storage

  • OpenAI for embeddings and completions

  • Multi-query retrieval

  • Contextual compression

  • An LLM to generate the final answer

Overall, it presents a clean blueprint for a more advanced, effective RAG pipeline.

76. Hands-on ~ Conversation Memory - Basics

This document explains how to build conversation_memory.py to demonstrate conversational memory in LangChain.

Main idea

The chat model remembers earlier parts of a conversation by storing messages in session-based history and reusing them in later turns.

Key components

  • Chat model setup using init_chat_model or ChatOpenAI

  • Prompt template with:

    • a system message

    • a human input

    • a MessagesPlaceholder for chat history

  • Message history storage with:

    • InMemoryChatMessageHistory

    • a dictionary keyed by session_id

  • RunnableWithMessageHistory to automatically load and save messages for each session

  • StrOutputParser to format model output

basic_memory() workflow

  1. Initialize the chat model.

  2. Build a prompt that includes history.

  3. Chain the prompt, model, and parser.

  4. Create an in-memory store for session histories.

  5. Define get_session_history() to retrieve or create history for a session.

  6. Wrap the chain with RunnableWithMessageHistory.

  7. Use a fixed session_id to simulate one conversation.

  8. Send several user messages through the chain.

  9. Print the stored history to inspect saved human and AI messages.

Result

The model can answer follow-up questions using earlier context, such as:

  • remembering the user’s name

  • remembering what the user is learning

Conclusion

This is a simple example of session-based conversational memory in LangChain using modern runnable utilities and message history placeholders.

77. Hands-on ~ Multiple Sessions Memory

This describes how to support multiple independent chat sessions with one shared LLM by giving each user their own memory.

Main idea

  • Use one shared LLM

  • Build a prompt that accepts:

    • history for prior messages

    • current input

  • Create a chain from the prompt and LLM

  • Store conversation histories in a dictionary

  • Add a helper function that gets or creates a session’s history

  • Wrap the chain so:

    • message maps to the current user input

    • history maps to that user’s stored conversation history

How it works

If a session ID is new, a history object is created and saved.
That means each user gets separate memory instead of sharing one global chat history.

Example

  • User A says: “My favorite language is Python.”

  • User B says: “I love JavaScript.”

When they later ask:

  • “What is my favorite language?”

the system uses the correct session ID to load the right history:

  • User A → Python

  • User B → JavaScript

Why it matters

This setup lets the model:

  • remember past messages

  • keep conversations separate by user

  • answer based on each user’s own history

Summary: each session has its own memory, so multiple users can talk to the same model without mixing their conversations.

78. Hands-on ~ Message Trimming

The passage explains message trimming, which is the process of shortening a conversation history so it fits within a model’s context window and token limits.

Key points:

  • It uses a simulated long chat made of SystemMessage, HumanMessage, and AIMessage.

  • Trimming is done with a token limit and a strategy such as "last" or "first".

  • In the example, the last strategy is used, so the most recent messages are kept.

  • include_system=True means system messages are preserved.

  • allow_partial=False means messages are only kept if they fit completely.

Why it matters:

  • Reduces token usage

  • Keeps conversations within context limits

  • Avoids sending unnecessary history

  • Helps manage long-term memory efficiently

Example outcome:

  • The original chat had 8 messages

  • After trimming with a small token limit, it may shrink to only 2 messages

Overall, message trimming is a practical way to control how much conversation history is retained in AI applications.

79. Hands-on ~ Windowed Memory

The passage explains sliding window memory for LLM conversations:

  • LLM costs grow because each new request may include the full chat history.

  • To control this, sliding window memory keeps only the last K exchanges and discards older messages.

  • In the demo, a custom WindowChatHistory class extends LangChain’s InMemoryChatMessageHistory.

  • It overrides add_messages to check whether the number of messages exceeds K * 2:

    • 1 exchange = 1 human + 1 AI message

    • So K exchanges = K * 2 messages

  • If the limit is exceeded, it slices the list to keep only the newest messages:

    • self.messages = self.messages[-(K * 2):]

The demo conversation shows the memory shrinking as new messages arrive. With K = 2, only the last two exchanges remain, so the model remembers recent facts like:

  • “I work as an engineer.”

  • “I have two cats.”

It forgets earlier ones like:

  • “My name is Paulo.”

  • “I live in Seattle.”

Main takeaway

Sliding window memory provides fixed-size, predictable conversation memory, lowering cost and avoiding context-window overflow, but it loses older context.

80. Hands-on ~ Summary Memory

Summary memory keeps a conversation manageable by compressing older messages into a running summary instead of deleting them. It uses:

  • a summary LLM to maintain a stable, deterministic summary,

  • a chat LLM for the live conversation,

  • a prompt built from running summary + recent message buffer + current user input.

How it works

  1. The model responds using the summary, recent messages, and new input.

  2. The new exchange is added to a recent-message buffer.

  3. When the buffer gets too large, the oldest messages are summarized.

  4. That summary is merged into the running summary.

Why it’s useful

  • Recent context stays exact

  • Older context is preserved in compressed form

  • Token usage remains bounded, preventing context overflow

Key idea

It’s a hybrid memory strategy:

  • old info → summarized

  • new info → kept verbatim

This is especially useful for chatbots, RAG systems, and other long-running AI interactions.

81. Exercise and Solution ~ Persistent Memory

The passage explains how to build a chatbot with persistent memory using LangChain and SQLite.

Main points

  • Use RunnableWithMessageHistory and SQLChatMessageHistory to store conversation history in a local SQLite database.

  • Each chat session is identified by a session ID, so messages are saved and retrieved per user/session.

  • The chatbot can remember preferences across restarts, such as:

    • “I prefer dark mode themes.”

    • “What theme do I prefer?”

  • To make this work, you:

    1. Import the needed LangChain chat history tools.

    2. Set a SQLite .db file path.

    3. Create a function that returns a SQLChatMessageHistory for a given session.

    4. Build a prompt with a system message, history, and user input.

    5. Wrap the chain with RunnableWithMessageHistory.

    6. Pass a config containing the session_id.

    7. Test that the bot remembers past messages.

Persistence verification

  • To confirm memory is truly saved, you can:

    • run a conversation,

    • restart the chain,

    • reuse the same SQLite database,

    • and ask about earlier information.

  • You can also inspect the SQLite database directly to see stored human and AI messages.

Summarization idea

  • After about 10 messages, the conversation can be summarized automatically.

  • The summary can be stored as memory, while older raw messages may be pruned if desired.

Overall goal

The result is a chatbot that:

  • remembers preferences,

  • persists across restarts,

  • stores memory locally,

  • and can later be extended with automatic summarization.

83. Project ~ AI Research Assistant - Indexing Documents (Part 1)

The document outlines the setup of an AIResearchAssistant for a RAG pipeline using Chroma, OpenAIEmbeddings, and RecursiveCharacterTextSplitter.

Main points

  • Introduces structured output models:

    • ResearchResponse: includes answer, confidence, sources, and key_quotes

    • follow_up_questions: for generating follow-up prompts

  • Builds an AIResearchAssistant class that bundles the three core RAG components:

    1. Embedding model (OpenAIEmbeddings, text-embedding-3-small)

    2. Text splitter (RecursiveCharacterTextSplitter)

    3. Vector store (Chroma with persistent storage)

  • The constructor sets defaults like:

    • persistent_directory="research_db"

    • chunk_size=1000

    • chunk_overlap=200

  • Adds document ingestion methods:

    • add_documents to split, tag, timestamp, and store documents

    • add_text and add_texts as convenience wrappers for raw text

  • Includes inspection utilities:

    • get_document_count

    • list_sources

  • Confirms persistence and indexing through tests and cleanup steps

Overall takeaway

The assistant is now able to ingest, chunk, index, and persist documents, but it does not yet support retrieval or question answering. The next step is to add those capabilities so it can respond to user queries.

84. Project ~ AI Research Assistant - LLM Prompt and Output Parser (Part 2)

The passage explains how to turn a basic document retriever into a simple RAG-style Q&A chain using three main parts: an LLM, a prompt, and an output parser.

Main steps covered

  • Add a ChatOpenAI model to the assistant.

  • Build a retriever that uses similarity search and returns the top 4 relevant chunks.

  • Test the retriever to confirm it returns relevant document fragments.

  • Add a function to format retrieved documents into plain-text context for the LLM.

  • Create an ask method that:

    1. retrieves documents,

    2. formats them,

    3. builds a prompt with system and human instructions,

    4. runs a chain like prompt | llm | StrOutputParser(),

    5. returns the generated answer.

Testing and behavior

The assistant is tested with three questions:

  1. A factual question about RAG

  2. A question needing information from multiple sources

  3. A follow-up question

Key result

The system works, but it has a major weakness: no memory.
Because of that, follow-up questions can be misinterpreted or hallucinated instead of being answered correctly. This shows why grounding helps, but also why conversational memory will be needed next.

85. Project ~ AI Research Assistant - Adding Memory (Part 3)

The passage explains how to add session-based memory to an AI Research Assistant so each user session keeps its own conversation history.

Key points:

  • Add self.session_store as an in-memory dictionary to hold per-session chat history.

  • Create _get_session_history(self, session_id) to return or initialize a session’s message list.

  • Update the prompt in ask by inserting a MessagesPlaceholder named history between the system and human messages.

  • Inspecting session history shows it is just a list of stored messages, which only becomes useful when injected into the prompt.

  • In ask, retrieve the session history first and pass history.messages into the chain, optionally limiting the number of recent messages.

  • Before returning a response, save both sides of the exchange:

    • HumanMessage for the user question

    • AIMessage for the assistant reply

  • Add utility methods:

    • clear_session(…​) to erase a session’s history

    • get_session_history_display(…​) to view history in a readable format

Testing shows:

  • Follow-up questions now work because prior context is available.

  • Each Q&A pair adds two messages to memory.

  • Different session IDs have isolated histories, so one user’s conversation does not affect another’s.

Overall, the update gives the chatbot real conversational memory while keeping session histories separate.

86. Project ~ AI Research Assistant - Multi-Query Implementation (Part 4)

The passage explains how to improve a RAG retriever by adding multi-query retrieval.

Main idea

  • The current retriever only matches the query using similar words.

  • This works, but it can miss relevant chunks if the answer uses different terminology.

  • Multi-query retrieval fixes this by using an LLM to generate several semantically similar queries from the original question.

Basic vs. advanced retriever

  • Basic retriever: simple similarity search, returns about four documents.

  • Advanced retriever: multi-query retrieval, which expands the original question into multiple related searches.

Why it helps

Different people may ask the same thing in different ways. Multi-query retrieval improves recall by searching from multiple angles, making it more likely to find useful chunks even when wording differs.

Implementation described

  • Update _build_retriever in AIResearchAssistant.

  • Add a use_advanced flag.

  • If false, use the basic retriever.

  • If true, use the multi-query retriever.

  • Update ask to pass this flag through and handle retrieval the same way afterward.

Testing and results

  • With debugging enabled, the advanced retriever shows generated alternate queries.

  • It returns more relevant and unique chunks than the basic retriever.

  • This gives the model richer context and usually improves answer quality.

Conclusion

Multi-query retrieval makes the RAG system smarter and more flexible by retrieving information from several semantically related searches instead of relying on one keyword-based query.

The passage ends by noting the next step: moving from raw string answers to structured output, such as returning confidence, sources, and answer text in a predictable format.

87. Project ~ AI Research Assistant - Structured Output - Final Part

The passage explains how to improve an ask function in a RAG system by changing its output from plain text to a structured object.

Key points

  • The current ask function returns a plain string, which is hard to use programmatically.

  • A ResearchResponse data model already exists to solve this by structuring outputs with fields like:

    • answer

    • confidence

    • sources

    • key_quotes

    • follow_up_questions

  • A new function, ask_structured, is introduced right before ask.

    • It takes the same inputs as ask:

      • question

      • session_id

      • use_default

      • use_advanced

    • But it returns a ResearchResponse instead of a string.

  • Inside ask_structured, the LLM is wrapped with with_structured_output(ResearchResponse) so the model returns data in the schema format.

  • This makes it easy to access individual parts of the response directly in code, such as response.answer or response.sources.

Why this matters

  • Plain text is fine for display, but structured output is better for downstream processing.

  • It makes it easier to extract answers, confidence scores, sources, and follow-up questions.

Broader context

  • The project now includes a full RAG pipeline with:

    • a research assistant

    • advanced retrieval

    • conversation memory

    • structured responses

  • Document ingestion was intentionally left out for simplicity, but should be implemented separately in a real project:

    • upload file

    • extract content

    • store it in the database

Conclusion

The lesson shows how to use LangChain to build a more powerful RAG system, and it sets up the next topic: LangGraph, stateful agents, and how LangGraph and LangChain work together.