Chapter 1. LLM Fundamentals with LangChain
Getting Set Up with LangChain
The chapter recommends setting up LangChain on your computer to follow along with the examples. First, create an OpenAI account and generate an API key from the OpenAI website. LangChain supports both Python and JavaScript, and the book provides equivalent code examples in both languages.
For Python users:
-
Ensure Python is installed.
-
Optionally install Jupyter Notebook (
pip install notebook
). -
Install LangChain and related libraries via pip.
-
Set the OpenAI API key in your terminal environment.
-
Launch Jupyter Notebook to run the examples.
For JavaScript users:
-
Set the OpenAI API key in your terminal environment.
-
Install Node.js if needed.
-
Install LangChain and related packages via npm.
-
Save code examples as
.js
files and run them using Node.js.
This setup ensures you can run the provided LangChain code examples smoothly throughout the book.
Using LLMs in LangChain
This document details how to interact with Large Language Models (LLMs) using the LangChain framework, focusing on two primary interfaces: LLMs and Chat Models.
LLM Interface: This interface takes a simple string prompt, sends it to the LLM provider (like OpenAI), and returns the model’s prediction. Key parameters for configuration include model
(specifying the underlying LLM), temperature
(controlling output randomness/creativity), and max_tokens
(limiting output length/cost).
Chat Model Interface: This interface is designed for conversational interactions. It differentiates messages by role
:
-
System: Instructions for the model.
-
User: The user’s input/query.
-
Assistant: The model’s response.
LangChain provides message classes like HumanMessage
, AIMessage
, SystemMessage
, and ChatMessage
to manage these roles. Using SystemMessage
allows pre-configuring the AI’s behavior, ensuring more predictable responses.
Key Takeaways:
-
LangChain simplifies LLM interaction. It provides consistent interfaces regardless of the underlying provider.
-
Choosing between LLM and Chat Model depends on the application. Simple prompts use the LLM interface, while conversations benefit from the Chat Model interface.
-
Parameters like
temperature
andmax_tokens
are crucial for controlling LLM behavior and cost. -
SystemMessage
is powerful for guiding the model’s responses.
The document includes code examples in both Python and JavaScript demonstrating how to use both interfaces with OpenAI’s gpt-3.5-turbo
model.
Making LLM Prompts Reusable
The content explains how prompt instructions shape the output of language models by providing context and guiding responses. It demonstrates creating detailed prompts with dynamic inputs (like context and question) using LangChain’s PromptTemplate
and ChatPromptTemplate
interfaces in both Python and JavaScript. These templates allow placeholders (e.g., {context}
, {question}
) to be filled at runtime, enabling flexible prompt construction.
Examples show:
-
Defining a prompt template with placeholders.
-
Invoking the template with specific context and question values.
-
Passing the formatted prompt to an LLM (OpenAI) to generate answers.
-
Using
ChatPromptTemplate
for chat-based interactions with role-based messages (system
,human
).
The key takeaway is that LangChain simplifies building reusable, dynamic prompts that can be integrated with LLMs for question answering or chat applications, ensuring prompts are structured and adaptable to user inputs.
Chapter 2. RAG Part I: Indexing Your Data
Splitting Your Text into Chunks
The content explains how LangChain’s RecursiveCharacterTextSplitter
helps split large texts into semantically meaningful chunks by recursively splitting text using a prioritized list of separators (paragraphs, lines, words) to respect chunk size limits. It outputs chunks as Document
objects with metadata and position info.
Key points:
-
Default separators: paragraphs (
\n\n
), lines (\n
), words (space). -
Splitting starts with largest separator and moves to smaller ones if chunks exceed size.
-
Supports chunk size and overlap to maintain context.
-
Can split raw text strings or documents loaded from files.
-
Specialized splitting for code and Markdown using language-specific separators to keep semantic units (e.g., function bodies) intact.
-
LangChain provides built-in separators for many languages (Python, JS, Markdown, HTML).
-
from_language
method creates splitter instances tailored to specific languages. -
create_documents
method splits raw text strings into documents, optionally attaching metadata per chunk. -
Metadata is preserved and attached to each chunk, useful for tracking source or provenance.
Examples show usage in Python and JavaScript for plain text, Python code, and Markdown, demonstrating how chunks align with natural text/code boundaries and how metadata is propagated.
Generating Text Embeddings
The content explains how LangChain’s Embeddings
class interfaces with text embedding models (like OpenAI, Cohere, Hugging Face) to convert text into vector representations. It provides two methods: one for embedding multiple documents (list of strings) and one for embedding a single query string. Examples in Python and JavaScript demonstrate embedding multiple documents at once for efficiency, returning lists of numeric vectors.
An end-to-end example shows how to:
-
Load documents using document loaders (e.g.,
TextLoader
). -
Split large documents into smaller chunks with text splitters (e.g.,
RecursiveCharacterTextSplitter
). -
Generate embeddings for each chunk using an embeddings model (e.g.,
OpenAIEmbeddings
).
The example code is provided in both Python and JavaScript. After generating embeddings, the next step is to store them in a vector store database for further use.
Storing Embeddings in a Vector Store
The chapter explains vector stores—databases optimized for storing vectors and performing similarity calculations like cosine similarity, especially for unstructured data such as text and images. Unlike traditional structured-data databases, vector stores support CRUD and search operations on vector embeddings, enabling AI-powered applications like querying large documents.
There are many vector store providers, each with different features (multitenancy, filtering, performance, cost, scalability). However, vector stores are relatively new, can be complex to manage, and add system complexity.
To address this, PostgreSQL supports vector storage via the pgvector
extension, allowing users to manage both traditional relational data and vector embeddings in one familiar database.
The setup involves running a Docker container with Postgres + pgvector, then connecting via a connection string.
Examples in Python and JavaScript show how to:
-
Load and split documents into chunks
-
Generate embeddings using OpenAIEmbeddings (or other models)
-
Store embeddings and documents in PGVector (Postgres)
-
Perform similarity searches to retrieve relevant documents
-
Add new documents with metadata and UUIDs
-
Delete documents by ID
The process includes embedding queries, searching for nearest vectors in Postgres, and returning documents sorted by similarity.
This integration simplifies vector search by leveraging a popular relational database, reducing the need for separate vector store infrastructure while enabling scalable AI applications.
Tracking Changes to Your Documents
The content explains how LangChain’s indexing API helps manage vector stores with frequently changing data by avoiding costly re-indexing and duplicate embeddings. It uses a RecordManager
class to track documents via hashes, write times, and source IDs. The API supports cleanup modes to control deletion of outdated documents:
-
None
: no automatic cleanup. -
Incremental
: deletes previous versions if content changes. -
Full
: deletes previous versions and any documents not in the current batch.
Examples in Python and JavaScript demonstrate setting up a Postgres-backed vector store and record manager, creating documents, and indexing them with incremental cleanup to prevent duplicates. When documents are modified, the API replaces old versions sharing the same source ID. This approach keeps the vector store synchronized efficiently by only updating changed documents.
Indexing Optimization
MultiVectorRetriever
The document explains a method to handle documents containing both text and tables for retrieval-augmented generation (RAG) without losing table data. Instead of embedding raw text chunks (which can omit tables), it proposes a two-level indexing approach:
-
Summarize table elements using an LLM, generating summaries that include an ID referencing the full raw table.
-
Store summaries in a vector store for efficient similarity search.
-
Store full raw tables separately in a document store (docstore) keyed by the summary IDs.
-
When a query retrieves a summary, fetch the full referenced raw table from the docstore and pass it as context to the LLM for answer synthesis.
This decoupling allows retrieval of concise summaries for fast search while preserving access to complete table data for accurate answers.
The document provides detailed Python and JavaScript code examples demonstrating:
-
Loading and splitting documents into chunks.
-
Using an LLM to generate summaries of chunks.
-
Indexing summaries in a vector store (Postgres PGVector).
-
Storing original chunks in an in-memory docstore.
-
Using a
MultiVectorRetriever
to first retrieve summaries by similarity, then fetch full original chunks by ID. -
Querying the retriever to get relevant full context documents for downstream LLM prompting.
This approach ensures that tables and other complex document structures are not lost during chunking and embedding, enabling richer and more accurate retrieval and answer synthesis.
ColBERT Optimizing Embeddings
The text discusses a challenge with embedding models that compress documents into fixed-length vectors, which can include irrelevant or redundant content and cause hallucinations in LLM outputs. A more granular approach involves generating contextual embeddings for each token in both the query and document, scoring token-level similarities, and summing maximum similarity scores to rank documents. The ColBERT embedding model implements this solution effectively.
An example Python workflow using the RAGatouille library demonstrates how to:
-
Retrieve full Wikipedia page text via API,
-
Index the document with ColBERT embeddings,
-
Perform similarity search queries,
-
And integrate with LangChain retrievers for improved document retrieval.
Using ColBERT in this way enhances the relevance of documents retrieved as context for large language models, reducing hallucinations and improving output quality.
Chapter 3. RAG Part II: Chatting with Your Data
Introducing Retrieval-Augmented Generation
The text explains Retrieval-Augmented Generation (RAG), a technique that improves the accuracy of large language models (LLMs) by providing them with up-to-date external context. Without RAG, LLMs rely solely on pretrained data, which can be outdated, leading to incorrect answers—as illustrated by ChatGPT incorrectly naming France as the latest FIFA World Cup winner instead of Argentina (2022). By appending relevant, current information (e.g., from Wikipedia) as context to the prompt, the LLM can generate accurate responses. However, manually adding such context is impractical for real-world applications, highlighting the need for automated systems that retrieve and supply relevant information dynamically to LLMs during generation.
Retrieving Relevant Documents
The content explains the three core stages of a Retrieval-Augmented Generation (RAG) system for AI applications:
-
Indexing: Preprocess external data by loading documents, splitting them into chunks, embedding these chunks into vector representations, and storing them in a vector store for efficient retrieval. Code examples in Python and JavaScript demonstrate loading a text file, splitting it, embedding chunks using OpenAI embeddings, and storing them in a PostgreSQL-backed vector store (PGVector).
-
Retrieval: When a user query is received, it is converted into embeddings and compared against stored embeddings using similarity metrics (e.g., cosine similarity) to find the most relevant document chunks. The vector store provides an
as_retriever
method to abstract query embedding and similarity search. The number of documents retrieved can be controlled by a parameterk
to balance relevance, performance, and prompt size. Code examples show how to retrieve relevant documents using this method. -
Generation: The final stage involves combining the original user prompt with the retrieved relevant documents to form a comprehensive prompt that is sent to the language model for generating a prediction or answer.
Figures referenced illustrate the flow of indexing and retrieval processes, including similarity calculations using structures like Hierarchical Navigable Small World (HNSW) graphs.
Overall, the chapter emphasizes practical implementation of RAG with LangChain libraries, highlighting the importance of efficient indexing, controlled retrieval, and prompt synthesis for effective AI applications.
Generating LLM Predictions Using Relevant Documents
The content explains how to build a Retrieval-Augmented Generation (RAG) system by integrating relevant documents retrieved from a vector store into a prompt for a large language model (LLM) to generate informed answers. It provides Python and JavaScript code examples demonstrating:
-
Creating a dynamic prompt template with context and question variables.
-
Using a retriever to fetch relevant documents based on a user query.
-
Composing a chain that pipes the prompt output into the LLM.
-
Invoking the chain with retrieved documents and the user question to generate a final answer.
Further, it shows how to encapsulate this retrieval and generation logic into a reusable function or runnable (qa
) that takes a question, fetches documents, formats the prompt, and returns the answer—optionally including the retrieved documents for inspection.
The text highlights that while this basic RAG system works for personal use, production-ready systems require addressing challenges such as handling variable user input quality, routing queries across multiple data sources, translating natural language to query languages, and optimizing indexing (embedding and text splitting).
Finally, it previews upcoming discussion on research-backed strategies to optimize RAG system accuracy, summarized in an accompanying figure.
Query Transformation
Rewrite-Retrieve-Read
The Rewrite-Retrieve-Read (RRR) strategy improves question answering by first prompting a large language model (LLM) to rewrite a user’s poorly phrased or distracted query into a clearer, more focused search query. This rewritten query is then used to retrieve relevant documents, which are subsequently passed along with the original question to the LLM to generate an answer.
An example shows that a distracted user query containing irrelevant details leads to a failure in retrieving useful information and thus no answer. By contrast, applying the RRR approach—where the LLM rewrites the query before retrieval—results in fetching relevant documents and producing a meaningful answer.
This method can be implemented in various programming languages (Python, JavaScript) and works with any retrieval system, including vector stores or web search tools. The main trade-off is increased latency due to the extra LLM call required for rewriting the query.
Multi-Query Retrieval
The multi-query retrieval strategy enhances information retrieval by generating multiple related queries from a user’s initial question using an LLM. These queries are run in parallel against a data source, and the retrieved documents are combined and deduplicated to form a comprehensive context. This approach helps overcome limitations of single-query retrieval, especially when answers require multiple perspectives.
Key points include:
-
Using a prompt template to generate several query variations from the original question.
-
Running all queries in parallel with a
.batch
method to retrieve relevant documents. -
Deduplicating documents by their content to avoid repetition.
-
Combining the retrieved documents as context for a final prompt to the LLM to generate a comprehensive answer.
-
Encapsulating the multi-query retrieval logic in a standalone chain (
retrieval_chain
) for modularity and ease of integration.
Code examples in Python and JavaScript illustrate:
-
Generating multiple queries.
-
Parallel retrieval and deduplication of documents.
-
Constructing a prompt with combined context.
-
Producing the final answer using the LLM.
This strategy is particularly useful for complex questions requiring diverse information sources and can be integrated seamlessly into existing QA workflows.
RAG-Fusion
The RAG-Fusion strategy enhances multi-query retrieval by adding a final reranking step using the Reciprocal Rank Fusion (RRF) algorithm. RRF combines ranked document lists from multiple queries into a single unified ranking by scoring documents based on their positions across all lists, effectively promoting the most relevant documents to the top. This approach handles differences in score scales across queries and allows lower-ranked documents to influence the final ranking through a tunable parameter k.
The process involves:
-
Generating multiple search queries from a single user query using a language model prompt.
-
Retrieving relevant documents for each query.
-
Applying the RRF algorithm to fuse and rerank these documents into one consolidated list.
-
Using the reranked documents as context to answer the original question with a language model.
Code examples in Python and JavaScript demonstrate how to implement query generation, document retrieval, RRF reranking, and final question answering in a pipeline. RAG-Fusion improves retrieval by capturing diverse user intents, handling complex queries, and broadening document coverage to enable serendipitous discovery.
Hypothetical Document Embeddings
The document explains the Hypothetical Document Embeddings (HyDE) technique for improving document retrieval in RAG (Retrieval-Augmented Generation) systems. HyDE works by first generating a hypothetical document that answers the user’s query using a large language model (LLM). This generated passage is then embedded and used to retrieve relevant documents based on vector similarity, as it tends to be closer in embedding space to relevant documents than the original query.
The process involves:
-
Defining a prompt to generate the hypothetical document from the query.
-
Passing the generated document to a retriever to find similar documents in a vector store.
-
Using the retrieved documents as context in a final prompt to the LLM to produce the answer.
Code examples in Python and JavaScript illustrate how to implement each step using LangChain and OpenAI APIs.
The document also discusses query transformation strategies, which involve rewriting or decomposing the original user query to improve retrieval. Techniques include removing irrelevant text, grounding queries with conversation history, broadening the search with related queries, and breaking complex questions into simpler ones. The choice of rewriting strategy depends on the use case.
Finally, the text hints at the next topic: routing queries to retrieve relevant data from multiple sources to build a robust RAG system.
Reference: Gao et al., “Precise Zero-Shot Dense Retrieval Without Relevance Labels,” arXiv, 2022.
Query Routing
Logical Routing
The text explains the concept of logical routing in LLM applications, where the model is given knowledge of multiple data sources and decides which one to use based on the user’s query. This is achieved using function-calling models like GPT-3.5 Turbo, which generate structured outputs conforming to a predefined schema to classify queries.
Key points include:
-
Defining a schema (e.g., with Python’s Pydantic or JavaScript’s Zod) that specifies possible data sources (like "python_docs" or "js_docs").
-
Using a prompt that instructs the LLM to route queries based on the programming language referenced.
-
Invoking the LLM to produce JSON outputs indicating the chosen data source.
-
Passing the LLM’s output into further logic (e.g., a function) to select the appropriate processing chain.
-
Implementing resilience by using case-insensitive substring matching rather than exact string comparisons to handle slight deviations in LLM output.
-
Logical routing is ideal when there is a fixed set of data sources (vector stores, databases, APIs) to choose from for accurate information retrieval.
Overall, logical routing leverages structured function calls to enable LLMs to reason about and select the most relevant data source for a given query, improving the accuracy and reliability of multi-source applications.
Semantic Routing
The content explains semantic routing, a method that routes user queries to the most relevant data source by embedding both the query and various prompt templates, then using vector similarity (cosine similarity) to select the closest matching prompt. This approach contrasts with logical routing by leveraging semantic meaning rather than fixed rules.
Two example prompts are given—one for physics questions and one for math questions. Both prompts are embedded using OpenAI embeddings. When a user query arrives, it is embedded and compared to the prompt embeddings to find the most similar prompt. The selected prompt is then used to generate a response via a language model (ChatOpenAI).
Code examples in Python and JavaScript illustrate this process:
-
Embedding prompts and queries
-
Calculating cosine similarity
-
Selecting the best prompt based on similarity
-
Passing the routed prompt to a chat model for answering
The section concludes by transitioning to the next topic: transforming natural language queries into the query language of the target data source in retrieval-augmented generation (RAG) systems.
Query Construction
Text-to-Metadata Filter
The content explains how to perform metadata-filtered vector searches using LangChain’s SelfQueryRetriever
. When embedding data into a vector store, metadata key-value pairs can be attached to vectors. Later, queries can specify filters on this metadata to narrow search results.
LangChain’s SelfQueryRetriever
simplifies this by leveraging an LLM to convert natural language queries into structured metadata filters and semantic search queries. Users define the metadata schema (fields with names, descriptions, and types), which is included in the prompt to the LLM.
The retriever workflow is:
-
The LLM receives the user query plus metadata schema and generates a rewritten search query plus metadata filters.
-
The filters are parsed and translated into the vector store’s filter format.
-
A similarity search is performed on the vector store, applying the metadata filters to restrict results.
Code examples in Python and JavaScript demonstrate defining metadata fields (e.g., genre, year, director, rating), initializing the retriever with an LLM and database, and invoking it with a natural language query like “What’s a highly rated (above 8.5) science fiction film?”
This approach enables combining semantic search with precise metadata filtering automatically derived from user queries.
Text-to-SQL
The text discusses strategies for translating natural language queries into SQL using large language models (LLMs), emphasizing the importance of grounding SQL generation with accurate database descriptions and few-shot examples. Key points include:
-
Database Description: Providing the LLM with detailed
CREATE TABLE
statements (including column names and types) and sample rows helps it generate accurate SQL queries. -
Few-shot Examples: Including example question-to-SQL pairs in the prompt improves query generation accuracy by guiding the model on expected outputs.
-
Implementation: Code examples in Python and JavaScript demonstrate how to create a chain that converts user questions into SQL queries and executes them on a database (e.g., SQLite with the Chinook sample database) using LangChain and OpenAI’s GPT-4.
-
Security Considerations: Running LLM-generated SQL queries directly on production databases poses risks. Recommended precautions include:
-
Using read-only database users.
-
Restricting access to only necessary tables.
-
Implementing query timeouts to prevent resource exhaustion.
The text highlights that security for LLM-driven database querying is an evolving area requiring ongoing attention.
Chapter 4. Using LangGraph to Add Memory to Your Chatbot
Building a Chatbot Memory System
The text discusses two fundamental design decisions for building a robust chatbot memory system: how to store state and how to query it. A simple approach is to store the entire chat history as a list of messages, updating it by appending new messages after each turn, and including this history in the prompt to the model. This method enables context-aware responses, as demonstrated with example code in Python and JavaScript using LangChain, where previous conversation messages are passed to the model to answer follow-up questions effectively.
However, while this simple memory system works, scaling it for production introduces challenges such as ensuring atomic updates of memory after each interaction, storing memories in durable storage like databases, controlling which and how many messages are retained and used, and enabling inspection and modification of the stored state outside of LLM calls. The text hints at introducing better tooling to address these challenges in subsequent sections.
Introducing LangGraph
The chapter introduces LangGraph, an open-source library by LangChain designed to build multiactor, multistep, stateful cognitive architectures called "graphs." These graphs enable complex applications by coordinating multiple specialized actors (e.g., LLM prompts, search engines) working together, much like a team of specialists collaborating.
Key concepts explained include:
-
Multiactor: Multiple actors (nodes) collaborate, passing work among themselves (edges), requiring a coordination layer to define actors, manage handoffs, and schedule execution deterministically.
-
Multistep: Interactions occur over multiple discrete steps, tracking the order and frequency of actor calls, with each handoff scheduling the next computation step until completion.
-
Stateful: A central shared state tracks data across steps, allowing snapshotting, pausing/resuming execution, error recovery, and human-in-the-loop controls.
A LangGraph graph consists of:
-
State: Data input and modified during execution.
-
Nodes: Functions representing steps that receive and update state.
-
Edges: Connections between nodes, either fixed or conditional, defining execution flow.
LangGraph provides visualization and debugging tools and supports scalable production deployment. Installation instructions for Python and JavaScript are given.
To demonstrate LangGraph, the chapter plans to build a simple chatbot that responds directly to user messages, illustrating core LangGraph concepts with a single LLM call.
Creating a StateGraph
The content explains how to build a simple chatbot using LangGraph by defining a state, adding a node for an LLM call, connecting nodes with edges, and running the graph.
Key steps:
-
Define the Graph State
-
The state is a schema describing the data the graph manages.
-
Here, the state has a single key
messages
, a list of chat messages. -
A reducer function (
add_messages
) is used to append new messages instead of overwriting the list. -
This is shown in both Python (using
Annotated
andTypedDict
) and JavaScript (usingAnnotation
with a reducer).
-
-
Add a Chatbot Node
-
Nodes represent units of work, typically functions.
-
The
chatbot
node takes the current state, calls an LLM (ChatOpenAI
), and returns new messages to append. -
This updates the state by adding the LLM’s response to the
messages
list.
-
-
Add Edges to Define Execution Flow
-
Edges connect nodes and define the start and end of the graph execution.
-
The graph starts at
START
, runs thechatbot
node, then ends atEND
. -
The graph is compiled into a runnable object with
invoke
andstream
methods.
-
-
Visualize the Graph
-
The graph can be visualized using a Mermaid diagram.
-
-
Run the Graph
-
Input is provided as a dictionary/object matching the state shape (e.g.,
{"messages": [HumanMessage('hi!')]}
). -
The
stream()
method runs the graph and streams state updates after each step. -
Output shows the chatbot’s AI message appended to the messages list.
-
Overall, this example demonstrates how to build a simple chatbot workflow with LangGraph by defining state, nodes, edges, and running the graph with streaming output.