This section covers the core building blocks of a RAG-style pipeline in LangChain: loading documents, structuring them as Document objects, splitting them into chunks, embedding them as vectors, and storing/searching them in a vector database.
Document Loading
TextLoader
TextLoader from langchain_community.document_loaders loads a text file into a list of Document objects.
Key ideas:
-
Call
TextLoader(file_path).load() -
Output is a list, typically with one
Documentfor one text file -
Each document contains:
-
page_content: file text -
metadata: source path and related info
-
Why it matters:
-
automatically reads content
-
preserves provenance through metadata
-
fits naturally into retrieval pipelines
WebLoader
WebLoader loads content directly from a web page.
Key ideas:
-
Instantiate with a URL and call
.load() -
Optional settings include:
-
proxies -
verify_ssl -
header_template -
encoding -
requests_per_second -
bs_kwargs
-
Useful detail:
-
bs_kwargs={"features": "html.parser"}lets Beautiful Soup use a specific parser -
You can parse the full page or target specific HTML sections
Requirements:
-
bs4must be installed
Result:
-
returns
Documentobjects containing scraped page content and source URL metadata
Lazy Loading with DirectoryLoader
For many files, especially large collections, lazy loading is more memory-efficient.
Key ideas:
-
DirectoryLoadercan scan a directory -
TextLoadercan be used asloader_cls -
globpatterns select matching files, including subdirectories -
lazy loading yields documents incrementally instead of all at once
Benefit:
-
reduces memory usage when processing large datasets
Document Structure
LangChain’s Document is the standard container for text plus metadata.
Main fields:
-
page_content: required text -
metadata: optional dictionary with fields like source, author, tags, date, or custom labels
Notes:
-
documents are often created by loaders
-
they can also be created manually
-
updates are usually done by creating a new
Document
Why it matters:
-
all downstream tasks such as splitting, embedding, and retrieval operate on this structure
Text Splitting and Chunking
RecursiveCharacterTextSplitter
This is presented as a strong default text splitter.
Typical configuration:
-
chunk_size=500 -
chunk_overlap=50 -
separators like:
-
paragraph breaks
-
line breaks
-
spaces
-
character-level fallback
-
How it works:
-
recursively tries to split on larger, more meaningful boundaries first
-
preserves coherence better than naive fixed-length splitting
Usage:
-
split_text()for strings -
split_documents()forDocumentlists
Why it matters:
-
improves chunk quality for embeddings, retrieval, summarization, and QA
Chunk Size Comparison
Chunk size has a major impact on retrieval behavior.
Example comparison:
-
200→ 6 chunks -
500→ 3 chunks -
1000→ 1 chunk
Tradeoff:
-
smaller chunks:
-
more precise retrieval
-
less context per chunk
-
-
larger chunks:
-
more context
-
less precise retrieval
-
Takeaway:
-
there is no universal best chunk size
-
choose based on document type, queries, and model context window
Overlap Importance
Chunk overlap helps preserve context across boundaries.
Without overlap:
-
important phrases may be split apart
-
retrieval may miss complete meaning
With overlap:
-
some text is repeated between chunks
-
related details are more likely to stay together
Takeaway:
-
overlap adds small redundancy but often improves retrieval reliability
Markdown Header Splitter
Useful for Markdown documents such as docs, READMEs, wikis, and notes.
Key ideas:
-
split based on headers like
,, -
each chunk retains metadata about header hierarchy
Benefit:
-
preserves document structure and section context for better retrieval
Code Splitter
Code should be split with language awareness.
Approach:
-
use
RecursiveCharacterTextSplitter.from_language(…) -
specify the programming language, such as Python
-
use chunking settings like size
500and overlap50
Benefit:
-
preserves meaningful units such as classes and functions
-
improves code retrieval quality
Takeaway:
-
always use a language-aware splitter for code when possible
PDF Splitting
A common real-world workflow uses PyPDFLoader plus RecursiveCharacterTextSplitter.
Pattern:
-
load PDF pages as
Documentobjects -
split with
split_documents() -
preserve metadata such as page number and creation details
Why it matters:
-
works for real document formats
-
keeps metadata intact
-
prepares chunks for embeddings and RAG
Embeddings
OpenAI Embeddings
Embeddings convert text into vectors representing semantic meaning.
Models mentioned:
-
text-embedding-3-small→ 1536 dimensions -
text-embedding-3-large→ 3072 dimensions -
text-embedding-ada-002→ deprecated
Typical usage:
-
create
OpenAIEmbeddings(model="text-embedding-3-small") -
embed_query(text)for one string -
embed_documents(texts)for multiple strings
Why it matters:
-
embeddings are the basis for similarity search and vector retrieval
Free / Local Embedding Options
LangChain supports alternative embedding providers with similar APIs.
Options mentioned:
-
Hugging Face / Sentence Transformers
-
example:
all-MiniLM-L6-v2 -
384 dimensions
-
-
Ollama embeddings
-
requires
langchain-ollama
-
Benefit:
-
easy to switch providers
-
useful for local or no-cost setups
Embeddings and Similarity Search Basics
Core retrieval process:
-
embed documents
-
embed the query
-
compare vectors
-
rank by similarity
Key ideas:
-
semantically similar texts lie near each other in vector space
-
cosine similarity is commonly used
-
OpenAI embeddings are typically normalized, so direction matters more than vector length
Use case:
-
programming-related queries rank programming documents above unrelated ones
Embedding Caching
Caching avoids recomputing the same embeddings.
Approach:
-
wrap embeddings with
CacheBackedEmbeddings -
use
LocalFileStoreas the cache
Benefit:
-
reduces API cost
-
reduces latency
-
repeated requests return cached vectors
Vector Stores with Chroma
Basic Chroma Setup
Chroma is used as the vector store.
Typical workflow:
-
create sample
Documentobjects -
choose an embedding model
-
create the store with
Chroma.from_documents(…) -
optionally persist to disk
This handles:
-
embedding the documents
-
storing the vectors
-
saving the database
Similarity Search
Basic search:
-
similarity_search(query, k=2)
Result:
-
returns the top
kmost relevant documents
Benefit:
-
simple semantic retrieval without manually implementing vector search
Similarity Search with Scores
Use:
-
similarity_search_with_score()
Important detail:
-
Chroma returns distance scores here, not similarity scores
-
smaller scores mean better matches
Example interpretation:
-
0.066is more relevant than1.34
If needed, a similarity-like value can be approximated from distance.
Metadata Filtering
Search can combine semantic similarity with metadata constraints.
Example:
-
filter={"topic": "database"}
Effect:
-
only documents matching the metadata filter are considered
-
results are then ranked semantically within that subset
Why it matters:
-
narrows retrieval by source, topic, type, date, and other fields
Persistence
Chroma can be persisted locally and reloaded later.
Pattern:
-
set a
persist_directory -
build the vector store
-
reload from the same directory after restart
-
verify searches still work
Why it matters:
-
avoids rebuilding embeddings every session
-
enables local inspection of stored data
-
supports practical long-running applications
Vector Store as Retriever
A vector store can be converted into a retriever with as_retriever().
Modes:
-
search_type="similarity"-
returns the top most relevant documents
-
-
search_type="mmr"-
uses Maximum Marginal Relevance for relevance plus diversity
-
Important parameter:
-
fetch_kcontrols how many candidates are considered before selecting final results
Use cases:
-
similarity for precision
-
MMR for broader, less redundant retrieval
End-to-End Exercise
The final exercise combines the full workflow:
-
convert raw texts into
Documentobjects -
split them with
RecursiveCharacterTextSplitter -
embed the chunks
-
store them in Chroma
-
return a retriever
This demonstrates the standard RAG pipeline:
-
documents
-
chunking
-
embeddings
-
vector storage
-
retrieval
Overall Takeaway
The chapter presents a practical LangChain workflow for retrieval systems:
-
load documents from files, web pages, directories, and PDFs
-
represent them as
Documentobjects with metadata -
split them into meaningful chunks
-
convert chunks into embeddings
-
store and search them in Chroma
-
expose the vector store as a retriever for downstream chains
The main theme is that good retrieval depends on the full pipeline, especially:
-
preserving metadata
-
choosing appropriate chunk size and overlap
-
using suitable embeddings
-
selecting the right retrieval strategy