37. Hands-on ~ TextLoaders
The passage explains how to use LangChain’s TextLoader from
langchain_community.document_loaders to load a text file as a
document.
Main points
-
Import standard modules like
os,tempfile, andPath. -
Install
langchain-communityto access document loaders. -
Use
TextLoader(file_path)and call.load()to read a text file. -
The result is a list of documents.
Document contents
Each loaded document has:
-
page_content: the text from the file -
metadata: extra info such as the file source path
Example workflow
-
Create a temporary
.txtfile. -
Write sample text into it.
-
Load it with a helper function:
def load_text_file(file_path: str): loader = TextLoader(file_path) documents = loader.load() return documents
Expected output
-
len(documents)is1 -
documents[0].page_contentcontains the file text -
documents[0].metadataincludes the source path
Why this matters
Loaders are useful because they:
-
read document content
-
attach metadata automatically
-
help in retrieval and document-processing pipelines by tracking where data came from
38. Hands-on ~ WebLoader
Next, let’s look at the WebLoader.
First, I’m going to define a new function called demo. Then I’ll import WebLoader and instantiate it by passing in a URL. In this example, I’m using a simple Wikipedia page for web scraping.
After that, I call the load() method, just like before. This returns the documents from the web page.
There are several optional parameters you can pass to WebLoader, including:
-
proxies -
verify_ssl -
header_template -
encoding -
requests_per_second
and more, depending on what you need.
One useful option I want to highlight is bs_kwargs. This lets you pass arguments to Beautiful Soup. For example, you can specify the parser with:
bs_kwargs={"features": "html.parser"}
Since HTML pages are being parsed, this is a common setup.
You can also control what part of the page gets parsed. For example, you might target a specific element like a div, or leave it as None if you want to parse the whole page.
Now let’s print a content preview. I’ll display the source, content length, and a preview of the loaded document.
When I run it, I hit an issue: WebLoader depends on Beautiful Soup, which needs to be installed first.
So I add bs4 to the environment and run it again. This time it works.
Now you can see that one document was loaded from the web. It shows:
-
the source URL
-
the content length
-
a preview of the page content
So it successfully went to the Wikipedia page, scraped the content, and created a document from it.
We can also change the URL and load from other web pages as well. When I do that, it again returns one document with the new URL, its length, and a preview of the extracted content.
39. Hands-on ~ Lazy Loader
The passage explains a simple example of using lazy loading to efficiently load many files, especially large datasets.
-
A temporary directory is created with some sample
.txtfiles. -
A
DirectoryLoaderis configured to load files from that directory. -
TextLoaderis set as theloader_clsbecause the files are text files. -
A
globpattern is used so only.txtfiles, including those in subdirectories, are selected. -
Instead of loading everything at once, lazy loading loads documents incrementally, which saves memory.
-
The example prints both document contents and metadata, including the
sourcefield, to show where each file came from. -
Running the lazy loader confirms that the files are loaded correctly one by one.
Overall, it shows how lazy loading can be a practical, memory-efficient approach for working with large collections of files.
40. Hands-on ~ Document Structure
LangChain’s Document class is a structured container for text and
metadata. It is typically created by loaders, but you can also construct
it manually.
Main parts of a Document
-
page_content: required string field containing the actual text -
metadata: optional dictionary for extra details like source, author, tags, creation date, or custom labels
Example
from langchain_core.documents import Document
doc = Document(
page_content="This is a sample document",
metadata={
"source": "sample.txt",
"creation": "manual",
"author": "Paolo",
"length": 25,
"tags": ["sample", "demo"],
"created_at": "2024-01-01"
}
)
Inspecting and updating
-
Printing a
Documentshows its text and metadata clearly. -
Since documents are usually treated as immutable, updates are made by creating a new
Documentwith modified content or metadata.
Why it matters
Understanding Document helps you:
-
work with LangChain loader outputs
-
add custom metadata
-
prepare data for splitting, embedding, and vector databases
In short, a Document is simply text plus flexible metadata.
43. Hands-on ~ Text Splitter - RecursiveCharacterTextSplitter
|
This example explains how to split text into chunks using LangChain, focusing on RecursiveCharacterTextSplitter.
Main points
-
A file called
TextSplitters.pyis used with imported text-splitting tools,Language,Document, and.envloading. -
Two sample inputs are mentioned:
-
A text document about machine learning
-
A code sample for testing how splitting works on code
-
RecursiveCharacterTextSplitter
-
Configured with:
-
chunk_size=500 -
chunk_overlap=50 -
separators:
["\n\n", "\n", " ", ""]
-
-
It splits text hierarchically, trying to preserve meaning and structure:
-
Paragraphs
-
Sentences
-
Words
-
Characters
-
Usage note
-
Use
split_text()for plain strings -
Use
split_documents()forDocumentobjects
Result inspection
The example shows checking:
-
original text length
-
number of chunks
-
chunk sizes
-
a preview of the first chunk
Why it’s useful
Recursive splitting keeps related ideas together, producing more coherent chunks that work better for:
-
embeddings
-
retrieval
-
summarization
-
question answering
Key takeaway
RecursiveCharacterTextSplitter is presented as a strong default choice
for chunking text while preserving natural boundaries.
44. Hands-on ~ Chunk Comparison
The passage explains a chunk size comparison for text splitting in a RAG workflow.
Main idea
Chunk size controls how much context is stored in each vector chunk, which affects retrieval quality later.
What the code does
-
Defines chunk sizes:
200,500,1000 -
Prints
"Chunk size comparison" -
For each size:
-
Creates a
RecursiveCharacterTextSplitter -
Sets
chunk_overlapto about 20% of the chunk size -
Splits the text
-
Prints how many chunks are produced
-
Observed results
-
200 → 6 chunks
-
500 → 3 chunks
-
1000 → 1 chunk
Why this happens
Because smaller chunk sizes, even with overlap, create more pieces of text.
Why it matters
-
Smaller chunks: better retrieval precision, but less context per chunk
-
Larger chunks: more context, but less precise retrieval
-
500 is suggested as a reasonable middle ground in this example
Key takeaway
There is no single best chunk size. It should be chosen based on:
-
document type
-
query patterns
-
LLM context window
The point of the demo is that chunk size strongly affects retrieval performance, so it must be tested for each use case.
45. Hands-on ~ Overlap Importance In Code
Overlap in text splitting helps preserve context at chunk boundaries.
Main idea
When text is split into chunks:
-
Without overlap, important phrases can be cut in half between chunks.
-
With overlap, chunks repeat some text, so context is shared across boundaries.
Why it matters
This makes retrieval more reliable because:
-
key information is less likely to be lost,
-
related details stay together in at least one chunk,
-
the retriever is more likely to return a complete answer instead of partial context.
Example
If a sentence about an API expiring is split across two chunks, one chunk may mention the expiration and another may mention the fix. Without overlap, the retriever may only find one piece. With overlap, a chunk can contain both the problem and the solution.
Takeaway
Overlap is like “cheap insurance” for retrieval: a little redundancy improves the chance that important context is available when needed.
46. Hands-on ~ Markdown Header Splitter
The Markdown Header Text Splitter in LangChain is used to split Markdown
documents into chunks based on header structure like , , and
. You specify which headers to split on, pass them to the splitter,
and then call split_text() on your Markdown content.
Each resulting chunk includes:
-
the text content
-
metadata showing the header hierarchy
This preserves context, so chunks remain tied to where they appear in the document. It’s especially useful for structured Markdown content such as documentation, README files, wikis, and notes, because it keeps both the meaning and location of the text intact for better retrieval and downstream use.
47. Hands-on ~ Code Splitter
The passage explains that splitting code is more complex than splitting
plain text because code has syntax and logical structure. It describes
using a language-aware recursive character text splitter with
from_language, set to Python, along with a chunk size of 500
and overlap of 50.
The main point is that specifying the language helps preserve meaningful code units like functions and classes, instead of cutting them into incoherent pieces. In the example, the code is split into two chunks, and the function definition remains intact. This improves retrieval quality, making it more likely to return the right code block when answering programming questions.
Overall takeaway: always use from_language with the correct
language when splitting code so the chunks stay coherent and
syntax-aware.
48. PDF Document Splitting
The passage explains how to load and split a real PDF document in
LangChain using PyPDFLoader and RecursiveCharacterTextSplitter.
Main steps
-
Import the loader and splitter
-
PyPDFLoaderloads the PDF. -
RecursiveCharacterTextSplittersplits the content.
-
-
Load the PDF
-
The PDF is loaded into a list of
Documentobjects, usually one per page.
-
-
Create a splitter
-
Configure chunk size and overlap.
-
-
Split documents
-
Use
split_documents()instead ofsplit_text()because the input is a list ofDocumentobjects. -
The output chunks remain
Documentobjects with preserved metadata.
-
-
Inspect output
-
You can print the chunk text and metadata to see source information like page number, creator, and creation date.
-
Why it matters
This workflow is useful because it:
-
handles real file types like PDFs
-
preserves metadata
-
creates chunks suitable for search, embeddings, and RAG
-
supports many loaders and splitters for different document types
Key takeaway
For real documents, LangChain’s standard pattern is:
load documents → split documents → keep metadata → use chunks for retrieval or downstream NLP tasks
51. Hands-on ~ OpenAI Embedding
The passage explains how to create embeddings in practice using OpenAI embeddings through LangChain.
Key points
-
OpenAI provides several embedding models:
-
text-embedding-3-small→ 1,536 dimensions -
text-embedding-3-large→ 3,072 dimensions -
text-embedding-ada-002→ deprecated
-
-
More dimensions can capture richer meaning, but usually cost more.
-
For the example,
text-embedding-3-smallis used because it is cheaper and suitable for general-purpose use.
How to use it in LangChain
-
Import the wrapper:
from langchain_openai import OpenAIEmbeddings -
Create the embeddings model:
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
Embedding text
-
Use
embed_query()for a single text:embedding = embeddings.embed_query(text) -
This returns a vector: a list of numbers representing the text’s semantic meaning.
-
You can check the vector size with
len(embedding), which should be1536.
Embedding multiple documents
-
Use
embed_documents()to embed a list of texts at once:embeddings_list = embeddings.embed_documents(texts) -
This returns one vector per document, each with 1,536 values.
Main takeaway
Embeddings turn text into numerical vectors that capture meaning. These vectors are essential for:
-
similarity search
-
vector databases
-
retrieval-augmented generation (RAG) systems
52. Free Embedding Models
The passage explains that while OpenAI embeddings like
text-embedding-3-small are inexpensive, they still incur API cost. If
you want a fully local, free alternative, you can use LangChain wrapper
classes for:
-
Hugging Face / Sentence Transformers embeddings
-
Example model:
all-MiniLM-L6-v2 -
Uses 384 dimensions, which is smaller than many OpenAI models
-
Good for testing and smaller projects
-
Same general usage pattern as OpenAI embeddings
-
-
Ollama embeddings
-
Also supported through LangChain
-
Requires installing
langchain-ollamafirst -
You pass a model name and use it similarly to other embedding providers
-
Overall, the key point is that LangChain makes switching embedding backends easy, so you can use OpenAI, Hugging Face, or Ollama with nearly the same code.
53. Hands-on ~ Embeddings Deep Dive - Basics - Similarity Search
This section explains how to use embeddings with LangChain and why they matter for retrieval and RAG.
Main points
-
Create one reusable
OpenAIEmbeddingsmodel. -
Generate:
-
a single embedding with
embed_query() -
batch embeddings with
embed_documents()
-
-
Inspect embedding properties like:
-
vector length
-
first few values
-
vector norm
-
-
Use cosine similarity to compare a query embedding with document embeddings.
-
Rank documents by similarity to find the most relevant ones.
Key idea
Embeddings convert text into vectors so similar meanings are placed near each other in vector space. In the example:
-
a query about programming languages ranks Python and JavaScript higher than unrelated documents like cats or machine learning.
Why normalization matters
OpenAI embeddings are typically normalized, so vector comparisons depend more on direction than length. This helps make similarity scores more meaningful.
Relevance to RAG
This is the basic retrieval process used in RAG:
-
embed documents
-
embed the query
-
compare them
-
return the most relevant results
Overall, the section introduces the core workflow behind embedding-based search and retrieval.
54. Hands-on ~ Embedding Caching
The passage explains that caching embeddings is useful because it
prevents repeated API calls, reducing both latency and cost. It shows
how to wrap an OpenAIEmbeddings model with CacheBackedEmbeddings
using a LocalFileStore cache. In the example, the first call to
embed_query() computes and stores the embedding, while the second
call retrieves the same result from cache. The two outputs should match,
confirming the cache works.
56. Hands-on ~ Setting Up Chroma and Running Chroma Basics
The passage explains how to use Chroma as a vector store in LangChain.
Main points
-
Import
Chromafromlangchain_chroma. -
Also import:
-
Documentfrom LangChain Core -
RecursiveCharacterTextSplitterfrom LangChain text splitters -
TemporaryDirectoryfrom Python’stempfile
-
-
Use an embedding model, specifically
OpenAIEmbeddingswithtext-embedding-3-small.
What the code does
-
Creates a few sample
Documentobjects with page content and metadata. -
Builds a Chroma vector store using:
Chroma.from_documents( documents=simple_documents, embedding=embedding_model, persist_directory=tmp_dir ) -
This:
-
embeds the documents
-
stores them in Chroma
-
saves them to disk
-
Search example
-
Runs a similarity search like:
query = "What is LangChain?" results = vector_store.similarity_search(query, k=2) -
k=2means return the two most relevant documents. -
The top result is the LangChain description from the docs, and the next result may be related content like LangGraph.
Key takeaway
LangChain makes it very easy to work with Chroma:
-
no need to implement your own vector search
-
from_documentssimplifies setup -
documents, embeddings, and persistence are handled in one place
Overall, it shows a straightforward way to create, persist, and query a vector store.
58. Hands-on ~ Similarity Search with Scores
The passage explains how to use LangChain’s Chroma wrapper to perform a similarity search with scores.
Key points:
-
Set up a vector store using a temporary directory, sample documents, an embedding model, and a persistence directory.
-
Instead of
similarity_search, usevector.similarity_search_with_score()to retrieve documents along with their scores. -
In the example, the query is "explain vector store" and the top 3 results are returned.
Important note about scores:
-
The scores returned by Chroma here are distance scores, not similarity scores.
-
For distance scores, smaller values mean better matches.
-
Example interpretation:
-
0.066= most relevant -
1.34= least relevant
-
It also notes that if you want a similarity value from a distance score, you can convert it with:
similarity = 1 / (1 + distance)
Overall, the main takeaway is that similarity search with scores is straightforward, but you must know whether your vector store returns distance or similarity scores.
59. Hands-on ~ Metadata Filtering
The passage explains metadata filtering in similarity search.
-
A normal
similarity_search()returns documents based only on semantic similarity. -
You can add a
filterargument, such as{"topic": "database"}, to restrict results to documents whose metadata matches that condition. -
This means the search still uses meaning to rank documents, but only among documents that satisfy the metadata filter.
-
It’s useful for narrowing results by attributes like topic, source, date, or type.
-
Because of filtering, you may get fewer results than the requested
kif not enough documents match the metadata.
Main takeaway: metadata filtering makes vector search more precise by combining semantic relevance with metadata constraints.
60. Hands-on ~ Chroma DB Persistence
This section explains how to persist a Chroma vector store to disk, reload it after a restart, and verify that it still works.
Key points
-
Set a
persist_directorysuch as"chroma_db". -
Create the vector store with
Chroma.from_documents(…)using your loaded documents and embedding function. -
Save it locally, then confirm how many documents were stored.
-
Simulate a restart by deleting the in-memory vector store and reloading it from the same directory.
-
Check the document count again to confirm it reloaded successfully.
-
Run a similarity search to verify the database still functions after reloading.
Behind the scenes
-
Chroma stores data locally in the specified persistence folder.
-
This means you can:
-
create the vector store,
-
persist it,
-
delete it from memory,
-
reload it later,
-
and continue searching normally.
-
Inspecting the database
-
Since Chroma uses SQLite internally, you can inspect the database files directly with a SQLite extension in VS Code.
-
Useful tables include:
-
collections -
embeddings -
embedding_metadata
-
Why it matters
Persisting Chroma lets you:
-
keep vector data locally,
-
avoid rebuilding embeddings every time,
-
inspect stored data manually,
-
and confirm your vector database is working as expected.
61. Hands-on ~ Vector Store as a Retriever for Chains
The passage explains how to use a vector store as a retriever.
-
You can convert a vector store into a retriever with
as_retriever(). -
A similarity retriever (
search_type="similarity") returns the topkmost relevant documents for a query. -
The retriever can be queried with
.invoke(), which returns matching documents without generating an LLM answer yet. -
An MMR retriever (
search_type="mmr") uses Maximum Marginal Relevance to return results that are both relevant and diverse. -
fetch_kcontrols how many candidate documents are first considered before selecting the finalkresults.
Key difference:
-
Similarity retrieval = best for precision and the most relevant matches.
-
MMR retrieval = best for variety, diversity, and avoiding redundancy.
In short: use similarity when you want the closest matches, and MMR when you want a broader, more balanced set of documents.
62. Exercise and Solution ~ Vector Stores
The exercise explains how to build a simple retrieval pipeline using Chroma:
-
Define a helper function,
create_retriever, that takes a list of text documents, a chunk overlap value, and the number of documents to retrieve. -
Convert the text strings into
Documentobjects. -
Split the documents into smaller chunks with
RecursiveCharacterTextSplitter. -
Store the chunks in an in-memory Chroma vector store using an embedding model.
-
Return a retriever from that vector store.
It also describes testing the function with sample text and example queries like:
-
“What’s good for web development?”
-
“Which language is safest?”
The expected relevant outputs include languages such as JavaScript, Python, and Rust.
Overall, the goal is to demonstrate the core components of a RAG workflow: documents, chunking, embeddings, vector storage, and retrieval.