4. Document Loading, Chunking & Embeddings - Loaders, Splitters, Vector Stores

37. Hands-on ~ TextLoaders

The passage explains how to use LangChain’s TextLoader from langchain_community.document_loaders to load a text file as a document.

Main points

Import standard modules like os, tempfile, and Path.
Install langchain-community to access document loaders.
Use TextLoader(file_path) and call .load() to read a text file.
The result is a list of documents.

Document contents

Each loaded document has:

page_content: the text from the file
metadata: extra info such as the file source path

Example workflow

Create a temporary .txt file.
Write sample text into it.

Load it with a helper function:

def load_text_file(file_path: str):
    loader = TextLoader(file_path)
    documents = loader.load()
    return documents

Expected output

len(documents) is 1
documents[0].page_content contains the file text
documents[0].metadata includes the source path

Why this matters

Loaders are useful because they:

read document content
attach metadata automatically
help in retrieval and document-processing pipelines by tracking where data came from

38. Hands-on ~ WebLoader

Next, let’s look at the WebLoader.

First, I’m going to define a new function called demo. Then I’ll import WebLoader and instantiate it by passing in a URL. In this example, I’m using a simple Wikipedia page for web scraping.

After that, I call the load() method, just like before. This returns the documents from the web page.

There are several optional parameters you can pass to WebLoader, including:

proxies
verify_ssl
header_template
encoding
requests_per_second

and more, depending on what you need.

One useful option I want to highlight is bs_kwargs. This lets you pass arguments to Beautiful Soup. For example, you can specify the parser with:

bs_kwargs={"features": "html.parser"}

Since HTML pages are being parsed, this is a common setup.

You can also control what part of the page gets parsed. For example, you might target a specific element like a div, or leave it as None if you want to parse the whole page.

Now let’s print a content preview. I’ll display the source, content length, and a preview of the loaded document.

When I run it, I hit an issue: WebLoader depends on Beautiful Soup, which needs to be installed first.

So I add bs4 to the environment and run it again. This time it works.

Now you can see that one document was loaded from the web. It shows:

the source URL
the content length
a preview of the page content

So it successfully went to the Wikipedia page, scraped the content, and created a document from it.

We can also change the URL and load from other web pages as well. When I do that, it again returns one document with the new URL, its length, and a preview of the extracted content.

39. Hands-on ~ Lazy Loader

The passage explains a simple example of using lazy loading to efficiently load many files, especially large datasets.

A temporary directory is created with some sample .txt files.
A DirectoryLoader is configured to load files from that directory.
TextLoader is set as the loader_cls because the files are text files.
A glob pattern is used so only .txt files, including those in subdirectories, are selected.
Instead of loading everything at once, lazy loading loads documents incrementally, which saves memory.
The example prints both document contents and metadata, including the source field, to show where each file came from.
Running the lazy loader confirms that the files are loaded correctly one by one.

Overall, it shows how lazy loading can be a practical, memory-efficient approach for working with large collections of files.

40. Hands-on ~ Document Structure

LangChain’s Document class is a structured container for text and metadata. It is typically created by loaders, but you can also construct it manually.

Main parts of a `Document`

page_content: required string field containing the actual text
metadata: optional dictionary for extra details like source, author, tags, creation date, or custom labels

Example

from langchain_core.documents import Document

doc = Document(
    page_content="This is a sample document",
    metadata={
        "source": "sample.txt",
        "creation": "manual",
        "author": "Paolo",
        "length": 25,
        "tags": ["sample", "demo"],
        "created_at": "2024-01-01"
    }
)

Inspecting and updating

Printing a Document shows its text and metadata clearly.
Since documents are usually treated as immutable, updates are made by creating a new Document with modified content or metadata.

Why it matters

Understanding Document helps you:

work with LangChain loader outputs
add custom metadata
prepare data for splitting, embedding, and vector databases

In short, a Document is simply text plus flexible metadata.

43. Hands-on ~ Text Splitter - RecursiveCharacterTextSplitter

API reference: https://reference.langchain.com/python/langchain-text-splitters/character/RecursiveCharacterTextSplitter
Usage guide (text splitter integrations): https://docs.langchain.com/oss/python/integrations/splitters/recursive_text_splitter

This example explains how to split text into chunks using LangChain, focusing on RecursiveCharacterTextSplitter.

Main points

A file called TextSplitters.py is used with imported text-splitting tools, Language, Document, and .env loading.
Two sample inputs are mentioned:
1. A text document about machine learning
2. A code sample for testing how splitting works on code

RecursiveCharacterTextSplitter

Configured with:
- chunk_size=500
- chunk_overlap=50
- separators: ["\n\n", "\n", " ", ""]
It splits text hierarchically, trying to preserve meaning and structure:
1. Paragraphs
2. Sentences
3. Words
4. Characters

Usage note

Use split_text() for plain strings
Use split_documents() for Document objects

Result inspection

The example shows checking:

original text length
number of chunks
chunk sizes
a preview of the first chunk

Why it’s useful

Recursive splitting keeps related ideas together, producing more coherent chunks that work better for:

embeddings
retrieval
summarization
question answering

Key takeaway

RecursiveCharacterTextSplitter is presented as a strong default choice for chunking text while preserving natural boundaries.

44. Hands-on ~ Chunk Comparison

The passage explains a chunk size comparison for text splitting in a RAG workflow.

Main idea

Chunk size controls how much context is stored in each vector chunk, which affects retrieval quality later.

What the code does

Defines chunk sizes: 200, 500, 1000
Prints "Chunk size comparison"
For each size:
- Creates a RecursiveCharacterTextSplitter
- Sets chunk_overlap to about 20% of the chunk size
- Splits the text
- Prints how many chunks are produced

Observed results

200 → 6 chunks
500 → 3 chunks
1000 → 1 chunk

Why this happens

Because smaller chunk sizes, even with overlap, create more pieces of text.

Why it matters

Smaller chunks: better retrieval precision, but less context per chunk
Larger chunks: more context, but less precise retrieval
500 is suggested as a reasonable middle ground in this example

Key takeaway

There is no single best chunk size. It should be chosen based on:

document type
query patterns
LLM context window

The point of the demo is that chunk size strongly affects retrieval performance, so it must be tested for each use case.

45. Hands-on ~ Overlap Importance In Code

Overlap in text splitting helps preserve context at chunk boundaries.

Main idea

When text is split into chunks:

Without overlap, important phrases can be cut in half between chunks.
With overlap, chunks repeat some text, so context is shared across boundaries.

Why it matters

This makes retrieval more reliable because:

key information is less likely to be lost,
related details stay together in at least one chunk,
the retriever is more likely to return a complete answer instead of partial context.

Example

If a sentence about an API expiring is split across two chunks, one chunk may mention the expiration and another may mention the fix. Without overlap, the retriever may only find one piece. With overlap, a chunk can contain both the problem and the solution.

Takeaway

Overlap is like “cheap insurance” for retrieval: a little redundancy improves the chance that important context is available when needed.

46. Hands-on ~ Markdown Header Splitter

The Markdown Header Text Splitter in LangChain is used to split Markdown documents into chunks based on header structure like , , and . You specify which headers to split on, pass them to the splitter, and then call split_text() on your Markdown content.

Each resulting chunk includes:

the text content
metadata showing the header hierarchy

This preserves context, so chunks remain tied to where they appear in the document. It’s especially useful for structured Markdown content such as documentation, README files, wikis, and notes, because it keeps both the meaning and location of the text intact for better retrieval and downstream use.

47. Hands-on ~ Code Splitter

The passage explains that splitting code is more complex than splitting plain text because code has syntax and logical structure. It describes using a language-aware recursive character text splitter with from_language, set to Python, along with a chunk size of 500 and overlap of 50.

The main point is that specifying the language helps preserve meaningful code units like functions and classes, instead of cutting them into incoherent pieces. In the example, the code is split into two chunks, and the function definition remains intact. This improves retrieval quality, making it more likely to return the right code block when answering programming questions.

Overall takeaway: always use from_language with the correct language when splitting code so the chunks stay coherent and syntax-aware.

48. PDF Document Splitting

The passage explains how to load and split a real PDF document in LangChain using PyPDFLoader and RecursiveCharacterTextSplitter.

Main steps

Import the loader and splitter
- PyPDFLoader loads the PDF.
- RecursiveCharacterTextSplitter splits the content.
Load the PDF
- The PDF is loaded into a list of Document objects, usually one per page.
Create a splitter
- Configure chunk size and overlap.
Split documents
- Use split_documents() instead of split_text() because the input is a list of Document objects.
- The output chunks remain Document objects with preserved metadata.
Inspect output
- You can print the chunk text and metadata to see source information like page number, creator, and creation date.

Why it matters

This workflow is useful because it:

handles real file types like PDFs
preserves metadata
creates chunks suitable for search, embeddings, and RAG
supports many loaders and splitters for different document types

Key takeaway

For real documents, LangChain’s standard pattern is:

load documents → split documents → keep metadata → use chunks for retrieval or downstream NLP tasks

51. Hands-on ~ OpenAI Embedding

The passage explains how to create embeddings in practice using OpenAI embeddings through LangChain.

Key points

OpenAI provides several embedding models:
- text-embedding-3-small → 1,536 dimensions
- text-embedding-3-large → 3,072 dimensions
- text-embedding-ada-002 → deprecated
More dimensions can capture richer meaning, but usually cost more.
For the example, text-embedding-3-small is used because it is cheaper and suitable for general-purpose use.

How to use it in LangChain

Import the wrapper:

from langchain_openai import OpenAIEmbeddings

Create the embeddings model:

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

Embedding text

Use embed_query() for a single text:

embedding = embeddings.embed_query(text)

This returns a vector: a list of numbers representing the text’s semantic meaning.
You can check the vector size with len(embedding), which should be 1536.

Embedding multiple documents

Use embed_documents() to embed a list of texts at once:

embeddings_list = embeddings.embed_documents(texts)

This returns one vector per document, each with 1,536 values.

Main takeaway

Embeddings turn text into numerical vectors that capture meaning. These vectors are essential for:

similarity search
vector databases
retrieval-augmented generation (RAG) systems

52. Free Embedding Models

The passage explains that while OpenAI embeddings like text-embedding-3-small are inexpensive, they still incur API cost. If you want a fully local, free alternative, you can use LangChain wrapper classes for:

Hugging Face / Sentence Transformers embeddings
- Example model: all-MiniLM-L6-v2
- Uses 384 dimensions, which is smaller than many OpenAI models
- Good for testing and smaller projects
- Same general usage pattern as OpenAI embeddings
Ollama embeddings
- Also supported through LangChain
- Requires installing langchain-ollama first
- You pass a model name and use it similarly to other embedding providers

Overall, the key point is that LangChain makes switching embedding backends easy, so you can use OpenAI, Hugging Face, or Ollama with nearly the same code.

53. Hands-on ~ Embeddings Deep Dive - Basics - Similarity Search

This section explains how to use embeddings with LangChain and why they matter for retrieval and RAG.

Main points

Create one reusable OpenAIEmbeddings model.
Generate:
- a single embedding with embed_query()
- batch embeddings with embed_documents()
Inspect embedding properties like:
- vector length
- first few values
- vector norm
Use cosine similarity to compare a query embedding with document embeddings.
Rank documents by similarity to find the most relevant ones.

Key idea

Embeddings convert text into vectors so similar meanings are placed near each other in vector space. In the example:

a query about programming languages ranks Python and JavaScript higher than unrelated documents like cats or machine learning.

Why normalization matters

OpenAI embeddings are typically normalized, so vector comparisons depend more on direction than length. This helps make similarity scores more meaningful.

Relevance to RAG

This is the basic retrieval process used in RAG:

embed documents
embed the query
compare them
return the most relevant results

Overall, the section introduces the core workflow behind embedding-based search and retrieval.

54. Hands-on ~ Embedding Caching

The passage explains that caching embeddings is useful because it prevents repeated API calls, reducing both latency and cost. It shows how to wrap an OpenAIEmbeddings model with CacheBackedEmbeddings using a LocalFileStore cache. In the example, the first call to embed_query() computes and stores the embedding, while the second call retrieves the same result from cache. The two outputs should match, confirming the cache works.

56. Hands-on ~ Setting Up Chroma and Running Chroma Basics

The passage explains how to use Chroma as a vector store in LangChain.

Main points

Import Chroma from langchain_chroma.
Also import:
- Document from LangChain Core
- RecursiveCharacterTextSplitter from LangChain text splitters
- TemporaryDirectory from Python’s tempfile
Use an embedding model, specifically OpenAIEmbeddings with text-embedding-3-small.

What the code does

Creates a few sample Document objects with page content and metadata.

Builds a Chroma vector store using:

Chroma.from_documents(
    documents=simple_documents,
    embedding=embedding_model,
    persist_directory=tmp_dir
)

This:
1. embeds the documents
2. stores them in Chroma
3. saves them to disk

Search example

Runs a similarity search like:

query = "What is LangChain?"
results = vector_store.similarity_search(query, k=2)

k=2 means return the two most relevant documents.
The top result is the LangChain description from the docs, and the next result may be related content like LangGraph.

Key takeaway

LangChain makes it very easy to work with Chroma:

no need to implement your own vector search
from_documents simplifies setup
documents, embeddings, and persistence are handled in one place

Overall, it shows a straightforward way to create, persist, and query a vector store.

58. Hands-on ~ Similarity Search with Scores

The passage explains how to use LangChain’s Chroma wrapper to perform a similarity search with scores.

Key points:

Set up a vector store using a temporary directory, sample documents, an embedding model, and a persistence directory.
Instead of similarity_search, use vector.similarity_search_with_score() to retrieve documents along with their scores.
In the example, the query is "explain vector store" and the top 3 results are returned.

Important note about scores:

The scores returned by Chroma here are distance scores, not similarity scores.
For distance scores, smaller values mean better matches.
Example interpretation:
- 0.066 = most relevant
- 1.34 = least relevant

It also notes that if you want a similarity value from a distance score, you can convert it with:

similarity = 1 / (1 + distance)

Overall, the main takeaway is that similarity search with scores is straightforward, but you must know whether your vector store returns distance or similarity scores.

59. Hands-on ~ Metadata Filtering

The passage explains metadata filtering in similarity search.

A normal similarity_search() returns documents based only on semantic similarity.
You can add a filter argument, such as {"topic": "database"}, to restrict results to documents whose metadata matches that condition.
This means the search still uses meaning to rank documents, but only among documents that satisfy the metadata filter.
It’s useful for narrowing results by attributes like topic, source, date, or type.
Because of filtering, you may get fewer results than the requested k if not enough documents match the metadata.

Main takeaway: metadata filtering makes vector search more precise by combining semantic relevance with metadata constraints.

60. Hands-on ~ Chroma DB Persistence

This section explains how to persist a Chroma vector store to disk, reload it after a restart, and verify that it still works.

Key points

Set a persist_directory such as "chroma_db".
Create the vector store with Chroma.from_documents(…) using your loaded documents and embedding function.
Save it locally, then confirm how many documents were stored.
Simulate a restart by deleting the in-memory vector store and reloading it from the same directory.
Check the document count again to confirm it reloaded successfully.
Run a similarity search to verify the database still functions after reloading.

Behind the scenes

Chroma stores data locally in the specified persistence folder.
This means you can:
1. create the vector store,
2. persist it,
3. delete it from memory,
4. reload it later,
5. and continue searching normally.

Inspecting the database

Since Chroma uses SQLite internally, you can inspect the database files directly with a SQLite extension in VS Code.
Useful tables include:
- collections
- embeddings
- embedding_metadata

Why it matters

Persisting Chroma lets you:

keep vector data locally,
avoid rebuilding embeddings every time,
inspect stored data manually,
and confirm your vector database is working as expected.

61. Hands-on ~ Vector Store as a Retriever for Chains

The passage explains how to use a vector store as a retriever.

You can convert a vector store into a retriever with as_retriever().
A similarity retriever (search_type="similarity") returns the top k most relevant documents for a query.
The retriever can be queried with .invoke(), which returns matching documents without generating an LLM answer yet.
An MMR retriever (search_type="mmr") uses Maximum Marginal Relevance to return results that are both relevant and diverse.
fetch_k controls how many candidate documents are first considered before selecting the final k results.

Key difference:

Similarity retrieval = best for precision and the most relevant matches.
MMR retrieval = best for variety, diversity, and avoiding redundancy.

In short: use similarity when you want the closest matches, and MMR when you want a broader, more balanced set of documents.

62. Exercise and Solution ~ Vector Stores

The exercise explains how to build a simple retrieval pipeline using Chroma:

Define a helper function, create_retriever, that takes a list of text documents, a chunk overlap value, and the number of documents to retrieve.
Convert the text strings into Document objects.
Split the documents into smaller chunks with RecursiveCharacterTextSplitter.
Store the chunks in an in-memory Chroma vector store using an embedding model.
Return a retriever from that vector store.

It also describes testing the function with sample text and example queries like:

“What’s good for web development?”
“Which language is safest?”

The expected relevant outputs include languages such as JavaScript, Python, and Rust.

Overall, the goal is to demonstrate the core components of a RAG workflow: documents, chunking, embeddings, vector storage, and retrieval.

4. Document Loading, Chunking & Embeddings - Loaders, Splitters, Vector Stores

37. Hands-on ~ TextLoaders

Main points

Document contents

Example workflow

Expected output

Why this matters

38. Hands-on ~ WebLoader

39. Hands-on ~ Lazy Loader

40. Hands-on ~ Document Structure

Main parts of a Document

Example

Inspecting and updating

Why it matters

43. Hands-on ~ Text Splitter - RecursiveCharacterTextSplitter

Main points

RecursiveCharacterTextSplitter

Usage note

Result inspection

Why it’s useful

Key takeaway

44. Hands-on ~ Chunk Comparison

Main idea

What the code does

Observed results

Why this happens

Why it matters

Key takeaway

45. Hands-on ~ Overlap Importance In Code

Main idea

Why it matters

Example

Takeaway

46. Hands-on ~ Markdown Header Splitter

47. Hands-on ~ Code Splitter

48. PDF Document Splitting

Main steps

Why it matters

Key takeaway

51. Hands-on ~ OpenAI Embedding

Key points

How to use it in LangChain

Embedding text

Embedding multiple documents

Main takeaway

52. Free Embedding Models

53. Hands-on ~ Embeddings Deep Dive - Basics - Similarity Search

Main points

Key idea

Why normalization matters

Relevance to RAG

54. Hands-on ~ Embedding Caching

56. Hands-on ~ Setting Up Chroma and Running Chroma Basics

Main points

What the code does

Search example

Key takeaway

58. Hands-on ~ Similarity Search with Scores

59. Hands-on ~ Metadata Filtering

60. Hands-on ~ Chroma DB Persistence

Key points

Behind the scenes

Inspecting the database

Why it matters

61. Hands-on ~ Vector Store as a Retriever for Chains

62. Exercise and Solution ~ Vector Stores

Main parts of a `Document`