37. Hands-on ~ TextLoaders
The passage explains how to use LangChain’s TextLoader from
langchain_community.document_loaders to load a text file as a
document.
Main points
-
Import standard modules like
os,tempfile, andPath. -
Install
langchain-communityto access document loaders. -
Use
TextLoader(file_path)and call.load()to read a text file. -
The result is a list of documents.
Document contents
Each loaded document has:
-
page_content: the text from the file -
metadata: extra info such as the file source path
Example workflow
-
Create a temporary
.txtfile. -
Write sample text into it.
-
Load it with a helper function:
def load_text_file(file_path: str): loader = TextLoader(file_path) documents = loader.load() return documents
Expected output
-
len(documents)is1 -
documents[0].page_contentcontains the file text -
documents[0].metadataincludes the source path
Why this matters
Loaders are useful because they:
-
read document content
-
attach metadata automatically
-
help in retrieval and document-processing pipelines by tracking where data came from
38. Hands-on ~ WebLoader
Next, let’s look at the WebLoader.
First, I’m going to define a new function called demo. Then I’ll import WebLoader and instantiate it by passing in a URL. In this example, I’m using a simple Wikipedia page for web scraping.
After that, I call the load() method, just like before. This returns the documents from the web page.
There are several optional parameters you can pass to WebLoader, including:
-
proxies -
verify_ssl -
header_template -
encoding -
requests_per_second
and more, depending on what you need.
One useful option I want to highlight is bs_kwargs. This lets you pass arguments to Beautiful Soup. For example, you can specify the parser with:
bs_kwargs={"features": "html.parser"}
Since HTML pages are being parsed, this is a common setup.
You can also control what part of the page gets parsed. For example, you might target a specific element like a div, or leave it as None if you want to parse the whole page.
Now let’s print a content preview. I’ll display the source, content length, and a preview of the loaded document.
When I run it, I hit an issue: WebLoader depends on Beautiful Soup, which needs to be installed first.
So I add bs4 to the environment and run it again. This time it works.
Now you can see that one document was loaded from the web. It shows:
-
the source URL
-
the content length
-
a preview of the page content
So it successfully went to the Wikipedia page, scraped the content, and created a document from it.
We can also change the URL and load from other web pages as well. When I do that, it again returns one document with the new URL, its length, and a preview of the extracted content.
39. Hands-on ~ Lazy Loader
The passage explains a simple example of using lazy loading to efficiently load many files, especially large datasets.
-
A temporary directory is created with some sample
.txtfiles. -
A
DirectoryLoaderis configured to load files from that directory. -
TextLoaderis set as theloader_clsbecause the files are text files. -
A
globpattern is used so only.txtfiles, including those in subdirectories, are selected. -
Instead of loading everything at once, lazy loading loads documents incrementally, which saves memory.
-
The example prints both document contents and metadata, including the
sourcefield, to show where each file came from. -
Running the lazy loader confirms that the files are loaded correctly one by one.
Overall, it shows how lazy loading can be a practical, memory-efficient approach for working with large collections of files.