Part 1: Introduction to Generative AI and LlamaIndex
Chapter 2: LlamaIndex: The Hidden Jewel - An Introduction to the LlamaIndex Ecosystem
Introducing PITS – our LlamaIndex hands-on project
The author introduces PITS, an AI tutor built with LlamaIndex, designed to provide a personalized and interactive learning experience. Users can upload study materials, after which PITS will assess the user’s knowledge with a quiz and then create customized learning material, including slides and narration, divided into chapters. PITS will adapt to the user’s knowledge level, answer questions, and remember the conversation context across multiple sessions. LlamaIndex will be used to understand and index the study materials, while GPT-4 will power the teaching interactions.
Familiarizing ourselves with the structure of the LlamaIndex code repository
The LlamaIndex framework’s code, reorganized for modularity and efficiency, is structured as follows:
-
llama-index-core: The foundational package, providing essential framework components.
-
llama-index-integrations: Add-on packages for customizing the framework with specific LLMs, data loaders, embedding models, and vector store providers.
-
llama-index-packs: Ready-made templates developed by the community to kickstart applications.
-
llama-index-cli: Supports the LlamaIndex command-line interface.
-
OTHERS: Contains fine-tuning abstractions and experimental features.
Each subfolder within llama-index-integrations
and llama-index-packs
represents an individual package that can be installed via pip. For example, to use llama_index.llms.mistralai
, you must first install the llama-index-llms-mistralai
package. The book will list necessary packages at the beginning of each chapter.
Part 2: Starting Your First LlamaIndex Project
Chapter 3: Kickstarting your Journey with LlamaIndex
Uncovering the essential building blocks of LlamaIndex – documents, nodes, and indexes
Documents
LlamaIndex uses Document
objects to contain and structure raw data from various sources like PDFs, databases, or APIs. A Document
holds the text content, a unique ID, and metadata (additional information) for more specific queries. Documents can be created manually or, more commonly, generated in bulk using data loaders from LlamaHub, which supports various data formats and sources. An example is provided using the WikipediaReader
to load data from Wikipedia articles into Document
objects. The next step is converting these raw Document
objects into a format that LLMs can process, which is where Nodes come in.
Nodes
Nodes are smaller, manageable chunks of content extracted from Documents, addressing prompt size limits by allowing selection of relevant information. They create semantic units of data centered around specific information and allow the creation of relationships between Nodes. In LlamaIndex, the TextNode
class is a main focus, with attributes like text
, start_char_idx
, end_char_idx
, text_template
, metadata_template
, metadata_seperator
, and metadata
. Nodes inherit Document-level metadata but can also be individually customized.
Manually creating the Node objects
The provided code demonstrates how to manually create TextNode
objects from a Document
object in LlamaIndex. It involves slicing the document’s text and assigning it to individual nodes. Each node is automatically assigned a unique ID, but this can be customized. This manual approach offers full control over the node’s text and metadata.
Automatically extracting Nodes from Documents using splitters
The TokenTextSplitter
in LlamaIndex is a tool for chunking documents into nodes, which is important for RAG workflows. It splits text into chunks of whole sentences with a default overlap to maintain context. The splitter can be customized with parameters like chunk_size
and chunk_overlap
. The example shows how to use TokenTextSplitter
on a Document
object, splitting the text into nodes and inheriting metadata from the original document. A warning is triggered if the metadata is too large, leaving less room for the actual content text. The next chapter will cover more text-splitting and node-parsing techniques available in LlamaIndex.
Nodes don’t like to be alone – they crave relationships
This content explains how to manually create relationships between nodes in LlamaIndex, focusing on the "previous" and "next" relationships to maintain order within a document. It highlights that LlamaIndex can automatically create these relationships during node parsing. Additionally, it introduces other relationship types like "SOURCE," "PARENT," and "CHILD," which are useful for tracking the origin of nodes and representing hierarchical structures within the data. The content concludes by posing the question of why these relationships are important, setting the stage for further discussion on their utility.
Why are relationships important?
Creating relationships between Nodes in LlamaIndex enhances querying by providing more context, tracking provenance, enabling navigation, supporting knowledge graph construction, and improving index structure. These relationships augment Nodes with contextual connections, leading to more expressive querying and complex index topologies. After structuring raw data into queryable Nodes, the next step is to organize them into efficient indexes.
Indexes
The passage explains the concept of indexing in LlamaIndex, which is crucial for organizing data for retrieval-augmented generation (RAG). Indexing transforms messy data into structured knowledge that AI can use effectively. LlamaIndex supports various index types, including SummaryIndex
, DocumentSummaryIndex
, VectorStoreIndex
, TreeIndex
, KeywordTableIndex
, KnowledgeGraphIndex
, and ComposableGraph
, each with its own strengths and trade-offs. All index types share common features like building the index, inserting new nodes, and querying the index. A SummaryIndex
example is provided, illustrating its creation and function as a simple list-based data structure that organizes nodes in order.
Are we there yet?
The text discusses how to retrieve answers from an index using retrievers and response synthesizers. It uses a Lionel Messi index as an example, querying "What is Messi’s hometown?" The summary index retrieves all nodes to synthesize a response with full context.
How does this actually work under the hood?
The QueryEngine
in LlamaIndex retrieves relevant Nodes from an index using a retriever, which fetches and ranks them. A node postprocessor then transforms, re-ranks, or filters these Nodes. Finally, a response synthesizer formulates an LLM prompt with the query and Node context, generates a response, and post-processes it into a natural language answer. The index.as_query_engine()
creates a complete query engine with default components. The overall process involves loading data, parsing it into Nodes, building an index, querying the index, and synthesizing a response. Different index types like SummaryIndex
, TreeIndex
, and KeywordIndex
impact performance and use cases, and the index structure defines the data management logic.
Starting our PITS project – hands-on exercise
The chapter introduces the hands-on development of the PITS project, emphasizing a modular code structure for clarity and ease of understanding. The project is built using Python and integrates with LlamaIndex, with a focus on creating a learning application. The author provides a disclaimer that the current implementation lacks certain features, such as authentication and error handling, which can be improved upon later.
A detailed overview of the Python source code files is provided, including their functions:
-
app.py: Main entry point for the Streamlit app.
-
document_uploader.py: Manages document ingestion and indexing.
-
training_material_builder.py: Creates learning materials based on user knowledge.
-
training_interface.py: Displays teaching content and facilitates user interaction.
-
quiz_builder.py: Generates quizzes based on user knowledge.
-
quiz_interface.py: Administers quizzes and evaluates user performance.
-
conversation_engine.py: Manages user interactions and maintains conversational context.
-
storage_manager.py: Handles file operations for session states and user uploads.
-
session_functions.py: Manages session state saving, loading, and deletion.
-
logging_functions.py: Records user interactions and application events.
-
global_settings.py: Contains application configurations and settings.
-
user_onboarding.py: Manages user onboarding processes.
-
index_builder.py: Builds indexes for the application.
The chapter also highlights the importance of the YAML package for session management and provides installation instructions. It delves into the global_settings.py
, session_functions.py
, and logging_functions.py
modules, explaining their roles in managing configurations, session states, and logging user actions, respectively. The author emphasizes the necessity of logging for debugging and monitoring the application. The chapter concludes with a promise of further coding in subsequent chapters.
Chapter 4: Ingesting Data into Our RAG Workflow
Ingesting data via LlamaHub
This section emphasizes the importance of data ingestion and processing in a RAG workflow, highlighting common challenges and potential solutions.
Key Challenges:
-
Data Quality: The quality of the RAG output depends on the quality of the input data. Cleaning, deduplicating, and removing redundant, ambiguous, biased, incomplete, or outdated information is crucial.
-
Data Dynamics: Knowledge repositories evolve, requiring a system for regularly updating content to incorporate new information and remove outdated data.
-
Data Variety: Data comes in various formats, and a RAG system should handle them all. While LlamaIndex offers many data loaders, automated ingestion can be challenging. LlamaParse is introduced as a solution for automated data ingestion and processing.
The section then transitions to discussing data ingestion using LlamaHub data loaders.
An overview of LlamaHub
LlamaHub is a library of integrations, including over 180 data connectors (also known as data readers or data loaders), that allow seamless integration of external data with LlamaIndex. These connectors extract data from various sources like databases, APIs, files, and websites, converting it into LlamaIndex Document
objects, saving you from writing custom parsers. LlamaIndex’s modular architecture means these integrations aren’t included in the core installation, requiring separate installation of the corresponding package. These readers may also utilize specialized libraries and tools tailored to each data type. The LlamaHub website lists all available readers with documentation and samples. The source code for the readers can be found in the llama-index-integrations/readers
subfolder of the Llama-index GitHub repository. Before using a data reader, make sure to install any additional dependencies required by the specific connector.
Using the LlamaHub data loaders to ingest content
Ingesting data from a web page
The SimpleWebPageReader
in LlamaIndex extracts text content from web pages. It requires the llama-index-readers-web
package to be installed. The reader fetches content from URLs, converts HTML to plain text (if specified and if the html2text
package is installed), and attaches metadata using a custom function if provided. The content, URL, and metadata are then encapsulated in a Document
object. While effective for simple web pages, it may not be suitable for complex, interactive websites. It simplifies the process of ingesting and structuring basic web content, allowing developers to focus on building RAG applications.
Ingesting data from a database
This text discusses using databases for efficient data management and introduces the DatabaseReader
connector in LlamaIndex for querying various database systems. It explains how to install the connector, connect to a database (using a URI, SQLAlchemy Engine, or credentials), execute a SQL query, and convert the results into LlamaIndex Document objects. The text provides an example using an SQLite database and points to the official documentation for a more general example. It also highlights the ease of use of LlamaHub readers, mentioning the wide variety of supported data formats and hinting at more efficient methods for ingesting multiple documents in the next section.
Bulk-ingesting data from sources with multiple file formats
This document discusses two methods for loading data into LlamaIndex for use in Retrieval-Augmented Generation (RAG) systems.
-
SimpleDirectoryReader: This is a simple and easy-to-use reader that can ingest multiple data formats (PDFs, Word docs, text files, CSVs) from a directory or a list of files. It automatically detects the file type and uses the appropriate reader to extract the content.
-
LlamaParse: This is a more advanced parsing service that is part of the LlamaCloud enterprise platform. It is designed for complex file formats and uses multi-modal capabilities and LLM intelligence to provide high-quality document parsing. It allows users to provide natural language instructions to guide the parsing process and offers a JSON output mode for structured data. It can be used in combination with
SimpleDirectoryReader
for bulk ingestion. It supports a wide range of file types and offers a free tier. It is a paid service, so users should review the privacy policy before submitting proprietary data.
Parsing the documents into nodes
Understanding the simple text splitters
This text discusses text splitters in LlamaIndex, which break down documents into smaller pieces at the raw text level. It provides code examples and explanations for three specific text splitters:
-
SentenceSplitter: Splits text while maintaining sentence boundaries, creating nodes containing groups of sentences.
-
TokenTextSplitter: Splits text at the token level, respecting sentence boundaries. Key parameters include
chunk_size
(max tokens per chunk),chunk_overlap
(token overlap between chunks),separator
(primary token boundary), andbackup_separators
(additional splitting points). -
CodeSplitter: Designed for source code, splitting based on programming language using an abstract syntax tree (AST) to keep related statements together. Requires installing
tree_sitter
andtree_sitter_languages
. Key parameters includelanguage
(programming language),chunk_lines
(lines per chunk),chunk_lines_overlap
(line overlap), andmax_chars
(max characters per chunk).
Using more advanced node parsers
This text discusses advanced tools in LlamaIndex for chunking text into nodes, focusing on NodeParser
and its derived classes. Key aspects include:
-
NodeParser Basics: All node parsers inherit from the
NodeParser
class, which allows customization ofinclude_metadata
,Include_prev_next_rel
, andCallback_manager
. -
SentenceWindowNodeParser: Splits text into sentences and includes a window of surrounding sentences in the metadata.
-
LangchainNodeParser: Integrates Langchain text splitters into LlamaIndex.
-
SimpleFileNodeParser: Automatically selects a node parser based on the file type.
-
HTMLNodeParser: Parses HTML files using Beautiful Soup, converting them into nodes based on HTML tags.
-
MarkdownNodeParser: Processes markdown text, creating nodes for each header and incorporating the header hierarchy into the metadata.
-
JSONNodeParser: Processes structured data in JSON format.
Practical ways of using these node creation models
The provided text outlines three main ways to implement node parsers or text splitters in LlamaIndex:
-
Standalone Usage: Directly calling
get_nodes_from_documents()
on a parser instance. This allows for explicit control and inspection of the generated nodes and their metadata. -
Configuring in
Settings
: Setting a customtext_splitter
inSettings
makes it the default for all subsequent operations that rely on text splitting. -
Ingestion Pipeline: Defining the parser as a transformation step within an ingestion pipeline, which is a structured process for data ingestion. This will be explained later in the chapter.
Working with metadata to improve the context
SummaryExtractor
The SummaryExtractor
in LlamaIndex generates concise summaries of nodes and their adjacent nodes ("prev", "self", "next"). This is useful in RAG architectures to improve retrieval by allowing search to consider summaries instead of full document content. It can be customized by specifying which summaries to generate and defining a custom prompt template. A practical use case is summarizing customer support issues and resolutions to quickly retrieve relevant past cases for new support requests.
QuestionsAnsweredExtractor
The QuestionsAnsweredExtractor
in LlamaIndex generates a specified number of questions that a given text node can answer. This helps focus retrieval on nodes directly addressing specific inquiries, making it useful for applications like FAQ systems.
Key features include:
-
Customizable Question Count: You can control how many questions are generated.
-
Prompt Customization: The prompt used to generate questions can be modified via the
prompt_template
parameter. -
Embedding Option: The
embedding_only
parameter allows controlling whether the generated metadata is used solely for embeddings.
Estimating the potential cost of using metadata extractors
Estimate your maximal costs before running the actual extractors
This section explains how to estimate LLM costs before running extractors on a real LLM using LlamaIndex tools.
-
MockLLM: A stand-in LLM that simulates LLM behavior locally without API calls. It uses a
max_tokens
parameter to mimic token generation limits for cost prediction. The actual cost will likely be lower than themax_tokens
value. -
CallbackManager and TokenCountingHandler:
CallbackManager
is a debugging tool, used here withTokenCountingHandler
to count tokens used in LLM operations. -
Tokenizer: Converts text into tokens for LLMs. It’s crucial to use a tokenizer compatible with the specific LLM for accurate cost predictions. LlamaIndex defaults to
CL100K
(GPT-4 tokenizer) but can be customized. -
Workflow: The extractor uses
MockLLM
locally.TokenCountingHandler
intercepts the prompt and response to count tokens. -
Multiple Extractors: Use
token_counter.reset_counts()
to estimate costs for multiple extractors individually in the same run. -
Key Takeaway: Metadata extraction costs should be estimated and optimized to avoid high operating costs.
Hands-on – ingesting study materials into our PITS
This text details the creation of a document_uploader.py
module designed to ingest and prepare study materials for a tutoring project. Here’s a summary:
-
Purpose: The module handles uploading books, documentation, and articles to provide context for the tutor.
-
Key Function:
ingest_documents()
This function is the core of the module. It: -
Loads Documents: Reads files from a designated
STORAGE_PATH
(defined inglobal_settings.py
). -
Logs Uploads: Records each uploaded file using a logging function.
-
Utilizes Caching: Checks for a pre-existing cache file (
CACHE_FILE
) to speed up processing. If found, it uses the cached data; otherwise, it processes the documents from scratch. -
Ingestion Pipeline: Employs an
IngestionPipeline
with three transformations: -
TokenTextSplitter: Splits documents into chunks.
-
SummaryExtractor: Summarizes each chunk.
-
OpenAIEmbedding: Generates embeddings (explained in a later chapter).
-
Saves Cache: Persists the processed data to the cache file for future use.
-
Returns Nodes: Returns the processed data as "nodes."
The module aims to streamline document processing and improve efficiency through caching, preparing the study materials for indexing in the next step of the project.
Chapter 5: Indexing with LlamaIndex
Indexing data – a bird’s-eye view
Common features of all Index types
LlamaIndex’s index types share common features inherited from the BaseIndex
class, allowing for customization across all index types. These shared features include:
-
Nodes: Indexes are built upon nodes, which can be customized and dynamically updated through insertion and deletion. Indexes can be built from pre-existing nodes or from documents, with settings available to customize underlying mechanics.
-
Storage Context: This defines how and where data is stored, crucial for efficient data management.
-
Progress Display: The
show_progress
option usestqdm
to display progress bars for long operations. -
Retrieval Modes: Indexes offer pre-defined retrieval modes and customizable Retriever classes for query processing.
-
Asynchronous Operations: The
use_async
parameter enables asynchronous processing for performance optimization.
Indexing may involve LLM calls, potentially raising cost and privacy concerns.
Understanding the VectorStoreIndex
A simple usage example for the VectorStoreIndex
The VectorStoreIndex
in LlamaIndex provides a simple way to ingest documents and make them searchable. It automatically handles node parsing (breaking down documents into chunks) using default or customizable parameters like chunk size and overlap.
Here’s a breakdown of the process:
-
Ingestion: Documents are loaded using
SimpleDirectoryReader
. -
Node Creation: Documents are split into nodes (chunks of text).
-
Embedding: These nodes are converted into high-dimensional vectors using a language model.
-
Storage: The vectors are stored in a vector store.
-
Querying: Incoming queries are also embedded, and their similarity to the stored vectors is calculated using cosine similarity.
-
Retrieval: The most similar vectors (and their corresponding document chunks) are returned as the query result.
Key Parameters:
-
use_async
: Enables asynchronous calls (default:False
). -
show_progress
: Displays progress bars during index construction (default:False
). -
store_nodes_override
: Forces storage of Node objects (default:False
).
The index utilizes fixed-size chunking by default, but performance can be optimized by testing different chunk sizes. The core strength of this index lies in its ability to perform semantic search by leveraging vector similarity.
Understanding embeddings
Vector embeddings are a way to translate data (text, images, sounds, etc.) into a numerical format that Large Language Models (LLMs) can understand. Think of them as converting information into a "standard language" for the LLM.
Here’s a breakdown of the key ideas:
-
Numerical Representation: Embeddings represent data as lists of numbers (vectors). These numbers capture the meaning of the data.
-
Semantic Understanding: LLMs use these numbers to understand relationships between concepts – like synonyms or different meanings of the same word (e.g., "bank" as a riverbank vs. a financial institution).
-
Similarity Search: Embeddings allow LLMs to find data that is similar in meaning. This is done by calculating the "distance" between vectors. A process called "top-k similarity search" finds the k most similar pieces of data.
-
Context is Key: The size of the text chunks used to create embeddings matters. Too small, and context is lost; too large, and meaning can be diluted.
Essentially, vector embeddings allow LLMs to "see" and "think" about data in a structured way, enabling them to process information and generate relevant responses. They are fundamental to how LLMs work with and understand the world around them.
Understanding similarity search
This text discusses the importance of similarity search in machine learning, particularly with the rise of embeddings which capture semantic meaning in vector form. Identifying similar vectors allows machines to understand relationships in data and is crucial for applications like recommendation systems and information retrieval.
The document focuses on three methods LlamaIndex uses to measure vector similarity:
-
Cosine Similarity: Measures the angle between two vectors – a smaller angle indicates higher similarity. It’s less sensitive to vector length and is the default method in LlamaIndex.
-
Dot Product: Calculates similarity based on the alignment and length of vectors. Higher values indicate greater similarity, but it is sensitive to vector length, potentially biasing results towards longer documents.
-
Euclidean Distance: Measures the actual distance between vector values, useful when vector dimensions represent real-world measurements.
The key difference lies in how each method approaches similarity: cosine similarity and dot product focus on direction, while Euclidean distance focuses on magnitude/distance. Understanding these differences is important for choosing the right method for a specific Retrieval-Augmented Generation (RAG) scenario.
OK, but how does LlamaIndex generate these embeddings?
LlamaIndex defaults to using OpenAI’s text-embedding-ada-002
model for creating text embeddings, which are crucial for tasks like semantic search. However, it offers flexibility to use alternative models due to cost, privacy, or specialization needs.
Key takeaways:
-
Alternatives to OpenAI: LlamaIndex supports various embedding models beyond OpenAI, including local models and those from other providers.
-
Hugging Face Integration: A popular option is using models from Hugging Face, a community-driven platform for AI models (particularly in NLP). The
llama-index-embeddings-huggingface
package enables this, withBAAI/bge-small-en-v1.5
as a well-balanced default local model. -
Custom Models: Advanced users can create and integrate their own custom embedding models by extending LlamaIndex’s
BaseEmbedding
class. -
Further Integrations: LlamaIndex also integrates with Langchain, Azure, CohereAI, and other providers, expanding the range of available embedding models.
In essence, LlamaIndex provides a versatile system for handling text embeddings, allowing users to choose the model that best fits their requirements and constraints.
How do I decide which embedding model I should use?
Choosing the right embedding model is crucial for a successful Retrieval-Augmented Generation (RAG) application, impacting performance, quality, and cost. Key considerations include:
-
Performance: Both qualitative (semantic understanding, domain specificity) and quantitative (semantic similarity, benchmarks like MTEB Leaderboard - https://huggingface.co/spaces/mteb/leaderboard are important.
-
Speed & Efficiency: Latency and throughput matter for real-time applications, as queries need to be embedded quickly. Consider input chunk size limitations.
-
Language Support: Choose a model that supports the languages your application requires.
-
Resources & Cost: Balance embedding accuracy with computational costs, storage, and API usage fees.
-
Accessibility: Consider availability (API vs. local install) and ease of integration.
-
Privacy & Connectivity: Local models offer privacy and offline functionality.
LlamaIndex offers flexibility and supports many embedding models (see https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings.html#list-of-supported-embeddings.
While OpenAI’s text-embedding-ada-002
is a good default choice, benchmarking different models is recommended to optimize for specific application needs. Resources like https://blog.getzep.com/text-embedding-latency-a-semi-scientific-look/ can help evaluate model performance.
Persisting and reusing Indexes
This text discusses the importance of storing vector embeddings generated by LlamaIndex to avoid redundant computation and ensure consistent query results. Here’s a summary:
-
Why persist embeddings? Re-embedding documents is computationally expensive and slow. Storing embeddings allows for faster processing, lower costs, and consistent query accuracy.
-
Vector Stores in LlamaIndex: LlamaIndex uses vector stores for efficient storage and retrieval of these embeddings. It defaults to in-memory storage, but offers persistence via the
.persist()
method. -
How to persist and load:
-
Use
index.storage_context.persist(persist_dir="index_cache")
to save the index data to disk. -
Use
StorageContext.from_defaults()
andload_index_from_storage()
to reload the index from the saved directory in future sessions, avoiding re-indexing.
In essence, the text explains how to save and reload LlamaIndex indexes to disk for efficiency and consistency.
Understanding the StorageContext
The StorageContext
in LlamaIndex is a central component for managing data storage during indexing and querying. It encompasses four key stores:
-
Document Store: Stores documents locally in
docstore.json
. -
Index Store: Stores index structures locally in
index_store.json
. -
Vector Stores: Manages multiple vector stores (locally in
vector_store.json
by default). -
Graph Store: Stores graph data structures in
graph_store.json
.
LlamaIndex automatically creates these local storage files when using the persist()
method, but allows for custom persistence locations.
While basic local stores are provided, the StorageContext
is designed to be flexible, supporting integrations with more robust solutions like AWS S3, Pinecone, and MongoDB.
The example demonstrates customizing vector storage using ChromaDB:
-
Install
chromadb
via pip. -
Initialize a Chroma client and create a collection (
my_chroma_store
). -
Create a
ChromaVectorStore
instance linked to the Chroma collection. -
Integrate the
ChromaVectorStore
into theStorageContext
. -
Build an index using the customized
StorageContext
.
This approach simplifies working with vector databases, abstracting away complexity and allowing developers to focus on application logic. LlamaIndex offers a scalable solution, ranging from simple in-memory storage to cloud-hosted databases, with easy component swapping.
Exploring other index types in LlamaIndex
The SummaryIndex
The SummaryIndex
is a simple and efficient indexing method in LlamaIndex, differing from the VectorStoreIndex
by storing data in a sequential list of nodes without using embeddings or a vector store. This makes it faster and less resource-intensive.
Key features and use cases:
-
Simple Structure: Data is stored as a list of chunks from ingested documents.
-
No LLM or Embeddings: Operates locally without requiring large language models or embedding models during indexing.
-
Linear Scan: Retrieval involves scanning the list sequentially for relevant information.
-
Suitable for: Documentation search, scenarios with resource constraints, or when complex semantic search isn’t necessary.
-
Usage: Easily created using
SummaryIndex.from_documents()
. -
Refinement Process: Uses a "create and refine" approach during queries, building an initial response and then refining it with additional context.
-
Retrievers: Compatible with different retrievers (
SummaryIndexRetriever
,SummaryIndexEmbeddingRetriever
,SummaryIndexLLMRetriever
) for varied search mechanisms.
In essence, the SummaryIndex
provides a straightforward way to index and search data when speed and simplicity are prioritized over complex semantic understanding.
The DocumentSummaryIndex
The DocumentSummaryIndex
is a specialized indexing tool within LlamaIndex designed for efficient document retrieval, particularly useful for large datasets where quick access to specific documents is needed.
Key Features & Functionality:
-
Summarization: It works by summarizing each document and linking these summaries to the document’s underlying nodes.
-
Efficient Retrieval: These summaries act as a quick filter, identifying relevant documents before deeper analysis.
-
Use Case: Ideal for knowledge management systems within organizations dealing with extensive documentation (reports, policies, manuals, etc.). It avoids issues with embedding-based retrieval on entire datasets with similar text chunks.
-
Customization: Offers parameters to control:
-
response_synthesizer
: How summaries are generated. -
summary_query
: The prompt used for summarization. -
show_progress
: Display progress bars during indexing. -
embed_summaries
: Embed summaries for similarity-based searches (default isTrue
). -
Retrieval Methods: Supports both embedding-based and LLM-based retrievers.
Basic Usage:
Creating a DocumentSummaryIndex
involves loading documents, summarizing them, and associating the summaries with the document nodes. The get_document_summary()
method allows access to the generated summaries for individual documents.
In essence, the DocumentSummaryIndex
prioritizes speed and relevance by leveraging document summaries to narrow the search space, making it a valuable tool for specific retrieval scenarios.
The KeywordTableIndex
The KeywordTableIndex
in LlamaIndex is an efficient index structure designed for rapid, targeted factual lookup based on keyword matching. It functions similarly to a glossary, creating a keyword-to-node mapping for quick retrieval of relevant information.
Key Features:
-
Keyword-Based: Instead of relying on complex embedding spaces, it uses a straightforward keyword table.
-
Efficient Search: Enables fast retrieval by directly matching keywords in queries to those in the index.
-
Customizable: Offers parameters like
keyword_extract_template
(for prompt customization),max_keywords_per_chunk
(to manage table size), anduse_async
(for performance). -
Keyword Extraction: Extracts keywords from documents using an LLM and a defined prompt, linking them to the source text chunks.
-
Retrieval Modes: Supports simple keyword matching, RAKE, and LLM-based keyword extraction/matching.
-
Alternatives: Offers
SimpleKeywordTableIndex
(regex-based) andRAKEKeywordTableIndex
(usingrake_nltk
) as LLM-free options. -
Create and Refine: Like
SummaryIndex
, it uses a create and refine approach for final response synthesis.
The index is particularly useful when precise keyword matching is crucial, and provides a versatile tool for applications requiring keyword precision. A simple example demonstrates its ease of use, automatically extracting keywords and setting up the retrieval system.
The TreeIndex
The TreeIndex
is a hierarchical data structure within LlamaIndex designed for efficient information organization and retrieval, particularly useful for complex datasets. Unlike a flat index, it organizes data in a tree format where each node summarizes its children, created recursively using LLMs and customizable summarization prompts.
Key Features & Parameters:
-
Hierarchical Structure: Data is organized in a tree, allowing for abstraction and efficient querying.
-
Customizable Parameters:
-
summary_template
: Prompt for summarization during index construction. -
insert_prompt
: Prompt for integrating new nodes into the tree. -
num_children
: Maximum number of child nodes per node (default is 10). -
build_tree
: Determines if the tree is built during index construction or query time. -
use_async
: Enables asynchronous operation for faster processing of large datasets. -
Retrieval Modes: Offers various retrieval strategies including
TreeSelectLeafRetriever
,TreeSelectLeafEmbeddingRetriever
,TreeRootRetriever
, andTreeAllLeafRetriever
. -
Query Process: Queries traverse the tree, identifying relevant keywords in node summaries to pinpoint relevant leaf nodes.
Usage:
The TreeIndex
is created from documents and used with a query engine to retrieve information. A simple example demonstrates loading documents and querying the index.
Drawbacks:
While powerful, TreeIndex
has potential drawbacks:
-
Increased Computation: Building and maintaining the tree is computationally intensive.
-
Recursive Retrieval: Querying involves recursive tree traversal, which can be slow.
-
Summarization Overhead: Summarizing nodes adds to the processing cost.
-
Storage Requirements: Requires more storage than flat indexes.
-
Maintenance: Updates and insertions can be complex.
Overall:
The TreeIndex
is a valuable tool for RAG applications dealing with large, complex datasets where context and relationships are important. However, its computational and storage costs should be carefully considered against the benefits of improved retrieval performance. It excels in scenarios needing efficient, context-aware retrieval, particularly within organizations managing hierarchical data.
The KnowledgeGraphIndex
The KnowledgeGraphIndex
in LlamaIndex is a tool for enhancing query processing by building a knowledge graph (KG) from text data. It primarily uses an LLM to extract triplets (subject-predicate-object) from text, but allows for custom extraction functions.
Key Features & Benefits:
-
Relationship Focus: Excels at understanding complex relationships between entities and concepts, providing context-aware responses. Ideal for multifaceted questions.
-
Use Cases: Suitable for applications like news aggregation, where tracking entities and their relationships over time is valuable.
-
Customization: Offers several customizable parameters:
-
kg_triple_extract_template
: Controls how triplets are identified. -
max_triplets_per_chunk
: Limits triplets per text chunk. -
graph_store
: Defines graph storage type. -
include_embeddings
: Adds embeddings for enhanced retrieval. -
max_object_length
: Limits the length of the object in a triplet. -
kg_triplet_extract_fn
: Allows for custom triplet extraction. -
Construction: Builds the KG by either using a default LLM-based triplet extraction method or a user-provided custom function. Embeddings can be included for each triplet.
-
Querying: Utilizes three distinct retrievers (
KGTableRetriever
,KnowledgeGraphRAGRetriever
, and a hybrid mode) to retrieve relevant information from the KG.
In essence, the KnowledgeGraphIndex
transforms text into a structured knowledge representation, enabling more intelligent and contextually relevant query responses.
Building Indexes on top of other Indexes with ComposableGraph
The ComposableGraph
in LlamaIndex is a method for structuring information by hierarchically stacking Indexes. It allows you to build lower-level Indexes within individual documents (like TreeIndex
) and then aggregate those into higher-level Indexes over a collection of documents (like SummaryIndex
).
Key features and functionality:
-
Hierarchical Structure: Enables organization of detailed information within documents and summarization across collections.
-
Construction: Built using
ComposableGraph.from_indices()
, requiring a root Index class (e.g.,SummaryIndex
), child Indexes (e.g.,TreeIndex
), and summaries for each child Index. -
Querying: A
ComposableGraphQueryEngine
recursively traverses the hierarchy, starting from the root summary Index, to retrieve relevant information from lower-level Indexes. -
Customization: Allows for custom query engines at each Index level for tailored retrieval strategies.
-
Summaries: Summaries can be manually defined or automatically generated using queries or
SummaryExtractor
.
Benefits:
-
Efficient retrieval of information from both high-level summaries and detailed, low-level Indexes.
-
Comprehensive understanding of complex datasets.
-
Deep, hierarchical understanding of data.
In essence, ComposableGraph
provides a powerful way to organize and query complex information by leveraging a layered indexing approach.
Estimating the potential cost of building and querying Indexes
This text details the potential costs and privacy concerns associated with using Indexes in LlamaIndex, primarily due to their reliance on Large Language Models (LLMs) for building and querying.
Key takeaways:
-
Cost Considerations: Repeated LLM calls, especially during index construction (like
TreeIndex
orKeywordTableIndex
) and embedding generation (likeVectorStoreIndex
), can quickly become expensive. -
Best Practices for Cost Reduction:
-
Utilize Indexes that minimize LLM calls during building (e.g.,
SummaryIndex
,SimpleKeywordTableIndex
). -
Employ cheaper LLM models when full accuracy isn’t essential.
-
Cache and reuse existing Indexes to avoid redundant building.
-
Optimize query parameters (e.g.,
similarity_top_k
) to reduce LLM calls. -
Use local LLM and embedding models for cost control and enhanced data privacy.
-
Cost Estimation: The text provides practical examples using
MockLLM
andMockEmbedding
withTokenCountingHandler
to estimate LLM and embedding token usage before building and querying indexes. This allows for proactive cost management. -
RAG & Smaller Models: Retrieval-Augmented Generation (RAG) enhances the performance of smaller models by providing access to external knowledge, mitigating the need for excessively large, costly models.
-
Importance of Prediction: Always estimate token usage before indexing large datasets to avoid unexpected expenses.
In essence, the document advocates for a proactive approach to cost and privacy management when using LlamaIndex Indexes, emphasizing estimation, optimization, and the potential benefits of local models.
Indexing our PITS study materials – hands-on
This text details the implementation of an index_builder.py
module for a tutoring application using LlamaIndex. The module is responsible for creating and loading indexes for efficient data retrieval.
Here’s a summary of the key points:
-
Two Index Types: The module creates two types of indexes: a
VectorStoreIndex
and aTreeIndex
. -
Persistence: The code first attempts to load existing indexes from a specified storage location (
INDEX_STORAGE
). This avoids rebuilding the indexes if they already exist, saving time and resources. -
Index IDs: When multiple indexes are stored in the same location,
index_id
is used to differentiate and correctly load them. -
Building New Indexes: If the indexes are not found in storage, they are built from provided
nodes
(presumably document chunks). Each index is assigned a unique ID ("vector"
and"tree"
) usingset_index_id
. -
Storage: Newly created indexes are persisted to the
INDEX_STORAGE
directory for future use. -
Return Value: The
build_indexes
function returns both thevector_index
andtree_index
objects.
The code provides a basic implementation with potential for improvement, and the next step (covered in Chapter 6) will focus on querying the data using these indexes.
Part 3: Retrieving and Working with Indexed Data
Chapter 6: Querying Our Data, Part 1 – Context Retrieval
Understanding the basic retrievers
This text explains retrieval mechanisms within the LlamaIndex RAG (Retrieval-Augmented Generation) system. Here’s a summary:
-
Core Function: Retrievers find relevant information ("nodes") from an index to provide context for generating responses. They return results as
NodeWithScore
objects, which include a relevance score (though not all retrievers provide a score). -
Construction Methods: Retrievers can be created in two main ways:
-
From an Index: Using the
as_retriever()
method of an index object (e.g.,summary_index.as_retriever()
). -
Direct Instantiation: Directly creating a retriever object (e.g.,
SummaryIndexEmbeddingRetriever(index=summary_index)
).
-
-
Upcoming Information: The text previews a detailed list of available retriever options for each index type within LlamaIndex, intended as a reference for building applications.
The VectorStoreIndex retrievers
This document details various retriever options available within the LlamaIndex framework for different index types, focusing on how they function and their customization options.
1. VectorIndex Retrievers:
-
VectorIndexRetriever
: The default retriever forVectorStoreIndex
, it uses vector similarity search. Key customizable parameters include: -
similarity_top_k
: Number of top results returned. -
vector_store_query_mode
: Query mode for the vector store (e.g., Pinecone, OpenSearch). -
filters
,doc_ids
,node_ids
: Methods for narrowing search scope using metadata or IDs. -
alpha
,sparse_top_k
: Parameters for hybrid (sparse & dense) search. -
vector_store_kwargs
: For passing specific arguments to the vector store. -
VectorIndexAutoRetriever
: A more advanced retriever that uses an LLM to automatically optimize query parameters based on content description and metadata, useful for complex or ambiguous data.
2. SummaryIndex Retrievers:
-
SummaryIndexRetriever
: Returns all nodes in the index without filtering or sorting – useful for a complete data view. -
SummaryIndexEmbeddingRetriever
: Uses embeddings (created dynamically) to find the most relevant nodes based on similarity to the query, returning nodes with a relevance score (NodeWithScore
). -
SummaryIndexLLMRetriever
: Leverages an LLM and a prompt to select relevant nodes. Customizable via: -
choice_select_prompt
: Override the default prompt. -
choice_batch_size
: Batch size for query processing. -
format_node_batch_fn
,parse_choice_select_answer_fn
: Functions for formatting node batches and parsing LLM responses (including relevance score calculation). -
service_context
: Allows customization of the LLM used.
General Considerations:
-
Security: Filtering information early in the RAG process (at the retriever stage) is a secure design principle.
-
Cost: Reducing the amount of information processed by the LLM (through filtering) can lower costs.
The document emphasizes choosing the appropriate retriever based on the data’s structure, the user’s familiarity with the data, and the desired level of control over the search process.
The DocumentSummaryIndex retrievers
The text details two retrieval options for a DocumentSummaryIndex
: DocumentSummaryIndexLLMRetriever
and DocumentSummaryIndexEmbeddingRetriever
.
-
Uses an LLM to select relevant summaries from document summaries.
-
Processes queries in batches, configurable with
choice_batch_size
. -
Allows custom prompts (
choice_select_prompt
) and functions for formatting nodes for the LLM (format_node_batch_fn
) and parsing the LLM’s response (parse_choice_select_answer_fn
). -
Returns results sorted by relevance and includes a relevance score for each node.
-
Note: Experimentation showed LLM-assigned relevance scores tend to be consistently high, potentially requiring prompt adjustments for nuanced differentiation.
DocumentSummaryIndexEmbeddingRetriever
:
-
Relies on embeddings to find summaries with the highest similarity to the query.
-
Requires the index to be built with
embed_summaries=True
. -
Uses
similarity_top_k
to specify the number of summaries to return. -
Does not return a relevance score.
-
Effective for finding relevant summaries based on embedding similarity.
In essence, the LLM retriever leverages natural language understanding for more sophisticated relevance assessment (with scores), while the embedding retriever uses a faster, similarity-based approach.
The TreeIndex retrievers
This text details the TreeIndex
in LlamaIndex, a complex index type designed for hierarchical data like filesystems or organizational charts. It’s important to note that TreeIndex
builds a new hierarchical structure based on summaries of the original data, not simply reflecting existing hierarchies. Querying this structure can be computationally expensive due to its recursive nature.
Here’s a breakdown of the different retrieval methods available for TreeIndex
:
-
TreeSelectLeafRetriever
(Default): Recursively navigates the tree, using an LLM to identify the most relevant leaf nodes. Thechild_branch_factor
controls how many child nodes are considered at each level (defaults to 1). Offers customizable prompt templates for query refinement. Doesn’t return relevance scores. -
TreeSelectLeafEmbeddingRetriever
: Similar toTreeSelectLeafRetriever
, but uses embedding similarity to select nodes instead of an LLM. Includes anembed_model
parameter for specifying the embedding model. Doesn’t return relevance scores. -
TreeAllLeafRetriever
: Retrieves all leaf nodes, regardless of hierarchy, and sorts them. Fastest option, useful for ensuring no information is missed, but doesn’t provide relevance scores. -
TreeRootRetriever
: Retrieves responses directly from the root nodes of the tree, assuming answers are pre-computed and stored there. Efficient when information is already summarized at the top level. Doesn’t return relevance scores.
Practical Use Case: The text highlights a clinical decision support system (CDSS) as a good example, where pre-computed answers to common medical questions are stored in root nodes for quick retrieval.
In essence, TreeIndex
offers flexibility in how you navigate and retrieve information from hierarchical data, with trade-offs between speed, computational cost, and the need for relevance scoring.
The KeywordTableIndex retrievers
The KeywordTableIndex
retrieves information by first extracting keywords from a query. This extraction method varies depending on the retriever used. Once keywords are extracted, the retriever counts their frequency within the indexed nodes and sorts nodes by matching keyword count (typically descending, indicating relevance). Results are returned as NodeWithScore
objects, though relevance scores are not directly provided by the index itself.
There are three main retriever options:
-
KeywordTableGPTRetriever: Uses an LLM to identify keywords.
-
KeywordTableSimpleRetriever: Uses a faster, regex-based keyword extraction method.
-
KeywordTableRAKERetriever: Employs the RAKE method for keyword extraction.
Common arguments for configuring these retrievers include: query_keyword_extract_template
(for the default retriever), max_keywords_per_query
, and num_chunks_per_query
to control query complexity and system performance.
The KnowledgeGraphIndex retrievers
This text details two types of retrievers used with Knowledge Graph Indices in LlamaIndex: KGTableRetriever
and KnowledgeGraphRAGRetriever
. Both extract relevant information (nodes) from a knowledge graph based on user queries, which are structured as triplets (subject, predicate, object).
KGTableRetriever:
-
Is the default retriever and operates in three modes:
-
Keyword: Uses keywords from the query to find matching nodes (case-sensitive).
-
Embedding: Converts the query to an embedding and finds similar nodes.
-
Hybrid: Combines keyword and embedding searches for precision and semantic understanding.
-
Offers several customizable parameters to control keyword extraction, query refinement, and the amount of information retrieved (e.g.,
max_keywords_per_query
,similarity_top_k
). -
Returns a default score of 1000 for retrieved nodes.
-
If no nodes are found, returns a placeholder node indicating "No relationships found".
KnowledgeGraphRAGRetriever:
-
Identifies key entities in the query and uses them to navigate the graph.
-
Utilizes entity extraction and synonym expansion to broaden the query context.
-
Traverses the graph to a specified depth (
graph_traversal_depth
). -
Also operates in keyword, embedding, and hybrid modes (though as of January 2024, only keyword mode was fully implemented in v0.9.25).
-
Includes a
with_nl2graphquery
option to convert natural language queries into graph queries. -
Offers parameters to control entity/synonym limits, expansion policies, and verbosity.
Both retrievers share the ability to customize prompts using BasePromptTemplate
objects (detailed in a later chapter). They both aim to retrieve relevant knowledge sequences to answer user queries, balancing information quality and quantity through parameters like max_knowledge_sequence
.
Efficient use of retrieval mechanisms – asynchronous operation
This text discusses the benefits of using asynchronous execution in LlamaIndex, as opposed to the previously used synchronous methods. While synchronous methods are simpler to understand, asynchronous operations improve performance, reduce latency, and enhance user experience—especially in applications with frequent, complex queries and large datasets.
The provided code example demonstrates how to run two retrievers in parallel using asyncio.gather()
. Although the performance gain is minimal with a small dataset, the benefits become significant in real-world applications. The text then indicates it will move on to discussing more advanced retrieval methods.
Building more advanced retrieval mechanisms
Implementing metadata filters
This text demonstrates how to implement a retrieval system using LlamaIndex that filters results based on metadata, specifically to handle situations where the same term has different meanings depending on the user’s context (in this case, their department).
Here’s a breakdown:
-
The Problem: Different departments within an organization may have differing definitions for the same concepts (e.g., "incident").
-
The Solution: Use metadata filtering to retrieve only the definition relevant to the current user’s department.
-
Implementation:
-
Define User Departments: A dictionary maps users to their respective departments.
-
Create Nodes with Metadata: Text nodes are created, each containing a definition and metadata specifying the relevant department.
-
Filtering Function: A function
show_report
usesMetadataFilters
to retrieve nodes matching the user’s department. -
Retrieval: The
as_retriever
method is used with the filters to create a retriever that only returns relevant nodes.
-
-
Example: Running the same query ("What is an incident?") for users "Alice" (Security) and "Bob" (IT) returns different definitions tailored to their departments.
-
Advanced Filtering: While the default vector store in LlamaIndex only supports equality (
EQ
) filtering, more sophisticated vector stores (like Pinecone or ChromaDB) support a wider range of operators (greater than, less than, in, not in, etc.) for more complex filtering scenarios, such as access control based on clearance levels.
In essence, the text showcases a practical application of metadata filtering in LlamaIndex to achieve a form of "polymorphism" in information retrieval, delivering contextually appropriate results to different users.
Using selectors for more advanced decision logic
This text discusses the importance of selectors in advanced Retrieval-Augmented Generation (RAG) applications, particularly when dealing with diverse user queries. Because users may ask specific questions, seek general information, or request summaries/comparisons, a RAG system needs a way to dynamically choose the best retrieval method.
Selectors act as this decision-making component, implementing conditional logic to route queries to the appropriate tool (retriever, parser, index, etc.). LlamaIndex offers five types of selectors: LLMSingleSelector
, LLMMultiSelector
, EmbeddingSingleSelector
, PydanticSingleSelector
, and PydanticMultiSelector
, which differ in how they make their selections (LLM reasoning, similarity calculations, or Pydantic objects).
The example provided demonstrates a simple LLMSingleSelector
that uses an LLM to choose from a predefined list of options based on a user query, returning both the selected option and the reasoning behind the choice. The text emphasizes that selectors are a generic mechanism applicable to various conditional logic scenarios within a RAG application, not just retrievers. It then introduces the concept of ToolMetadata
as a more advanced selection method, setting the stage for further explanation.
Understanding tools
This text explains how to implement an adaptive retrieval mechanism using LlamaIndex, enabling an application to intelligently choose the best retriever for a given query.
Here’s a summary of the key concepts and steps:
-
Agentic Functionality & Tool Containers: The core idea is to use a generic container holding different functionalities (retrievers in this case) that can be selected at runtime based on context.
-
LlamaHub Tools: LlamaHub provides a collection of pre-built tools for various tasks.
-
RetrieverTool
: This class encapsulates a retriever and a textual description, allowing a selector to understand its purpose. -
RouterRetriever
: This object uses a selector to decide whichRetrieverTool
to use for a given query. It takes the selector and a list ofRetrieverTool
objects as input. -
Selectors (
PydanticMultiSelector
): These determine which retriever(s) to use.PydanticMultiSelector
can select multiple retrievers simultaneously, handling complex queries that require information from multiple sources.PydanticSingleSelector
would only choose one. -
Implementation: The example code demonstrates creating two retrievers (one for Ancient Rome, one for dogs), wrapping them in
RetrieverTool
objects with descriptive text, and then combining them into aRouterRetriever
. Queries are then passed to theRouterRetriever
, which dynamically selects the appropriate retriever based on the query’s content.
The text sets the stage for further discussion of more advanced retrieval and query engine techniques in later chapters.
Transforming and rewriting queries
This text introduces QueryTransform
as a powerful tool for Retrieval-Augmented Generation (RAG) applications. It allows for the modification and rewriting of queries before they are used to search an index, improving retrieval relevance and accuracy.
Key takeaways:
-
Purpose: To refine user queries into more effective search terms. A practical example given is a technical support chatbot where vague user descriptions can be transformed into specific technical queries.
-
Variations: Several
QueryTransform
types exist, each with a specific function:-
IdentityQueryTransform
: No modification – maintains default behavior. -
HyDEQueryTransform
: Generates hypothetical documents to improve relevance. -
DecomposeQueryTransform
: Breaks down complex queries into simpler subqueries. -
ImageOutputQueryTransform
: Formats results for image output (e.g., generating<img>
tags). -
StepDecomposeQueryTransform
: Decomposes queries while considering previous reasoning/context.
-
-
Example: The provided Python code demonstrates
DecomposeQueryTransform
taking a broad query ("Tell me about buildings in ancient Rome") and refining it into a more focused one ("What were some famous buildings in ancient Rome?"). This illustrates how transformation can lead to more accurate retrieval.
In essence, QueryTransform
enhances RAG systems by bridging the gap between how users ask questions and how the index best understands and responds to them.
Creating more specific sub-queries
This text explains how to improve query performance in LlamaIndex by breaking down complex questions into simpler sub-queries using the OpenAIQuestionGenerator
.
Here’s a summary of the key points:
-
Problem: Ambiguous or complex questions can lead to poor results from information retrieval systems.
-
Solution:
OpenAIQuestionGenerator
automatically generates more specific sub-questions from an initial query. -
How it works:
-
It utilizes LLMs (specifically OpenAI’s by default) to understand the query and available tools.
-
ToolMetadata
is used to describe each retrieval tool (e.g., a vector index for Ancient Rome, a summary index for dogs). -
The generator receives a list of tools and the original query, then outputs a list of
SubQuestion
objects, each containing atool_name
and a refinedsub_question
.
-
-
Benefits: More specific queries lead to better context for retrieval and higher-quality answers.
-
Alternatives:
LLMQuestionGenerator
(allows use of any LLM) andGuidanceQuestionGenerator
(guides query processing order) are also available. -
Next Steps: These sub-queries are used with a
SubQuestionQueryEngine
(discussed in a later chapter) to process the information.
In essence, the text demonstrates a technique for enhancing query accuracy by strategically decomposing complex requests into manageable, focused sub-questions.