Using VectorStoreIndex
This document details the VectorStoreIndex in LlamaIndex, a core component for Retrieval-Augmented Generation (RAG). Here’s a breakdown:
Core Functionality:
-
VectorStoreIndexbuilds an index from a list ofNodeobjects (text chunks with metadata). -
It’s fundamental to most LlamaIndex applications, used directly or indirectly.
Loading Data & Creating the Index:
-
Basic Usage: The simplest method uses
from_documentsto load documents and automatically create the index. -
Ingestion Pipeline: Offers more control over indexing by customizing chunking, metadata, and embedding using a pipeline of transformations.
-
Manual Node Creation: Allows complete control by creating and defining
Nodeobjects directly before passing them to the index. Includes methods for handling updates (insertion, deletion, update, refresh). -
Batch Size: The
insert_batch_sizeparameter can be adjusted for memory management, especially when using remote vector databases.
Storing the Index:
-
LlamaIndex supports numerous vector stores.
-
You specify the desired vector store using a
StorageContext, which includes thevector_storeargument (e.g., Pinecone).
Advanced Features:
-
Composable Retrieval: The index can retrieve not just nodes, but also query engines, retrievers, and query pipelines, automatically executing them when retrieved.
Resources:
-
Links are provided to further documentation on loading, ingestion pipelines, node usage, metadata extraction, document management, vector stores, and example notebooks.
VectorStoreIndex Code
This code defines a VectorStoreIndex class, which is a type of index
in the LlamaIndex library designed to work with vector stores. Here’s a
breakdown of the code, section by section:
1. Imports:
-
A variety of modules are imported from the
llama_indexlibrary, covering asynchronous operations, base classes for retrievers and indices, data structures, embedding utilities, schema definitions (nodes, metadata), settings, storage, and vector store types. -
asynciois imported for asynchronous programming. -
loggingis used for logging messages. -
typingis used for type hinting.
2. VectorStoreIndex Class Definition:
-
Inheritance: The
VectorStoreIndexclass inherits fromBaseIndex[IndexDict]. This means it’s a specialized type of index that uses anIndexDictto store its internal structure. -
index_struct_cls: This class attribute is set toIndexDict, specifying the type of index structure to be used. -
__init__Method (Constructor):-
This method initializes the
VectorStoreIndexobject. -
It takes several arguments, including:
-
nodes: A list ofBaseNodeobjects representing the data to be indexed. -
use_async: A boolean flag to enable asynchronous operations. -
store_nodes_override: A flag to control whether to always storeNodeobjects in the index and document store, even if the vector store already stores the text. -
embed_model: The embedding model to use for generating vector embeddings. -
insert_batch_size: The number of nodes to process in each batch during insertion. -
Other arguments related to the parent
BaseIndexclass.
-
-
It initializes instance variables (e.g.,
_use_async,_embed_model,_insert_batch_size) based on the input arguments. -
It calls the
__init__method of the parentBaseIndexclass to perform common initialization tasks.
-
-
from_vector_storeClass Method:-
This is a class method (indicated by the
@classmethoddecorator) that creates aVectorStoreIndexfrom an existingBasePydanticVectorStore. -
It checks if the vector store stores text. If not, it raises a
ValueError. -
It creates a
StorageContextfrom the vector store. -
It returns a new
VectorStoreIndexobject initialized with the vector store and other parameters.
-
-
vector_storeProperty:-
This property provides access to the underlying
BasePydanticVectorStoreobject.
-
-
as_retrieverMethod:-
This method returns a
BaseRetrieverobject that can be used to retrieve data from the index. -
It creates a
VectorIndexRetrieverobject, passing in the index itself, the node IDs, and other relevant parameters.
-
-
_get_node_with_embeddingMethod:-
This method takes a list of
BaseNodeobjects and generates embeddings for them using the specified embedding model. -
It uses the
embed_nodesfunction to perform the embedding in batches. -
It creates a new list of
BaseNodeobjects, each with its embedding added.
-
-
_aget_node_with_embeddingMethod:-
This is an asynchronous version of
_get_node_with_embedding. It usesasync_embed_nodesto generate embeddings asynchronously.
-
-
_async_add_nodes_to_indexMethod:-
This is an asynchronous method that adds nodes to the index.
-
It processes nodes in batches using
iter_batch. -
It generates embeddings for each batch of nodes using
_aget_node_with_embedding. -
It adds the nodes to the vector store using
self._vector_store.async_add. -
It handles the storage of nodes in the index struct and document store based on whether the vector store stores text and the
_store_nodes_overrideflag.
-
-
_add_nodes_to_indexMethod:-
This is the synchronous version of
_async_add_nodes_to_index. It performs the same operations but without using asynchronous calls.
-
-
_build_index_from_nodesMethod:-
This method builds the index from a list of nodes.
-
It creates an
IndexDictobject. -
It calls either
_async_add_nodes_to_indexor_add_nodes_to_indexbased on the_use_asyncflag. -
It returns the created
IndexDictobject.
-
-
build_index_from_nodesMethod:-
This method builds the index from nodes, filtering out nodes without content.
-
-
_insertMethod:-
This method inserts nodes into the index.
-
-
insert_nodesMethod:-
This method inserts nodes into the index, handling potential errors and updating the storage context.
-
-
_delete_nodeMethod:-
Placeholder for deleting a single node.
-
-
delete_nodesMethod:-
This method deletes a list of nodes from the index. It deletes from the vector store and, optionally, from the document store.
-
-
_delete_from_index_structMethod:-
Deletes nodes from the index structure.
-
-
_delete_from_docstoreMethod:-
Deletes nodes from the document store.
-
-
delete_ref_docMethod:-
Deletes a document and its associated nodes using a reference document ID.
-
-
_adelete_from_index_structMethod:-
Asynchronous version of
_delete_from_index_struct.
-
-
_adelete_from_docstoreMethod:-
Asynchronous version of
_delete_from_docstore.
-
-
adelete_ref_docMethod:-
Asynchronous version of
delete_ref_doc.
-
-
ref_doc_infoProperty:-
Retrieves information about ingested documents and their nodes.
-
3. GPTVectorStoreIndex = VectorStoreIndex:
-
This line creates an alias
GPTVectorStoreIndexthat points to theVectorStoreIndexclass. This is likely done for compatibility or to provide a more specific name for a particular use case.
In Summary:
The VectorStoreIndex class provides a way to build an index on top
of an existing vector store. It handles the process of embedding nodes,
adding them to the vector store, and managing the index structure. It
supports both synchronous and asynchronous operations and provides
methods for inserting, deleting, and retrieving data. The class is
designed to be flexible and adaptable to different vector store
implementations. The store_nodes_override parameter is a key feature
that allows control over how nodes are stored, especially when the
vector store itself doesn’t store the original text content.