Building Indexing Pipelines

Indexing is the first step in your RAG journey. Indexing involves reading text from your documents and splitting them into smaller units called chunks. The chunks are converted into vector embeddings using an embedding model. These embeddings along with the text and associated metadata are inserted into a vector store.

Creating your first indexing pipeline

1. When you first log into the application you will see the Indexing Pipelines page as shown in the image below.


Creating an indexing pipeline

2. You will see a message saying no indexing pipelines found. Click on Create button to create a new indexing pipeline. After clicking on the Create button you will see a popup dialog box asking to name the indexing pipeline as shown in the image below.


Name the indexing pipeline

Pipeline name requirements

  • Indexing pipeline name must contain only lowercase alphabets and hypen.

  • Indexing pipeline name cannot contain numbers and special characters barring hypen.


3. After entering the indexing pipeline name and hitting the Create button in the popup dialog box you will see the Indexing Pipeline canvas where you can build your indexing pipeline as shown in the image below.


Indexing pipeline canvas
  1. Drag and drop the components present in the sidebar on the left to build your indexing pipeline.

Supported document formats

Upload Files Component

Currently, MRAG allows a user to upload .txt and .pdf documents only. Each document must be less than 1MB and the total documents’ size must be less than 5MB.

You should use the Upload File component show in the above image to upload your documents. This should be the first component of your indexing pipeline.

Supported document splitters

Document splitters divides the document text into smaller units called chunks.

  • MRAG supports multiple document splitters.

  • Each splitter can be applied to multiple documents.

  • Multiple splitters can be applied to a single document by selecting the document in multiple splitter components.

  • Files uploaded using the Upload File component must be selected in at least one splitter component.

Currently, MRAG supports document splitters like

  • Token Splitter

  • Sentence Splitter

  • Regex Splitter

  • PDF Font Splitter

  • Dummy Splitter

A user can drag and drop these document splitter components on to the canvas to build the indexing pipeline. Below are the details of the document splitters.


Token Splitter

Token Splitter

Token Splitter splits a document into chunk based on the tokens present in the document. This component has multiple parameters that are described below.

Select Files

Select Files dropdown list enables a user to select the documents to which the splitter must be applied to. A user can select multiple files in the dropdown list.

Chunk Size

Chunk Size enables a user to specify the maximum number of tokens to be present in a single chunk.

Chunk Overlap

Chunk Overlap enables a user to specify the number of tokens at the end of the previous chunk to be appended to the start of the current chunk. This ensures that the text is not split abruptly resulting in context loss.

Separator

Separator enables a user to specify the delimiter to use for splitting the tokens.

Include Filename

Include Filename enables a user to choose whether to include the filename in the chunk metadata. Including the filename in the chunk improves retriever preformance and LLM response.

Metadata Schema

Metadata Schema dropdown list enables a user to choose the metadata schema to extract metadata from the document/chunk. Including metadata in a chunk enables self query. Only a single schema can be selected in a splitter component.

Chunk Level Metadata

Chunk Level Metadata enables a user to choose whether the metadata must be extracted from the document as a whole or from each chunk independently.


Sentence Splitter

Sentence Splitter

Sentence Splitter splits a document into chunks while preserving the sentence. This component has multiple parameters that are described below.

Select Files

Select Files dropdown list enables a user to select the documents to which the splitter must be applied to. A user can select multiple files in the dropdown list.

Chunk Size

Chunk Size enables a user to specify the maximum number of tokens to be present in a single chunk.

Chunk Overlap

Chunk Overlap enables a user to specify the number of tokens at the end of the previous chunk to be appended to the start of the current chunk. This ensures that the text is not split abruptly resulting in context loss.

Separator

Separator enables a user to specify the delimiter to use for splitting the document.

Include Filename

Include Filename enables a user to choose whether to include the filename in the chunk metadata. Including the filename in the chunk improves retriever preformance and LLM response.

Metadata Schema

Metadata Schema dropdown list enables a user to choose the metadata schema to extract metadata from the document/chunk. Including metadata in a chunk enables self query. Only a single schema can be selected in a splitter component.

Chunk Level Metadata

Chunk Level Metadata enables a user to choose whether the metadata must be extracted from the document as a whole or from each chunk independently.


Regex Splitter

Regex Splitter

Regex Splitter splits a document into chunks based on the regular expressions provided by the user. This results in smart chunking where the complete section of a document is present as a single. For example, a user can split the document into chunks based on the section number (1, 1.2, 1.2.2, etc). This ensures the whole section of a document is present in a single chunk. This component has multiple parameters that are described below.

Select Files

Select Files dropdown list enables a user to select the documents to which the splitter must be applied to. A user can select multiple files in the dropdown list.

Regex

Regex enables a user to specify the regular expressions to use for splitting the document. A user can specify multiple regular expressions using ~ as the delimiter.

Include Filename

Include Filename enables a user to choose whether to include the filename in the chunk metadata. Including the filename in the chunk improves retriever preformance and LLM response.

Metadata Schema

Metadata Schema dropdown list enables a user to choose the metadata schema to extract metadata from the document/chunk. Including metadata in a chunk enables self query. Only a single schema can be selected in a splitter component.

Chunk Level Metadata

Chunk Level Metadata enables a user to choose whether the metadata must be extracted from the document as a whole or from each chunk independently.


PDF Font Splitter

PDF Font Splitter

PDF Font Splitter splits a document into chunks based on the combination font size, case (upper or lower) and font weight (bold or not). This results in smart chunking where the complete section of a document is present as a single. For example, a user can split the document into chunks based on the font style of a section title (size 18, bold and uppercase). This ensures the whole section of a document is present in a single chunk. This component has multiple parameters that are described below.

Select Files

Select Files dropdown list enables a user to select the documents to which the splitter must be applied to. A user can select multiple files in the dropdown list.

Font Size

Font Size enables a user to specify the font size where a document must be split.

Is Bold

Is Bold enables a user to choose the font weight where a document must be split.

Is Uppercase

Is Uppercase enables a user to choose the case where a document must be split.

Include Filename

Include Filename enables a user to choose whether to include the filename in the chunk metadata. Including the filename in the chunk improves retriever preformance and LLM response.

Metadata Schema

Metadata Schema dropdown list enables a user to choose the metadata schema to extract metadata from the document/chunk. Including metadata in a chunk enables self query. Only a single schema can be selected in a splitter component.

Chunk Level Metadata

Chunk Level Metadata enables a user to choose whether the metadata must be extracted from the document as a whole or from each chunk independently.


Dummy Splitter

Dummy Splitter

Dummy Splitter as the name suggests acts just as placeholder in cases when the whole document must be considered as a chunk. It is useful in cases when the document is very small or difficult to decide on the splitting strategy. This component has multiple parameters that are described below.

Select Files

Select Files dropdown list enables a user to select the documents to which the splitter must be applied to. A user can select multiple files in the dropdown list.

Include Filename

Include Filename enables a user to choose whether to include the filename in the chunk metadata. Including the filename in the chunk improves retriever preformance and LLM response.

Metadata Schema

Metadata Schema dropdown list enables a user to choose the metadata schema to extract metadata from the document/chunk. Including metadata in a chunk enables self query. Only a single schema can be selected in a splitter component.

Chunk Level Metadata

Chunk Level Metadata enables a user to choose whether the metadata must be extracted from the document as a whole or from each chunk independently.

Context Enrichment

Context Enrichment adds additional metadata to the chunks which improves the retriever preformance. MRAG supports the following context enrichment techniques.

HyPE

HyPE

HyPE (Hypothetical Prompt Embedding) is a context enrichment technique that generates queries that the chunk can answer. The queries are generated using an LLM. These queries are then used to generate embeddings. HyPE improves retriever performance as during the retrieval a user’s query will be compared to the generated queries which is an apple to apple comparison (We are not comparing a user’s query with a large chunk of text for retrieval). This component has multiple parameters that are described below.

N Questions

Number of queries to be generated per chunk.

Include Chunk Text

Sometimes when a chunk is large, HyPE might not generate all the questions a chunk can answer. In such a case, to prevent the loss of information we can embed the whole chunk text and store into the vector store apart from the query embeddings.

Embed Per Question

If True, each generated query is embedded separately and the chunk text is added as metadata to each query. If we have 5 queries, each query is embedded separately and the metadata is added 5 times (1 time for each query) and stored in vector store. If False, the generated queries are embedded together and the chunk text is added as metadata. If we have 5 queries, all the queries are embedded together and the metadata is only once (for all the queries together) and stored in vector store.

Vector Index

Vector index is a data store where the chunks in the documents along with their vector embeddings and metadata are stored. When a user asks a query, its vector embedding is computed and top-k similar chunks are retrieved from the vector index and are augmented as context to the query and passed to an LLM to generate the response.

Vector Index

MRAG provides Vector Index component to build an indexing pipeline. This component has multiple parameters that are described below.

Embedding Model

Embedding Model to use to compute the vector embeddings of the chunks.

Batch Size

Batch size of the chunks to compute vector embeddings.

Vector Store

Vector Store provider like ChromaDB, Pinecone.

Executing the pipeline


Execute Pipeline

After you drag and drop the components on to the canvas and build the indexing pipeline, click the Execute Pipeline button in the left sidebar to execute the pipeline. The pipeline will be executed as a background job. Once the job begins you will be automatically redirected to the Indexing Pipelines screen. You can view the list of your Indexing Pipelines and track the execution status of a pipeline by hitting the Refresh button.


Pipeline List