top of page

Mastering RAG Vectorization

Updated: Mar 14



Vectorized Chunks for RAG
Chunks. This is all that remains of a document when it has been vectorised as a RAG

How to manipulate vectorizing a RAG?

While you can't directly alter the internal code of a commercial LLM, you can change how it performs vectorization. In fact, you have significant control over what gets vectorized and which vectors are used for retrieval. Here are the key levers you have in Retrieval-Augmented Generation (RAG):

1. Text Preprocessing:

Segmentation/Chunkin

The way you divide your text into chunks has a huge impact. You can experiment with different chunking strategies:

  • Fixed Size: Chunks with a fixed number of words or characters. Simple, but can ignore semantic boundaries.

  • Sentence-Based Segmentation: Splitting by sentences. Often a good compromise.

  • Paragraph-Based Segmentation:  Preserves context better, but can lead to very long chunks.

  • Semantic Segmentation: More sophisticated. Tries to identify thematically coherent units (e.g., using topic modeling or by detecting headings/section breaks).

  • Recursive Chunking: First coarse-grained, then finer subdivision (e.g., first paragraphs, then sentences). Helps capture hierarchical relationships.

  • Overlapping Chunks (Sliding Window):  Chunks overlap by a certain number of tokens. Can help minimize context loss at chunk boundaries.

  • Chunk Size: Small chunks are more precise for specific questions; larger chunks provide more context. You need to find a balance that suits your application.

  • Libraries: Libraries like LangChain or LlamaIndex offer pre-built chunking methods and often allow you to define your own strategies.

Cleaning/Normalization:

  • Removal of Irrelevant Content: Remove HTML tags, JavaScript, excessive whitespace, and special characters that don't have semantic value.

  • Stemming/Lemmatization: Reduce words to their root form (e.g., "running" -> "run"). Can improve vectorization, but caution: It can also lead to information loss. Lemmatization is often better for domain-specific vocabulary.

  • Stop Word Removal: Remove common words ("the", "a", "is") that carry little meaning. Caution: Sometimes, stop words can be important in certain contexts (e.g., "to be or not to be"). Consider whether to use a standard list or a custom one for your domain.

  • Case Folding: Convert to lowercase (or uppercase). Standardizes the representation.

  • Enrichment (Augmentation):

    • Metadata: Add additional information to the chunks before they are vectorized:

      • Source: URL, filename, document title.

      • Author:

      • Date: Creation or modification date.

      • Topic Tags: Manually or automatically assigned keywords.

      • Entities: Recognized proper nouns (people, places, organizations) – use Named Entity Recognition (NER).

      • Summaries: Short summaries of the chunk (manually or automatically created). Vectorizing the summary can be done in addition to vectorizing the entire chunk.

      • Hierarchical Information: If your text has a structure (e.g., chapter -> section -> subsection), store this information as metadata.

      • Relationships to Other Chunks: If there are logical connections between chunks (e.g., "previous section," "next section," "see also"), store these.

    • Why Metadata?

      • Filtering: You can restrict retrieval to specific sources, authors, time periods, etc.

      • Boosting: Certain metadata fields (e.g., title) can be given more weight in the similarity search.

      • Context for the LLM: The LLM can use the metadata to formulate and justify the answer better (e.g., "According to article X from May 12th...").

    • Hypothetical Document Embeddings (HyDE): Instead of vectorizing the original text, first generate a hypothetical answer to the question using the LLM. Vectorize this answer and search for it in the knowledge base. The idea is that the hypothetical answer is semantically closer to relevant documents.

  • Choice of Embedding Method (Indirect Influence):

    • You can (and should) try different embedding models. Each model has strengths and weaknesses:

      • Word Embeddings (Word2Vec, GloVe, FastText): Good for word-level semantics, but don't fully consider the context of a word in a sentence.

      • Sentence Embeddings (Sentence-BERT, Universal Sentence Encoder): Create vectors for entire sentences/sections. Better suited for RAG because they consider context.

      • Transformer-based Embeddings (BERT, RoBERTa, etc.): State-of-the-art. Can be fine-tuned to improve performance for a specific domain or task.

    • Training Data of the Model: Models trained on different data will produce different embeddings. A model trained on scientific texts is likely better suited for scientific questions than a model trained on news articles.

    • Fine-tuning: For maximum precision, you can fine-tune a pre-trained embedding model on your data. For this, you need a dataset of question-answer pairs or question-context pairs. Fine-tuning teaches the model to give more weight to the aspects relevant to your domain.

2. Indexing and Retrieval:

  • Vector Database: The choice of vector database influences how efficiently and flexibly you can search:

    • FAISS (Facebook AI Similarity Search): Very efficient for large datasets.

    • Annoy (Approximate Nearest Neighbors Oh Yeah): Also fast, good for finding nearest neighbors.

    • Pinecone, Weaviate, Qdrant, Chroma: Fully managed vector databases that often offer additional features like filtering, metadata storage, and scalability.

  • Search Algorithm:

    • k-Nearest Neighbors (kNN): Find the k vectors closest to the query vector.

    • Approximate Nearest Neighbors (ANN): Faster than kNN, but with a small inaccuracy. Often the better choice for large datasets.

    • Radius Search: Find all vectors within a certain radius of the query vector.

  • Similarity Metric:

    • Cosine Similarity: Measures the angle between two vectors. Standard measure for text embeddings.

    • Euclidean Distance: Measures the direct distance between two vectors.

    • Inner Product (Dot Product): Can be calculated faster than cosine similarity.

  • Filtering: Use metadata to restrict the search (e.g., "only documents by author X," "only documents from 2023").

  • Re-Ranking:

    • By Similarity: The vector database provides a list of results, sorted by similarity.

    • Additional Criteria: You can re-rank this list based on:

      • Metadata: Prefer more recent documents, documents with higher authority, etc.

      • Diversity: Ensure that the results are not too similar to each other (e.g., using Maximum Marginal Relevance (MMR)).

      • LLM-based Re-Ranking: Have an LLM evaluate and re-rank the results based on relevance to the query. This is more computationally expensive but can significantly improve quality.

  • Hybrid Retrieval: Combine vector search with traditional search methods (e.g., keyword search) to leverage the advantages of both approaches.

3. Query Transformation:

  • Reformulation: The LLM can reformulate the original question to increase the chances of a good retrieval:

    • Synonyms: Replace words with synonyms.

    • Abstraction/Concretization: Make the question more general or more specific.

    • Decomposition: Break down a complex question into multiple sub-questions.

    • Adding Context: Supplement the question with information known to the LLM from the previous conversation history.

  • Query Expansion: Expand the query with related terms (e.g., from a thesaurus or by querying a knowledge graph).


Summary: Influencing Factors

Factor

Influence

Examples

Chunking Strategy

Determines how the text is divided into sections.

Fixed size, sentence-based, paragraph-based, semantic, recursive, overlapping.

Chunk Size

Affects the granularity of the information units.

Small chunks for precision, large chunks for context.

Preprocessing

Cleans and normalizes the text.

Stop word removal, stemming/lemmatization, case folding, removal of irrelevant content.

Metadata

Adds context and allows for filtering and boosting.

Source, author, date, topic tags, entities, summaries, hierarchical information, relationships to other chunks.

Embedding Model

Determines how text is converted into vectors.

Word embeddings, sentence embeddings, transformer-based embeddings, fine-tuned models.

Vector Database

Affects search efficiency and flexibility.

FAISS, Annoy, Pinecone, Weaviate, Qdrant, Chroma.

Search Algorithm

Determines how similar vectors are found.

k-Nearest Neighbors (kNN), Approximate Nearest Neighbors (ANN), radius search.

Similarity Metric

Measures the similarity between vectors.

Cosine similarity, Euclidean distance, inner product.

Filtering

Restricts the search based on metadata.

Filter by author, date, source, etc.

Re-Ranking

Reorders the results based on additional criteria.

Metadata-based re-ranking, diversity-based re-ranking, LLM-based re-ranking.

Query Transformation

Modifies the query to improve retrieval.

Reformulation, query expansion.

Hybrid Retrieval

Combines vector search with traditional search methods.

Keyword search + vector search.


 
 
bottom of page