Enhancing LLM Performance with Vector Search and Vector Databases

16 min readJul 9, 2024

This series of tutorials are dedicated to exploring Large Language Models (LLMs) and their real-life applications across various use cases. If you’ve missed any previous posts, you can catch up on them here (links attached):

Gentle Introduction to Large Language Models
Semantic Search and RAG with Large Language Models
Open-Sourced and Closed-Sourced Large Language Models
Comprehensive Guide on Prompt Enginerring
Enhancing LLM Performance with Vector Search and Vector Databases

Do not forget to subscribe in order to receive the practical use-cases in the world of NLP.

For those who eager to delve deeper into the world of LLMs and Natural Language Processing (NLP), please feel free to join to the LLMs Zoomcamp course by DataTalks Club. It’s completely free.

This Zoomcamp offers a wealth of information on LLMs, their practical applications, and, most importantly, connects you with a community of open-minded professionals. Whether you need professional advice or career guidance, you’ll find support from experienced Data Engineers, Machine Learning Engineers, and Data Scientists.

Join community to share knowledge and grow together.

Recap: Vector Basics and Vector Search

Understanding language is a complex task for computers. Traditional NLP methods had their limitations, but deep learning and large text corpora have enabled us to encode language into high-dimensional vectors, enhancing language modeling.

Computers process numbers, so we need to convert text into numerical form. This is where vectors, or embeddings, come into play. Vectors represent text as sets of values in a multi-dimensional space.

Think of a vector as a point in space. For example, 3D vectors can be visualized on a graph with X, Y, and Z axes. In NLP, vectors have many more dimensions to capture the nuances of language. Similar meanings are clustered together, while different meanings are spread apart.

Consider a list of movies and their genres, encoded into multi-dimensional vectors based on their genres:

“Interstellar” might be represented as [0.8, 0.1, 0.1] vector for Action, Sci-Fi, and Drama.
“Limiltess” could be [0.1, 0.9, 0.0] vector for Drama, Crime, and Action.
“Inside Out” might be [0.1, 0.1, 0.8)] vector for Animation, Comedy, and Family.

Here, each dimension represents an encoded list of genres. Movies with similar genres will have vectors that are close to each other in this multi-dimensional space. When you search for movies similar to “Interstellar,” the system looks for vectors near [0.8, 0.1, 0.1]. This vector representation captures the essence of the movies and helps in finding similar items effectively (we are performing search).

By transforming text into these numerical vectors, we enable machine learning models to process and understand language more effectively.

But here one important question arises: “Hm, ok I got it! But wait…How many dimensions are enough to represent the vector/embeddings?”

Choosing the right number of dimensions for your vectors is crucial for effectively representing text.

One-Hot Encoding

One simple way to represent words as vectors is one-hot encoding. Each word is represented by a vector as long as the number of words in your vocabulary (if there are 1000 unique words, the length of each vector in OHE representation will be 1000), with a single 1 in the position corresponding to the word, and 0s elsewhere. For example:

“apple” = [1, 0, 0]
“banana” = [0, 1, 0]
“cherry” = [0, 0, 1]

from sklearn.preprocessing import OneHotEncoder
import numpy as np

words = ["apple", "banana", "cherry"]
data = np.array(words).reshape(-1, 1)

encoder = OneHotEncoder()
one_hot_encoded = encoder.fit_transform(data)

print("One-Hot Encoded Vectors:")
print(one_hot_encoded)

One-Hot Encoded Vectors:
<Compressed Sparse Row sparse matrix of dtype 'float64' with 3 stored elements and shape (3, 3)>
Coords Values
(0, 0) 1.0
(1, 1) 1.0
(2, 2) 1.0

While easy to understand, one-hot encoding has significant limitations. It doesn’t capture any relationship between words and it is very expensive from the point of memory usage. For instance, “apple” and “banana” are both fruits, but one-hot encoding cannot reflect this similarity.

Dense Vectors

To capture the nuances and relationships between words, we use dense vectors. Unlike one-hot encoding, dense vectors have fewer dimensions and each dimension is a continuous value rather than just 0 or 1. These vectors are generated through training on large datasets, and they encode semantic information, meaning they capture the meaning and context of words. For example:

“apple” = [0.9, 0.1, 0.1]
“banana” = [0.8, 0.1, 0.6]
“cherry” = [0.1, 0.8, 0.2]

Here, each dimension doesn’t directly correspond to a human-understandable feature. Instead, the values are learned representations that capture complex relationships. Words with similar meanings or contexts will have similar vectors. For instance, the distance between “apple” and “banana” is smaller than between “apple” and “cherry,” reflecting their similarity as fruits.

Dense vectors, also known as embeddings, allow us to perform operations that were impossible with one-hot encoding. For example, we can calculate the similarity between words by measuring the distance between their vectors. Similar words will have vectors that are close together in the multi-dimensional space.

For instance, in the vectors above, the distance between “apple” and “banana” is smaller than the distance between “apple” and “cherry,” reflecting their similarity as fruits.

For dense representations (embeddings), we typically use pre-trained models like Word2Vec or GloVe or any other pre-trained embedding-model from Hugging Face as you will see it in another part of the tutorial.

For the example purposes you need to install the lib called gensim:

$ pip install numpy sklearn gensim

from gensim.models import Word2Vec
from gensim.models import KeyedVectors

sentences = [
    ["apple", "banana", "cherry"],
    ["apple", "fruit", "healthy"],
    ["banana", "yellow", "fruit"],
    ["cherry", "red", "fruit"]
]

model = Word2Vec(sentences, vector_size=3, window=5, min_count=1, workers=4)

apple_vector = model.wv['apple']
banana_vector = model.wv['banana']
cherry_vector = model.wv['cherry']

print("Dense Vectors:")
print("apple:", apple_vector)
print("banana:", banana_vector)
print("cherry:", cherry_vector)

Dense Vectors:
apple :  [-0.12544572  0.24601682 -0.05111571]
banana:  [ 0.21529575  0.2990996  -0.16718094]
cherry:  [ 0.3003091  -0.31009832 -0.23722696]

How many dimensions are enough to represent vectors effectively? Typically, vectors have hundreds or even thousands of dimensions. The exact number depends on the complexity of the language and the task. More dimensions allow the model to capture more nuanced information but also require more computational resources.

For practical purposes, techniques like Principal Component Analysis (PCA) or t-Distributed Stochastic Neighbor Embedding (t-SNE) are used to reduce the dimensionality of vectors while retaining as much meaningful information as possible. These techniques help in visualizing high-dimensional data and improving model efficiency.

Dense vector search is powerful for capturing semantic similarity, but it has limitations when it comes to lexical similarity.

Consider a document corpus of legal texts, like the United States Code (U.S.C.), which contains specific titles and section numbers. If a user searches for a specific code section, such as “18 U.S.C. § 1341” (which refers to mail fraud), they expect to find exact matches. Dense vector encoding can fail here because it focuses on semantic similarity, potentially retrieving unrelated sections that are close in the vector space but irrelevant to the search intent.

Hybrid Search

For highly specific terms in specialized domains, such as legal citations or medical terms, keyword search often yields better results. Therefore, combining dense vector search with keyword search — a technique known as hybrid search — provides a more effective solution.

This combination is widely used in practice due to more accurate and semantically correct results.

In hybrid search, dense vector search retrieves the top k documents with semantic scores, while keyword search using TF/IDF retrieves the top k documents with lexical scores. These scores are then combined using strategies like:

Interleaving: Alternating between the top results of each method.
Score Normalization: Normalizing TF/IDF and dense vector scores between 0 and 1 before combining.

You can filter by attributes like legal jurisdiction or document type to narrow down results, ensuring that searches for specific codes like “18 U.S.C. § 1341” yield highly relevant matches.

The practical example of hybrid search approach in legal docs (you can try with your own corpora, all the code is reproducible):

Install the following libriaries, as we will need them:

$ pip install numpy scikit-learn faiss-cpu

faiss: A library for efficient similarity search and clustering of dense vectors. It performs very fast and very effective search that do not consume a lot of memory of your machine.

Keyword Search:

import numpy as np
import faiss
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


legal_docs = [
    {"title": "18 U.S.C. § 1341", 
     "text": "Frauds and swindles involving mail."},
    
    {"title": "26 U.S.C. § 7201", 
     "text": "Attempt to evade or defeat tax."},
    
    {"title": "42 U.S.C. § 1983", 
     "text": "Civil action for deprivation of rights."},
    
    {"title": "18 U.S.C. § 1029", 
     "text": "Fraud and related activity in connection with access devices."},
]

# Extract titles and texts
titles = [doc["title"] for doc in legal_docs]
texts = [doc["text"] for doc in legal_docs]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(texts)

TfidfVectorizer is initialized and fitted to the texts, converting them into a TF-IDF matrix. This matrix represents the importance of terms in the documents relative to the entire corpus.

def keyword_search(query, vectorizer, tfidf_matrix):
    # we need to transform the query into a TF-IDF vector
    query_vec = vectorizer.transform([query])
    # cosine similarity between the query-vector and the TF-IDF matrix
    scores = cosine_similarity(query_vec, tfidf_matrix).flatten()
    return scores

# Example keyword search
query = "mail fraud"
keyword_scores = keyword_search(query, vectorizer, tfidf_matrix)
print("Keyword Scores:", keyword_scores)

Keyword Scores: [0.32891916, 0., 0., 0.24081928]

When the code is run, it prints the similarity scores for the query “mail fraud” against the sample legal documents. These scores represent the cosine similarity between the query and each document’s TF-IDF vector. Here’s what each score indicates:

First Document (“18 U.S.C. § 1341”):

Score: 0.53404633
The document text “Frauds and swindles involving mail.” is the most similar to the query “mail fraud”, with a moderate similarity score.

Second Document (“26 U.S.C. § 7201”):

Score: 0.0
The document text “Attempt to evade or defeat tax.” has no similarity with the query, hence the score is 0.

Third Document (“42 U.S.C. § 1983”):

Score: 0.0
The document text “Civil action for deprivation of rights.” also has no similarity with the query.

Fourth Document (“18 U.S.C. § 1029”):

Score: 0.35418298
The document text “Fraud and related activity in connection with access devices.” has some similarity with the query, as it contains the word “Fraud”.

In a practical application, where embeddings are meaningful and represent the semantic content of documents, the similarity scores would reflect how closely each document matches the query, allowing for more effective and relevant search results. But for now let’s imagine that we do have some sort pre-trained embeddings for legal documents that we can use for our task:

np.random.seed(42)
embeddings = np.random.random((len(texts), 128)).astype('float32')

# Initialize FAISS index for fast search across embeddings
index = faiss.IndexFlatL2(128) # the same as embedding dim
index.add(embeddings)

# Dense vector search funcionality
def dense_vector_search(query_embedding, index):
    D, I = index.search(query_embedding, k=len(texts))
    return D.flatten(), I.flatten()

query_embedding = np.random.random((1, 128)).astype('float32')

dense_distances, dense_indices = dense_vector_search(query_embedding, index)
dense_scores = 1 / (1 + dense_distances)  # Convert distances to similarity scores
print("Dense Vector Scores:", dense_scores)

Dense Vector Scores: [0.05064009 0.05011931 0.04504744 0.03674265]

Given that the embeddings and the query embedding are randomly generated, the similarity scores will likely be very close to each other. This is because random vectors in high-dimensional spaces tend to be roughly equidistant from each other. In a real-world scenario, these embeddings would be generated by a model like BERT or another transformer model, capturing the semantic meaning of the documents, leading to more meaningful similarity scores.

Detailed Breakdown of the Output:

The output array [0.54983196, 0.54983196, 0.54983196, 0.54983196] represents the similarity scores of the four documents in the dataset to the randomly generated query embedding.
These scores are calculated based on the Euclidean distances between the query embedding and each document embedding, converted to similarity scores using 1 / (1 + distance).
The similarity scores are close to each other because the embeddings are random and not semantically meaningful.

Usually we perform normalization in order to use the optimal output for our search. The normalization function scales the scores to a range between 0 and 1. This is essential because the scores from TF/IDF and dense vector searches may have different ranges and scales. By normalizing them, we can combine them meaningfully.

def normalize_scores(scores):
    min_score, max_score = np.min(scores), np.max(scores)
    return (scores - min_score) / (max_score - min_score)

normalized_keyword_scores = normalize_scores(keyword_scores)
normalized_dense_scores = normalize_scores(dense_scores)


# The normalized scores from both search methods are combined. 
# This simple addition assumes equal importance for both types of scores
combined_scores = normalized_keyword_scores + normalized_dense_scores

# The combined scores are sorted in descending order to get the indices 
# of the top documents. The documents are then selected 
# based on these indices.
top_indices = np.argsort(combined_scores)[::-1]
top_docs = [legal_docs[i] for i in top_indices]

#  the top documents are printed in order of their rank
for i, doc in enumerate(top_docs):
    print(f"Rank {i+1}:")
    print(f"Title: {doc['title']}")
    print(f"Text: {doc['text']}\n")


Rank 1:
Title: 18 U.S.C. § 1341
Text: Frauds and swindles involving mail.

Rank 2:
Title: 26 U.S.C. § 7201
Text: Attempt to evade or defeat tax.

Rank 3:
Title: 18 U.S.C. § 1029
Text: Fraud and related activity in connection with access devices.

Rank 4:
Title: 42 U.S.C. § 1983
Text: Civil action for deprivation of rights.

This approach ensures that both keyword relevance and semantic relevance are considered in retrieving the most appropriate documents, enhancing the search effectiveness for specialized queries like legal codes.

So, as you saw, by transforming text into these numerical vectors, we enable machine learning models to process and understand language more effectively and solve much complex tasks.

Hands-On with Vector Search and Vector Databases

As you know all the details about vectors and vector search, let’s dig deeper to practical part where we can apply the vector search concepts and vector databases.

Vector databases are specialized systems designed to store, manage, and retrieve high-dimensional vector representations of data. Unlike traditional databases that handle structured data like numbers and strings, vector databases excel at managing complex data types such as embeddings generated by machine learning models. These embeddings capture semantic relationships and similarities in a multidimensional vector space.

Key Components of Vector Databases

Vector Storage: Efficiently stores high-dimensional vectors, often handling millions or billions of entries.
Indexing Engine: Creates indices for fast retrieval using techniques like FAISS (Facebook AI Similarity Search), Annoy (Approximate Nearest Neighbors Oh Yeah), and HNSW (Hierarchical Navigable Small World graphs).
Query Engine: Processes search queries by comparing query vectors against stored vectors using distance metrics such as cosine similarity, Euclidean distance, or Manhattan distance.
Metadata Storage: Stores associated metadata like document IDs, timestamps, or additional attributes for filtering and retrieval.

This are usually the basic componens of the Vector Databases, but depending on system complexity there can be any additional layers of complexity as monitoring, caching and etc.

Advantages Over Traditional Databases

High-Dimensional Data Handling: Optimized for complex, high-dimensional data.
Efficient Similarity Search: Excels at finding similar vectors, crucial for image retrieval, recommendation systems, and semantic search.
Scalability: Built to handle large datasets and high query loads by distributing data across multiple nodes.
Speed: Uses advanced indexing techniques for quick retrieval, essential for real-time applications.
Flexibility: Can store and index various data types (text, images, audio) as vectors.
Enhanced Search Accuracy: Leverages vector representations to provide more accurate and relevant search results.

Integrating vector databases with Large Language Models (LLMs) significantly enhances NLP applications, enabling more sophisticated and accurate information retrieval, recommendation, and search functionalities.

Open-Source Vector Databases

FAISS (Facebook AI Similarity Search): A library developed by Facebook for efficient similarity search and clustering of dense vectors.
Annoy (Approximate Nearest Neighbors Oh Yeah): A C++ library with Python bindings for fast approximate nearest neighbors search, developed by Spotify.
Milvus: An open-source vector database built for scalable similarity search and AI applications.
Weaviate: An open-source vector search engine that allows you to store data objects and vector embeddings for efficient retrieval.
Vespa: An open-source search engine and analytics engine, which provides efficient retrieval for high-dimensional vectors.
Elasticsearch: While not a vector database by design, Elasticsearch supports vector search through plugins like k-NN, enabling similarity searches on vector data.

Сlosed-Source Vector Databases

Pinecone: A managed vector database service designed for high performance and scalability, offering fast and accurate vector search.
AWS Kendra: An enterprise search service by Amazon Web Services that can index and search across various data sources using vectors.
Azure Cognitive Search: A cloud search service by Microsoft that supports vector search among other advanced search capabilities.
Google Cloud AI and Machine Learning Products: Google offers several managed services that include vector search capabilities integrated with other AI and machine learning tools.

Let’s examine the vector search using the Elastic Search. But before running all the code please run the Elastic Search via Docker:

docker run -it \
    --rm \
    --name elasticsearch \
    -p 9200:9200 \
    -p 9300:9300 \
    -e "discovery.type=single-node" \
    -e "xpack.security.enabled=false" \
    docker.elastic.co/elasticsearch/elasticsearch:8.4.3

The Architecture of Search using ES:

Elasticsearch is a powerful search engine that can be extended to perform semantic search using its capabilities for handling vector data. While Elasticsearch was initially designed for full-text search, it has evolved to support vector search, making it a versatile tool for modern NLP applications.

Before diving into semantic search, it’s essential to understand two fundamental concepts in Elasticsearch: documents and indexes.

Document: A document in Elasticsearch is a collection of fields and their associated values. Think of it as a JSON object where each field can hold various data types, such as text, numbers, dates, or even nested objects.
Index: An index is a collection of documents. It is stored in a highly optimized format designed for efficient searching. An index in Elasticsearch is akin to a database in traditional SQL systems. Each index can hold multiple types of documents, making it flexible and scalable.

To implement semantic search using Elasticsearch, you need to leverage its ability to handle vector data and its powerful search capabilities. Here’s a step-by-step guide to setting up semantic search with Elasticsearch:

Data Preparation: Organize your data into documents. Each document should represent an entity you want to search, such as a product, article, or user profile.
Index Creation: Create an index in Elasticsearch to store your documents. Define the structure of your documents and specify the fields that will store vector embeddings.

The Notebook can be found here.

Step 1: Data Preparation:

import json

with open('documents.json', 'rt') as f_in:
    docs_raw = json.load(f_in)

documents = []

for course_dict in docs_raw:
    for doc in course_dict['documents']:
        doc['course'] = course_dict['course']
        documents.append(doc)

Step 2: Embeddings:

We need to generate vector embeddings for the textual content of documents using a pre-trained language model. Tools like BERT, Sentence Transformers, or any other embedding model can be used to create these vectors. Let’s use SentenceTransformer.

model = SentenceTransformer("all-mpnet-base-v2")

operations = []
for doc in documents:
    # Transforming the title into an embedding using the model
    doc["text_vector"] = model.encode(doc["text"]).tolist()
    operations.append(doc)

Step 3: Setup ElasticSearch Connection. Creation of Mapping and Index:

Mapping is the process of defining how a document, and the fields it contains, are stored and indexed.
Each document is a collection of fields, which each have their own data type.
We can compare mapping to a database schema in how it describes the fields and properties that documents hold, the datatype of each field (e.g., string, integer, or date), and how those fields should be indexed and stored

from elasticsearch import Elasticsearch
es_client = Elasticsearch('http://localhost:9200') 
index_name = "course-questions"

index_settings = {
    "settings": {
        "number_of_shards": 1,     # Define the number of primary shards
        "number_of_replicas": 0    # Define the number of replica shards
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},     # Field for storing the main text content
            "section": {"type": "text"},  # Field for storing section information as text
            "question": {"type": "text"}, # Field for storing questions as text
            "course": {"type": "keyword"},# Field for storing course identifiers
            "text_vector": {              # Field for storing vector embeddings
                "type": "dense_vector",
                "dims": 768,              # Number of dimensions in the vector
                "index": True,            # Enable indexing for this vector
                "similarity": "cosine"    # Use cosine similarity for vector comparison
            }
        }
    }
}



es_client.indices.delete(index=index_name, ignore_unavailable=True)
es_client.indices.create(index=index_name, body=index_settings)

number_of_shards: Defines the number of primary shards for the index. Shards are subdivisions of an index that allow Elasticsearch to distribute and parallelize the data. In this case, it is set to 1, meaning the entire index will be stored in a single shard.

number_of_replicas: Specifies the number of replica shards. Replicas are copies of the primary shards and provide redundancy to ensure data availability in case of hardware failure. Here, it is set to 0, meaning no replicas will be created.

mappings Section: This part defines the structure of the documents within the index, specifying the data types for each field.

text: A field for storing the main text content of the document. It is of type text, which means it will be analyzed and indexed for full-text search.
section: Another field for storing text, likely representing a specific section of a larger document. It is also of type text.
question: A field for storing questions. This is useful if the documents contain FAQ or question-answer pairs. It is of type text.
course: A field for storing course identifiers. It is of type keyword, meaning it will be indexed as-is without being analyzed. This is useful for exact match queries and aggregations.
text_vector: A field for storing vector embeddings. This field has several specific settings:
type: dense_vector, indicating that this field will store high-dimensional vectors.
dims: The number of dimensions in the vector, set to 768. This must match the output dimensionality of your embedding model.
index: Set to True, enabling indexing of this vector field for efficient search.
similarity: Specifies the similarity measure to be used for comparing vectors. It is set to cosine, which is commonly used for measuring the similarity between high-dimensional vectors.

Step 4: Adding the documents to the ElasticSearch Index:

for doc in operations:
    try:
        es_client.index(index=index_name, document=doc)
    except Exception as e:
        print(e)

Step 5: End user Query and keyword search:

search_term = "windows or mac?"
vector_search_term = model.encode(search_term)

query = {
    "field": "text_vector",
    "query_vector": vector_search_term,
    "k": 5,
    "num_candidates": 10000, 
}

res = es_client.search(index=index_name, knn=query, source=["text", "section", "question", "course"])

knn_query = {
    "field": "text_vector",
    "query_vector": vector_search_term,
    "k": 5,
    "num_candidates": 10000
}

response = es_client.search(
    index=index_name,
    query={
        "match": {"section": "General course-related questions"},
    },
    knn=knn_query,
    size=5
)

response["hits"]["hits"]

Here, we define a k-NN (k-Nearest Neighbors) query to search for the top 5 documents most similar to the encoded search term. We specify the field that contains the vector representations (text_vector) and the query vector (vector_search_term). We also set num_candidates to 10,000, meaning Elasticsearch will consider up to 10,000 candidate vectors before returning the top 5 results.

In this example, we demonstrated how to perform a vector-based search in Elasticsearch using a k-NN query. We started by encoding a search term into a vector, defining a k-NN query to find similar vectors, and performing the search. Additionally, we combined the k-NN search with a filter to narrow down the results based on specific criteria. This approach enhances the search capabilities by leveraging the semantic understanding of the search term, resulting in more relevant and contextually appropriate search results.

I hope you discovered valuable insights and new ideas in this post. To stay updated with detailed use-case analyses, tips, and much more, don’t forget to subscribe. Never miss out on the latest advancements and practical applications in NLP and machine learning!

If you liked the content or found it practically applicable and useful that’s great. You can support and inspire via BuyMeACoffe as creation of the valuable content requires a lot of efforts and time. And it is a great way to show that you are really enjoying it! Thanks!