Discovering Semantic Search and RAG with Large Language Models (LLMs)

16 min read3 days ago

Introductionary image in abstact concept that illustrates the retrieval augmented generation process.

Large Language Models (LLMs) are incredibly powerful tools capable of performing a wide range of tasks, from text generation to answering questions. But what exactly are LLMs, and how do they work? If you’re new to Machine Learning and Natural Language Processing (NLP), start with introductory post A Gentle Introduction to Large Language Models (LLMs). This beginner-friendly guide will help you get familiar with the basics of NLP and LLMs without diving into complex mathematics or jargon — just clear, straightforward explanations.

This series of tutorials are dedicated to exploring Large Language Models (LLMs) and their real-life applications across various use cases. If you’ve missed any previous posts, you can catch up on them here (links attached):

Do not forget to subscribe in order to receive the practical use-cases in the world of NLP.

Before to proceed I highly recommend the course from Alex Grigorev and DataTalks Club — LLMs Zoomcamp, where you can practically deep dive into complex concepts, create your own solution during the course and join the community of aspiring Data Engineers, Machine Learning Engineers and Data Scientists to share the knowledge togther.

Refresher on Large Language Models

Large Language Models (LLMs) are massive transformer-based models capable of tackling a wide array of tasks, such as text classification (e.g., Named Entity Recognition or NER) and text generation. Think of an LLM as a “magic” black box (hello Schrödinger’s cat!) where you input a question with some context (known as a prompt) and receive a meaningful answer as output.

However, is this the only way to utilize LLMs? Not quite. LLMs that haven’t been fine-tuned on specific data don’t understand context outside their training data. For instance, if you have notes from client conversations, an LLM won’t comprehend a specific query like, “What price did I propose to client XYZ yesterday?” because it lacks that particular context.

So, what’s the solution? Retrieval-Augmented Generation (RAG) approach allows you to build a custom knowledge base and instruct the model to use this information as context to generate relevant responses. Sounds like magic, right? But it’s not — it’s a blend of advanced math, clever coding, and some engineering finesse. Let’s demystify this concept and explore how it works in practice.

Embeddings and Search in LLMs

To smoothly transition into RAG (Retrieval-Augmented Generation) and its implementation, let’s revisit the foundational concepts that underpin this approach. This will help you intuitively grasp the details.

As you might already know from the basics of NLP, LLMs work with text embeddings. These embeddings convert words, sentences, and texts into machine-readable formats represented in a multidimensional vector space.

Concept of representing the words in vector space

With these embeddings, we can perform various vector calculations. One key concept is similarity search, where two vectors are considered similar if they are close to each other based on certain metrics. Remember measures like Cosine Similarity, Euclidean Distance, and Manhattan Distance? These come into play here.

Embeddings are essentially the core of many NLP applications. They capture semantic relationships between words, enabling models to “understand context and meaning”. For instance, in the vector space, words like “cat” and “dog” will be positioned closer together due to their related meanings (both of them are animals), compared to unrelated words like “human” and “space”.

Similarity search is crucial in various applications, from document retrieval to recommendation systems. By leveraging embeddings, we can efficiently find and rank documents or items that are most relevant to a given query. This is the foundational idea behind semantic search, which we will explore further in the context of RAG.

Concept of Semantic Search with Embeddings

Using embeddings allows us to capture the semantic value of different text-based information (and not just text — you can represent any kind of data this way). By leveraging these embeddings, we can develop a plenty of fantastic applications, using them as a rich source of linguistic information.

Semantic search is a method of searching that aims to understand the meaning behind the words in your query, rather than just matching exact words. It seeks to grasp the context, intent, and relationships between words to deliver more relevant and accurate results.

Large Language Models (LLMs) like GPT-4 enhance semantic search by utilizing their deep understanding of language. These models have been trained on vast amounts of text and can comprehend nuances, synonyms, and context. For instance, if you search for “best way to prevent a cold” an LLM-powered search might return results about boosting your immune system, getting enough sleep, and other related topics, rather than just pages containing the exact phrase “prevent a cold”.

By understanding user intent and context, semantic search can vastly improve search accuracy. This is especially useful in areas like customer support, where understanding the precise nature of a query can lead to quicker and more effective solutions.

Symmetric Semantic Search:

In symmetric semantic search, the system treats both the query (what you type into the search bar) and the documents (the content being searched) in the same way. It converts both into a common format, such as numerical vectors, which can be easily compared to find matches. This approach is called “symmetric” because it handles both sides of the search process similarly.

Asymmetric Semantic Search:

In contrast, asymmetric semantic search treats the query and documents differently. This method is particularly useful when the query’s information type is fundamentally different from that of the documents. For example, the query might be a short question, while the documents could be long articles or detailed passages.

Asymmetric semantic search acknowledges the inherent imbalance between the query (input text) and the documents/information (e.g., your database, various embeddings) that the system, with the help of LLMs, needs to retrieve. This type of search can produce highly accurate and relevant results, even if the exact words aren’t used in the query. It leverages the learning capabilities of LLMs to understand the intent behind the query, rather than relying on the user to pinpoint the exact terms.

Symmetric vs Assymetric. What approach is the best?

Symmetric semantic search is often used in scenarios where both queries and documents are of similar nature, such as product searches on e-commerce sites.
Asymmetric semantic search, on the other hand, is useful in more complex information retrieval tasks, such as legal document searches or academic research.

In symmetric search, embeddings for both queries and documents are typically generated using the same model, ensuring consistency. In asymmetric search, different models or variations of the same model might be used to handle the distinct characteristics of queries and documents.

Imagine using a medical database. A symmetric search might involve looking up symptoms and finding articles with those symptoms. An asymmetric search could involve a patient’s short description of symptoms and retrieving detailed medical research papers relevant to those symptoms.

Practical Implementation of the simple Semantic Search

To use semantic search right out of the box, you need to follow a few straightforward steps. These steps are based on basic mathematical principles and require minimal effort.

Here’s a high-level overview of the architecture and steps involved:

Preprocessing step (1):

Document Collection: Gather all documents to form your text corpus. These documents can be in various formats, such as DOCX, PDFs, JSON, or TXT files (ensure that you have all the methods to preprocess the differents formats).
Creating Text Embeddings: Convert the textual information into vectors by creating text embeddings. This step encodes the textual data into a format that can be easily processed and compared.
Storing Embeddings: If your document corpus is too large to fit in memory, use a vector database to store the embeddings. For smaller datasets, you can perform in-memory operations.

Semantic Search Flow (2):

To retrieve the necessary information based on the input query, follow these steps:

Preprocess and Embed the Query: Convert the input query into a vector using the same embedding technique as your document corpus.
Extract Candidate Documents: Use a distance metric (e.g., cosine similarity, Euclidean distance) to find the most relevant documents. A higher similarity score indicates a better match. Setting a threshold value can help filter out irrelevant documents.
Sort and Rerank Candidates: If necessary, sort or rerank the candidate documents to ensure the most relevant results are at the top.
Return Results: Provide the sorted and relevant results back to the user.

Practical Implementation of the simple Semantic Search

As it was said above, at the core of any semantic search system is the text embedding mechanism. This component converts a document into a unique vector that represents the encoded text.

The choice of embedding approach significantly impacts the quality of the text representation. You can use simple methods like TfIdfVectorizer or more advanced ones like OpenAI’s embeddings, trained on massive corpora. But with private embeddings like OpenAI’s ones you will be limited in certain applications of them and usage.

Once you have chosen your embedding approach, you need to calculate the similarity between the input vector and the document embeddings. One common method is cosine similarity, which measures the angle between two vectors. A cosine similarity score close to 1 indicates that the vectors are in a similar direction, while a score close to -1 indicates they are in opposite directions.

Now, let’s dive into a Python implementation that constructs a simple yet effective search index using TF-IDF (Term Frequency-Inverse Document Frequency) and cosine similarity. This search index is designed to handle both text fields and keyword fields, providing a robust solution for searching and filtering text data.

First, import the following libraries and ensure they are installed. Create a script named minmaxsearch.py at the root of your project to test this functionality:

1. Import necessary libriaries:

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from concurrent.futures import ThreadPoolExecutor
from scipy.sparse import csr_matrix

If they are not installed, please do this via your python package index via pip command.

2. The class definition that will perform all the necessary calculations and semantic search:

class Index:
    def __init__(self, text_fields, keyword_fields, vectorizer_params=None):
        if vectorizer_params is None:
            vectorizer_params = {}
        self.text_fields = text_fields
        self.keyword_fields = keyword_fields
        self.vectorizers = {field: TfidfVectorizer(**vectorizer_params) for field in text_fields}
        self.keyword_df = None
        self.text_matrices = {}
        self.docs = []

The Index class encapsulates the logic for building and searching the index. It has several important attributes and methods.

__init__ method initializes the class with text fields, keyword fields, and optional parameters for the TF-IDF vectorizer. It sets up the necessary structures for storing the vectorizers, keyword data, TF-IDF matrices, and documents.

3. Implementation of the fit method:

def fit(self, docs):
    self.docs = docs
    keyword_data = {field: [] for field in self.keyword_fields}
    
    def fit_text_field(field):
        texts = [doc.get(field, '') for doc in docs]
        return field, self.vectorizers[field].fit_transform(texts)
    
    with ThreadPoolExecutor() as executor:
        results = executor.map(fit_text_field, self.text_fields)
        for field, matrix in results:
            self.text_matrices[field] = matrix
    
    for doc in docs:
        for field in self.keyword_fields:
            keyword_data[field].append(doc.get(field, ''))
    
    self.keyword_df = pd.DataFrame(keyword_data)
    return self

the fit method processes a list of documents, extracting text and keyword data. It uses a thread pool to parallelize the fitting of TF-IDF vectorizers for each text field, which improves efficiency. The keyword data is stored in a DataFrame for easy filtering later.

4. Implementing the search feature for the index:

def search(self, query, filter_dict={}, boost_dict={}, num_results=10):
    query_vecs = {field: self.vectorizers[field].transform([query]) for field in self.text_fields}
    scores = np.zeros(len(self.docs))
    
    def compute_similarity(field, query_vec):
        sim = cosine_similarity(query_vec, self.text_matrices[field]).flatten()
        boost = boost_dict.get(field, 1)
        return sim * boost
    
    with ThreadPoolExecutor() as executor:
        similarities = executor.map(lambda field: compute_similarity(field, query_vecs[field]), self.text_fields)
        for sim in similarities:
            scores += sim
    
    for field, value in filter_dict.items():
        if field in self.keyword_fields:
            mask = self.keyword_df[field] == value
            scores = scores * mask.to_numpy()
    
    top_indices = np.argpartition(scores, -num_results)[-num_results:]
    top_indices = top_indices[np.argsort(-scores[top_indices])]
    
    top_docs = [self.docs[i] for i in top_indices if scores[i] > 0]
    
    return top_docs

The search method takes a query string and optional filters and boosts. It converts the query into TF-IDF vectors, computes cosine similarities between the query and the indexed documents, and applies any keyword filters. Finally, it returns the top matching documents based on their relevance scores.

So now we have the fully implemented class that is capable for building the index and perform a basic semantic search through the encoded versions of the documents. If you want to test it in practice run the following chunk of code and obtain the results. You can also extend this class to various methods of embeddings and etc.

import minsearch
import requests
import json 

docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)

index.fit(documents)

Now you have the index using which you can search the information by providing your question (query), e.g.:

query = "Can I join the course if it has already started?"

filter_dict = {"course": "data-engineering-zoomcamp"}
boost_dict = {"question": 3}

results = index.search(query, filter_dict, boost_dict, num_results=5)

for result in results:
    pprint(json.dumps(result, indent=2))

The output will be the json-format:

{
  {
    ...
    "text": "Yes, even if you don\'t register, you\'re still eligible to...",
    "section": "General course-related questions",
    "question": "Course - Can I still join the course after the start date?",
   "course": "data-engineering-zoomcamp",
  },
...
}

That is the basics of the semantic search.

If you want to extend the functionality to using the private-sourced embeddings like OpenAI’s one, you need add to the class additional functionality. OpenAI provides several embedding engine options that can be used for text embedding. Each engine may provide different levels of accuracy and may be optimized for different types of text data.

First of all you should visit and open account on OpenAI’s platform in order to obtain an API key to communicate with their models and embeddings (IMPORTANT: It is not free and you have to pay for the usage depending on how much tokens you have sent/received, more on the pricing strategy you will find on documentation page).

import openai
from openai.embeddings_utils import get_embeddings, get_embedding

# Create the .env file at the root and place the OPENAI_API_KEY in this file
OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY')

# Using OpenAI's embedding model to get back vector representation
embedded_text = get_embedding('The result of embedding is a vector representation', 
                              engine='text-embedding-ada-002')

len(embedded_text)

Open-Sourced Embeddings:

While companies like OpenAI offer powerful text embedding products, there are also several excellent open-source alternatives. One popular option is the bi-encoder approach using BERT, a robust deep learning algorithm known for achieving state-of-the-art results across various natural language processing tasks. You can find pre-trained bi-encoders in many open-source repositories, including the Sentence Transformers library, which provides models ready to use for a range of NLP tasks.

The bi-encoder approach leverages BERT’s ability to understand and encode complex language patterns. By using two separate BERT models for input and output, the system can handle diverse NLP tasks with greater accuracy. The simultaneous training of these models ensures that the embeddings generated are highly tuned to capturing semantic similarity.

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
input_texts = ["How to prevent a cold?", 
               "What are the symptoms of flu?", 
               "Best ways to boost immune system"]

output_texts = ["To prevent a cold, you should...", 
                "Flu symptoms include...", 
                "Boosting immunity involves..."]

# Encode the texts
input_embeddings = model.encode(input_texts, convert_to_tensor=True)
output_embeddings = model.encode(output_texts, convert_to_tensor=True)

# Compute the similarity between input and output texts
similarity_scores = util.pytorch_cos_sim(input_embeddings, output_embeddings)

# Print the similarity scores
print(similarity_scores)

Retrieval-Augmented Generation (RAG)

As we delve deeper into the capabilities of Large Language Models (LLMs), it’s crucial to understand Retrieval-Augmented Generation (RAG), a powerful technique that enhances the performance and applicability of these models.

RAG combines the strengths of traditional retrieval-based methods with the generative prowess of LLMs, leading to more accurate and contextually relevant responses.

Why RAG?

Standard LLMs like GPT-4 are trained on vast amounts of data, enabling them to generate coherent and contextually appropriate responses to a wide range of queries. However, they have a significant limitation: they cannot access or incorporate information that was not part of their training data. This becomes problematic in scenarios requiring up-to-date information or specific knowledge that is not commonly available in the general training corpus.

Imagine you have a repository of internal documents, such as client interactions, proprietary research papers, or specialized knowledge bases. An LLM, without additional context, wouldn’t be able to accurately answer specific queries related to this information.

This is where RAG comes into play. RAG allows the model to retrieve relevant documents from an external knowledge base and use this information to generate informed responses.

How Does RAG Work?

RAG integrates two main components: a retriever and a generator.

Retriever: This component is responsible for fetching relevant documents from a knowledge base based on the input query. It uses techniques like text embeddings and similarity search to identify the most relevant pieces of information. Think of it as the librarian who quickly finds the books or articles you need from the vast library.
Generator: Once the relevant documents are retrieved, the generator (typically an LLM like GPT-4) takes over. It processes the input query and the retrieved documents to generate a response that is both accurate and contextually enriched. The generator is akin to an expert who not only understands the query but also synthesizes information from multiple sources to provide a comprehensive answer.

The RAG Workflow

Let’s break down the RAG process into clear steps:

Input Query: A user submits a query to the system. This could be a question or a request for specific information.
Document Retrieval: The retriever converts the query into a vector representation and searches the knowledge base for relevant documents. It ranks these documents based on their similarity to the query.
Document Embedding: The retrieved documents are also converted into vector representations to facilitate seamless integration with the query.
Response Generation: The generator takes the input query and the embeddings of the retrieved documents. It uses this combined information to generate a detailed and contextually appropriate response.
Output: The system presents the generated response to the user, ensuring that the answer is informed by the most relevant and up-to-date information available in the knowledge base.

Let’s see the practical usage and implementation of the RAG:

q = 'the course has already started, can I still enroll?'

client = OpenAI(api_key = OPENAI_API_KEY)

response = client.chat.completions.create(
    model='gpt-4o',
    messages=[{"role": "user", "content": q}]
)

response.choices[0].message.content

It will output the following results:

"It depends on the specific course and institution offering it. Here are some factors to consider:\n\n1. **Course Enrollment Policies:** Check with the institution or course administrator. Some courses have a fixed enrollment period, while others may allow late registration.\n  \n2. **Availability:** Ensure there are still open spots in the course. Popular courses may fill up quickly.\n\n3. **Prerequisites and Preparedness:** Make sure you meet any prerequisites and consider how much of the course material you have missed. Catching up may require extra effort.\n\n4. **Impact on Grades:** Determine if the institution allows late enrollees to make up for missed work and whether this might impact your grade.\n\n5. **Instructor Approval:** Sometimes, enrolling late requires approval from the instructor, who can provide guidance on how to catch up with coursework.\n\n6. **Fees and Deadlines:** Be aware of any additional fees that might apply to late registration and check deadlines for adding or dropping courses.\n\nTo proceed, contact the course administrator, academic advisor, or the institution's registration office as soon as possible for specific information and to discuss your options."

Next step is to implement the search function:

def search(query):
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query=query,
        filter_dict={'course': 'data-engineering-zoomcamp'},
        boost_dict=boost,
        num_results=5
    )

    return results

After this we need to build the prompt that forms the context that we will be using:

def build_prompt(query, search_results):
    prompt_template = """
    You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
    Use only the facts from the CONTEXT when answering the QUESTION.

    QUESTION: {question}

    CONTEXT: 
    {context}
    """.strip()

    context = ""
    
    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
    
    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

The llm calling with prompt:

def llm(prompt):
    response = client.chat.completions.create(
        model='gpt-4o',
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.choices[0].message.content

And the final step is to implement the rag functionality:

query = 'how do I run kafka?'

def rag(query):
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer

rag(query)

The result of the following flow will be:

'To run Kafka, follow these steps based on the provided CONTEXT:\n\nIf you are running a Java producer/consumer/kstreams application in the terminal:\n\nIn the project directory, execute the following command:\n```bash\njava -cp build/libs/<jar_name>-1.0-SNAPSHOT.jar:out src/main/java/org/example/JsonProducer.java\n```\n\nIf you are running a Python producer:\n\n1. Create and activate a virtual environment:\n   ```bash\n   python -m venv env\n   source env/bin/activate\n   ```\n\n2. Install the required packages:\n   ```bash\n   pip install -r ../requirements.txt\n   ```\n\n3. Make sure the docker images for your Kafka setup are up and running.\n\nTo activate the virtual environment later:\n```bash\nsource env/bin/activate\n```\n\nTo deactivate the virtual environment when done:\n```bash\ndeactivate\n```\n\nFor Windows, the activation path differs:\n```bash\nenv/Scripts/activate\n```\n\nMake sure you adjust paths and filenames according to your specific setup and operating system.'

And that’s it!

Practical Applications of RAG

RAG’s ability to combine retrieval and generation makes it incredibly versatile. Here are a few practical applications:

Customer Support: RAG can enhance customer support systems by providing accurate answers based on a company’s internal knowledge base, leading to faster and more reliable support.
Research Assistance: Researchers can use RAG to access and synthesize information from extensive research databases, aiding in literature reviews and data analysis.
Content Creation: Writers and content creators can leverage RAG to gather and integrate information from various sources, producing well-informed and comprehensive articles.
Healthcare: In the medical field, RAG can help clinicians access the latest research and guidelines, providing evidence-based recommendations for patient care.

Retrieval-Augmented Generation (RAG) represents a significant advancement in the field of natural language processing. By combining the strengths of retrieval-based methods and generative models, RAG enables more accurate, contextually relevant, and informed responses. Whether used in customer support, research, content creation, or healthcare, RAG’s ability to integrate external knowledge with powerful language models opens up a world of possibilities, making it an indispensable tool for modern applications.

By understanding and implementing RAG, we can push the boundaries of what LLMs can achieve, ensuring that they provide not just coherent, but also highly relevant and informed responses to user queries.