Gentle Introduction to Large Language Models (LLMs)

11 min readJun 21, 2024

Many of you have probably experimented with tools like ChatGPT for fun. Over the past few years, this tech has infiltrated every aspect of our lives — from business processes to daily tasks. And guess what? We’re just getting started. Many still don’t fully understand Machine Learning, Neural Networks, or AI, but that’s about to change. This is a starting point; if you don’t grasp the basics now, you might find it harder to catch up later. So, it’s high time to start gaining new knowledge.

When OpenAI launched ChatGPT, it was a game-changer, revealing the incredible power of Large Language Models (LLMs). These models have transformed everything from trip planning to cooking recipes. But let’s demystify them: at their core, LLMs are sophisticated mathematical models that excel at processing language.

This subset of Natural Language Processing (NLP) has become a leading force in machine learning. Now, everyone’s eager to learn about LLMs, NLP, and how to harness their potential.

So, let’s break it down — what LLMs are, how they function, and how you can use them for learning, life, and beyond. All you need to follow this series along is basic school math, some Python programming skills, and a healthy dose of curiosity. A foundational understanding of Machine Learning is a plus but not mandatory. I’ll provide additional sources where you can catch up with all of this information if you want to dig deeper.

This series of tutorials are dedicated to exploring Large Language Models (LLMs) and their real-life applications across various use cases. If you’ve missed any previous posts, you can catch up on them here (links attached):

Do not forget to subscribe in order to receive the practical use-cases in the world of NLP.

Demystifying Large Language Models

The rise of LLMs began even before OpenAI’s ChatGPT made headlines. In early 2017, Google’s Research Team introduced a groundbreaking deep learning model architecture known as the Transformer (research paper: “Attention is all you need”).

This architecture quickly set the standard for a variety of Natural Language Processing (NLP) tasks. You’ve probably interacted with transformer models without realizing it — think Google Translate, Google’s Search Engine, or Autocomplete. Before transformers, Recurrent Neural Networks (RNNs) were commonly used for these tasks. Now, let’s dive into how LLMs work and explore the basics of NLP.

NLP, or Natural Language Processing, is a field of Machine Learning focused on understanding and interpreting human language. There are two primary types of tasks we address using mathematical and statistical methods: Classification (e.g., identifying the type of an email) and Regression (e.g., generating a poem about Artificial Intelligence). Thanks to rapid advancements, there are numerous frameworks and packages for experimenting with open-source LLMs. One popular package is Transformers from Hugging Face.

Large Language Models (LLMs) are typically based on the Transformer architecture, designed to understand and generate human language, code, and other data. These deep learning models are trained on enormous datasets (petabytes of text) to capture the nuances of human language.

The Transformer neural network architecture is both impressive and straightforward. It can be highly parallelized and scaled in ways other architectures cannot. A crucial component of the Transformer architecture is self-attention, which allows each word in a sequence to consider all other words in the sequence, capturing long-range dependencies and contextual relationships. However, transformers are not without limitations. One challenge is their input context window — the maximum length of text they can process at a time. There are other challenges, but we will talk about them later.

While this post isn’t about delving deep into the Transformer architecture, a basic understanding is helpful. If you’re curious about the inner workings of Transformers, I highly recommend the YouTube video by 3Blue1Brown titled Visual Intro To Transformers.

The most common subtasks in the field of NLP are following:

Classification: Categorizing text, sentences, or words. Examples include identifying spam vs. non-spam emails, correcting grammar, determining the sentiment of a book, or defining its genre. You can even classify more granular elements like individual words, such as grammatical tagging or named entity recognition, where each word gets a label like person, place, or object.
Text Generation: This involves creating new textual content, such as generating responses to questions, writing new sentences for language translation, or even crafting entirely new texts.

To start using transformers with Python in your environment install it via PiP:

$ pip install transformers

To start using Transformers as easy as it can be, e.g. Classification Task:

from transformers import pipeline


classifier = pipeline("text-classification")
text = "I love using transformers for natural language processing!"
result = classifier(text)


print(result)

There are plenty of pipelines and models are available. One of the common tasks that available inside this pipeline are:

fill-mask: Predicts the masked token in a sentence.
feature-extraction: Provides vector/embedding representation of text.
ner: Named Entity Recognition.
question-answering: Provides answers to questions based on context.
sentiment-analysis: Determines the sentiment of the text.
summarization: Shortens the text to main points.
text-generation: Generates new text based on a prompt.
translation: Translates text from one language to another.
zero-shot-classification: Classifies texts that haven’t been labeled.

For example using Named Entity Recognition:

from transformers import pipeline
ner = pipeline("ner", grouped_entities=True)
text = "The Transformers library is a powerful tool for natural language processing that are used by developers."
entities = ner(text)
print(entities)

Text Generation: This involves creating new textual content, such as generating responses to questions, writing new sentences for language translation, or even crafting entirely new texts.

Example of text generation task:

from transformers import pipeline
generator = pipeline("text-generation", model="gpt2")

prompt = "Once upon a time, in a land far, far away,"
generated_text = generator(prompt, max_length=50, num_return_sequences=1)

print(generated_text)

How LLMs work on a high-level:

There are numerous LLMs available today, and it can be overwhelming if you don’t understand the differences between them. Transformers can be grouped into several categories, each designed to solve specific tasks.

1. GPT-like models (auto-regressive Transformers): These models generate text one word at a time, with each word depending on the previously generated words. They excel at text generation tasks like story completion and conversational agents. Examples include GPT-2, GPT-3, and GPT-4.

2. BERT-like models (auto-encoding Transformers): These models are designed to understand and interpret text by analyzing the context of words within a sentence. They are excellent for tasks such as text classification, named entity recognition, and question answering. Examples include BERT and RoBERTa.

3. BART/T5-like models (sequence-to-sequence Transformers): These versatile models can handle a variety of tasks by converting input sequences to output sequences. They are particularly effective for tasks like text summarization, translation, and other transformation tasks. Examples include BART and T5.

A diagram showing the 3 different groups (types) of Transformers. Transformers can be grouped either to Auto-Regressive or Auto-Encoding or Sequence-To-Sequence Transformers.

The Transformer models mentioned earlier (such as GPT, BERT, BART, and T5) are fundamentally trained as language models. They learn from massive amounts of raw text data through a method called self-supervised learning. In self-supervised learning, the model generates its own training signals from the input data, which means it doesn’t require human-labeled data.

During this training process, these models develop a statistical understanding of the language they are exposed to. However, this broad linguistic comprehension isn’t directly applicable to specific practical tasks. To make these models useful for particular applications, we employ a process called transfer learning. In transfer learning, the pretrained model is fine-tuned using supervised learning, which involves human-labeled data tailored to a specific task.

One common task in this context is causal language modeling, where the model predicts the next word in a sentence based on the preceding words. This approach generates predictions using past and present inputs without considering future inputs, making it suitable for tasks like text generation and autocomplete.

In contrast, masked language modeling, used in models like BERT, involves masking certain words in a sentence and training the model to predict these masked words based on the context provided by the surrounding words. This method enables the model to understand bidirectional context, making it highly effective for comprehension tasks such as question answering and text classification.

Modern Transformer models are extraordinarily large in terms of parameters and storage size. For instance, the GPT-4 model boasts 1.76 trillion parameters! Training such models necessitates vast amounts of high-quality data and substantial computational resources, which makes it impractical for individual developers to train these models from scratch.

Despite these challenges, the techniques mentioned above, particularly transfer learning, enable you to fine-tune these pre-trained models for specific use cases effectively. Fine-tuning involves adjusting the weights of a pretrained model on a smaller, task-specific dataset. This process significantly reduces the computational resources and time required compared to training a model from scratch.

Example of Transfer Learning Workflow:

1. Pretraining: The model is initially trained on a large corpus of text data using self-supervised learning. This step helps the model learn general language features such as grammar, syntax, and semantics.

2. Fine-Tuning: The pretrained model is then fine-tuned on a smaller, labeled dataset specific to the desired application. For example, fine-tuning BERT on a dataset of customer reviews can improve its performance in sentiment analysis tasks.

3. Evaluation and Deployment: After fine-tuning, the model is evaluated to ensure it meets the performance criteria for the specific task. Once validated, the model can be deployed for practical applications.

While Transformers have revolutionized NLP, they also come with challenges such as computational expense, energy consumption, and the need for vast datasets. Researchers are continuously exploring new architectures and techniques to make these models more efficient and accessible. Techniques like model distillation, which reduces the size of a large model while retaining its performance, and sparse attention mechanisms, which aim to reduce computational overhead, are promising areas of research.

In summary for this part, understanding and leveraging the capabilities of Transformer models like GPT, BERT, BART, and T5 through self-supervised learning and transfer learning can unlock numerous practical applications in NLP. By fine-tuning these models for specific tasks, we can harness their power without the need for extensive computational resources.

High-Level Transformer Architecture:

Today’s LLMs are mostly Transformers. The Transformer model architecture is shown below (don’t worry if it seems complex at first):

However, let’s break it down step by step. The Transformer architecture primarily consists of two crucial components: the Encoder and the Decoder:

Encoder: Think of the encoder as a sophisticated reader. It takes the input text and builds a detailed representation of its features, helping the model understand the input deeply. This involves encoding the input into continuous representations that capture its nuances.
Decoder: The decoder is like a creative writer. It uses the encoder’s representations along with other inputs to generate a target sequence, optimizing the model for generating outputs. This could be translating text, generating new sentences, or summarizing information.

Depending on the task, these components can be used independently or together:

1. Encoder-only models: Ideal for tasks that require a deep understanding of the input, such as sentence classification and named entity recognition. Examples include BERT and RoBERTa.

2.Decoder-only models: Suited for generative tasks where the goal is to produce text, like text generation. Examples include GPT-2, GPT-3, and GPT-4.

3. Encoder-Decoder models (aka sequence-to-sequence models): Perfect for tasks that require both understanding and generation, such as translation and summarization. Examples include BART and T5.

A key feature of the Transformer architecture is its advanced attention mechanisms. These mechanisms instruct the model to focus on specific parts of the input sequence, leading to more accurate and context-aware results.

Attention is a mechanism used in deep learning models (RNNs, Transformers) that assigns different weights to different parts of inputs allowing the model to prioritize and emphasize the most important information while performing tasks like generation, translation or sentiment analysis. As I’ve said above, attention lets the model focus on different parts of the input in a dynamic manner that leads to improved performance and higher accuracy.

Attention mechanisms include:

Self-Attention: Allows the model to weight the importance of each word in a sentence relative to every other word, enabling it to capture dependencies and relationships over long distances. For a detailed explanation, see The Illustrated Transformer.
Multi-Head Attention: Enhances the model’s ability to focus on different parts of the input simultaneously by using multiple attention heads. This parallel processing significantly boosts the model’s performance.

Analogy for Attention: Imagine reading a complex book. Self-attention is like taking notes and cross-referencing them with each other to understand the story better, while multi-head attention is like having several people read the book and highlight different important sections, providing a richer understanding when combined.

For any LLM to learn any kind of rule, however, it has to convert what we perceive as text into something machine readable. This is done through the process of embedding / vectorization. The output of such a process is embeddings that are just mathematical (vector) representations of words, sentences or tokens in a high dimensional space. Via embeddings we can represent words, sentences and capture their semantic meaning and relationships with other words via vector distance calculations. There are several types of embeddings: positional (encode positions of a token), token embeddings (semantics) and mixed:

Positional Embeddings: Encode the position of a token within a sentence, helping the model understand the order of words.
Token Embeddings: Represent the semantic meaning of individual words or tokens.
Mixed Embeddings: Combine positional and token embeddings to provide a comprehensive representation of the input.

So far we covered basic on high-level the following parts: Transformers, Encoder, Decoder, Encoder-Decoder, Attention and Embeddings. But what is really important inside the using of LLMs is a technique called RLHF (Reinforcement Learning from Human Feedback).

This technique is used to align language models to the user’s intent. It is done by Reinforcement Learning (RL). RLHF is a very popular mechanic of aligning pre-trained LLMs that uses human feedback to enhance their performance. It allows the LLM to learn from a relatively small amount of human feedback to correct the behavior. RLHF has shown significant improvements in modern LLMS like GPT-3.5, GPT-4 and even it is used in ChatGPT products.

That was a high level information about LLMs and Transformers. I have a separate post where I am explaining details and math behind the Transformers. But for now it is enough to make clear what is LLM and how they perform its work.

And where you can practically use the LLMs:

Classical NLP (text/word/sentence classification)
Translation from one language to another
Code/SQL/Review/Simple Text Generation
Information Retrieval
Semantic Search
Chatbots

In summary, the Transformer architecture, with its Encoder, Decoder, advanced attention mechanisms, and embeddings, provides a powerful foundation for LLMs. Techniques like RLHF further enhance these models, making them incredibly versatile and effective for a wide range of NLP tasks.

If you liked the content or found it practically applicable and useful that’s great. You can support and inspire via BuyMeACoffe as creation of the valuable content requires a lot of efforts and time. And it is a great way to show that you are really enjoying it! Thanks!

Gentle Introduction to Large Language Models (LLMs)

Demystifying Large Language Models

How LLMs work on a high-level:

High-Level Transformer Architecture:

Written by Marat | Machine Mind

No responses yet