Fine-Tuning in Sentence Transformers 3
Information RetrievalEmbedding models are one of the backbones of successful Retrieval-Augmented Generation (RAG) applications, crucial in retrieving relevant contexts to generate accurate answers. However, we often train embedding models on general knowledge, which limits their effectiveness when applied to specific domains or company-specific data. Customizing embeddings for your particular use case can improve the retrieval performance of your RAG application, leading to more accurate and relevant results.
With the release of Sentence Transformers 3, fine-tuning embedding models has become more accessible and efficient than before. This powerful library allows developers and researchers to fine-tune or enhance pre-trained models with domain-specific data, potentially achieving performance comparable to or surpassing proprietary models at a fraction of the cost.
This blog post will walk you through fine-tuning an embedding model using Sentence Transformers 3. We'll demonstrate how, with just a modest dataset and accessible computational resources, you can significantly improve retrieval performance for your specific use case. Our example will focus on fine-tuning the all-mpnet-base-v2 model for biomedical question answering, showcasing the potential for domain-specific improvements.
We'll cover everything from dataset preparation and model selection to fine-tuning, including:
- Installing necessary libraries and setting up your environment
- Preparing and formatting your dataset
- Establishing a baseline and evaluation protocol
- Defining an appropriate loss function
- Configuring training arguments
- Fine-tuning the model using the
SentenceTransformerTrainer
- Evaluating the results and comparing performance
By the end of this tutorial, you'll have a clear understanding of how to leverage Sentence Transformers 3 to create custom embedding models tailored to your specific needs, potentially kickstarting your own data flywheel and improving your RAG application's performance.
Prerequisites
We will install the following libraries:
- Pytorch
- Sentence Transformers (HF)
- Transformers (HF)
- Datasets (HF)
Throughout this tutorial, we are using Python 3.11.5.
After installing the necessary libraries, you should register on Hugging Face as we are going to use Hugging Face Hub to push our models and training logs.
Get your access token here.
Dataset Preparation
The Hugging Face Hub has many datasets that we can use to fine-tune embedding models. You can look here at the required dataset structure needed for fine-tuning embeddings.
We will use enelpol/rag-mini-bioasq, which includes 4,719 question-answer passages from the BioASQ challenge datasets for biomedical semantic indexing and Question Answering (QA). We will use this dataset as a Positive Pair configuration.
We must load the dataset using the Hugging Face datasets library.
The dataset has the following format.
Given that the format is a bit different from the format that we need to provide to 'Sentence-transformers', we have to select and rename the columns to match the expected format.
Once the formatting is ready, we save the train and test datasets to disk.
Baseline and evaluation
Following dataset preparation, our next step is establishing a baseline method and evaluation protocol. This crucial step allows us to gauge the effectiveness of future model refinements against a known starting point. We'll assess how well a pre-existing model handles our specific data and how it performs after fine-tuning.
We've selected all-mpnet-base-v2 as our base model that we will fine-tune later. This model isn't particularly performant compared to other models of similar size, but let's see how far we can go with fine-tuning. With only 110 million parameters and a 768-dimensional embedding space, it obtains a score of 57.78 on the MTEB Leaderboard, which is lower than the performance of OpenAI's text-embedding-ada-002, which obtains a score of 60.99. We will also compare this model with the bge-base-en-v1.5, which also has 109 million parameters and a 768-dimensional embedding space. The bge-base-en-v1.5 achieves an impressive score of 63.55 on the MTEB Leaderboard.
Given that we want to improve the Information Retrieval (IR) capabilities of the embeddings, to quantify performance, we will employ the InformationRetrievalEvaluator. This tool assesses how well our model can fetch the most relevant documents for given queries. It calculates various performance metrics, including Mean Reciprocal Rank (MRR), Recall@K, and Normalized Discounted Cumulative Gain (NDCG). A useful explanation of these IR metrics can be found here.
To conduct our evaluation, we will utilize a comprehensive document pool combining train and test data for the corpus, while queries will be sourced exclusively from the test set. This approach ensures we assess the model’s ability to retrieve relevant documents from a larger corpus that includes unseen data, providing a more robust and realistic evaluation of its retrieval capabilities.
We use the 'model_evaluator' to evaluate the baseline bge-base reference model. Later, we will also use it to evaluate the fine-tuned model.
We obtain the following results for both models.
Metric | all-mpnet-base-v2 | bge-base-en-v1.5 |
---|---|---|
accuracy@1 | 0.7850 | 0.8515 |
accuracy@3 | 0.8755 | 0.9349 |
accuracy@5 | 0.9024 | 0.9491 |
accuracy@10 | 0.9278 | 0.9590 |
precision@1 | 0.7850 | 0.8515 |
precision@3 | 0.2918 | 0.3116 |
precision@5 | 0.1805 | 0.1898 |
precision@10 | 0.0928 | 0.0959 |
recall@1 | 0.7850 | 0.8515 |
recall@3 | 0.8755 | 0.9349 |
recall@5 | 0.9024 | 0.9491 |
recall@10 | 0.9278 | 0.9590 |
ndcg@10 | 0.8571 | 0.9122 |
mrr@10 | 0.8348 | 0.8965 |
map@100 | 0.8367 | 0.8973 |
Defining the Loss Function
In this case, we are using the MultipleNegativesRankingLoss to fine-tune our embedding model. We use this loss function to align with our dataset format, which consists of positive text pairs. You can take a look at dataset format information and loss function information to determine which loss function to use based on your use case.
Fine-tuning the Model
Now that we've prepared our data and model, we're ready to fine-tune our embedding model using the SentenceTransformersTrainer.
To configure our training process, we'll use the SentenceTransformerTrainingArguments class. This tool allows us to specify various parameters that can impact training performance and help with tracking and debugging. We'll be using parameter values based on those recommended in the Sentence Transformers documentation. However, it's important to note that these are just starting points. You should experiment with different values tailored to your specific dataset and task for optimal results.
The training on 4k samples took around 1 minute on an Nvidia A10G instance of Modal labs. At the time of writing (August 2024), the instance costs 1.1 USD/hour, which indicates a cost of less than 0.1 USD for the training.
Now we can evaluate the fine-tuned model using the 'model evaluator' from earlier.
If we focus on only a couple of metrics that are more relevant in our case, we get the following information.
Model | MRR@10 | NDCG@10 |
---|---|---|
all-mpnet-base-v2 (Baseline) | 0.8347 | 0.8571 |
bge-base-en-v1.5 | 0.8965 | 0.9122 |
all-mpnet-base-v2 Fine-tuned | 0.8919 | 0.9093 |
The fine-tuned model shows significant improvements over the baseline model, with a 6.85% increase in MRR@10 and a 6.09% increase in NDCG@10. It reached the performance level of the bge-base-en-v1.5 embeddings.
Conclusion
Embedding models play a crucial role in the success of RAG applications, as the quality of retrieved context directly impacts the generated answers. Using the Sentence Transformers 3 library, we fine-tuned the all-mpnet-base-v2 model on a biomedical question-answering dataset.
Results show substantial improvements:
- MRR@10 increased from 0.8347 to 0.8919 (6.85% improvement)
- NDCG@10 improved from 0.8571 to 0.9093 (6.09% improvement)
Our fine-tuned model achieved performance comparable to the more performant bge-base-en-v1.5 model despite starting from a lower baseline.
The fine-tuning process has become highly accessible and efficient. With only 4,719 question-answer pairs, we achieved these improvements in approximately one minute training time on an Nvidia A10G GPU. The estimated cost for this training was less than 0.1 USD, making it a cost-effective approach for enhancing domain-specific retrieval tasks. These results show the value of customizing embedding models for specific domains or use cases. We can get significant performance gains even with a relatively small dataset and minimal training time.