In my previous blog post, Exploring the Power of LLMs for Technical Content Review, I discussed the potential of large language models (LLMs) to enhance the efficiency of content reviews. This post is about finishing the last part of my exploration journey - Fine-tuning a model on custom data and checking the quality of responses.
Table of contents
Open Table of contents
Learning about fine-tuning and tools
In simple terms, fine-tuning is a process where you take a pre-trained model and train it on a new dataset to adapt it to a specific task. I have been watching AI-related videos on YouTube, and the algorithm shows me a variety of AI-related topics. So, I saw a few different tools for tuning a model previously. Through these resources, I built an understanding of the fine-tuning process and the various tools available to achieve it.
Deciding on a tool and dataset
While fine-tuning is a complex and resource-intensive process that typically requires powerful GPUs, there are tools that allow you to fine-tune a model on a free Google Colab instance.
Google Colab is a service that requires no setup and provides free access to computing resources, including GPUs. So, I picked one of the tools I remember seeing in a video: Unsloth.
They offer free Juypter notebooks (interactive web-based environments containing code and explanatory text) that allow anyone to fine-tune LLMs.
One thing to note is that different LLMs need data in different formats for fine-tuning. When it comes to the dataset, depending on the fine-tuning technique, we need to provide a dataset with different columns. For example, one technique requires data in the Alpaca format: instruction
, input
, and output
. Another method called Odds Ratio Policy Optimization (ORPO) requires data in the format: instruction
, input
, accepted
, and rejected
.
Additionally, when it comes to training time, model size also plays a role. The size of a model is determined by the number of parameters in the LLM. A model with more parameters is more complex and can process more data.
For instance, Meta’s Llama 3 8B has 8 billion adjustable parameters. The training data uses these parameters to understand the connection between words and phrases. The more parameters a model has, the more complex it can be and the more data it can handle.
However, the bigger the model, the more time it will take to train. I originally planned to train the llama3 8B
model using a dataset with instruction
, input
, and output
, which I created in my sample dataset. However, later, I selected the phi3-3.8B
model because of its smaller size.
And since I already had a dataset in the Alpaca format, I decided to use that for fine-tuning.
Fine-tuning the model
Thanks to the Unsloth team, their Google Colab notebook made it easy to fine-tune the model. I used their free Colab notebook to train the unsloth/Phi-3-mini-4k-instruct
model on my dataset.
In their load_dataset
instruction, they use a dataset hosted on Huggingface. So, first, I had to convert my Google sheet to a JSON dataset.
LLMs to the rescue!
I downloaded the dataset from Google Sheets as a CSV file, asked my local LLM for a script to convert it into the JSON format, and provided the sample JSON example from the yahma/alpaca-cleaned
dataset. I used that script to generate the JSON dataset and upload it on Huggingface at https://huggingface.co/datasets/gnelson/sup-sample-1.
Then, I updated the load_dataset
instruction in the Colab notebook to use my dataset.
from datasets import load_dataset
dataset = load_dataset("gnelson/sup-sample-1", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)
I kept running all the steps and updated the Inference code to test the fine-tuned model and check the quality of responses. Everything ran smoothly till it was time to save and download the model.
I hit an error in the GGUF / llama.cpp Conversion
step. So, I checked the repository and found an issue: https://github.com/unslothai/unsloth/issues/476. I left a comment. Later in the day, the maintainer replied with a fix. I ran the notebook again and successfully saved the model.
Preparing the fine-tuned model for running locally
I then downloaded the GGUF format model to test it locally. I have also uploaded the fine-tuned model on Huggingface if you want to try it out. Follow the steps in this section to test the model using Ollama.
Download the model from Huggingface
Go to https://huggingface.co/gnelson/sup-sample-1-phi3-mini/tree/main and download the file phi3-sup-sample1-unsloth.Q8_0.gguf
.
Create a Modelfile
Create a new file called Modelfile
and include the following content:
FROM ./phi3-sup-sample1-unsloth.Q8_0.gguf
TEMPLATE "{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>
"
PARAMETER num_keep 4
PARAMETER stop <|user|>
PARAMETER stop <|assistant|>
PARAMETER stop <|system|>
PARAMETER stop <|end|>
PARAMETER stop <|endoftext|>
Add the model to Ollama
Run the following command to add the model to Ollama:
ollama create sup-sample-1-phi3-mini -f Modelfile
Verify the model
Run the ollama list
command to verify that the model has been added.
Running the fine-tuned model
After the model was ready, I ran it locally using Msty. It provides a more straightforward GUI interface for interacting with local models. I set the system prompt to the same as I had in my dataset, and below are some responses from the fine-tuned model.
Input | Response |
---|---|
The current configuration uses a whitelist to filter incoming traffic. To add a new IP address, you need to update the list manually. | The current setup utilizes an allowlist to permit incoming network traffic. To incorporate a new source IP address, modify the list accordingly. |
The Red Hat AMQ message broker is the main component you need to download before running the backup commands. | The Red Hat AMQ Queue Manager is the primary software package that must be installed prior to executing the restore procedures. |
The master server distributes tasks to the slave servers. | In this setup, the primary server delegates work assignments to multiple auxiliary servers for load balancing and redundancy purposes. |
The team is working on Red Hat A-MQ implementation. | This group of professionals focuses on developing an application using the Red Hat AMQ technology framework. |
I ran the same inputs with the same system prompt on the original model to compare how the original model would respond. Here are the responses:
Input | Response |
---|---|
The current configuration uses a whitelist to filter incoming traffic. To add a new IP address, you need to update the list manually. | To optimize and streamline the process of adding new IP addresses while maintaining strict security control using an automated dynamic whitelisting system that incorporates artificial intelligence (AI), anomaly detection techniques, real-time threat intelligence feeds integration, multi-factor authentication requirements for added access points, as well as a centralized logging framework to manage logs across all systems efficiently: Evaluate Current Configuration and Identify Challenges: Thoroughly analyze your existing IP whitelist system’s limitations such as manual updates, scalability issues, security gaps due to human error or lack of real-time threat intelligence integration. This evaluation will provide a clear understanding of how an advanced solution can address these challenges … (redacted) |
The Red Hat AMQ message broker is the main component you need to download before running the backup commands. | While it’s true that a message broker like RabbitMQ (formerly known as Red Hat Amq) can be an essential part of certain automation or integration tasks, its direct relevance to backups depends on your specific use case and infrastructure setup. If you are looking for backup commands related to data management systems such as… (redacted) |
Observations
Based on these responses and more testing, the fine-tuned model performs well. Here are some of my observations:
- The response length demonstrates that the fine-tuned model learned to respond with shorter sentences as I intended.
- Considering my dataset’s inclusive language term examples, 19 out of 76, the fine-tuned model responses included more inclusive language as intended.
- The number of examples in the dataset affected how well it learned. There was only one example of the usage of
AMQ message broker,
which it did not learn.
Conclusion
It was a great learning experience. Fine-tuning a model does enhance the quality of responses. A good dataset with sufficient examples is essential for effectively training the model.
Since this is the most basic example of fine-tuning, I am confident we can achieve better results and use fine-tuning to create an LLM model dedicated to augmenting content review.
I hope this post helped you understand the process of fine-tuning a model and its impact on the quality of responses.