Fine-Tuning an Open-Source LLM with Axolotl Using Direct Preference Optimization (DPO)

LLMs have unlocked countless new opportunities for AI applications. If you’ve ever wanted to fine-tune your own model, this guide will show you how to do it easily and without writing any code. Using tools like Axolotl and DPO, we’ll walk through the process step by step.

What Is an LLM?

A Large Language Model (LLM) is a powerful AI model trained on vast amounts of text data—tens of trillions of characters—to predict the next set of words in a sequence. This has only been made possible in the last 2-3 years with the advances that have been made in GPU compute, which have allowed such huge models to be trained in a matter of a few weeks.

You’ve likely interacted with LLMs through products like ChatGPT or Claude before and have experienced firsthand their ability to understand and generate human-like responses.

Why Fine-Tune an LLM?

Can’t we just use GPT-4o for everything? Well, while it is the most powerful model we have at the time of writing this article, it’s not always the most practical choice. Fine-tuning a smaller model, ranging from 3 to 14 billion parameters, can yield comparable results at a small fraction of the cost. Moreover, fine-tuning allows you to own your intellectual property and reduces your reliance on third parties.

Understanding Base, Instruct, and Chat Models

Before diving into fine-tuning, it’s essential to understand the different types of LLMs that exist:

Base Models: These are pretrained on large amounts of unstructured text, such as books or internet data. While they have an intrinsic understanding of language, they are not optimized for inference and will produce incoherent outputs. Base models are developed to serve as a starting point for developing more specialized models.
Instruct Models: Built on top of base models, instruct models are fine-tuned using structured data like prompt-response pairs. They are designed to follow specific instructions or answer questions.
Chat Models: Also built on base models, but unlike instruct models, chat models are trained on conversational data, enabling them to engage in back-and-forth dialogue.

What Is Reinforcement Learning and DPO?

Reinforcement Learning (RL) is a technique where models learn by receiving feedback on their actions. It is applied to instruct or chat models in order to further refine the quality of their outputs. Typically, RL is not done on top of base models as it uses a much lower learning rate which will not move the needle enough.

DPO is a form of RL where the model is trained using pairs of good and bad answers for the same prompt/conversation. By presenting these pairs, the model learns to favor the good examples and avoid the bad ones.

When to Use DPO

DPO is particularly useful when you want to adjust the style or behavior of your model, for example:

Style Adjustments: Modify the length of responses, the level of detail, or the degree of confidence expressed by the model.
Safety Measures: Train the model to decline answering potentially unsafe or inappropriate prompts.

However, DPO is not suitable for teaching the model new knowledge or facts. For that purpose, Supervised Fine-Tuning (SFT) or Retrieval-Augmented Generation (RAG) techniques are more appropriate.

Creating a DPO Dataset

In a production environment, you would typically generate a DPO dataset using feedback from your users, by for example:

User Feedback: Implementing a thumbs-up/thumbs-down mechanism on responses.
Comparative Choices: Presenting users with two different outputs and asking them to choose the better one.

If you lack user data, you can also create a synthetic dataset by leveraging larger, more capable LLMs. For example, you can generate bad answers using a smaller model and then use GPT-4o to correct them.

For simplicity, we’ll use a ready-made dataset from HuggingFace: olivermolenschot/alpaca_messages_dpo_test. If you inspect the dataset, you’ll notice it contains prompts with chosen and rejected answers—these are the good and bad examples. This data was created synthetically using GPT-3.5-turbo and GPT-4.

You’ll generally need between 500 and 1,000 pairs of data at a minimum to have effective training without overfitting. The biggest DPO datasets contain up to 15,000–20,000 pairs.

Fine-Tuning Qwen2.5 3B Instruct with Axolotl

We’ll be using Axolotl to fine-tune the Qwen2.5 3B Instruct model which currently ranks at the top of the OpenLLM Leaderboard for its size class. With Axolotl, you can fine-tune a model without writing a single line of code—just a YAML configuration file. Below is the config.yml we’ll use:

base_model: Qwen/Qwen2.5-3B-Instruct
strict: false

# Axolotl will automatically map the dataset from HuggingFace to the prompt template of Qwen 2.5
chat_template: qwen_25
rl: dpo
datasets:
  - path: olivermolenschot/alpaca_messages_dpo_test
    type: chat_template.default
    field_messages: conversation
    field_chosen: chosen
    field_rejected: rejected
    message_field_role: role
    message_field_content: content

# We pick a directory inside /workspace since that's typically where cloud hosts mount the volume
output_dir: /workspace/dpo-output

# Qwen 2.5 supports up to 32,768 tokens with a max generation of 8,192 tokens
sequence_len: 8192

# Sample packing does not currently work with DPO. Pad to sequence length is added to avoid a Torch bug
sample_packing: false
pad_to_sequence_len: true

# Add your WanDB account if you want to get nice reporting on your training performance
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

# Can make training more efficient by batching multiple rows together
gradient_accumulation_steps: 1
micro_batch_size: 1

# Do one pass on the dataset. Can set to a higher number like 2 or 3 to do multiple
num_epochs: 1

# Optimizers don't make much of a difference when training LLMs. Adam is the standard
optimizer: adamw_torch

# DPO requires a smaller learning rate than regular SFT
lr_scheduler: constant
learning_rate: 0.00005

# Train in bf16 precision since the base model is also bf16
bf16: auto

# Reduces memory requirements
gradient_checkpointing: true

# Makes training faster (only suported on Ampere, Ada, or Hopper GPUs)
flash_attention: true

# Can save multiple times per epoch to get multiple checkpoint candidates to compare
saves_per_epoch: 1

logging_steps: 1
warmup_steps: 0

Setting Up the Cloud Environment

To run the training, we’ll use a cloud hosting service like Runpod or Vultr. Here’s what you’ll need:

Docker Image: Clone the winglian/axolotl-cloud:main Docker image provided by the Axolotl team.
*Hardware Requirements: An 80GB VRAM GPU (like a 1×A100 PCIe node) will be more than enough for this size of a model.
Storage: 200GB of volume storage to will accommodate all files we need.
CUDA Version: Your CUDA version should be at least 12.1.

*This type of training is considered a full fine-tune of the LLM, and is thus very VRAM intensive. If you’d like to run a training locally, without relying on cloud hosts, you could attempt to use QLoRA, which is a form of Supervised Fine-tuning. Although it is theoretically possible to combine DPO & QLoRA, this is very seldom done.

Steps to Start Training

Set HuggingFace Cache Directory:

export HF_HOME=/workspace/hf

This ensures that the original model downloads to our volume storage which is persistent.

Create Configuration File: Save the config.yml file we created earlier to /workspace/config.yml.

Start Training:

python -m axolotl.cli.train /workspace/config.yml

And voila! Your training should start. After Axolotl downloas the model and the trainig data, you should see output similar to this:

[2024-12-02 11:22:34,798] [DEBUG] [axolotl.train.train:98] [PID:3813] [RANK:0] loading model

[2024-12-02 11:23:17,925] [INFO] [axolotl.train.train:178] [PID:3813] [RANK:0] Starting trainer...

The training should take just a few minutes to complete since this is a small dataset of only 264 rows. The fine-tuned model will be saved to /workspace/dpo-output.

Uploading the Model to HuggingFace

You can upload your model to HuggingFace using the CLI:

Install the HuggingFace Hub CLI:

pip install huggingface_hub[cli]

Upload the Model:

huggingface-cli upload /workspace/dpo-output yourname/yourrepo

Replace yourname/yourrepo with your actual HuggingFace username and repository name.

Evaluating Your Fine-Tuned Model

For evaluation, it’s recommended to host both the original and fine-tuned models using a tool like Text Generation Inference (TGI). Then, perform inference on both models with a temperature setting of 0 (to ensure deterministic outputs) and manually compare the responses of the two models.

This hands-on approach provides better insights than solely relying on training evaluation loss metrics, which may not capture the nuances of language generation in LLMs.

Conclusion

Fine-tuning an LLM using DPO allows you to customize models to better suit your application’s needs, all while keeping costs manageable. By following the steps outlined in this article, you can harness the power of open-source tools and datasets to create a model that aligns with your specific requirements. Whether you’re looking to adjust the style of responses or implement safety measures, DPO provides a practical approach to refining your LLM.

Happy fine-tuning!