Skip to content

Releases: huggingface/trl

v0.17.0

25 Apr 02:20
cd6b3de
Compare
Choose a tag to compare

Major and breaking

The TRL v0.17 release introduces three major changes that, together, enable significantly faster generation performance in GRPO—up to 10x faster in some configurations.

autonlp-08

These three changes are:

  • Data parallelism (DP) for the vLLM server
  • A new GRPO training strategy that generates once per effective batch
  • Support for the V1 engine in vLLM

Below, we provide a summary of these changes and how to use them.

⚡ Up to 4x faster: Data Parallel for vLLM server

The TRL vLLM server now supports data parallelism (DP), enabling significantly faster generation speeds—especially for smaller models. This new feature can be used by adding the --data_parallel_size N argument when launching the vLLM server.

trl vllm-serve --model Qwen/Qwen2.5-14B-Instruct --tensor_parallel_size 2 --data_parallel_size 2

by @qgallouedec in #3310

* ☝️ [GRPO] Generate once per effective batch

Previously, GRPO made one generation request per global batch. The global batch is the total of all local batches, without accounting for gradient accumulation. In other words, if the gradient accumulation step was 8, GRPO would make 8 generation requests per training step.

Now, GRPO groups these global batches into a single "effective batch" and makes only one generation request per effective batch. Since vLLM applies optimizations that are especially effective for large batches, this new approach leads to significantly faster training overall.

No changes are required in the training script, as this is handled internally by the GRPO trainer.

Untitled-2025-04-08-0623-2

by @qgallouedec in #3283

⏱️ Fix vLLM server to support V1 Engine

vLLM provides two versions of its engine (V0 and V1), and V1 is significantly faster. This version is now supported by TRL and requires vLLM version 0.8.3 or higher.

by @I-l-l-I in #3276

👎 [GRPO] Adds option to disable dropout

Disabling dropout has shown to stabilize training. You can now disable dropout in GRPO by setting the disable_dropout argument to False in the GRPO config.

from trl import GRPOConfig

training_args = GRPOConfig(..., disable_dropout=True)

by @edbeeching in #3234

🩺 Dr. GRPO loss

GRPO now supports the various losses proposed in the recent literature, including the Dr. GRPO loss. The loss type can be set in the GRPO config:

from trl import GRPOConfig

training_args = GRPOConfig(..., loss_type="dr_grpo")

by @qgallouedec in #3256

🎲 [GRPO] Make training dataset shuffle optional

The GRPO trainer now has an option to disable shuffling of the training dataset. This is useful for curriculum learning, where the order of the training data is important.

from trl import GRPOConfig

training_args = GRPOConfig(..., shuffle_dataset=False)

by @LeonEricsson in #3334

☕ Overlong-filtering for GRPO

Overlong filtering has been shown to significantly stabilize learning and improve performance. You can now use it in TRL!

It simply consists in masking the loss of truncated samples

from trl import GRPOConfig

training_args = GRPOConfig(..., mask_truncated_completions=True)

Untitled-2025-04-08-0623

by @shirinyamani in #3248

🐯 Integrate Liger GRPO Loss to GRPO Trainer

Liger allows to significantly reduce the memory peak of the loss computation. You can now use it in TRL with the use_liger_loss argument in the GRPO config:

from trl import GRPOConfig

training_args = GRPOConfig(..., use_liger_loss=True)

by @shivam15s in #3184

Bug fixes

What's Changed

Read more

v0.16.1

04 Apr 18:23
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.16.0...v0.16.1

v0.16.0

22 Mar 21:18
23a635e
Compare
Choose a tag to compare

Major and breaking

🚀 Scaling GRPO to 70B+ Models and Multi-Node Training with vLLM Server & NCCL Communication

Previously, vLLM could only be used by dedicating a single GPU, preventing both the scalability benefits of vLLM and multi-node training. This limitation has now been removed!

GRPO can now scale efficiently with models exceeding 70B parameters, supporting multi-node training with super-fast performance.

autonlp-08-16

To take advantage of this, simply launch a vLLM server using the following command:

trl vllm-serve --model <model_name> --tensor_parallel_size <tp_size>

Then, start GRPO training with use_vllm=True.

Below is a comparison of GRPO throughput with and without vLLM, across different TP values and model sizes.

@binary-husky and @qgallouedec in #3094

🐦‍🔥 6x faster GRPO with multi-step optimization

This release introduces the multi-step trick, which allows for the reuse of generated data across multiple steps, speeding up the training process.

To support this, we've implemented importance sampling and clipping logic. This enhancement should lead to significant improvements in training speed.

Screenshot 2025-03-23 at 14 52 28

To use it, simply set num_iterations to a value greater than 1.

training_args = GRPOConfig(..., num_iterations=4)

by @qgallouedec in #2899

🌍 Use global normalization in GRPO

As demonstrated in Dr GRPO, sequence-level normalization can introduce a response level length bias.

GmsxibSaoAAevq_

To address this, we have now switched to normalizing the loss and by the total number of tokens in the batch, ensuring more consistent and unbiased training.

- loss = ((per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean()
+ loss = (per_token_loss * completion_mask).sum() / completion_mask.sum()

by @edbeeching in #2881

⚖️ Add option not to scale rewards

As demonstrated in Dr GRPO, scaling rewards can introduce a question-level difficulty bias. To address this, we have now added an option to disable reward scaling in GRPO.

training_args = GRPOConfig(..., scale_rewards=False)
  advantages = rewards - mean_grouped_rewards
- advantages = advantages / std_grouped_rewards
+ if self.args.scale_rewards:
+     advantages = advantages / std_grouped_rewards

it's likely that we'll make this (scale_rewards=False) the default behavior in the future.

by @qgallouedec in #3135

🤸‍♀️ Domain-specific rewards in GRPO

When optimizing across multiple domains, not all reward functions are relevant for every sample. For example, a math verifier's reward does not apply to grammar samples, and a grammar verifier's reward does not apply to math samples.

It is now possible to return None for rewards that do not make sense for a given sample. For instance, when the domain is specified in a column like domain, you can implement it as follows:

GmcKjsgaQAAawqj

def math_reward(completions, domain, **kwargs):
    rewards = []
    for completion, dom in zip(completions, domain):
        if dom == "math":
            rewards.append(verify(completion))
        else:
            rewards.append(None)
    return rewards

This allows for more domain-specific reward handling, ensuring that irrelevant rewards are ignored and don’t interfere with optimization.

by @shirinyamani in #3079

🍃 Do not load reference model when beta == 0.0

It has been observed that not minimizing the KL divergence between the trained model and the reference model can still yield good results, while significantly reducing memory usage and compute. This is because there is no need to store the reference model in memory or perform a forward pass for it.

When beta is set to 0.0, the reference model is not loaded, and the KL divergence is not computed, leading to savings in both time and memory.

training_args = GRPOConfig(..., beta=0.0)

by @ingambe in #2806

🕊️ Padding-free for SFT

Padding-free batching is an alternative approach to packing for reducing memory usage. In this method, a batch is first sampled and then flattened into a single sequence, avoiding padding. Unlike packing, which can result in incomplete sequences by combining parts of different samples, padding-free batching ensures that all sequences remain complete and intact.

To enable padding-free batching in SFT, simply set padding_free=True in the SFTConfig, and make sure to use flash_attention2 as the attention implementation.

training_args = SFTConfig(..., padding_free=True, model_init_kwargs={"attn_implementation": "flash_attention2"})

by @qgallouedec in #3076

🎬 Clip Higher for Better Exploration

As outlined in the DAPO paper, increasing the upper bound epsilon leads to higher entropy during generation, promoting better exploration. To enable this, we’ve added support for adjusting the upper bound epsilon directly in the default GRPO trainer.

training_args = GRPOConfig(epsilon_high=0.28)

by @shirinyamani in #3118

Bug fixes

Minor

What's Changed

  • [SFT] fix check for AutoLigerKernelForCausalLM by @kashif in #2874
  • 🆙 Bump vLLM min version to 0.7.2 by @edbeeching in #2860
  • [GRPO] Fix loss normalization by @edbeeching in #2881
  • 💬 Add maybe_convert_to_chatml map for conversational datasets in SFT by @kashif in #2862
  • 🧶 [GRPO][vLLM + LoRA] Move unmerge of PEFT model after weight loading by @XZ-X in #2873
  • 🍟 [SFT] Handles the dataset if it has been preprocessed by @BenasdTW in #2863
  • Optimize vllm num_generations ...
Read more

v0.15.2

25 Feb 22:40
Compare
Choose a tag to compare

What changed

Full Changelog: v0.15.1...v0.15.2

v0.15.1

18 Feb 14:57
Compare
Choose a tag to compare

What's Changed

  • 💬 Add maybe_convert_to_chatmlmap for conversational datasets by @kashif in SFT in #2862
  • [SFT] fix check for AutoLigerKernelForCausalLM by @kashif in #2874
  • 🍟 [SFT] Handles the dataset if it has been preprocessed by @BenasdTW in #2863
  • 🧶 [GRPO][vLLM + LoRA] Move unmerge of PEFT model after weight loading by @XZ-X in #2873
  • 🪂 Don't gather logits in SFT to avoid hanging by @qgallouedec in #2890
  • Release: v0.15.1 by @qgallouedec

Full Changelog: v0.15.0...v0.15.1

v0.15.0

13 Feb 14:42
Compare
Choose a tag to compare

Major and breaking changes

Coming soon

What's Changed

New Contributors

Full Changelog: v0.9.6...v0.15.0

v0.14.0

29 Jan 16:47
Compare
Choose a tag to compare

Major and breaking changes

👨‍👨‍👧‍👧 GRPO

by @qgallouedec in #2565

What's Changed

New Contributors

Full Changelog: v0.13.0...v0.14.0

v0.13.0

16 Dec 18:31
Compare
Choose a tag to compare

Major and breaking changes

🐾 Process-supervised RM Trainer

We introduced a new trainer to train Process-supervised Reward Model (PRM) in TRL. A PRM rewards the quality of intermediate steps, promoting structured reasoning over focusing solely on the final outcome.With this trainer, we introduce a new dataset type: Stepwise supervision, which is a variant of the prompt-completion type, but for which completion is divided into several intermediate steps, and each step is associated with a label. Find out more in the stepwise-supervision section in the TRL documentation.

Here is an example of how to use the PRMTrainer to train a PRM on the Math Shepherd dataset:

# train_prm.py
from datasets import load_dataset
from trl import PRMConfig, PRMTrainer
from transformers import AutoModelForTokenClassification, AutoTokenizer

model = AutoModelForTokenClassification.from_pretrained("Qwen/Qwen2-0.5B", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B")
train_dataset = load_dataset("trl-lib/math_shepherd", split="train[:10%]")

training_args = PRMConfig(output_dir="Qwen2-0.5B-Reward-Math-Sheperd", logging_steps=10)
trainer = PRMTrainer(model=model, args=training_args, processing_class=tokenizer, train_dataset=train_dataset)
trainer.train()

For more information, check out the PRMTrainer documentation.

by @qgallouedec and @gaetanlop in #2127 and #2148

🔀 Add MergeModelCallBack

Various works show that model merging can non-trivially improve performance, especially if the models belong to the same architecture. TRL now features a callback that merges the reference model with the current policy and optionally pushes the merged checkpoint to the Hub. This could be done on step/epoch end and/or the end of training. This callback uses Arcee's mergekit lib: https://github.com/arcee-ai/mergekit

from trl import DPOTrainer, MergeModelCallback
from trl.mergekit_utils import MergeConfig

config = MergeConfig()
merge_callback = MergeModelCallback(config)
trainer = DPOTrainer(...,  callbacks=[merge_callback])

by @August-murr in #2282

🔨 Support for tools for data utils

TRL preprocessing utils now support tooling. A first step toward agent fine-tuning.

from trl import apply_chat_template

def get_current_temperature(location: str):
    """
    Gets the temperature at a given location.

    Args:
        location: The location to get the temperature for
    """
    return 22.0

example = apply_chat_template(example, tokenizer, tools=[get_current_temperature])

by @August-murr in #2455

🌋 Add support for LLaVA-Next in DPOTrainer

VLMs have their own specificities which require special treatment in the trainer. DPOTrainer now supports LLaVA-Next models natively.

model = model = AutoModelForVision2Seq.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
trainer = DPOTrainer(model=model, ...)

by @chenweize1998 in #2413

🕹️ CLI and TRLParser refactor

TRL CLI has been refactored to be more user-friendly and easy to extend. We plan to extend the support to all trainers soon.

(simplified output, for readibility)

$ trl dpo --help
usage: trl dpo [-h] --dataset_name DATASET_NAME [--dataset_config DATASET_CONFIG] --output_dir OUTPUT_DIR [--loss_type {sigmoid,hinge,ipo}]

options:
  -h, --help            show this help message and exit
  --dataset_name DATASET_NAME, --dataset-name DATASET_NAME
  --dataset_config DATASET_CONFIG, --dataset-config DATASET_CONFIG
  --output_dir OUTPUT_DIR, --output-dir OUTPUT_DIR
                        The output directory where the model predictions and checkpoints will be written. (default: None)
  --loss_type {sigmoid,hinge,ipo}, --loss-type {sigmoid,hinge,ipo}

by @qgallouedec in #2380 and #2412

🤝 Mixture of judges

TRL features a new judge AllTrueJudge that unifies the decision of multiple binary judges. This judge implements the Mixture of Judges as described in the CGPO paper.

from trl import AllTrueJudge, BaseBinaryJudge

class RandomBinaryJudge(BaseBinaryJudge):
    """
    Random binary judge, for testing purposes.
    """

    def judge(self, prompts, completions, gold_completions=None, shuffle_order=True):
        return [random.choice([0, 1, -1]) for _ in range(len(prompts))]


prompts = ["The capital of France is", "The biggest planet in the solar system is"]
completions = [["Paris", "Marseille"], ["Saturn", "Jupiter"]]
judge = AllTrueJudge(judges=[RandomBinaryJudge(), RandomBinaryJudge()])
judgements = judge.judge(prompts=prompts, completions=completions)
print(judgements)  # [0, 1]

by @gaetanlop in #2159

❄️ DPO trainer supports num_logits_to_keep to save memory

Save memory by only keeping the top num_logits_to_keep logits in the DPO trainer.

training_args = DPOConfig(..., use_num_logits_to_keep=True)

by @xyangk in #2129

🗺️ Implementation DiscoPOP Loss

The DiscoPOP paper uses LLMs to discover more efficient offline preference optimization losses. In the paper the proposed DiscoPOP loss (which is a log-ratio modulated loss) outperformed other optimization losses on different tasks (IMDb positive text generation, Reddit TLDR summarization, and Alpaca Eval 2.0).

training_args = DPOConfig(..., loss_type="discopop", discopop_tau=0.05)

by @fanconic in #2323

🧑‍🍳 Add precompute batch size argument in DPOTrainer for reference model

We can now control the batch size for precomputing reference model logits.

training_args = DPOConfig(
...
    precompute_ref_log_probs=True,
    precompute_ref_batch_size=4,
)

by @SwayamInSync in #2426

📦 Support for packing tokenized datasets for SFT

SFTTrainer has supported packing datasets for faster training. Now, it support packing tokenized datasets as well.

by @kmehant in #2011

📉 Add PEFT support for PPOTrainer

PPOTrainer now supports PEFT for efficient training.

PPOTrainer(
    ...,
    peft_config=peft_config,
)

by @ccs96307 in #2344

💾 Deprecate config in favor of args in PPOTrainer

config has been deprecated in favor of args in PPOTrainer.

  PPOTrainer(
-   config=training_args,
+   args=training_args,
  )

by @qgallouedec in #2384

👮 Deprecate policy in favor of model in PPOTrainer

policy has been deprecated in favor of model in PPOTrainer.

  PPOTrainer(
-   policy=model,
+   model=model,
  )

by @qgallouedec in #2386

What's Changed

Read more

v0.12.2

06 Dec 13:01
4c71daf
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.12.1...v0.12.2

v0.12.1

15 Nov 12:30
Compare
Choose a tag to compare

What's Changed

  • 👈 Add tokenizer arg back and add deprecation guidelines by @qgallouedec in #2348

Full Changelog: v0.12.0...v0.12.1