[GRPO]: Fix Multi-GPU training for Entropy based masking of tokens. #3964

pramodith · 2025-08-27T14:41:16Z

What does this PR do?

#3933 identified that training a model with top_entropy_quntile < 1 causes the training to hang indefinitely in a multi-gpu setting because the gather operation to gather the non_padded entropy tensor can have different shapes on different devices.

This PR fixes the said issue by:

Gathering the number of non padded entropy tokens.
Find the max number of non padded entropy tokens across all gpus.
Pad the non padded entropies to the max length
Gather all the entropies post padding.
Compute the threshold

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

pramodith · 2025-08-27T14:43:55Z

Tested that training works correctly on a machine with 2 A40s. using this script:

Click to expand code

from trl import GRPOConfig, GRPOTrainer
from datasets import load_dataset
from accelerate import PartialState
from transformers import AutoModelForCausalLM, AutoTokenizer

dataset = load_dataset("trl-lib/tldr", split="train[:50]")

def reward_len(completions, **kwargs):
    return [-abs(20 - len(completion)) for completion in completions]

training_args = GRPOConfig(
    max_steps=10,
    per_device_train_batch_size=8,
    num_generations=8,
    logging_steps=1,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs = {"use_reentrant": False}, #must be false for DDP
    top_entropy_quantile=0.2,
    max_completion_length=1024
)

trainer = GRPOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    reward_funcs=reward_len,
    args=training_args,
    train_dataset=dataset,
)
trainer.train()

accelerate launch --num_processes 2 --multi_gpu --mixed_precision=bf16 examples/scripts/test.py

HuggingFaceDocBuilderDev · 2025-08-27T14:45:42Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copilot

Pull Request Overview

This PR fixes a multi-GPU training hang issue when using entropy-based token masking with top_entropy_quantile < 1 in GRPO training. The problem occurred because the gather operation attempted to collect entropy tensors of different shapes across devices, causing the training to hang indefinitely.

Implements safe gathering of variable-length entropy tensors across multiple GPUs
Adds padding strategy to ensure consistent tensor shapes before gathering
Maintains single-GPU compatibility with conditional branching

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

trl/trainer/grpo_trainer.py

qgallouedec · 2025-08-29T01:46:44Z

trl/trainer/grpo_trainer.py

+            all_padded_entropies = self.accelerator.gather(padded_entropies)
+            all_padded_entropies_mask = self.accelerator.gather(padded_entropies_mask)
+            all_non_padded_entropies = all_padded_entropies[all_padded_entropies_mask.bool()].flatten()
+        else:


do we need this if/else?

I think it can be removed

Ummm there are a few ops that aren't useful to have in the single gpu-case right? Like all the first four lines in the if block aren't relevant/needed for a single gpu.

they'll probably be no-op, no?

These two won't be no-op

non_pad_entropies_seq_length = torch.tensor([non_pad_entropies.numel()], device=entropies.device) max_non_pad_entropies_seq_length = self.accelerator.gather(non_pad_entropies_seq_length).max().item() torch.zeros( max_non_pad_entropies_seq_length - non_pad_entropies.numel(), device=non_pad_entropies.device, ),

indeed but that's ok I think

trl/trainer/grpo_trainer.py

…m/pramodith/trl into pramodith/entropy_mask_gather_bug

qgallouedec · 2025-08-29T19:27:23Z

just merge main to your branch to fix the CI

qgallouedec · 2025-09-02T23:28:54Z

trl/trainer/grpo_trainer.py

+            )
+            all_padded_entropies = self.accelerator.gather(padded_entropies)
+            all_padded_entropies_mask = self.accelerator.gather(padded_entropies_mask)
+            all_non_padded_entropies = all_padded_entropies[all_padded_entropies_mask.bool()].flatten()


Suggested change

all_non_padded_entropies = all_padded_entropies[all_padded_entropies_mask.bool()].flatten()

all_non_padded_entropies = all_padded_entropies[all_padded_entropies_mask.bool()]

already flat

qgallouedec · 2025-09-02T23:30:24Z

can you also remove accelerator=None from the function signature?

…m/pramodith/trl into pramodith/entropy_mask_gather_bug

qgallouedec

LGTM! Thanks

…uggingface#3964) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>

Pad all gathered tokens to the same length.

add432b

pramodith added 2 commits August 27, 2025 14:48

precommit

0c94b17

Merge branch 'main' into pramodith/entropy_mask_gather_bug

bb3a527

pramodith requested a review from Copilot August 27, 2025 14:50

Copilot AI reviewed Aug 27, 2025

View reviewed changes

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

pramodith and others added 2 commits August 27, 2025 15:16

Create a padded_entropies_mask.

c8f97ad

Merge branch 'main' into pramodith/entropy_mask_gather_bug

5c6534a

qgallouedec reviewed Aug 29, 2025

View reviewed changes

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

pramodith added 2 commits August 29, 2025 17:10

use numel

c0cba4d

Merge branch 'pramodith/entropy_mask_gather_bug' of https://github.co…

9d3a08e

…m/pramodith/trl into pramodith/entropy_mask_gather_bug

pramodith added 3 commits August 29, 2025 22:57

Merge branch 'main' into pramodith/entropy_mask_gather_bug

db52ac9

Merge branch 'main' into pramodith/entropy_mask_gather_bug

cb4190a

Merge branch 'main' into pramodith/entropy_mask_gather_bug

b41bb12

qgallouedec reviewed Sep 2, 2025

View reviewed changes

pramodith and others added 5 commits September 3, 2025 20:47

address comments.

a57173a

Merge branch 'pramodith/entropy_mask_gather_bug' of https://github.co…

b454f9a

…m/pramodith/trl into pramodith/entropy_mask_gather_bug

Merge branch 'main' into pramodith/entropy_mask_gather_bug

272feea

Merge branch 'main' into pramodith/entropy_mask_gather_bug

e8cfbe2

nits

bc667bc

qgallouedec approved these changes Sep 3, 2025

View reviewed changes

pramodith merged commit 3bfa981 into huggingface:main Sep 3, 2025
10 checks passed

SamY724 pushed a commit to SamY724/trl that referenced this pull request Sep 6, 2025

[GRPO]: Fix Multi-GPU training for Entropy based masking of tokens. (h…

1d37071

…uggingface#3964) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>

pramodith deleted the pramodith/entropy_mask_gather_bug branch September 8, 2025 20:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GRPO]: Fix Multi-GPU training for Entropy based masking of tokens. #3964

[GRPO]: Fix Multi-GPU training for Entropy based masking of tokens. #3964

Uh oh!

pramodith commented Aug 27, 2025

Uh oh!

pramodith commented Aug 27, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Aug 27, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

qgallouedec Aug 29, 2025

Uh oh!

qgallouedec Aug 29, 2025

Uh oh!

pramodith Aug 29, 2025

Uh oh!

qgallouedec Aug 29, 2025

Uh oh!

pramodith Aug 29, 2025

Uh oh!

qgallouedec Sep 2, 2025

Uh oh!

Uh oh!

qgallouedec commented Aug 29, 2025

Uh oh!

qgallouedec Sep 2, 2025

Uh oh!

qgallouedec commented Sep 2, 2025

Uh oh!

qgallouedec left a comment

Uh oh!

Uh oh!

Uh oh!

	all_non_padded_entropies = all_padded_entropies[all_padded_entropies_mask.bool()].flatten()
	all_non_padded_entropies = all_padded_entropies[all_padded_entropies_mask.bool()]

[GRPO]: Fix Multi-GPU training for Entropy based masking of tokens. #3964

[GRPO]: Fix Multi-GPU training for Entropy based masking of tokens. #3964

Uh oh!

Conversation

pramodith commented Aug 27, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

pramodith commented Aug 27, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Aug 27, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

qgallouedec Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

pramodith Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

pramodith Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

qgallouedec commented Aug 29, 2025

Uh oh!

qgallouedec Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec commented Sep 2, 2025

Uh oh!

qgallouedec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!