[SFTTrainer]: Check for assistant mask up to max_length #3930

pramodith · 2025-08-20T21:18:14Z

What does this PR do?

Addresses #3927 where it's possible for all the assistant_mask tokens are 0 when the inputs are truncated if max_length is set.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2025-08-20T21:23:54Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec · 2025-08-20T22:03:05Z

In my understanding, if there is one sample in the dataset for which the prompt is too long, resulting in the prompt+completion part being truncated, this would make the training fail?

EDIT for clarification:
Example if there is one sample with prompt_len=70, completion_len=32 and max_length=64, this would make the training fail instead of ignoring this example?

pramodith · 2025-08-20T22:07:30Z

In my understanding, if there is one sample in the dataset for which the prompt is too long, resulting in the prompt+completion part being truncated, this would make the training fail?

True! Should we check for some threshold ratio of rows and then just log a warning if that threshold is reached instead of raising an exception to handle the case of truncation?

pramodith · 2025-08-20T22:14:54Z

EDIT for clarification:
Example if there is one sample with prompt_len=70, completion_len=32 and max_length=64, this would make the training fail instead of ignoring this example?

I actually like the idea of ignoring a row! I think we should drop/ignore the rows in the dataset that have no assistant tokens post truncation and log to the user the % of rows ignored. I think I can accomplish this with some dataset ops.

qgallouedec · 2025-08-21T05:47:06Z

yep agree. Maybe log the number of active tokens (not masked)

Copilot

Pull Request Overview

This PR addresses an issue where all assistant mask tokens could become 0 after truncation when max_length is set, making training ineffective. The fix adds validation and filtering to ensure trainable assistant tokens remain after truncation.

Adds validation to check for remaining assistant tokens after truncation
Filters out examples with no assistant tokens and provides detailed logging
Raises an error if no trainable tokens remain after truncation

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

trl/trainer/sft_trainer.py

qgallouedec · 2025-08-22T02:52:46Z

Maybe we can even check directly the labels, it would allow to account for all unmasked tokens (just check label!=-100)

pramodith · 2025-08-22T20:38:05Z

Maybe we can even check directly the labels, it would allow to account for all unmasked tokens (just check label!=-100)

Hmmm my understanding is that the labels column is populated in the data collator and not at dataset preparation stage. This'd mean that we need to keep track of the running sum of the number of tokens since the data collator is called per iteration of the Dataloader. Is this something we want to do?

I also think that just accounting for the assistant tokens count would be the same as the total number of trainable tokens, everything else would either be a pad token or non-assistant token both of which would need to be masked in assistant only training mode.

qgallouedec · 2025-08-22T21:05:13Z

Hmmm my understanding is that the labels column is populated in the data collator

Ah yes you're right!

I also think that just accounting for the assistant tokens count would be the same as the total number of trainable tokens

not exactly, because when you train on completion-only, you've a completion_mask that can also be full of 0.0 after truncation

…tion.

pramodith · 2025-08-22T21:25:54Z

not exactly, because when you train on completion-only, you've a completion_mask that can also be full of 0.0 after truncation

Ahh yes forgot about that, made some changes to reflect all the three types of datasets conversational, completions/instruction-tuning and language modeling.

qgallouedec · 2025-08-27T03:36:01Z

trl/trainer/sft_trainer.py

+                if args.assistant_only_loss:
+                    total_trainable_tokens_before_truncation = get_trainable_tokens(dataset, "assistant_masks")
+                # Prompt Completions/Instruction Tuning Dataset
+                elif "completion_mask" in first_row:
+                    total_trainable_tokens_before_truncation = get_trainable_tokens(dataset, "completion_mask")


so it can be both conversational and prompt-completion. In such case, the loss is only computed for the tokens 1. in the completion and 2 from role "user".

qgallouedec · 2025-09-02T22:54:59Z

@albertvillanova can you review this one?
Basically the idea is that we have a dataset containing the column input_ids (list of ints), maybe a column assistant_mask (list of bools), maybe a column completion_mask (list of bools).
A token contributes to the loss only if:

assistant_masks is not a column or assistant_masks=1 for this token AND
completion_mask is not a column or completion_mask=1 for this token

We want to log the number of tokens contributing to the loss and filter the examples where none of the tokens contribute to the loss.
I'm not sure if it's the best way to leverage datasets in this case

albertvillanova

I think the approach of looping row-by-row is going to be problematic for scale:

Performance / memory: Iterating over the dataset in pure Python materializes every row and defeats the purpose of 🤗 Datasets' efficient Arrow backend. On anything larger than a toy dataset, this will be orders of magnitude slower than a vectorized map/filter.
No batching / parallelism: You're not leveraging the dataset pipeline’s ability to process in batches and in parallel

I'd recommend re-implementing with dataset.map(..., batched=True, num_proc=...) + dataset.filter(...), which keeps the whole pipeline efficient, parallel, and robust.

pramodith · 2025-09-04T13:18:36Z

I'd recommend re-implementing with dataset.map(..., batched=True, num_proc=...) + dataset.filter(...), which keeps the whole pipeline efficient, parallel, and robust.

@albertvillanova let me know if I'm wrong but what we're trying is more of a reduce operation than a map operation since we're trying to compute the total number of non masked tokens.

I can run a map with batched=True to get the row wise sum but will still need to iterate over the entire dataset to reduce that to a final dataset level sum. Am I missing something here?

I came across an example of converting a dataset to a polars dataframe and running reduce operation on the polars dataframe here. Would that help in this scenario? My main concern with this approach is if we'll be effectively duplicating the dataset in memory with this approach.

albertvillanova

Yes, I think that is a good approach:

Do as many intermediate steps with map(batched=True and/or filter.
- These are efficient and fast: 🤗 Datasets caches all the transformations in Arrow
- At this stage, data is not in-memory, but memory-mapped on disk
For the final aggregation, convert to Polars and run the reduce.
- This operation works with zero-copy on the underlying Arrow buffers

…github.com/pramodith/trl into pramodith/update_assistant_token_exception

pramodith · 2025-09-10T09:38:03Z

@albertvillanova re-opened this one per your request 😄 .

Adding my comment in the new PR that I had opened (and subsequently closed after re-opening this) here so that it doesn't get lost.

I don't think I can avoid the for-loops even with batched I have to iterate through all the rows in the batch to get the sum of tokens in each row, so I'm not sure how much this'll help.

qgallouedec · 2025-09-13T03:57:43Z

can we try to do it without polars? I'd like to avoid adding this dependency just for this feature

pramodith · 2025-09-13T15:30:51Z

Feels like there isn't a nice and efficient way of counting the number of trainable tokens ahead of time without multiple passes through the entire dataset and I don't feel like this is a valuable enough addition to the library to pursue it any further. Unless anyone has a simple solution in mind, I'm considering closing out this PR.

pramodith added 2 commits August 20, 2025 21:09

Update assistant mask exception.

0b16294

use model_max_length

cb0258d

pramodith changed the title ~~Update assistant mask exception.~~ [SFTTrainer]: Check for assistant mask up to max_length Aug 20, 2025

Merge branch 'main' into pramodith/update_assistant_token_exception

f7dd77e

pramodith added 2 commits August 21, 2025 10:11

Log % of trainable assistant tokens

067cf05

filter out rows with no assistant tokens.

faf3f69

pramodith requested a review from Copilot August 21, 2025 10:57

Copilot AI reviewed Aug 21, 2025

View reviewed changes

trl/trainer/sft_trainer.py Outdated Show resolved Hide resolved

trl/trainer/sft_trainer.py Outdated Show resolved Hide resolved

trl/trainer/sft_trainer.py Outdated Show resolved Hide resolved

pramodith added 3 commits August 21, 2025 11:03

address co-pilot comments.

bb22c8f

IterableDataset doesn't have num_rows.

4b0c21c

Merge branch 'main' into pramodith/update_assistant_token_exception

3eed0cc

Merge branch 'main' into pramodith/update_assistant_token_exception

8c758c7

Compute total trainable tokens for all types of datasets after trunca…

023ddaf

…tion.

pramodith and others added 5 commits August 22, 2025 21:27

input_ids should use length for total token count.

054e4a7

precommit

0400a27

account for iterable dataset.

bce8654

revert

a25d404

Merge branch 'main' into pramodith/update_assistant_token_exception

e3efa70

qgallouedec reviewed Aug 27, 2025

View reviewed changes

qgallouedec and others added 2 commits August 27, 2025 03:36

revert

3fd22ef

Merge branch 'main' into pramodith/update_assistant_token_exception

0fa5da8

pramodith added 3 commits August 27, 2025 16:30

Add tests and the case for conversational + assistant only.

f6738f2

Merge branch 'main' into pramodith/update_assistant_token_exception

3601e68

Merge branch 'main' into pramodith/update_assistant_token_exception

09c1ed8

qgallouedec requested a review from albertvillanova September 2, 2025 22:55

qgallouedec assigned albertvillanova Sep 2, 2025

albertvillanova requested changes Sep 4, 2025

View reviewed changes

albertvillanova reviewed Sep 8, 2025

View reviewed changes

pramodith closed this Sep 8, 2025

pramodith deleted the pramodith/update_assistant_token_exception branch September 8, 2025 20:27

pramodith mentioned this pull request Sep 9, 2025

[SFTTrainer]: Check for assistant mask up to max_length #3930 #4052

Closed

5 tasks

pramodith added 3 commits September 9, 2025 22:11

Use batched and polars.

b4cc0a3

Merge branch 'main' into pramodith/update_assistant_token_exception

8cb4ce3

Merge branch 'pramodith/update_assistant_token_exception' of https://…

9da4b44

…github.com/pramodith/trl into pramodith/update_assistant_token_exception

pramodith reopened this Sep 10, 2025

pramodith added 3 commits September 10, 2025 10:36

Merge branch 'main' into pramodith/update_assistant_token_exception

57723c6

Add polars to requirements.

caf0c6e

Merge branch 'pramodith/update_assistant_token_exception' of https://…

54e4eed

…github.com/pramodith/trl into pramodith/update_assistant_token_exception

Merge branch 'main' into pramodith/update_assistant_token_exception

f5fdb2f

[SFTTrainer]: Check for assistant mask up to max_length #3930

Are you sure you want to change the base?

[SFTTrainer]: Check for assistant mask up to max_length #3930

Uh oh!

Conversation

pramodith commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Aug 20, 2025

Uh oh!

qgallouedec commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pramodith commented Aug 20, 2025

Uh oh!

pramodith commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qgallouedec commented Aug 21, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qgallouedec commented Aug 22, 2025

Uh oh!

pramodith commented Aug 22, 2025

Uh oh!

qgallouedec commented Aug 22, 2025

Uh oh!

pramodith commented Aug 22, 2025

Uh oh!

qgallouedec Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

pramodith commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

pramodith commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qgallouedec commented Sep 13, 2025

Uh oh!

pramodith commented Sep 13, 2025

Uh oh!

Uh oh!

pramodith commented Aug 20, 2025 •

edited

Loading

qgallouedec commented Aug 20, 2025 •

edited

Loading

pramodith commented Aug 20, 2025 •

edited

Loading

qgallouedec commented Sep 2, 2025 •

edited

Loading

pramodith commented Sep 4, 2025 •

edited

Loading

pramodith commented Sep 10, 2025 •

edited

Loading