GRPOTrainer : fix prompt truncation for multimodal inputs with multiple image tokens #3879

artem-spector · 2025-08-11T11:57:34Z

What does this PR do?

This PR fixes an issue in prompt truncation when max_prompt_length is set for multimodal inputs.
Previously, the code generated prompt_inputs by calling the entire processing_class (processor) on prompts_text and images.
When the processor included image inputs, it could insert multiple image tokens into the tokenized sequence.
This sometimes caused truncate_with_protected_tokens to fail when reducing the sequence to max_prompt_length because the multiple protected tokens consumed too much of the allowed space.
Now we call only the tokenizer on the prompts_texts, thus ensuring that only the textual prompt is tokenized for truncation, while still respecting protected tokens.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ x] Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…tokenizer, and not the whole processor

pramodith

I think I'm missing something, the inputs to the truncate_with_protected_tokens function will be the same even after the changes right? The tokenization code moved down to the if block is going to be applied in both the previous and proposed changes as long as max_prompt_length is set to True.

pramodith · 2025-08-11T13:15:41Z

trl/trainer/grpo_trainer.py

+                padding_side="left",
+                add_special_tokens=False
+            )
+            prompt_inputs = Trainer._prepare_inputs(self, prompt_inputs)


Suggested change

prompt_inputs = Trainer._prepare_inputs(self, prompt_inputs)

prompt_inputs = super()._prepare_inputs(self, prompt_inputs)

This block of code also ensures that prompts_text never contains more than one image_token so I'm wondering where you were running into failures, an example of where the current code breaks would be helpful.

trl/trl/trainer/grpo_trainer.py

Lines 1416 to 1434 in de27d61

if self.image_token is not None:

escaped_img_token = re.escape(self.image_token)

# Search for the image token in the chat template

if re.search(escaped_img_token, self.processing_class.chat_template):

prompts_text = [

re.sub(rf"({escaped_img_token})+", self.image_token, text) for text in prompts_text

]

else:

# If the chat template doesn't use the image token, we remove all instances of it + vision_end_token_id

if self.vision_end_token_id is not None:

escaped_eoi_token = re.escape(

self.processing_class.tokenizer.decode([self.vision_end_token_id])

)

prompts_text = [

re.sub(rf"({escaped_img_token})+{escaped_eoi_token}", "", text) for text in prompts_text

]

else:

# If vision_end_token_id is None, just remove the image tokens

prompts_text = [re.sub(rf"({escaped_img_token})+", "", text) for text in prompts_text]

The problem is that this code:
prompt_inputs = self.processing_class(
text=prompts_text,
return_tensors="pt",
padding=True,
padding_side="left",
add_special_tokens=False,
**kwargs,
)
may generate input_ids that include both prompt and image tokens.
For example, LlavaNextProcessor for a short prompt and an image 459x320 gives prompt_ids of length 1822. Most of those are image_tokens, and truncate_with_protected_tokens would fail trying to truncate it to 512.

In the fixed version
prompt_inputs = tokenizer(
prompts_text,
return_tensors="pt",
padding=True,
padding_side="left",
add_special_tokens=False
)
the tokenizer processes only the prompt text, and the prompt_ids length is 41, nothing to truncate

Co-authored-by: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com>

pramodith

Thanks for the clarification! Looks good to me.

qgallouedec · 2025-08-29T19:37:54Z

So if I understand correctly, instead of ensuring that the max number of tokens is on text tokens + image tokens, you apply it only on text token?

artem-spector · 2025-08-30T09:39:27Z

So if I understand correctly, instead of ensuring that the max number of tokens is on text tokens + image tokens, you apply it only on text token?

correct

qgallouedec · 2025-09-03T20:54:47Z

The issue though is that the final sequence (the one taken by the model as input) may be longer than max_length

if max_prompt_length is set, trim the prompts_text based on the text …

eec845b

…tokenizer, and not the whole processor

artem-spector changed the title ~~Fix prompt truncation for multimodal inputs with multiple image tokens~~ GRPOTrainer : fix prompt truncation for multimodal inputs with multiple image tokens Aug 11, 2025

pramodith reviewed Aug 11, 2025

View reviewed changes

Update trl/trainer/grpo_trainer.py

bf40af1

Co-authored-by: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com>

pramodith approved these changes Aug 11, 2025

View reviewed changes

artem-spector added 3 commits August 12, 2025 15:20

Merge branch 'main' into trim-prompt-only

dc0a36c

fix the call to super()._prepare_inputs

32958c4

Merge branch 'main' into trim-prompt-only

4247eae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GRPOTrainer : fix prompt truncation for multimodal inputs with multiple image tokens #3879

GRPOTrainer : fix prompt truncation for multimodal inputs with multiple image tokens #3879

Uh oh!

artem-spector commented Aug 11, 2025

Uh oh!

pramodith left a comment

Uh oh!

pramodith Aug 11, 2025

Uh oh!

pramodith Aug 11, 2025

Uh oh!

artem-spector Aug 11, 2025

Uh oh!

pramodith left a comment

Uh oh!

qgallouedec commented Aug 29, 2025

Uh oh!

artem-spector commented Aug 30, 2025

Uh oh!

qgallouedec commented Sep 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

	prompt_inputs = Trainer._prepare_inputs(self, prompt_inputs)
	prompt_inputs = super()._prepare_inputs(self, prompt_inputs)

	if self.image_token is not None:
	escaped_img_token = re.escape(self.image_token)
	# Search for the image token in the chat template
	if re.search(escaped_img_token, self.processing_class.chat_template):
	prompts_text = [
	re.sub(rf"({escaped_img_token})+", self.image_token, text) for text in prompts_text
	]
	else:
	# If the chat template doesn't use the image token, we remove all instances of it + vision_end_token_id
	if self.vision_end_token_id is not None:
	escaped_eoi_token = re.escape(
	self.processing_class.tokenizer.decode([self.vision_end_token_id])
	)
	prompts_text = [
	re.sub(rf"({escaped_img_token})+{escaped_eoi_token}", "", text) for text in prompts_text
	]
	else:
	# If vision_end_token_id is None, just remove the image tokens
	prompts_text = [re.sub(rf"({escaped_img_token})+", "", text) for text in prompts_text]

GRPOTrainer : fix prompt truncation for multimodal inputs with multiple image tokens #3879

Are you sure you want to change the base?

GRPOTrainer : fix prompt truncation for multimodal inputs with multiple image tokens #3879

Uh oh!

Conversation

artem-spector commented Aug 11, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

pramodith left a comment

Choose a reason for hiding this comment

Uh oh!

pramodith Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

pramodith Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

artem-spector Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

pramodith left a comment

Choose a reason for hiding this comment

Uh oh!

qgallouedec commented Aug 29, 2025

Uh oh!

artem-spector commented Aug 30, 2025

Uh oh!

qgallouedec commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

qgallouedec commented Sep 3, 2025 •

edited

Loading