-
Notifications
You must be signed in to change notification settings - Fork 6.2k
Fix Qwen-Image long prompt dimension mismatch error (issue #12083) #12087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix Qwen-Image long prompt dimension mismatch error (issue #12083) #12087
Conversation
@naykun could you also give this a look? |
pos_index = torch.arange(self._current_max_len) | ||
neg_index = torch.arange(self._current_max_len).flip(0) * -1 - 1 | ||
self.register_buffer('pos_freqs', torch.cat( | ||
[ | ||
self.rope_params(pos_index, self.axes_dim[0], self.theta), | ||
self.rope_params(pos_index, self.axes_dim[1], self.theta), | ||
self.rope_params(pos_index, self.axes_dim[2], self.theta), | ||
], | ||
dim=1, | ||
) | ||
self.neg_freqs = torch.cat( | ||
)) | ||
self.register_buffer('neg_freqs', torch.cat( | ||
[ | ||
self.rope_params(neg_index, self.axes_dim[0], self.theta), | ||
self.rope_params(neg_index, self.axes_dim[1], self.theta), | ||
self.rope_params(neg_index, self.axes_dim[2], self.theta), | ||
], | ||
dim=1, | ||
) | ||
)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some changes seem to be overlapping with #12061
Hi @sayakpaul , the solution looks good for addressing the runtime error. However, I'd like to point out that the Qwen image model is not trained on prompts longer than 512 tokens, so extremely long prompts may lead to unpredictable behavior. Perhaps we should add a warning to highlight this limitation. |
Tremendous suggestion! @robin-ede can we incorporate this and modify the test so that we verify we raise the warning? Here's an example of how we test warnings: Lines 1811 to 1820 in 0611631
|
Yea for sure! I'll get this done in a bit. |
…e#12083) - Add dynamic expansion capability to QwenEmbedRope pos_freqs buffer - Expand buffer when max_vid_index + max_len exceeds current size - Prevent RuntimeError when text prompts exceed 1024 tokens with large images - Add comprehensive test case for long prompt scenarios - Maintain backward compatibility with existing functionality Fixes: huggingface#12083
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
- Add warning when prompts exceed 512 tokens (model's training limit) - Warn users about potential unpredictable behavior with long prompts - Add comprehensive test with CaptureLogger to verify warning system - Follow established diffusers warning patterns for consistency
- Move CaptureLogger import to top level following established patterns - Use logging.WARNING constant instead of hardcoded value - Simplify device handling to match other QwenImage tests - Remove redundant variable assignments and comments
- Fix whitespace and string quote consistency - Add trailing commas where appropriate - Clean up formatting per diffusers code standards
f250165
to
35cb2c8
Compare
Should be fixed! @sayakpaul |
- Fix test_long_prompt_warning to properly trigger the 512-token warning - Replace inefficient wall-of-text approach with elegant hardcoded multiplier - Use precise token counting to ensure required_len > _current_max_len threshold - Add runtime assertion for test robustness and maintainability - Fix max_sequence_length validation error in test_long_prompt_no_error
- Replace character counting with actual token counting for accuracy - Use multiplier that generates ~521 tokens (well within limits) - Add runtime assertions to verify token count assumptions - Ensure test validates the original fix without triggering warnings - Make test intent clearer with proper token-based thresholds
@bot /style |
Style bot fixed some files and pushed the changes. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Fix Qwen-Image long prompt dimension mismatch error
Fixes: #12083
What does this PR do?
Fixes a critical bug in Qwen-Image where long text prompts (>1024 tokens) with large images cause
RuntimeError: The size of tensor a (1024) must match the size of tensor b (983)
.Problem
QwenEmbedRope
had a fixed 1024-length buffer for positional frequencies. Large images (1024×1024) + long prompts required accessingpos_freqs[1024:2048]
from a 1024-element buffer.Solution
Added dynamic buffer expansion that automatically resizes
pos_freqs
when needed:register_buffer()
for proper tensor managementChanges
_expand_pos_freqs_if_needed()
methodforward()
to check expansion requirementsBefore:
pos_freqs[1024:2048]
→ IndexErrorAfter: Auto-expands buffer → Success
Fixes #12083
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
test_long_prompt_no_error()
intests/pipelines/qwenimage/test_qwenimage.py
Implementation Details
Architecture Analysis
Our solution follows established patterns in the diffusers codebase:
register_buffer()
like other components (modeling_ctx_clip.py
,embeddings.py
)get_1d_rotary_pos_embed()
which computes frequencies on-demandPerformance Impact
Backward Compatibility
Who can review?
This PR affects:
The fix is focused on the QwenImage transformer implementation and follows established PyTorch patterns for dynamic buffer management.