Qwen3 uses more memory than Qwen2.5 for a similar model size? #1332

DhruvaKartik · 2025-04-30T06:20:29Z

DhruvaKartik
Apr 30, 2025

I was checking out Qwen/Qwen3-0.6B on vLLM and noticed this:

vllm serve Qwen/Qwen3-0.6B --max-model-len 8192

INFO 04-30 05:33:17 [kv_cache_utils.py:634] GPU KV cache size: 353,456 tokens

INFO 04-30 05:33:17 [kv_cache_utils.py:637] Maximum concurrency for 8,192 tokens per request: 43.15x

Right after this, I ran the following and saw

vllm serve Qwen/Qwen2.5-0.5B-Instruct --max-model-len 8192

INFO 04-30 05:39:41 [kv_cache_utils.py:634] GPU KV cache size: 3,317,824 tokens

INFO 04-30 05:39:41 [kv_cache_utils.py:637] Maximum concurrency for 8,192 tokens per request: 405.01x

How can there be a 10x difference in concurrency for a similar model size? Am I missing something?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qwen3 uses more memory than Qwen2.5 for a similar model size? #1332

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Qwen3 uses more memory than Qwen2.5 for a similar model size? #1332

Uh oh!

DhruvaKartik Apr 30, 2025

Replies: 0 comments

DhruvaKartik
Apr 30, 2025