Gemma 4 31B on vLLM/SGLang: NVFP4/FP8 and MTP Benchmark

A record of running Gemma 4 31B IT on vLLM 0.21.0 and SGLang gemma4-mtp, comparing NVFP4/FP8 block quantization, FP8/BF16 KV cache, and Gemma 4 MTP speculative decoding.

These articles use AI-generated summaries of Obsidian notes originally kept as technical memos.

English translations are produced with AI assistance.

I ran Gemma 4 31B IT on a single RTX PRO 6000 Blackwell 96GB and compared vLLM 0.21.0 with SGLang gemma4-mtp. What I wanted to understand was not just which runtime was faster, but how KV cache dtype, KV pool size, and MTP speculative decoding behaved in real coding-agent work.

In this record, I compared NVFP4/FP8 block on the vLLM side and FP8 block + BF16 KV/FP8 KV on the SGLang side. In short, vLLM with NVFP4 + FP8 KV, or SGLang with FP8 block + FP8 KV, both fit practical use cases well. SGLang cannot match vLLM’s KV cache capacity, but in actual tasks it felt better at tool use and output quality. I also recorded videos of the generation runs.

Videos

The SGLang-side run record.

Video link: https://www.youtube.com/watch?v=YlysBl5Qg34

Video link: https://www.youtube.com/watch?v=cgYbqnwqygM

Comparison Table

I used the same RedHatAI FP8 model for the 256K-context checks. NVFP4 was measured with NVIDIA’s model.

Config	Runtime	Weights	KV cache	KV pool	Peak TG	Typical TG	Peak accept	Notes
A	vLLM 0.21.0	NVFP4	FP8	876K tok	109.8 tok/s	90-100 tok/s	99.5%	262K ctx, MTP x4
B	vLLM 0.21.0	FP8 block	FP8	876K tok	115.3 tok/s	80-100 tok/s	93.5%	262K ctx, MTP x4
C	SGLang gemma4-mtp	FP8 block	BF16	69.7K tok	~80 tok/s	55-75 tok/s	~90%	262K ctx, SWAKVPool, MTP x5/6
D	SGLang gemma4-mtp	FP8 block	FP8	139.5K tok	~93.8 tok/s	70-90 tok/s	~96%	262K ctx, SWAKVPool, MTP x5/6

NVFP4 sometimes collapsed into repetition. I may have made some mistake in sampling or configuration.
vLLM + MTP + prefix caching reaches the highest TG band in single-stream generation.
vLLM throughput split into two bands that tracked MTP accept rate more than load.
With FP8 KV, SGLang can take about twice the token capacity of BF16 KV at the same SWAKVPool footprint.
SGLang FP8 KV had lower raw throughput than vLLM, but the coding-agent quality felt better. Being able to set min_p is also useful.

What I Wanted To Compare

This run separated five axes:

Runtime: vLLM vs SGLang
Weight quantization: NVFP4 vs FP8 block
KV cache dtype: FP8 vs BF16
Context budget and KV pool topology: vLLM flat KV vs SGLang sliding-window KV pool
Speculative decoding: MTP x4 vs MTP x5/6

If all I cared about were benchmark numbers, peak throughput would be enough. But real coding-agent work mixes long context, repeated tool calls, free-form writing, and structured output. For day-to-day capacity planning, it is more useful to watch the gap between windows where MTP is highly accepted and windows where acceptance drops.

Hardware

Item	Value
GPU	NVIDIA RTX PRO 6000 Blackwell 96GB (SM120)
Host	AMD EPYC, 768GB DDR5 ECC

A. vLLM 0.21.0: NVFP4 weights + FP8 KV

The first baseline was Gemma 4 31B with NVFP4 4-bit weights, FP8 KV cache, and MTP speculative decoding enabled with the Gemma 4 Assistant draft model. Context was 262,144, and MTP used num_speculative_tokens=4. (*official cookbook)

Configuration

Item	Value
Runtime	vLLM 0.21.0
GPU	RTX PRO 6000 Blackwell 96GB
Quantization	ModelOpt NVFP4 weights
KV cache	FP8
Context length	262,144
Tensor parallel	1
Speculative decoding	Gemma 4 MTP, 4 speculative tokens
Attention backend	Triton Attention
Sampling backend	FlashInfer
Compile backend	torch.compile / Inductor

Server Command

This is the actual command. It includes some unnecessary environment variables left over from reuse.

  podman run --rm \
  --name test-model \
  --device nvidia.com/gpu=0 \
  --ipc=host \
  -p 8000:8000 \
  -v /mnt/data/models:/hf/hub:ro \
  -e HF_HOME=/hf \
  -e HF_HUB_OFFLINE=1 \
  -e TORCH_CUDA_ARCH_LIST=12.0 \
  -e VLLM_FLASHINFER_MOE_BACKEND=throughput \
  -e VLLM_TARGET_DEVICE=cuda \
  -e VLLM_USE_FLASHINFER_MOE_FP4=1 \
  registry.home.arpa/vllm/vllm-openai:v0.21.0-ubuntu2404 \
  /hf/hub/models--nvidia--Gemma-4-31B-IT-NVFP4/snapshots/e5ef03afa233c35cb000323ff098d4291e1dd07c \
  --host 0.0.0.0 \
  --port 8000 \
  --served-model-name test-model \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --tensor-parallel-size 1 \
  --trust-remote-code \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 1 \
  --async-scheduling \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4 \
  --speculative-config '{"model":"/hf/hub/models--google--gemma-4-31B-it-assistant/snapshots/4735700dca7bd22fad5dc348c228b50ec6cbac6d","num_speculative_tokens":4}' \
  --max-model-len 262144 \
  --override-generation-config '{"temperature":0.9,"top_p":0.85,"top_k":64,"presence_penalty":0.0,"repetition_penalty":1.13}'

Observed Logs

Representative values during the run looked like this.

  Generation throughput: 95.9 tok/s   acceptance length: 4.28   draft acceptance: 82.0%
Generation throughput: 95.3 tok/s   acceptance length: 4.18   draft acceptance: 79.5%
Generation throughput: 109.8 tok/s  acceptance length: 4.97   draft acceptance: 99.2%
Generation throughput: 107.6 tok/s  acceptance length: 4.98   draft acceptance: 99.5%

Aggregated:

Metric	Value
Generation throughput	Up to 109.8 tok/s
Typical active generation	Around 90-100 tok/s
Mean MTP acceptance length	Up to 4.98, equivalent to max 5
Average draft acceptance rate	Around 80-90%
Best draft acceptance rate	99.5%
GPU memory used for model load	32.06 GiB
KV cache size	876,006 tokens

Throughput split into a band around 95 tok/s and a band above 107 tok/s. The difference was pulled more by MTP accept rate than by GPU load.

The KV cache had room for 876K tokens, about 3.3x the 262K context window. However, with --max-num-seqs 1, a single session cannot use that surplus. I also tried several sampling settings to improve tool-call iteration stability. For this model, vLLM + MTP, text-only, and min_p could not be used together, so I adjusted temperature, top_p, top_k, and repetition_penalty. I did not land on a definitive setting for this workload.

When it collapsed, the Zed-side output fell into repeated (Sigh) and (Laughter) strings, while the vLLM log on the right kept reporting accepted/drafted throughput and acceptance rate. I was the one sighing.

Screen showing the NVFP4 configuration collapsing into repeated Sigh and Laughter output in Zed Agent — A repetition-collapse case in the NVFP4 configuration. The Zed Agent output on the left repeats (Sigh) and (Laughter), while the vLLM MTP log from the same time is visible on the right.

B. vLLM 0.21.0: FP8 block weights + FP8 KV

This used the same vLLM 0.21.0 runtime as A, changing only the weights from NVFP4 to FP8 block. KV cache stayed FP8.

Configuration

Item	Value
Runtime	vLLM 0.21.0
GPU	RTX PRO 6000 Blackwell 96GB
Quantization	FP8 block
KV cache	FP8
Context length	262,144
GPU KV cache size	876,006 tokens
Speculative decoding	Gemma 4 Assistant draft model, `num_speculative_tokens=4`

Observed Values

  Generation throughput: up to 115.3 tok/s
Typical active generation: around 80-100 tok/s
Mean MTP acceptance length: up to 4.74
Average draft acceptance rate: around 80-88%
Best draft acceptance rate: 93.5%
GPU KV cache size: 876,006 tokens

Compared with NVFP4 in A, peak throughput rose slightly from 109.8 tok/s to 115.3 tok/s. On the other hand, peak draft acceptance fell from 99.5% to 93.5%. The KV dtype stayed FP8, so the KV pool size was the same 876K tokens as A.

C. SGLang gemma4-mtp: FP8 block weights + BF16 KV

Next, I switched the runtime to SGLang gemma4-mtp. Weights stayed FP8 block, and the KV cache was BF16. On the SGLang side, this becomes a sliding-window-aware SWAKVPool, so capacity looks different from vLLM’s flat KV cache.

The timing was convenient: lmsysorg/sglang:gemma4-mtp images had been pushed the day before, so I used that line of images. This may have been the commit where NEXTN changed to FROZEN_KV_MTP.

Docker Hub screen showing lmsysorg/sglang gemma4-mtp tags — Docker image tags for SGLang gemma4-mtp. gemma4-mtp is listed as linux/amd64, while gemma4-mtp-cu13 and gemma4-mtp-arm64-cu13 are listed as linux/arm64. The wording is easy to misread: it is Linux and arm64, not Linux or arm64, and I wasted a download by confusing that.

Configuration

Item	Value
Runtime	SGLang gemma4-mtp
GPU	RTX PRO 6000 Blackwell 96GB
Quantization	FP8 block
KV cache dtype	`torch.bfloat16`
Context length	262,144
Speculative decoding	Auto-promoted from NEXTN to FROZEN_KV_MTP
Draft model	Gemma 4 Assistant draft model
Speculative steps	`speculative_num_steps=5`
Draft tokens	`speculative_num_draft_tokens=6`

SWAKVPool Initialization Values

  full_layer_tokens=69757
swa_layer_tokens=55805
max_total_num_tokens=69757

KV Memory

  SWA KV:  21.29 GB K + 21.29 GB V
Full KV:  2.66 GB K +  2.66 GB V
Total SWAKVPool usage: 47.90 GB

Observed Values

  Typical decode throughput: around 55-75 tok/s
Peak throughput: around 80 tok/s class
Typical accept length: around 4.5-5.5
Typical accept rate: around 75-90%

In terms of FP8 block weights, this is close to vLLM B, but throughput was lower than vLLM. Representative peak values were in the 80 tok/s range, and typical throughput was 55-75 tok/s. The logs did show some low-90s samples, but the comparison table treats this as about 80 tok/s on the conservative side. KV capacity was 69.7K tokens, far below vLLM’s 876K tokens.

D. SGLang gemma4-mtp: FP8 block weights + FP8 KV

This used the same SGLang setup as C, changing only KV cache dtype to torch.float8_e4m3fn. In the SGLang runs, as the video makes fairly visible, this was the configuration that worked best for real coding-agent tasks.

Configuration

Item	Value
Runtime	SGLang gemma4-mtp
GPU	RTX PRO 6000 Blackwell 96GB
Quantization	FP8 block
KV cache dtype	`torch.float8_e4m3fn`
Context length	262,144
Speculative decoding	Auto-promoted from NEXTN to FROZEN_KV_MTP
Draft model	Gemma 4 Assistant draft model
Speculative steps	`speculative_num_steps=5`
Draft tokens	`speculative_num_draft_tokens=6`

SWAKVPool Initialization Values

  full_layer_tokens=139515
swa_layer_tokens=111612
max_total_num_tokens=139515

KV Memory

  SWA KV:  21.29 GB K + 21.29 GB V
Full KV:  2.66 GB K +  2.66 GB V
Total SWAKVPool usage: 47.90 GB

Observed Values

  Typical decode throughput: around 70-90 tok/s
Peak throughput: around 93.8 tok/s
Typical accept length: around 5.0-5.8
Typical accept rate: around 80-96%

During an actual task, Zed Agent was generating a Django reservation-management feature on the left, while I watched SGLang decode batch logs, accept length, accept rate, and generation throughput on the right.

SGLang gemma4-mtp generating a coding-agent Django reservation-management feature — SGLang gemma4-mtp during generation. Zed Agent is implementing a Django restaurant reservation feature on the left, while the SGLang decode batch log, accept rate, and generation throughput are visible on the right.

Compared with BF16 KV in C, full-layer tokens rose from 69,757 to 139,515 at the same 47.90GB SWAKVPool footprint. That is almost exactly 2x. SWA layer tokens also rose from 55,805 to 111,612.

SGLang warned that FP8 KV scaling factors defaulted to 1.0 in this configuration. Within the range of this run, I did not observe a visible quality regression.

Cross-Configuration Comparison

KV capacity was controlled more by KV cache dtype than by weight precision. In the SGLang C to D change, weights stayed FP8 block, and simply changing KV from BF16 to FP8 increased full-layer tokens from 69.7K to 139.5K. Conversely, in vLLM A and B, changing weights from NVFP4 to FP8 block did not change the KV pool, because the KV dtype stayed FP8.

Runtime choice also mattered a lot. At roughly equivalent weight precision, vLLM had higher raw throughput than SGLang. In this run, SGLang’s strength was not throughput but output quality and follow-through in agentic workloads.

Viewpoint	Best Configuration	Reason
Highest peak throughput	B: vLLM FP8 block + FP8 KV	Reached 115.3 tok/s
Daily throughput stability	A: vLLM NVFP4 + FP8 KV	Stable in the 90-100 tok/s band, with high accept peak
MTP acceptance ceiling	A: vLLM NVFP4 + FP8 KV	Reached 99.5%
Practical SGLang setup	D: SGLang FP8 block + FP8 KV	Higher capacity than BF16 KV and better subjective quality
KV capacity	A/B: vLLM FP8 KV	Reserved 876K tokens

Summary

If the only criterion is speed, vLLM is strong for single-stream generation. A and B both reserved a 876K-token KV pool, and B reached a peak throughput of 115.3 tok/s. A reached an acceptance ceiling of 99.5%, and it is very fast when MTP fits the window.

On the other hand, SGLang FP8 KV felt better in actual coding-agent work. D had about twice the KV capacity of C with BF16 KV, and typical decode throughput rose to 70-90 tok/s. It does not have vLLM’s KV pool size, but if tool use and output stability matter more, D is easy to use.

The caveats:

vLLM was measured as a single stream with --max-num-seqs 1.
vLLM reports runtime accepted throughput, while SGLang reports decode throughput, so the numbers are not perfectly identical metrics.
Configurations using MTP use the Gemma 4 Assistant draft model.
NVFP4 had cases where output collapsed into repetition, so there may still be an unresolved sampling or configuration issue.

Personally, I adopted SGLang FP8. Depending on the role, I would think of it as a CTX 96-128K, KV FP8 operating point.

Step-3.7-Flash-NVFP4 as a Local Orchestrator: Multi-Agent System Development

A local-only multi-agent …

Running MiMo V2.5 Pro IQ2_S Locally: RTX PRO 6000 Blackwell x1/x2 Benchmark

Running the Xiaomi MiMo V2.5 …