Qwen3-Coder-Next 80B in Three Modes: BF16 CPU / IQ4_NL Hybrid / nvfp4 GPU Measured
Qwen3-Coder-Next (~80B MoE) benchmarked across BF16 CPU inference (7.8 tok/s), IQ4_NL Hybrid GPU offload (59-85 tok/s), and nvfp4 GPU (17-100 tok/s). Quantifying Expert Offload speed penalty and coding quality evaluation.
Background
Qwen3-Coder-Next is an ~80B parameter MoE model specialized for coding. Strong in code generation, review, and security auditing, it’s a candidate for local coding assistant deployment.
The problem is too many execution options: BF16 unquantized CPU inference, IQ4_NL Hybrid GPU offload (experts on CPU), nvfp4 full GPU inference. Each has different speed/quality/VRAM characteristics. All three were measured on the same hardware to establish per-use-case selection criteria.
Objective
- Confirm BF16 CPU inference speed and quality (maximum precision mode)
- Quantify Expert Offload speed penalty with IQ4_NL Hybrid
- Measure nvfp4 GPU throughput
- Evaluate coding task quality
Test Environment
| Item | Specification |
|---|---|
| CPU | AMD EPYC 9175F (Zen 5, 16C, L3 512MB) |
| GPU | NVIDIA RTX PRO 6000 Blackwell Max-Q (96GB VRAM) |
| Memory | DDR5-6400 768GB (12ch) |
| OS | Ubuntu 24.04 LTS |
Three Configurations
| Config | Runtime | Quantization | Expert Placement | ctx |
|---|---|---|---|---|
| A: CPU BF16 | llama.cpp | BF16 (unquantized) | All CPU | 16K |
| B: Hybrid offload | ik_llama.cpp | IQ4_NL | Expert=CPU, Attn=GPU | 65K |
| C: GPU nvfp4 | vLLM | nvfp4 | All GPU | 32K |
Results
Config A: BF16 CPU (th=13, ctx=16K)
| Metric | Measured |
|---|---|
| PP (short prompt) | 33.37 tok/s |
| PP (287 tokens) | 117.40 tok/s |
| TG (sustained) | 7.59 tok/s |
| TTFT (287 tokens) | ~2.58s |
Throughput stayed consistent through a 2,233-token generation. KV cache at q8_0 to manage memory pressure.
Config B: IQ4_NL Hybrid (Expert CPU offload, ctx=65K)
Run A: exps=CPU (Expert weights on CPU)
| Metric | Measured |
|---|---|
| GPU buffer (weights) | 1,403 MiB |
| CPU buffer (weights) | 41,472 MiB |
| graph_splits | 98 |
| TG (weighted avg) | 58.94 tok/s |
| PP (representative) | 761-1,120 tok/s |
Run B: No exps=CPU (All weights on GPU)
| Metric | Measured |
|---|---|
| GPU buffer (weights) | 42,875 MiB |
| graph_splits | 2 |
| TG (weighted avg) | 85.36 tok/s |
| PP (representative) | 880-3,572 tok/s |
Expert Offload speed penalty: -31% (58.94 → 85.36 tok/s)
Config C: nvfp4 GPU (vLLM, ctx=32K)
| Metric | Measured |
|---|---|
| TG (stable) | 17-100 tok/s (high variance) |
| PP | 17-669 tok/s (burst) |
Values fluctuate due to vLLM rolling average logs. Stable generation runs at 58-100 tok/s.
Comparison Summary
| Config | TG(tok/s) | VRAM | Quality | Use Case |
|---|---|---|---|---|
| A: BF16 CPU | 7.59 | 0 | Highest | Precision review/audit |
| B: Hybrid exps=CPU | 58.94 | ~3GB | Good | Normal coding |
| B: Hybrid all-GPU | 85.36 | ~43GB | Good | Fast coding |
| C: nvfp4 GPU | 58-100 | ~22GB | Good | vLLM integration |
Live Demonstration
To verify the output quality of Qwen3-Coder-Next in action, we recorded real-time execution demonstrating the model’s coding capability. The video below shows the model generating code and explaining its logic:
This demonstration shows:
- Real-time token generation proving the model is actually executing
- Verification of the actual output content and quality
- Live streaming of inference with visible token-by-token progression
- Proof that CPU/GPU inference is functioning as expected
Analysis
BF16 CPU Value Proposition
7.59 tok/s is slow for chat but sufficient for code review and security auditing. BF16 has zero quantization degradation, making it trustworthy for SQLi vulnerability detection, plaintext password risk identification, and similar precision-critical tasks.
The 12-channel DDR5-6400 bandwidth supports BF16’s massive data movement, with Zen 5’s native AVX-512 BF16 instructions contributing meaningfully.
Expert Offload 31% Penalty
exps=CPU saves 40GB+ VRAM but costs 31% generation speed. graph_splits jumps from 2 to 98, indicating frequent CPU-GPU data transfers for every expert computation.
59 tok/s is still fast, but when VRAM allows, full GPU loading is clearly superior. Expert Offload is a “VRAM compromise” strategy, not well-suited as a speed optimization method.
Coding Quality
BF16 evaluation:
- Security audit: Correctly identified SQLi vulnerabilities and plaintext password risks, providing fixes with unit tests
- Hallucination control: Properly refused out-of-spec questions (“NOT IN SPEC”)
- Complex logic: Met 90% of constraint-heavy Django requirements but missed some multi-tenant safety nuances. Best as a high-quality draft generator + expert reviewer
Lessons Learned
Three-mode selection is now clear:
- Precision work (BF16 CPU): Security audits, final reviews. Speed sacrificed for zero quantization loss
- Normal coding (Hybrid / GPU): Daily development. 59-85 tok/s feels responsive
- vLLM integration (nvfp4): When tool-call-parser, prefix caching, or API integration is needed
The 31% Expert Offload penalty was larger than expected. Confirmed by measurement: it’s a “use when VRAM is insufficient” setting.
Reproduction Steps
BF16 CPU
podman run --rm -it \
-p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
-v "$MO":/models:Z compute.home.arpa/llamacpp-zen5:qwen3-coder-next \
-m /models/.../BF16/Qwen3-Coder-Next-BF16-00001-of-00004.gguf \
--cache-type-k q8_0 --cache-type-v q8_0 \
--flash-attn on --ctx-size 16384 \
--parallel 1 --threads 13 --threads-batch 13 \
--batch-size 2048 --ubatch-size 512 \
--jinja --host 0.0.0.0 --port 8080
IQ4_NL Hybrid (Expert CPU offload)
podman run --rm -it --device nvidia.com/gpu=all \
-p 8001:8080 --shm-size 16g --cap-add=SYS_NICE \
-v "$MO":/models:ro,Z $IMG \
--host 0.0.0.0 --port 8080 -m "$MODEL" \
-c 65536 --threads 13 --threads-batch 23 \
-b 2048 -ub 2048 -ngl 99 \
-ot exps=CPU -fa on --no-mmap --jinja
nvfp4 GPU (vLLM)
vllm serve vincentzed-hf/Qwen3-Coder-Next-NVFP4 \
--dtype auto --gpu-memory-utilization 0.88 \
--max-num-seqs 1 --max-model-len 32768 \
--enable-prefix-caching --trust-remote-code \
--enable-auto-tool-choice --tool-call-parser qwen3_coder
Technical Notes
256K Context Configuration
IQ4_NL can be set to ctx=262144, but Expert Offload + 256K creates massive KV cache. Practical upper limit is ctx=65536.
graph_splits Meaning
graph_splits indicates CPU-GPU data transfer frequency. exps=CPU yields 98 (returns to CPU for each expert layer), all-GPU yields 2 (input/output only). This difference directly maps to speed difference.

