Qwen3-Coder-Next 80B in Three Modes: BF16 CPU / IQ4_NL Hybrid / nvfp4 GPU Measured
Qwen3-Coder-Next (~80B MoE) benchmarked across BF16 CPU inference (7.59 tok/s), IQ4_NL Hybrid GPU offload (59-85 tok/s), and nvfp4 GPU (17-100 tok/s). Quantifying Expert Offload speed penalty, SWA cache invalidation behavior, and coding quality evaluation.
Background
Qwen3-Coder-Next is an ~80B parameter MoE model specialized for coding. Strong in code generation, review, and security auditing, it is a candidate for local coding assistant deployment.
The problem is too many execution options: BF16 unquantized CPU inference, IQ4_NL Hybrid GPU offload (experts on CPU), nvfp4 full GPU inference. Each has different speed/quality/VRAM characteristics. All three were measured on the same hardware to establish per-use-case selection criteria.
Objective
- Confirm BF16 CPU inference speed and quality (maximum precision mode)
- Quantify Expert Offload speed penalty with IQ4_NL Hybrid
- Measure nvfp4 GPU throughput
- Evaluate coding task quality
- Document SWA cache invalidation behavior with Qwen3-Next-80B-A3B-Thinking (Q4_K_M)
Test Environment
| Item | Specification |
|---|---|
| CPU | AMD EPYC 9175F (Zen 5, 16C, L3 512MB) |
| GPU | NVIDIA RTX PRO 6000 Blackwell Max-Q (96GB VRAM) |
| Memory | DDR5-6400 768GB (12ch) |
| OS | Ubuntu 24.04 LTS |
Three Configurations
| Config | Runtime | Quantization | Expert Placement | ctx |
|---|---|---|---|---|
| A: CPU BF16 | llama.cpp | BF16 (unquantized) | All CPU | 16K |
| B: Hybrid offload | ik_llama.cpp | IQ4_NL | Expert=CPU, Attn=GPU | 65K |
| C: GPU nvfp4 | vLLM | nvfp4 | All GPU | 32K |
Results
Config A: BF16 CPU (th=13, ctx=16K)
| Metric | Measured |
|---|---|
| PP (short prompt) | 33.37 tok/s |
| PP (287 tokens) | 117.40 tok/s |
| TG (sustained) | 7.59 tok/s |
| TTFT (287 tokens) | ~2.58s |
Throughput stayed consistent through a 2,233-token generation. KV cache at q8_0 to manage memory pressure.
I chose BF16 because I wanted the cleanest possible view of the model’s coding behavior without quantization noise. The 12-channel DDR5-6400 bandwidth supports BF16’s massive data movement, and Zen 5’s native AVX-512 BF16 instructions contribute meaningfully. At 80B scale, memory movement is the real story. The 12-channel platform is what makes the BF16 result credible.
Config B: IQ4_NL Hybrid (Expert CPU offload, ctx=65K)
Run A: exps=CPU (Expert weights on CPU)
| Metric | Measured |
|---|---|
| GPU buffer (weights) | 1,403 MiB |
| CPU buffer (weights) | 41,472 MiB |
| graph_splits | 98 |
| TG (weighted avg) | 58.94 tok/s |
| PP (representative) | 761-1,120 tok/s |
Run B: No exps=CPU (All weights on GPU)
| Metric | Measured |
|---|---|
| GPU buffer (weights) | 42,875 MiB |
| graph_splits | 2 |
| TG (weighted avg) | 85.36 tok/s |
| PP (representative) | 880-3,572 tok/s |
Expert Offload speed penalty: -31% (58.94 -> 85.36 tok/s)
In Run A, most weights ended up on the CPU side while the GPU held only about 1.4 GiB. Even though GPU offload was enabled, the effective execution path crossed the GPU/CPU boundary far more aggressively than intended. graph_splits reached 98, meaning the compute graph was heavily fragmented with repeated synchronization and transfer across boundaries.
Run B was not just faster but also much more uniform, with most tasks clustering around 85 tok/s. GPU weights rose to 42.9 GiB while graph_splits fell to 2. That gap is big enough to matter in real coding sessions.
Weighted Average Methodology
The gen_tokens weighted average weights each task’s gen_tps by its gen_tokens count. Tasks that emit more output matter more, so the final average tracks long-form generation feel better than a plain mean.
weighted_gen_tps = S(gen_tokens_i * gen_tps_i) / S(gen_tokens_i)
Representative Task Results (Run A: exps=CPU)
| task | prompt tokens | prompt tps | gen tokens | gen tps | total ms |
|---|---|---|---|---|---|
| 82 | 100 | 291.93 | 66 | 59.14 | 1,458.59 |
| 256 | 3,757 | 1,045.89 | 141 | 59.71 | 5,953.40 |
| 652 | 1,594 | 761.22 | 67 | 58.94 | 3,230.68 |
| 922 | 908 | 678.81 | 180 | 58.52 | 4,413.36 |
| 2,663 | 2,284 | 961.73 | 3,810 | 58.80 | 67,175.38 |
Representative Task Results (Run B: exps=CPU disabled)
| task | prompt tokens | prompt tps | gen tokens | gen tps | total ms |
|---|---|---|---|---|---|
| 21 | 21 | 356.71 | 471 | 85.75 | 5,551.35 |
| 1,718 | 76 | 880.41 | 1,036 | 85.41 | 12,215.63 |
| 3,539 | 647 | 2,561.14 | 1,421 | 85.53 | 16,866.17 |
| 4,961 | 155 | 1,397.33 | 2,007 | 85.28 | 23,646.46 |
| 9,638 | 30 | 513.45 | 475 | 85.64 | 5,605.00 |
Config C: nvfp4 GPU (vLLM, ctx=32K)
| Metric | Measured |
|---|---|
| TG (stable) | 17-100 tok/s (high variance) |
| PP | 17-669 tok/s (burst) |
Values fluctuate due to vLLM rolling average logs. Stable generation runs at 58-100 tok/s.
Comparison Summary
| Config | TG(tok/s) | VRAM | Quality | Use Case |
|---|---|---|---|---|
| A: BF16 CPU | 7.59 | 0 | Highest | Precision review/audit |
| B: Hybrid exps=CPU | 58.94 | ~3GB | Good | Normal coding |
| B: Hybrid all-GPU | 85.36 | ~43GB | Good | Fast coding |
| C: nvfp4 GPU | 58-100 | ~22GB | Good | vLLM integration |
Live Demonstration
To verify the output quality of Qwen3-Coder-Next in action, we recorded real-time execution demonstrating the model’s coding capability:
Video link: https://www.youtube.com/watch?v=Hm8e7864Fcw
This demonstration shows:
- Real-time token generation proving the model is actually executing
- Verification of the actual output content and quality
Analysis
BF16 CPU Value Proposition
7.59 tok/s is slow for chat but sufficient for code review and security auditing. BF16 has zero quantization degradation, making it trustworthy for SQLi vulnerability detection, plaintext password risk identification, and similar precision-critical tasks.
The 12-channel DDR5-6400 bandwidth supports BF16’s massive data movement, with Zen 5’s native AVX-512 BF16 instructions contributing meaningfully.
Expert Offload 31% Penalty
exps=CPU saves 40GB+ VRAM but costs 31% generation speed. graph_splits jumps from 2 to 98, indicating frequent CPU-GPU data transfers for every expert computation.
59 tok/s is still fast, but when VRAM allows, full GPU loading is clearly superior. Expert Offload is a “VRAM compromise” strategy, not well-suited as a speed optimization method.
Under this measurement setup (A3B, n_ctx=65536, KV f16, n_batch=2048, -ngl 99), exps=CPU is not a speed feature. It is a VRAM-conservation mode that trades throughput for memory relief. On a machine where the weights fit comfortably in GPU memory, it is the wrong default.
Why It Gets Slower: Structural Analysis
- Transfer and graph-splitting overhead: graph_splits=98 indicates heavily fragmented compute graph with repeated GPU/CPU synchronization. MoE exps weights on CPU cause surrounding computation to become finely interleaved
- No benefit when VRAM is ample: Run B holds 42.9 GiB of GPU weights. On a 96GB card, the VRAM-relief argument for exps=CPU does not apply. The configuration pays CPU and transfer cost without a stability win
- KV f16 plus large context: With n_ctx=65536 and KV f16, exps=CPU does not reduce KV cost. It adds boundary complexity without addressing the largest pressure point
Coding Quality
BF16 evaluation:
- Security audit: Correctly identified SQLi vulnerabilities and plaintext password risks, providing fixes with unit tests
- Hallucination control: Properly refused out-of-spec questions (“NOT IN SPEC”)
- Complex logic: Met 90% of constraint-heavy Django requirements but missed some multi-tenant safety nuances. Best as a high-quality draft generator + expert reviewer
SWA Cache Invalidation (Qwen3-Next-80B-A3B-Thinking)
When running Qwen3-Next-80B-A3B-Thinking (Q4_K_M) on CPU-only inference, full prompt re-processing was triggered by Sliding Window Attention (SWA) or hybrid/recurrent memory architecture behavior.
slot update_slots: id 1 | task 6686 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 1 | task 6686 | erased invalidated context checkpoint (pos_min = 223, pos_max = 223, n_swa = 1, size = 75.376 MiB)
An invalidated context checkpoint of approximately 75 MiB was erased and the prompt was fully re-processed. This behavior was distinct from my previous tests with Qwen3-Coder-Next IQ4_NL on GPU offload. Worth noting as a caveat when running SWA-based architectures in CPU inference mode.
Lessons Learned
Three-mode selection is now clear:
- Precision work (BF16 CPU): Security audits, final reviews. Speed sacrificed for zero quantization loss
- Normal coding (Hybrid / GPU): Daily development. 59-85 tok/s feels responsive
- vLLM integration (nvfp4): When tool-call-parser, prefix caching, or API integration is needed
The 31% Expert Offload penalty was larger than expected. Confirmed by measurement: it is a “use when VRAM is insufficient” setting.
Reproduction Steps
BF16 CPU
podman run --rm -it \
-p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
-v /mnt/data/hf/hub/models--unsloth--Qwen3-Coder-Next-GGUF:/models:Z \
compute.home.arpa/llamacpp-zen5:qwen3-coder-next \
-m /models/snapshots/.../BF16/Qwen3-Coder-Next-BF16-00001-of-00004.gguf \
--cache-type-k q8_0 --cache-type-v q8_0 \
--flash-attn on --ctx-size 16384 \
--parallel 1 --threads 13 --threads-batch 13 \
--batch-size 2048 --ubatch-size 512 \
--jinja --host 0.0.0.0 --port 8080
IQ4_NL Hybrid (Expert CPU offload)
IMG=compute.home.arpa/ik_llama-cuda
MO=/mnt/data/hf/hub/models--ubergarm--Qwen3-Coder-Next-GGUF
MODEL=/models/snapshots/.../Qwen3-Coder-Next-IQ4_KSS.gguf
podman run --rm -it --device nvidia.com/gpu=all \
-p 8001:8080 --shm-size 16g --cap-add=SYS_NICE \
-v "$MO":/models:ro,Z $IMG \
--host 0.0.0.0 --port 8080 -m "$MODEL" \
-c 65536 --threads 13 --threads-batch 23 \
-b 2048 -ub 2048 -ngl 99 \
-ot exps=CPU -fa on --no-mmap --jinja
IQ4_NL Full GPU (Expert on GPU)
podman run --rm -it --device nvidia.com/gpu=all \
-p 8001:8080 --shm-size 16g --cap-add=SYS_NICE \
-v "$MO":/models:ro,Z $IMG \
--host 0.0.0.0 --port 8080 -m "$MODEL" \
-c 65536 --threads 13 --threads-batch 23 \
-b 2048 -ub 2048 -ngl 99 \
-fa on --no-mmap --jinja
nvfp4 GPU (vLLM)
podman run --rm --device nvidia.com/gpu=all \
--security-opt seccomp=unconfined --cap-add SYS_NICE --shm-size=16g \
-v /mnt/data/hf:/data/hf:Z \
-v /opt/containers/runtime/vllm/data/gpu_cache:/data/cache:Z \
-p 8000:8000 \
-e HF_HOME=/data/hf -e HF_DATASETS_CACHE=/data/hf \
-e VLLM_CACHE_ROOT=/data/cache -e HF_HUB_OFFLINE=1 \
-e FLASHINFER_DISABLE_VERSION_CHECK=1 \
compute.home.arpa/vllm-gpu:nightly vincentzed-hf/Qwen3-Coder-Next-NVFP4 \
--dtype auto --gpu-memory-utilization 0.88 \
--max-num-seqs 1 --max-model-len 32768 \
--enable-prefix-caching --trust-remote-code \
--enable-auto-tool-choice --tool-call-parser qwen3_coder \
--reasoning-parser qwen3 --served-model-name qwen3-coder-next-nvfp4
Qwen3-Next-80B-A3B-Thinking (CPU inference)
podman run --rm \
-p 8081:8080 --shm-size 1g \
-v /opt/containers/runtime/llamacpp/data/models:/models:Z \
compute.home.arpa/llamacpp-zen5:latest \
-m /models/Qwen3-Next-80B-A3B-Thinking-Q4_K_M.gguf \
--ctx-size 16384 --threads 15 \
--jinja --reasoning-budget 512 \
--host 0.0.0.0 --port 8080
Technical Notes
256K Context Configuration
IQ4_NL can be set to ctx=262144, but Expert Offload + 256K creates massive KV cache. Practical upper limit is ctx=65536.
graph_splits Meaning
graph_splits indicates CPU-GPU data transfer frequency. exps=CPU yields 98 (returns to CPU for each expert layer), all-GPU yields 2 (input/output only). This difference directly maps to speed difference.
Operational Guidelines
- On a 96GB-class GPU, keep
exps=CPUdisabled by default - Use
exps=CPUonly as an escape hatch when VRAM becomes the limiting factor - The next gains are in KV quantization and context tuning, not in CPU expert offload
- Reserve BF16 CPU inference as a precision-first background processing lane
