Background

Qwen3-Coder-Next is an ~80B parameter MoE model specialized for coding. Strong in code generation, review, and security auditing, it’s a candidate for local coding assistant deployment.

The problem is too many execution options: BF16 unquantized CPU inference, IQ4_NL Hybrid GPU offload (experts on CPU), nvfp4 full GPU inference. Each has different speed/quality/VRAM characteristics. All three were measured on the same hardware to establish per-use-case selection criteria.

Objective

  1. Confirm BF16 CPU inference speed and quality (maximum precision mode)
  2. Quantify Expert Offload speed penalty with IQ4_NL Hybrid
  3. Measure nvfp4 GPU throughput
  4. Evaluate coding task quality

Test Environment

ItemSpecification
CPUAMD EPYC 9175F (Zen 5, 16C, L3 512MB)
GPUNVIDIA RTX PRO 6000 Blackwell Max-Q (96GB VRAM)
MemoryDDR5-6400 768GB (12ch)
OSUbuntu 24.04 LTS

Three Configurations

ConfigRuntimeQuantizationExpert Placementctx
A: CPU BF16llama.cppBF16 (unquantized)All CPU16K
B: Hybrid offloadik_llama.cppIQ4_NLExpert=CPU, Attn=GPU65K
C: GPU nvfp4vLLMnvfp4All GPU32K

Results

Config A: BF16 CPU (th=13, ctx=16K)

MetricMeasured
PP (short prompt)33.37 tok/s
PP (287 tokens)117.40 tok/s
TG (sustained)7.59 tok/s
TTFT (287 tokens)~2.58s

Throughput stayed consistent through a 2,233-token generation. KV cache at q8_0 to manage memory pressure.

Config B: IQ4_NL Hybrid (Expert CPU offload, ctx=65K)

Run A: exps=CPU (Expert weights on CPU)

MetricMeasured
GPU buffer (weights)1,403 MiB
CPU buffer (weights)41,472 MiB
graph_splits98
TG (weighted avg)58.94 tok/s
PP (representative)761-1,120 tok/s

Run B: No exps=CPU (All weights on GPU)

MetricMeasured
GPU buffer (weights)42,875 MiB
graph_splits2
TG (weighted avg)85.36 tok/s
PP (representative)880-3,572 tok/s

Expert Offload speed penalty: -31% (58.94 → 85.36 tok/s)

Config C: nvfp4 GPU (vLLM, ctx=32K)

MetricMeasured
TG (stable)17-100 tok/s (high variance)
PP17-669 tok/s (burst)

Values fluctuate due to vLLM rolling average logs. Stable generation runs at 58-100 tok/s.

Comparison Summary

ConfigTG(tok/s)VRAMQualityUse Case
A: BF16 CPU7.590HighestPrecision review/audit
B: Hybrid exps=CPU58.94~3GBGoodNormal coding
B: Hybrid all-GPU85.36~43GBGoodFast coding
C: nvfp4 GPU58-100~22GBGoodvLLM integration

Live Demonstration

To verify the output quality of Qwen3-Coder-Next in action, we recorded real-time execution demonstrating the model’s coding capability. The video below shows the model generating code and explaining its logic:

This demonstration shows:

  • Real-time token generation proving the model is actually executing
  • Verification of the actual output content and quality
  • Live streaming of inference with visible token-by-token progression
  • Proof that CPU/GPU inference is functioning as expected

Analysis

BF16 CPU Value Proposition

7.59 tok/s is slow for chat but sufficient for code review and security auditing. BF16 has zero quantization degradation, making it trustworthy for SQLi vulnerability detection, plaintext password risk identification, and similar precision-critical tasks.

The 12-channel DDR5-6400 bandwidth supports BF16’s massive data movement, with Zen 5’s native AVX-512 BF16 instructions contributing meaningfully.

Expert Offload 31% Penalty

exps=CPU saves 40GB+ VRAM but costs 31% generation speed. graph_splits jumps from 2 to 98, indicating frequent CPU-GPU data transfers for every expert computation.

59 tok/s is still fast, but when VRAM allows, full GPU loading is clearly superior. Expert Offload is a “VRAM compromise” strategy, not well-suited as a speed optimization method.

Coding Quality

BF16 evaluation:

  • Security audit: Correctly identified SQLi vulnerabilities and plaintext password risks, providing fixes with unit tests
  • Hallucination control: Properly refused out-of-spec questions (“NOT IN SPEC”)
  • Complex logic: Met 90% of constraint-heavy Django requirements but missed some multi-tenant safety nuances. Best as a high-quality draft generator + expert reviewer

Lessons Learned

Three-mode selection is now clear:

  • Precision work (BF16 CPU): Security audits, final reviews. Speed sacrificed for zero quantization loss
  • Normal coding (Hybrid / GPU): Daily development. 59-85 tok/s feels responsive
  • vLLM integration (nvfp4): When tool-call-parser, prefix caching, or API integration is needed

The 31% Expert Offload penalty was larger than expected. Confirmed by measurement: it’s a “use when VRAM is insufficient” setting.

Reproduction Steps

BF16 CPU

  podman run --rm -it \
  -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v "$MO":/models:Z compute.home.arpa/llamacpp-zen5:qwen3-coder-next \
  -m /models/.../BF16/Qwen3-Coder-Next-BF16-00001-of-00004.gguf \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --flash-attn on --ctx-size 16384 \
  --parallel 1 --threads 13 --threads-batch 13 \
  --batch-size 2048 --ubatch-size 512 \
  --jinja --host 0.0.0.0 --port 8080
  

IQ4_NL Hybrid (Expert CPU offload)

  podman run --rm -it --device nvidia.com/gpu=all \
  -p 8001:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z $IMG \
  --host 0.0.0.0 --port 8080 -m "$MODEL" \
  -c 65536 --threads 13 --threads-batch 23 \
  -b 2048 -ub 2048 -ngl 99 \
  -ot exps=CPU -fa on --no-mmap --jinja
  

nvfp4 GPU (vLLM)

  vllm serve vincentzed-hf/Qwen3-Coder-Next-NVFP4 \
  --dtype auto --gpu-memory-utilization 0.88 \
  --max-num-seqs 1 --max-model-len 32768 \
  --enable-prefix-caching --trust-remote-code \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder
  

Technical Notes

256K Context Configuration

IQ4_NL can be set to ctx=262144, but Expert Offload + 256K creates massive KV cache. Practical upper limit is ctx=65536.

graph_splits Meaning

graph_splits indicates CPU-GPU data transfer frequency. exps=CPU yields 98 (returns to CPU for each expert layer), all-GPU yields 2 (input/output only). This difference directly maps to speed difference.