Qwen3-Coder-Next 80B in Three Modes: BF16 CPU / IQ4_NL Hybrid / nvfp4 GPU Measured

Background

Qwen3-Coder-Next is an ~80B parameter MoE model specialized for coding. Strong in code generation, review, and security auditing, it’s a candidate for local coding assistant deployment.

The problem is too many execution options: BF16 unquantized CPU inference, IQ4_NL Hybrid GPU offload (experts on CPU), nvfp4 full GPU inference. Each has different speed/quality/VRAM characteristics. All three were measured on the same hardware to establish per-use-case selection criteria.

Objective

Confirm BF16 CPU inference speed and quality (maximum precision mode)
Quantify Expert Offload speed penalty with IQ4_NL Hybrid
Measure nvfp4 GPU throughput
Evaluate coding task quality

Test Environment

Item	Specification
CPU	AMD EPYC 9175F (Zen 5, 16C, L3 512MB)
GPU	NVIDIA RTX PRO 6000 Blackwell Max-Q (96GB VRAM)
Memory	DDR5-6400 768GB (12ch)
OS	Ubuntu 24.04 LTS

Three Configurations

Config	Runtime	Quantization	Expert Placement	ctx
A: CPU BF16	llama.cpp	BF16 (unquantized)	All CPU	16K
B: Hybrid offload	ik_llama.cpp	IQ4_NL	Expert=CPU, Attn=GPU	65K
C: GPU nvfp4	vLLM	nvfp4	All GPU	32K

Results

Config A: BF16 CPU (th=13, ctx=16K)

Metric	Measured
PP (short prompt)	33.37 tok/s
PP (287 tokens)	117.40 tok/s
TG (sustained)	7.59 tok/s
TTFT (287 tokens)	~2.58s

Throughput stayed consistent through a 2,233-token generation. KV cache at q8_0 to manage memory pressure.

Config B: IQ4_NL Hybrid (Expert CPU offload, ctx=65K)

Run A: exps=CPU (Expert weights on CPU)

Metric	Measured
GPU buffer (weights)	1,403 MiB
CPU buffer (weights)	41,472 MiB
graph_splits	98
TG (weighted avg)	58.94 tok/s
PP (representative)	761-1,120 tok/s

Run B: No exps=CPU (All weights on GPU)

Metric	Measured
GPU buffer (weights)	42,875 MiB
graph_splits	2
TG (weighted avg)	85.36 tok/s
PP (representative)	880-3,572 tok/s

Expert Offload speed penalty: -31% (58.94 → 85.36 tok/s)

Config C: nvfp4 GPU (vLLM, ctx=32K)

Metric	Measured
TG (stable)	17-100 tok/s (high variance)
PP	17-669 tok/s (burst)

Values fluctuate due to vLLM rolling average logs. Stable generation runs at 58-100 tok/s.

Comparison Summary

Config	TG(tok/s)	VRAM	Quality	Use Case
A: BF16 CPU	7.59	0	Highest	Precision review/audit
B: Hybrid exps=CPU	58.94	~3GB	Good	Normal coding
B: Hybrid all-GPU	85.36	~43GB	Good	Fast coding
C: nvfp4 GPU	58-100	~22GB	Good	vLLM integration

Live Demonstration

To verify the output quality of Qwen3-Coder-Next in action, we recorded real-time execution demonstrating the model’s coding capability. The video below shows the model generating code and explaining its logic:

This demonstration shows:

Real-time token generation proving the model is actually executing
Verification of the actual output content and quality
Live streaming of inference with visible token-by-token progression
Proof that CPU/GPU inference is functioning as expected

Analysis

BF16 CPU Value Proposition

7.59 tok/s is slow for chat but sufficient for code review and security auditing. BF16 has zero quantization degradation, making it trustworthy for SQLi vulnerability detection, plaintext password risk identification, and similar precision-critical tasks.

The 12-channel DDR5-6400 bandwidth supports BF16’s massive data movement, with Zen 5’s native AVX-512 BF16 instructions contributing meaningfully.

Expert Offload 31% Penalty

exps=CPU saves 40GB+ VRAM but costs 31% generation speed. graph_splits jumps from 2 to 98, indicating frequent CPU-GPU data transfers for every expert computation.

59 tok/s is still fast, but when VRAM allows, full GPU loading is clearly superior. Expert Offload is a “VRAM compromise” strategy, not well-suited as a speed optimization method.

Coding Quality

BF16 evaluation:

Security audit: Correctly identified SQLi vulnerabilities and plaintext password risks, providing fixes with unit tests
Hallucination control: Properly refused out-of-spec questions (“NOT IN SPEC”)
Complex logic: Met 90% of constraint-heavy Django requirements but missed some multi-tenant safety nuances. Best as a high-quality draft generator + expert reviewer

Lessons Learned

Three-mode selection is now clear:

Precision work (BF16 CPU): Security audits, final reviews. Speed sacrificed for zero quantization loss
Normal coding (Hybrid / GPU): Daily development. 59-85 tok/s feels responsive
vLLM integration (nvfp4): When tool-call-parser, prefix caching, or API integration is needed

The 31% Expert Offload penalty was larger than expected. Confirmed by measurement: it’s a “use when VRAM is insufficient” setting.

Reproduction Steps

BF16 CPU

  podman run --rm -it \
  -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v "$MO":/models:Z compute.home.arpa/llamacpp-zen5:qwen3-coder-next \
  -m /models/.../BF16/Qwen3-Coder-Next-BF16-00001-of-00004.gguf \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --flash-attn on --ctx-size 16384 \
  --parallel 1 --threads 13 --threads-batch 13 \
  --batch-size 2048 --ubatch-size 512 \
  --jinja --host 0.0.0.0 --port 8080

IQ4_NL Hybrid (Expert CPU offload)

  podman run --rm -it --device nvidia.com/gpu=all \
  -p 8001:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z $IMG \
  --host 0.0.0.0 --port 8080 -m "$MODEL" \
  -c 65536 --threads 13 --threads-batch 23 \
  -b 2048 -ub 2048 -ngl 99 \
  -ot exps=CPU -fa on --no-mmap --jinja

nvfp4 GPU (vLLM)

  vllm serve vincentzed-hf/Qwen3-Coder-Next-NVFP4 \
  --dtype auto --gpu-memory-utilization 0.88 \
  --max-num-seqs 1 --max-model-len 32768 \
  --enable-prefix-caching --trust-remote-code \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder

Technical Notes

256K Context Configuration

IQ4_NL can be set to ctx=262144, but Expert Offload + 256K creates massive KV cache. Practical upper limit is ctx=65536.

graph_splits Meaning

graph_splits indicates CPU-GPU data transfer frequency. exps=CPU yields 98 (returns to CPU for each expert layer), all-GPU yields 2 (input/output only). This difference directly maps to speed difference.

Qwen3-Coder-Next 80B in Three Modes: BF16 CPU / IQ4_NL Hybrid / nvfp4 GPU Measured

Background link

Objective link

Test Environment link

Three Configurations link

Results link

Config A: BF16 CPU (th=13, ctx=16K) link

Config B: IQ4_NL Hybrid (Expert CPU offload, ctx=65K) link

Config C: nvfp4 GPU (vLLM, ctx=32K) link

Comparison Summary link

Live Demonstration link

Analysis link

BF16 CPU Value Proposition link

Expert Offload 31% Penalty link

Coding Quality link

Lessons Learned link

Reproduction Steps link

BF16 CPU link

IQ4_NL Hybrid (Expert CPU offload) link

nvfp4 GPU (vLLM) link

Technical Notes link

256K Context Configuration link

graph_splits Meaning link