Qwen3-Coder-Next 80B in Three Modes: BF16 CPU / IQ4_NL Hybrid / nvfp4 GPU Measured

Qwen3-Coder-Next (~80B MoE) benchmarked across BF16 CPU inference (7.59 tok/s), IQ4_NL Hybrid GPU offload (59-85 tok/s), and nvfp4 GPU (17-100 tok/s). Quantifying Expert Offload speed penalty, SWA cache invalidation behavior, and coding quality evaluation.

These articles use AI-generated summaries of Obsidian notes originally kept as technical memos.

English translations are produced with AI assistance.

Background

Qwen3-Coder-Next is an ~80B parameter MoE model specialized for coding. Strong in code generation, review, and security auditing, it is a candidate for local coding assistant deployment.

The problem is too many execution options: BF16 unquantized CPU inference, IQ4_NL Hybrid GPU offload (experts on CPU), nvfp4 full GPU inference. Each has different speed/quality/VRAM characteristics. All three were measured on the same hardware to establish per-use-case selection criteria.

Objective

Confirm BF16 CPU inference speed and quality (maximum precision mode)
Quantify Expert Offload speed penalty with IQ4_NL Hybrid
Measure nvfp4 GPU throughput
Evaluate coding task quality
Document SWA cache invalidation behavior with Qwen3-Next-80B-A3B-Thinking (Q4_K_M)

Test Environment

Item	Specification
CPU	AMD EPYC 9175F (Zen 5, 16C, L3 512MB)
GPU	NVIDIA RTX PRO 6000 Blackwell Max-Q (96GB VRAM)
Memory	DDR5-6400 768GB (12ch)
OS	Ubuntu 24.04 LTS

Three Configurations

Config	Runtime	Quantization	Expert Placement	ctx
A: CPU BF16	llama.cpp	BF16 (unquantized)	All CPU	16K
B: Hybrid offload	ik_llama.cpp	IQ4_NL	Expert=CPU, Attn=GPU	65K
C: GPU nvfp4	vLLM	nvfp4	All GPU	32K

Results

Config A: BF16 CPU (th=13, ctx=16K)

Metric	Measured
PP (short prompt)	33.37 tok/s
PP (287 tokens)	117.40 tok/s
TG (sustained)	7.59 tok/s
TTFT (287 tokens)	~2.58s

Throughput stayed consistent through a 2,233-token generation. KV cache at q8_0 to manage memory pressure.

I chose BF16 because I wanted the cleanest possible view of the model’s coding behavior without quantization noise. The 12-channel DDR5-6400 bandwidth supports BF16’s massive data movement, and Zen 5’s native AVX-512 BF16 instructions contribute meaningfully. At 80B scale, memory movement is the real story. The 12-channel platform is what makes the BF16 result credible.

Config B: IQ4_NL Hybrid (Expert CPU offload, ctx=65K)

Run A: exps=CPU (Expert weights on CPU)

Metric	Measured
GPU buffer (weights)	1,403 MiB
CPU buffer (weights)	41,472 MiB
graph_splits	98
TG (weighted avg)	58.94 tok/s
PP (representative)	761-1,120 tok/s

Run B: No exps=CPU (All weights on GPU)

Metric	Measured
GPU buffer (weights)	42,875 MiB
graph_splits	2
TG (weighted avg)	85.36 tok/s
PP (representative)	880-3,572 tok/s

Expert Offload speed penalty: -31% (58.94 -> 85.36 tok/s)

In Run A, most weights ended up on the CPU side while the GPU held only about 1.4 GiB. Even though GPU offload was enabled, the effective execution path crossed the GPU/CPU boundary far more aggressively than intended. graph_splits reached 98, meaning the compute graph was heavily fragmented with repeated synchronization and transfer across boundaries.

Run B was not just faster but also much more uniform, with most tasks clustering around 85 tok/s. GPU weights rose to 42.9 GiB while graph_splits fell to 2. That gap is big enough to matter in real coding sessions.

Weighted Average Methodology

The gen_tokens weighted average weights each task’s gen_tps by its gen_tokens count. Tasks that emit more output matter more, so the final average tracks long-form generation feel better than a plain mean.

  weighted_gen_tps = S(gen_tokens_i * gen_tps_i) / S(gen_tokens_i)

Representative Task Results (Run A: exps=CPU)

task	prompt tokens	prompt tps	gen tokens	gen tps	total ms
82	100	291.93	66	59.14	1,458.59
256	3,757	1,045.89	141	59.71	5,953.40
652	1,594	761.22	67	58.94	3,230.68
922	908	678.81	180	58.52	4,413.36
2,663	2,284	961.73	3,810	58.80	67,175.38

Representative Task Results (Run B: exps=CPU disabled)

task	prompt tokens	prompt tps	gen tokens	gen tps	total ms
21	21	356.71	471	85.75	5,551.35
1,718	76	880.41	1,036	85.41	12,215.63
3,539	647	2,561.14	1,421	85.53	16,866.17
4,961	155	1,397.33	2,007	85.28	23,646.46
9,638	30	513.45	475	85.64	5,605.00

Config C: nvfp4 GPU (vLLM, ctx=32K)

Metric	Measured
TG (stable)	17-100 tok/s (high variance)
PP	17-669 tok/s (burst)

Values fluctuate due to vLLM rolling average logs. Stable generation runs at 58-100 tok/s.

Comparison Summary

Config	TG(tok/s)	VRAM	Quality	Use Case
A: BF16 CPU	7.59	0	Highest	Precision review/audit
B: Hybrid exps=CPU	58.94	~3GB	Good	Normal coding
B: Hybrid all-GPU	85.36	~43GB	Good	Fast coding
C: nvfp4 GPU	58-100	~22GB	Good	vLLM integration

Live Demonstration

To verify the output quality of Qwen3-Coder-Next in action, we recorded real-time execution demonstrating the model’s coding capability:

Video link: https://www.youtube.com/watch?v=Hm8e7864Fcw

This demonstration shows:

Real-time token generation proving the model is actually executing
Verification of the actual output content and quality

Analysis

BF16 CPU Value Proposition

7.59 tok/s is slow for chat but sufficient for code review and security auditing. BF16 has zero quantization degradation, making it trustworthy for SQLi vulnerability detection, plaintext password risk identification, and similar precision-critical tasks.

The 12-channel DDR5-6400 bandwidth supports BF16’s massive data movement, with Zen 5’s native AVX-512 BF16 instructions contributing meaningfully.

Expert Offload 31% Penalty

exps=CPU saves 40GB+ VRAM but costs 31% generation speed. graph_splits jumps from 2 to 98, indicating frequent CPU-GPU data transfers for every expert computation.

59 tok/s is still fast, but when VRAM allows, full GPU loading is clearly superior. Expert Offload is a “VRAM compromise” strategy, not well-suited as a speed optimization method.

Under this measurement setup (A3B, n_ctx=65536, KV f16, n_batch=2048, -ngl 99), exps=CPU is not a speed feature. It is a VRAM-conservation mode that trades throughput for memory relief. On a machine where the weights fit comfortably in GPU memory, it is the wrong default.

Why It Gets Slower: Structural Analysis

Transfer and graph-splitting overhead: graph_splits=98 indicates heavily fragmented compute graph with repeated GPU/CPU synchronization. MoE exps weights on CPU cause surrounding computation to become finely interleaved
No benefit when VRAM is ample: Run B holds 42.9 GiB of GPU weights. On a 96GB card, the VRAM-relief argument for exps=CPU does not apply. The configuration pays CPU and transfer cost without a stability win
KV f16 plus large context: With n_ctx=65536 and KV f16, exps=CPU does not reduce KV cost. It adds boundary complexity without addressing the largest pressure point

Coding Quality

BF16 evaluation:

Security audit: Correctly identified SQLi vulnerabilities and plaintext password risks, providing fixes with unit tests
Hallucination control: Properly refused out-of-spec questions (“NOT IN SPEC”)
Complex logic: Met 90% of constraint-heavy Django requirements but missed some multi-tenant safety nuances. Best as a high-quality draft generator + expert reviewer

SWA Cache Invalidation (Qwen3-Next-80B-A3B-Thinking)

When running Qwen3-Next-80B-A3B-Thinking (Q4_K_M) on CPU-only inference, full prompt re-processing was triggered by Sliding Window Attention (SWA) or hybrid/recurrent memory architecture behavior.

  slot update_slots: id  1 | task 6686 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)

slot update_slots: id  1 | task 6686 | erased invalidated context checkpoint (pos_min = 223, pos_max = 223, n_swa = 1, size = 75.376 MiB)

An invalidated context checkpoint of approximately 75 MiB was erased and the prompt was fully re-processed. This behavior was distinct from my previous tests with Qwen3-Coder-Next IQ4_NL on GPU offload. Worth noting as a caveat when running SWA-based architectures in CPU inference mode.

Lessons Learned

Three-mode selection is now clear:

Precision work (BF16 CPU): Security audits, final reviews. Speed sacrificed for zero quantization loss
Normal coding (Hybrid / GPU): Daily development. 59-85 tok/s feels responsive
vLLM integration (nvfp4): When tool-call-parser, prefix caching, or API integration is needed

The 31% Expert Offload penalty was larger than expected. Confirmed by measurement: it is a “use when VRAM is insufficient” setting.

Reproduction Steps

BF16 CPU

  podman run --rm -it \
  -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v /mnt/data/hf/hub/models--unsloth--Qwen3-Coder-Next-GGUF:/models:Z \
  compute.home.arpa/llamacpp-zen5:qwen3-coder-next \
  -m /models/snapshots/.../BF16/Qwen3-Coder-Next-BF16-00001-of-00004.gguf \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --flash-attn on --ctx-size 16384 \
  --parallel 1 --threads 13 --threads-batch 13 \
  --batch-size 2048 --ubatch-size 512 \
  --jinja --host 0.0.0.0 --port 8080

IQ4_NL Hybrid (Expert CPU offload)

  IMG=compute.home.arpa/ik_llama-cuda
MO=/mnt/data/hf/hub/models--ubergarm--Qwen3-Coder-Next-GGUF
MODEL=/models/snapshots/.../Qwen3-Coder-Next-IQ4_KSS.gguf

podman run --rm -it --device nvidia.com/gpu=all \
  -p 8001:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z $IMG \
  --host 0.0.0.0 --port 8080 -m "$MODEL" \
  -c 65536 --threads 13 --threads-batch 23 \
  -b 2048 -ub 2048 -ngl 99 \
  -ot exps=CPU -fa on --no-mmap --jinja

IQ4_NL Full GPU (Expert on GPU)

  podman run --rm -it --device nvidia.com/gpu=all \
  -p 8001:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z $IMG \
  --host 0.0.0.0 --port 8080 -m "$MODEL" \
  -c 65536 --threads 13 --threads-batch 23 \
  -b 2048 -ub 2048 -ngl 99 \
  -fa on --no-mmap --jinja

nvfp4 GPU (vLLM)

  podman run --rm --device nvidia.com/gpu=all \
  --security-opt seccomp=unconfined --cap-add SYS_NICE --shm-size=16g \
  -v /mnt/data/hf:/data/hf:Z \
  -v /opt/containers/runtime/vllm/data/gpu_cache:/data/cache:Z \
  -p 8000:8000 \
  -e HF_HOME=/data/hf -e HF_DATASETS_CACHE=/data/hf \
  -e VLLM_CACHE_ROOT=/data/cache -e HF_HUB_OFFLINE=1 \
  -e FLASHINFER_DISABLE_VERSION_CHECK=1 \
  compute.home.arpa/vllm-gpu:nightly vincentzed-hf/Qwen3-Coder-Next-NVFP4 \
  --dtype auto --gpu-memory-utilization 0.88 \
  --max-num-seqs 1 --max-model-len 32768 \
  --enable-prefix-caching --trust-remote-code \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 --served-model-name qwen3-coder-next-nvfp4

Qwen3-Next-80B-A3B-Thinking (CPU inference)

  podman run --rm \
  -p 8081:8080 --shm-size 1g \
  -v /opt/containers/runtime/llamacpp/data/models:/models:Z \
  compute.home.arpa/llamacpp-zen5:latest \
  -m /models/Qwen3-Next-80B-A3B-Thinking-Q4_K_M.gguf \
  --ctx-size 16384 --threads 15 \
  --jinja --reasoning-budget 512 \
  --host 0.0.0.0 --port 8080

Technical Notes

256K Context Configuration

IQ4_NL can be set to ctx=262144, but Expert Offload + 256K creates massive KV cache. Practical upper limit is ctx=65536.

graph_splits Meaning

graph_splits indicates CPU-GPU data transfer frequency. exps=CPU yields 98 (returns to CPU for each expert layer), all-GPU yields 2 (input/output only). This difference directly maps to speed difference.

Operational Guidelines

On a 96GB-class GPU, keep exps=CPU disabled by default
Use exps=CPU only as an escape hatch when VRAM becomes the limiting factor
The next gains are in KV quantization and context tuning, not in CPU expert offload
Reserve BF16 CPU inference as a precision-first background processing lane

Llama-4-Maverick-17B-128E CPU Inference: Q4_K_M vs Q8_0 Speed-Quality Trade-off Measured

Llama-4-Maverick (17B active / …

GLM-4.7-Flash IQ5_K Benchmark: CPU vs Hybrid vs Full GPU Performance Comparison

Benchmarking GLM-4.7-Flash …