Background

Qwen3-Coder-Next is an ~80B parameter MoE model specialized for coding. Strong in code generation, review, and security auditing, it is a candidate for local coding assistant deployment.

The problem is too many execution options: BF16 unquantized CPU inference, IQ4_NL Hybrid GPU offload (experts on CPU), nvfp4 full GPU inference. Each has different speed/quality/VRAM characteristics. All three were measured on the same hardware to establish per-use-case selection criteria.

Objective

  1. Confirm BF16 CPU inference speed and quality (maximum precision mode)
  2. Quantify Expert Offload speed penalty with IQ4_NL Hybrid
  3. Measure nvfp4 GPU throughput
  4. Evaluate coding task quality
  5. Document SWA cache invalidation behavior with Qwen3-Next-80B-A3B-Thinking (Q4_K_M)

Test Environment

ItemSpecification
CPUAMD EPYC 9175F (Zen 5, 16C, L3 512MB)
GPUNVIDIA RTX PRO 6000 Blackwell Max-Q (96GB VRAM)
MemoryDDR5-6400 768GB (12ch)
OSUbuntu 24.04 LTS

Three Configurations

ConfigRuntimeQuantizationExpert Placementctx
A: CPU BF16llama.cppBF16 (unquantized)All CPU16K
B: Hybrid offloadik_llama.cppIQ4_NLExpert=CPU, Attn=GPU65K
C: GPU nvfp4vLLMnvfp4All GPU32K

Results

Config A: BF16 CPU (th=13, ctx=16K)

MetricMeasured
PP (short prompt)33.37 tok/s
PP (287 tokens)117.40 tok/s
TG (sustained)7.59 tok/s
TTFT (287 tokens)~2.58s

Throughput stayed consistent through a 2,233-token generation. KV cache at q8_0 to manage memory pressure.

I chose BF16 because I wanted the cleanest possible view of the model’s coding behavior without quantization noise. The 12-channel DDR5-6400 bandwidth supports BF16’s massive data movement, and Zen 5’s native AVX-512 BF16 instructions contribute meaningfully. At 80B scale, memory movement is the real story. The 12-channel platform is what makes the BF16 result credible.

Config B: IQ4_NL Hybrid (Expert CPU offload, ctx=65K)

Run A: exps=CPU (Expert weights on CPU)

MetricMeasured
GPU buffer (weights)1,403 MiB
CPU buffer (weights)41,472 MiB
graph_splits98
TG (weighted avg)58.94 tok/s
PP (representative)761-1,120 tok/s

Run B: No exps=CPU (All weights on GPU)

MetricMeasured
GPU buffer (weights)42,875 MiB
graph_splits2
TG (weighted avg)85.36 tok/s
PP (representative)880-3,572 tok/s

Expert Offload speed penalty: -31% (58.94 -> 85.36 tok/s)

In Run A, most weights ended up on the CPU side while the GPU held only about 1.4 GiB. Even though GPU offload was enabled, the effective execution path crossed the GPU/CPU boundary far more aggressively than intended. graph_splits reached 98, meaning the compute graph was heavily fragmented with repeated synchronization and transfer across boundaries.

Run B was not just faster but also much more uniform, with most tasks clustering around 85 tok/s. GPU weights rose to 42.9 GiB while graph_splits fell to 2. That gap is big enough to matter in real coding sessions.

Weighted Average Methodology

The gen_tokens weighted average weights each task’s gen_tps by its gen_tokens count. Tasks that emit more output matter more, so the final average tracks long-form generation feel better than a plain mean.

  weighted_gen_tps = S(gen_tokens_i * gen_tps_i) / S(gen_tokens_i)
  

Representative Task Results (Run A: exps=CPU)

taskprompt tokensprompt tpsgen tokensgen tpstotal ms
82100291.936659.141,458.59
2563,7571,045.8914159.715,953.40
6521,594761.226758.943,230.68
922908678.8118058.524,413.36
2,6632,284961.733,81058.8067,175.38

Representative Task Results (Run B: exps=CPU disabled)

taskprompt tokensprompt tpsgen tokensgen tpstotal ms
2121356.7147185.755,551.35
1,71876880.411,03685.4112,215.63
3,5396472,561.141,42185.5316,866.17
4,9611551,397.332,00785.2823,646.46
9,63830513.4547585.645,605.00

Config C: nvfp4 GPU (vLLM, ctx=32K)

MetricMeasured
TG (stable)17-100 tok/s (high variance)
PP17-669 tok/s (burst)

Values fluctuate due to vLLM rolling average logs. Stable generation runs at 58-100 tok/s.

Comparison Summary

ConfigTG(tok/s)VRAMQualityUse Case
A: BF16 CPU7.590HighestPrecision review/audit
B: Hybrid exps=CPU58.94~3GBGoodNormal coding
B: Hybrid all-GPU85.36~43GBGoodFast coding
C: nvfp4 GPU58-100~22GBGoodvLLM integration

Live Demonstration

To verify the output quality of Qwen3-Coder-Next in action, we recorded real-time execution demonstrating the model’s coding capability:

Video link: https://www.youtube.com/watch?v=Hm8e7864Fcw

This demonstration shows:

  • Real-time token generation proving the model is actually executing
  • Verification of the actual output content and quality

Analysis

BF16 CPU Value Proposition

7.59 tok/s is slow for chat but sufficient for code review and security auditing. BF16 has zero quantization degradation, making it trustworthy for SQLi vulnerability detection, plaintext password risk identification, and similar precision-critical tasks.

The 12-channel DDR5-6400 bandwidth supports BF16’s massive data movement, with Zen 5’s native AVX-512 BF16 instructions contributing meaningfully.

Expert Offload 31% Penalty

exps=CPU saves 40GB+ VRAM but costs 31% generation speed. graph_splits jumps from 2 to 98, indicating frequent CPU-GPU data transfers for every expert computation.

59 tok/s is still fast, but when VRAM allows, full GPU loading is clearly superior. Expert Offload is a “VRAM compromise” strategy, not well-suited as a speed optimization method.

Under this measurement setup (A3B, n_ctx=65536, KV f16, n_batch=2048, -ngl 99), exps=CPU is not a speed feature. It is a VRAM-conservation mode that trades throughput for memory relief. On a machine where the weights fit comfortably in GPU memory, it is the wrong default.

Why It Gets Slower: Structural Analysis

  1. Transfer and graph-splitting overhead: graph_splits=98 indicates heavily fragmented compute graph with repeated GPU/CPU synchronization. MoE exps weights on CPU cause surrounding computation to become finely interleaved
  2. No benefit when VRAM is ample: Run B holds 42.9 GiB of GPU weights. On a 96GB card, the VRAM-relief argument for exps=CPU does not apply. The configuration pays CPU and transfer cost without a stability win
  3. KV f16 plus large context: With n_ctx=65536 and KV f16, exps=CPU does not reduce KV cost. It adds boundary complexity without addressing the largest pressure point

Coding Quality

BF16 evaluation:

  • Security audit: Correctly identified SQLi vulnerabilities and plaintext password risks, providing fixes with unit tests
  • Hallucination control: Properly refused out-of-spec questions (“NOT IN SPEC”)
  • Complex logic: Met 90% of constraint-heavy Django requirements but missed some multi-tenant safety nuances. Best as a high-quality draft generator + expert reviewer

SWA Cache Invalidation (Qwen3-Next-80B-A3B-Thinking)

When running Qwen3-Next-80B-A3B-Thinking (Q4_K_M) on CPU-only inference, full prompt re-processing was triggered by Sliding Window Attention (SWA) or hybrid/recurrent memory architecture behavior.

  slot update_slots: id  1 | task 6686 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)

slot update_slots: id  1 | task 6686 | erased invalidated context checkpoint (pos_min = 223, pos_max = 223, n_swa = 1, size = 75.376 MiB)
  

An invalidated context checkpoint of approximately 75 MiB was erased and the prompt was fully re-processed. This behavior was distinct from my previous tests with Qwen3-Coder-Next IQ4_NL on GPU offload. Worth noting as a caveat when running SWA-based architectures in CPU inference mode.

Lessons Learned

Three-mode selection is now clear:

  • Precision work (BF16 CPU): Security audits, final reviews. Speed sacrificed for zero quantization loss
  • Normal coding (Hybrid / GPU): Daily development. 59-85 tok/s feels responsive
  • vLLM integration (nvfp4): When tool-call-parser, prefix caching, or API integration is needed

The 31% Expert Offload penalty was larger than expected. Confirmed by measurement: it is a “use when VRAM is insufficient” setting.

Reproduction Steps

BF16 CPU

  podman run --rm -it \
  -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v /mnt/data/hf/hub/models--unsloth--Qwen3-Coder-Next-GGUF:/models:Z \
  compute.home.arpa/llamacpp-zen5:qwen3-coder-next \
  -m /models/snapshots/.../BF16/Qwen3-Coder-Next-BF16-00001-of-00004.gguf \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --flash-attn on --ctx-size 16384 \
  --parallel 1 --threads 13 --threads-batch 13 \
  --batch-size 2048 --ubatch-size 512 \
  --jinja --host 0.0.0.0 --port 8080
  

IQ4_NL Hybrid (Expert CPU offload)

  IMG=compute.home.arpa/ik_llama-cuda
MO=/mnt/data/hf/hub/models--ubergarm--Qwen3-Coder-Next-GGUF
MODEL=/models/snapshots/.../Qwen3-Coder-Next-IQ4_KSS.gguf

podman run --rm -it --device nvidia.com/gpu=all \
  -p 8001:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z $IMG \
  --host 0.0.0.0 --port 8080 -m "$MODEL" \
  -c 65536 --threads 13 --threads-batch 23 \
  -b 2048 -ub 2048 -ngl 99 \
  -ot exps=CPU -fa on --no-mmap --jinja
  

IQ4_NL Full GPU (Expert on GPU)

  podman run --rm -it --device nvidia.com/gpu=all \
  -p 8001:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z $IMG \
  --host 0.0.0.0 --port 8080 -m "$MODEL" \
  -c 65536 --threads 13 --threads-batch 23 \
  -b 2048 -ub 2048 -ngl 99 \
  -fa on --no-mmap --jinja
  

nvfp4 GPU (vLLM)

  podman run --rm --device nvidia.com/gpu=all \
  --security-opt seccomp=unconfined --cap-add SYS_NICE --shm-size=16g \
  -v /mnt/data/hf:/data/hf:Z \
  -v /opt/containers/runtime/vllm/data/gpu_cache:/data/cache:Z \
  -p 8000:8000 \
  -e HF_HOME=/data/hf -e HF_DATASETS_CACHE=/data/hf \
  -e VLLM_CACHE_ROOT=/data/cache -e HF_HUB_OFFLINE=1 \
  -e FLASHINFER_DISABLE_VERSION_CHECK=1 \
  compute.home.arpa/vllm-gpu:nightly vincentzed-hf/Qwen3-Coder-Next-NVFP4 \
  --dtype auto --gpu-memory-utilization 0.88 \
  --max-num-seqs 1 --max-model-len 32768 \
  --enable-prefix-caching --trust-remote-code \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 --served-model-name qwen3-coder-next-nvfp4
  

Qwen3-Next-80B-A3B-Thinking (CPU inference)

  podman run --rm \
  -p 8081:8080 --shm-size 1g \
  -v /opt/containers/runtime/llamacpp/data/models:/models:Z \
  compute.home.arpa/llamacpp-zen5:latest \
  -m /models/Qwen3-Next-80B-A3B-Thinking-Q4_K_M.gguf \
  --ctx-size 16384 --threads 15 \
  --jinja --reasoning-budget 512 \
  --host 0.0.0.0 --port 8080
  

Technical Notes

256K Context Configuration

IQ4_NL can be set to ctx=262144, but Expert Offload + 256K creates massive KV cache. Practical upper limit is ctx=65536.

graph_splits Meaning

graph_splits indicates CPU-GPU data transfer frequency. exps=CPU yields 98 (returns to CPU for each expert layer), all-GPU yields 2 (input/output only). This difference directly maps to speed difference.

Operational Guidelines

  • On a 96GB-class GPU, keep exps=CPU disabled by default
  • Use exps=CPU only as an escape hatch when VRAM becomes the limiting factor
  • The next gains are in KV quantization and context tuning, not in CPU expert offload
  • Reserve BF16 CPU inference as a precision-first background processing lane