Qwen3.5-397B IQ4_NL Measured: 22.5tok/s Average from 28 Runs, Hybrid Offload Config and 400B-Class MoE Daily Viability

Background

Qwen3.5-397B is a MoE model with 397B total parameters and 17B active. Normally requires multiple H100s, but IQ4_NL quantization + cpu-moe + tensor offloading makes it runnable on EPYC + consumer GPU hardware.

The question is whether it merely “runs” or achieves “daily-use speed.” 28 consecutive inference runs provide the answer.

Objective

Statistically characterize steady-state TG/PP speed from 28 runs with IQ4_NL
Document hybrid offload execution configs (cpu-moe / multi-GPU tensor offload)
Evaluate context-length dependency and stability
Determine daily viability of 400B-class MoE

Test Environment

Item	Specification
CPU	AMD EPYC 9175F (Zen 5, 16C, L3 512MB)
Memory	DDR5-6400 768GB (12ch)
GPU	NVIDIA RTX PRO 6000 (96GB VRAM)
OS	Ubuntu 24.04 LTS
Runtime	ik_llama.cpp (cpu-moe enabled)
Quantization	IQ4_NL
Context	Up to 262,144

Results

Throughput Statistics (28 Runs)

Metric	Prefill (PP)	Generation (TG)
Maximum	372.24 tok/s	24.04 tok/s
Minimum	101.49 tok/s	19.13 tok/s
Mean (Steady State)	~160 tok/s	~22.5 tok/s

Representative Runs

#	PP(tok)	TG(tok)	PP(tok/s)	TG(tok/s)	Total(s)
1	4,699	13	314.22	21.95	15.5
3	1,125	2,048	161.67	20.81	105.4
8	1,124	2,048	154.76	23.82	93.3
14	15,866	2,048	372.24	22.98	131.7
16	550	520	117.78	22.91	27.4

Run #14: 15,866-token massive input processed in 42.6s, generation starting at 22.98 tok/s. 15K tokens of source code ingested in ~40 seconds.

Context Length Dependency

PP speed: Short prompts (<1k) stay at ~100 tok/s (overhead-dominated). Longer prompts increase parallel efficiency, accelerating to 300+ tok/s
TG speed: Remarkably stable at 21-24 tok/s regardless of context length

Hybrid Offload Configurations

Single GPU + cpu-moe

  IMG=compute.home.arpa/ik_llama-cuda
MODEL=/models/.../IQ4_KSS/Qwen3.5-397B-A17B-IQ4_KSS-00001-of-00006.gguf

podman run --rm -it --device nvidia.com/gpu=all \
  -p 8001:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z "$IMG" \
  --host 0.0.0.0 --port 8080 -m "$MODEL" \
  -c 262144 --threads 13 --threads-batch 24 \
  --jinja -b 2048 -ub 2048 -ngl 99 \
  -fa on --no-mmap --cpu-moe

--cpu-moe: Offloads expert computations that exceed GPU VRAM to CPU. EPYC 9175F’s 12-channel DDR5 bandwidth supplies active experts while GPU accelerates Attention operations.

Multi-GPU Tensor Offload

With 2 GPUs, -ot enables regex-based layer distribution:

  ./build/bin/llama-server \
  --model "$model" \
  -fa on --ctx-size 135168 \
  -ctk q8_0 -ctv q8_0 \
  -ub 2048 -b 2048 -ngl 999 \
  -ot "blk\.(0|1|2|...|12)\.ffn_(gate|up|down)_exps.*=CUDA0,\
       blk\.(47|48|...|60)\.ffn_(gate|up|down)_exps.*=CUDA1" \
  --cpu-moe --threads 24 --no-mmap --jinja

Early layers (0-12) assigned to CUDA0, late layers (47-60) to CUDA1. Middle layers handled by cpu-moe. Flattens VRAM consumption while securing 135K context.

Analysis

Why 397B Hits 22 tok/s

Despite the 397B total, only 17B is active per token. IQ4_NL compresses memory bandwidth load by 4x+. The 12-channel DDR5 bandwidth supplies active experts while GPU accelerates Attention, achieving “17B-class speed” through division of labor.

IQ4_NL Quantization Choice

Unquantized deployment of 397B is impractical. IQ4_NL minimizes quality degradation while drastically reducing memory footprint. No obvious quality degradation observed across 28 runs.

Warm-up Requirement

First few runs show behavioral jitter. 2-3 dummy requests before stable TG speed. For resident deployment, incorporate a warm-up script.

Lessons Learned

“400B-class is experimental” is changing. 22.5 tok/s is sufficient for real-time human interaction, and represents high throughput for async batch processing.

TG speed stability at 21-24 tok/s regardless of context length is particularly important. Feeding 15K tokens of source code doesn’t slow generation. This matters for use cases involving entire repository ingestion.

Multi-GPU tensor offload is a “use it if you have 2 GPUs” configuration. cpu-moe alone works, but explicit layer distribution via -ot improves throughput.

Technical Notes

KV Cache Quantization

At 135K-262K context windows, KV cache consumes massive memory. -ctk q8_0 -ctv q8_0 quantizes KV cache to suppress memory pressure. Perceived quality impact is minimal.

Sampling Parameters

For stable code generation: --temp 0.7 --repeat-penalty 1.2 --min-p 0.01 --top-p 0.95 --top-k 40. Temp 1.0 produces creative but verbose output.

Qwen3.5-397B IQ4_NL Measured: 22.5tok/s Average from 28 Runs, Hybrid Offload Config and 400B-Class MoE Daily Viability

Background link

Objective link

Test Environment link

Results link

Throughput Statistics (28 Runs) link

Representative Runs link

Context Length Dependency link

Hybrid Offload Configurations link

Single GPU + cpu-moe link

Multi-GPU Tensor Offload link

Analysis link

Why 397B Hits 22 tok/s link

IQ4_NL Quantization Choice link

Warm-up Requirement link

Lessons Learned link

Technical Notes link

KV Cache Quantization link

Sampling Parameters link