Background

Qwen3.5-397B is a MoE model with 397B total parameters and 17B active. Normally requires multiple H100s, but IQ4_NL quantization + cpu-moe + tensor offloading makes it runnable on EPYC + consumer GPU hardware.

The question is whether it merely “runs” or achieves “daily-use speed.” 28 consecutive inference runs provide the answer.

Objective

  1. Statistically characterize steady-state TG/PP speed from 28 runs with IQ4_NL
  2. Document hybrid offload execution configs (cpu-moe / multi-GPU tensor offload)
  3. Evaluate context-length dependency and stability
  4. Determine daily viability of 400B-class MoE

Test Environment

ItemSpecification
CPUAMD EPYC 9175F (Zen 5, 16C, L3 512MB)
MemoryDDR5-6400 768GB (12ch)
GPUNVIDIA RTX PRO 6000 (96GB VRAM)
OSUbuntu 24.04 LTS
Runtimeik_llama.cpp (cpu-moe enabled)
QuantizationIQ4_NL
ContextUp to 262,144

Results

Throughput Statistics (28 Runs)

MetricPrefill (PP)Generation (TG)
Maximum372.24 tok/s24.04 tok/s
Minimum101.49 tok/s19.13 tok/s
Mean (Steady State)~160 tok/s~22.5 tok/s

Representative Runs

#PP(tok)TG(tok)PP(tok/s)TG(tok/s)Total(s)
14,69913314.2221.9515.5
31,1252,048161.6720.81105.4
81,1242,048154.7623.8293.3
1415,8662,048372.2422.98131.7
16550520117.7822.9127.4

Run #14: 15,866-token massive input processed in 42.6s, generation starting at 22.98 tok/s. 15K tokens of source code ingested in ~40 seconds.

Context Length Dependency

  • PP speed: Short prompts (<1k) stay at ~100 tok/s (overhead-dominated). Longer prompts increase parallel efficiency, accelerating to 300+ tok/s
  • TG speed: Remarkably stable at 21-24 tok/s regardless of context length

Hybrid Offload Configurations

Single GPU + cpu-moe

  IMG=compute.home.arpa/ik_llama-cuda
MODEL=/models/.../IQ4_KSS/Qwen3.5-397B-A17B-IQ4_KSS-00001-of-00006.gguf

podman run --rm -it --device nvidia.com/gpu=all \
  -p 8001:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z "$IMG" \
  --host 0.0.0.0 --port 8080 -m "$MODEL" \
  -c 262144 --threads 13 --threads-batch 24 \
  --jinja -b 2048 -ub 2048 -ngl 99 \
  -fa on --no-mmap --cpu-moe
  

--cpu-moe: Offloads expert computations that exceed GPU VRAM to CPU. EPYC 9175F’s 12-channel DDR5 bandwidth supplies active experts while GPU accelerates Attention operations.

Multi-GPU Tensor Offload

With 2 GPUs, -ot enables regex-based layer distribution:

  ./build/bin/llama-server \
  --model "$model" \
  -fa on --ctx-size 135168 \
  -ctk q8_0 -ctv q8_0 \
  -ub 2048 -b 2048 -ngl 999 \
  -ot "blk\.(0|1|2|...|12)\.ffn_(gate|up|down)_exps.*=CUDA0,\
       blk\.(47|48|...|60)\.ffn_(gate|up|down)_exps.*=CUDA1" \
  --cpu-moe --threads 24 --no-mmap --jinja
  

Early layers (0-12) assigned to CUDA0, late layers (47-60) to CUDA1. Middle layers handled by cpu-moe. Flattens VRAM consumption while securing 135K context.

Analysis

Why 397B Hits 22 tok/s

Despite the 397B total, only 17B is active per token. IQ4_NL compresses memory bandwidth load by 4x+. The 12-channel DDR5 bandwidth supplies active experts while GPU accelerates Attention, achieving “17B-class speed” through division of labor.

IQ4_NL Quantization Choice

Unquantized deployment of 397B is impractical. IQ4_NL minimizes quality degradation while drastically reducing memory footprint. No obvious quality degradation observed across 28 runs.

Warm-up Requirement

First few runs show behavioral jitter. 2-3 dummy requests before stable TG speed. For resident deployment, incorporate a warm-up script.

Lessons Learned

“400B-class is experimental” is changing. 22.5 tok/s is sufficient for real-time human interaction, and represents high throughput for async batch processing.

TG speed stability at 21-24 tok/s regardless of context length is particularly important. Feeding 15K tokens of source code doesn’t slow generation. This matters for use cases involving entire repository ingestion.

Multi-GPU tensor offload is a “use it if you have 2 GPUs” configuration. cpu-moe alone works, but explicit layer distribution via -ot improves throughput.

Technical Notes

KV Cache Quantization

At 135K-262K context windows, KV cache consumes massive memory. -ctk q8_0 -ctv q8_0 quantizes KV cache to suppress memory pressure. Perceived quality impact is minimal.

Sampling Parameters

For stable code generation: --temp 0.7 --repeat-penalty 1.2 --min-p 0.01 --top-p 0.95 --top-k 40. Temp 1.0 produces creative but verbose output.