Qwen3.5-397B IQ4_NL Measured: 22.5tok/s Average from 28 Runs, Hybrid Offload Config and 400B-Class MoE Daily Viability
Qwen3.5-397B-A17B (397B total / 17B active MoE) deployed with IQ4_NL quantization on EPYC 9175F + GPU hybrid setup. 28 consecutive inference runs averaging TG 22.5tok/s, peak PP 372tok/s. Includes long-context code generation evaluation with a WordPress-like Django CMS specification.
Background
Qwen3.5-397B is a MoE model with 397B total parameters and 17B active. Normally requires multiple H100s, but IQ4_NL quantization + cpu-moe + tensor offloading makes it runnable on EPYC + consumer GPU hardware.
The question is whether it merely “runs” or achieves “daily-use speed.” 28 consecutive inference runs provide the answer. Additionally, I tested its long-context code generation ability by feeding it a WordPress-like Django CMS specification and asking it to scaffold a full project.
Objective
- Statistically characterize steady-state TG/PP speed from 28 runs with IQ4_NL
- Document hybrid offload execution configs (cpu-moe / multi-GPU tensor offload)
- Evaluate context-length dependency and stability
- Determine daily viability of 400B-class MoE
- Evaluate long-context code generation quality (Django CMS scaffold)
Test Environment
| Item | Specification |
|---|---|
| CPU | AMD EPYC 9175F (Zen 5, 16C, L3 512MB) |
| Memory | DDR5-6400 768GB (12ch) |
| GPU | NVIDIA RTX PRO 6000 (96GB VRAM) |
| OS | Ubuntu 24.04 LTS |
| Runtime | ik_llama.cpp (cpu-moe enabled) |
| Quantization | IQ4_NL |
| Context | Up to 262,144 |
Results
Throughput Statistics (28 Runs)
| Metric | Prefill (PP) | Generation (TG) |
|---|---|---|
| Maximum | 372.24 tok/s | 24.04 tok/s |
| Minimum | 101.49 tok/s | 19.13 tok/s |
| Mean (Steady State) | ~160 tok/s | ~22.5 tok/s |
Representative Runs
| # | PP(tok) | TG(tok) | PP(tok/s) | TG(tok/s) | Total(s) |
|---|---|---|---|---|---|
| 1 | 4,699 | 13 | 314.22 | 21.95 | 15.5 |
| 3 | 1,125 | 2,048 | 161.67 | 20.81 | 105.4 |
| 8 | 1,124 | 2,048 | 154.76 | 23.82 | 93.3 |
| 14 | 15,866 | 2,048 | 372.24 | 22.98 | 131.7 |
| 16 | 550 | 520 | 117.78 | 22.91 | 27.4 |
Run #14: 15,866-token massive input processed in 42.6s, generation starting at 22.98 tok/s. 15K tokens of source code ingested in ~40 seconds.
Context Length Dependency
- PP speed: Short prompts (<1k) stay at ~100 tok/s (overhead-dominated). Longer prompts increase parallel efficiency, accelerating to 300+ tok/s
- TG speed: Remarkably stable at 21-24 tok/s regardless of context length
Hybrid Offload Configurations
Single GPU + cpu-moe
IMG=compute.home.arpa/ik_llama-cuda
MODEL=/models/.../IQ4_KSS/Qwen3.5-397B-A17B-IQ4_KSS-00001-of-00006.gguf
podman run --rm -it --device nvidia.com/gpu=all \
-p 8001:8080 --shm-size 16g --cap-add=SYS_NICE \
-v "$MO":/models:ro,Z "$IMG" \
--host 0.0.0.0 --port 8080 -m "$MODEL" \
-c 262144 --threads 13 --threads-batch 24 \
--jinja -b 2048 -ub 2048 -ngl 99 \
-fa on --no-mmap --cpu-moe
--cpu-moe: Offloads expert computations that exceed GPU VRAM to CPU. EPYC 9175F’s 12-channel DDR5 bandwidth supplies active experts while GPU accelerates Attention operations.
Multi-GPU Tensor Offload
With 2 GPUs, -ot enables regex-based layer distribution:
./build/bin/llama-server \
--model "$model" \
-fa on --ctx-size 135168 \
-ctk q8_0 -ctv q8_0 \
-ub 2048 -b 2048 -ngl 999 \
-ot "blk\.(0|1|2|...|12)\.ffn_(gate|up|down)_exps.*=CUDA0,\
blk\.(47|48|...|60)\.ffn_(gate|up|down)_exps.*=CUDA1" \
--cpu-moe --threads 24 --no-mmap --jinja
Early layers (0-12) assigned to CUDA0, late layers (47-60) to CUDA1. Middle layers handled by cpu-moe. Flattens VRAM consumption while securing 135K context.
Long-Context Code Generation: WordPress-Like Django CMS
Evaluation Setup
I fed Qwen3.5-397B-A17B a detailed English specification for a WordPress-like Django CMS foundation. The specification covered Posts/Pages, Categories/Tags, Comments, Media Library, Menus/Navigation, Drafts/scheduled publishing, and Revision history.
Inference Configuration
IMG=compute.home.arpa/ik_llama-cuda
MO=/mnt/data/hf/hub/models--ubergarm--Qwen3.5-397B-A17B-GGUF
MODEL=/models/snapshots/.../IQ4_KSS/Qwen3.5-397B-A17B-IQ4_KSS-00001-of-00006.gguf
podman run --rm -it --device nvidia.com/gpu=all \
-p 8001:8080 --shm-size 16g --cap-add=SYS_NICE \
-v "$MO":/models:ro,Z "$IMG" \
--host 0.0.0.0 --port 8080 -m "$MODEL" \
-c 262144 --threads 13 --threads-batch 26 \
--jinja --temp 0.7 --repeat-penalty 1.2 \
--min-p 0.01 --top-p 0.95 --seed 317 --top-k 40 \
-b 2048 -ub 2048 -ngl 99 \
-fa on --no-mmap -ger
The 262k context setting allowed the full specification and generated implementation code to live in a single session.
Specification Design Insights
The most effective element was placing a WordPress-to-Django concept mapping table at the beginning. Post/Page -> content.Post/content.Page, Category/Tag -> taxonomy.Term, Post Meta -> content.PostMeta, Revision -> content.Revision. This gave the model a stable vocabulary that kept field design consistent through the long generation.
Explicitly stating non-goals (no WordPress compatibility, no Gutenberg replication, no multisite UI parity, no full WYSIWYG) reduced the chance of the model drifting into overbuilt architectures.
Generated Output
The model produced at least these files:
apps/core/models.py/models_site.py/models_setting.pyapps/content/models.py/models_page.py/models_extra.pyapps/taxonomy/models.pyapps/media/models.pyapps/comments/models.pyapps/navigation/models.pyapps/seo/models.pyconfig/settings/base.py/dev.py/prod.pyconfig/urls.py/asgi.py/wsgi.pymanage.py/requirements.txt
That is a meaningful amount of scaffolding from a single specification input.
Generation Strengths
- Specification adherence: Post, Term, Revision, SEOEntry scaffolds were directly usable
- Concept mapping: WordPress mental model preserved while remaining Django-native
- Abstract models: TimeStampedModel, SoftDeleteModel, PublishableModel, SluggedModel generated naturally
- SEO constraints: SEOEntry included CheckConstraint binding to either post or page
Generation Weaknesses
- Field dropout: slug disappeared from Post mid-generation, had to be re-added via diff
- File placement drift: Page was first split to
models_page.py, then folded back intomodels.py - Field inconsistency: Comment.save() referenced
self.author_ipbefore the field existed in the model - Tool failures: ctree_check, ctree_init, filesystem_create_directory failed repeatedly before falling back to write_file
Code Generation Insight
Even with a well-structured specification, long-generation output is not a clean linear path. Tool failures, missing parent directories, model definition gaps, and field inconsistencies recur.
The value is not in trusting the output but in the speed of initial scaffolding. The human reviewer’s job is to quickly separate what is directly usable from what requires mandatory review.
Analysis
Why 397B Hits 22 tok/s
Despite the 397B total, only 17B is active per token. IQ4_NL compresses memory bandwidth load by 4x+. The 12-channel DDR5 bandwidth supplies active experts while GPU accelerates Attention, achieving “17B-class speed” through division of labor.
IQ4_NL Quantization Choice
Unquantized deployment of 397B is impractical. IQ4_NL minimizes quality degradation while drastically reducing memory footprint. No obvious quality degradation observed across 28 runs.
Warm-up Requirement
First few runs show behavioral jitter. 2-3 dummy requests before stable TG speed. For resident deployment, incorporate a warm-up script.
Lessons Learned
“400B-class is experimental” is changing. 22.5 tok/s is sufficient for real-time human interaction, and represents high throughput for async batch processing.
TG speed stability at 21-24 tok/s regardless of context length is particularly important. Feeding 15K tokens of source code does not slow generation. This matters for use cases involving entire repository ingestion.
The Django CMS generation test confirmed that 400B-class models with long context can genuinely accelerate project initialization. The structure of the specification itself (concept mapping tables, explicit non-goals, recommended implementation order) directly affects generation quality, a lesson that applies to future prompt design.
Multi-GPU tensor offload is a “use it if you have 2 GPUs” configuration. cpu-moe alone works, but explicit layer distribution via -ot improves throughput.
Technical Notes
KV Cache Quantization
At 135K-262K context windows, KV cache consumes massive memory. -ctk q8_0 -ctv q8_0 quantizes KV cache to suppress memory pressure. Perceived quality impact is minimal.
Sampling Parameters
For stable code generation: --temp 0.7 --repeat-penalty 1.2 --min-p 0.01 --top-p 0.95 --top-k 40. Temp 1.0 produces creative but verbose output.
Specification Design Guidelines for Code Generation
For repeating this workflow, split the source into three separate artifacts:
- A reusable inference launch template
- A clean CMS specification (with concept mapping + non-goals + implementation order)
- A generation log plus review notes
Placing a concept dictionary first, explicitly stating non-goals, and specifying implementation order helps stabilize model output through long generation sessions.
