Qwen3.5-397B IQ4_NL Measured: 22.5tok/s Average from 28 Runs, Hybrid Offload Config and 400B-Class MoE Daily Viability

Qwen3.5-397B-A17B (397B total / 17B active MoE) deployed with IQ4_NL quantization on EPYC 9175F + GPU hybrid setup. 28 consecutive inference runs averaging TG 22.5tok/s, peak PP 372tok/s. Includes long-context code generation evaluation with a WordPress-like Django CMS specification.

These articles use AI-generated summaries of Obsidian notes originally kept as technical memos.

English translations are produced with AI assistance.

Background

Qwen3.5-397B is a MoE model with 397B total parameters and 17B active. Normally requires multiple H100s, but IQ4_NL quantization + cpu-moe + tensor offloading makes it runnable on EPYC + consumer GPU hardware.

The question is whether it merely “runs” or achieves “daily-use speed.” 28 consecutive inference runs provide the answer. Additionally, I tested its long-context code generation ability by feeding it a WordPress-like Django CMS specification and asking it to scaffold a full project.

Objective

Statistically characterize steady-state TG/PP speed from 28 runs with IQ4_NL
Document hybrid offload execution configs (cpu-moe / multi-GPU tensor offload)
Evaluate context-length dependency and stability
Determine daily viability of 400B-class MoE
Evaluate long-context code generation quality (Django CMS scaffold)

Test Environment

Item	Specification
CPU	AMD EPYC 9175F (Zen 5, 16C, L3 512MB)
Memory	DDR5-6400 768GB (12ch)
GPU	NVIDIA RTX PRO 6000 (96GB VRAM)
OS	Ubuntu 24.04 LTS
Runtime	ik_llama.cpp (cpu-moe enabled)
Quantization	IQ4_NL
Context	Up to 262,144

Results

Throughput Statistics (28 Runs)

Metric	Prefill (PP)	Generation (TG)
Maximum	372.24 tok/s	24.04 tok/s
Minimum	101.49 tok/s	19.13 tok/s
Mean (Steady State)	~160 tok/s	~22.5 tok/s

Representative Runs

#	PP(tok)	TG(tok)	PP(tok/s)	TG(tok/s)	Total(s)
1	4,699	13	314.22	21.95	15.5
3	1,125	2,048	161.67	20.81	105.4
8	1,124	2,048	154.76	23.82	93.3
14	15,866	2,048	372.24	22.98	131.7
16	550	520	117.78	22.91	27.4

Run #14: 15,866-token massive input processed in 42.6s, generation starting at 22.98 tok/s. 15K tokens of source code ingested in ~40 seconds.

Context Length Dependency

PP speed: Short prompts (<1k) stay at ~100 tok/s (overhead-dominated). Longer prompts increase parallel efficiency, accelerating to 300+ tok/s
TG speed: Remarkably stable at 21-24 tok/s regardless of context length

Hybrid Offload Configurations

Single GPU + cpu-moe

  IMG=compute.home.arpa/ik_llama-cuda
MODEL=/models/.../IQ4_KSS/Qwen3.5-397B-A17B-IQ4_KSS-00001-of-00006.gguf

podman run --rm -it --device nvidia.com/gpu=all \
  -p 8001:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z "$IMG" \
  --host 0.0.0.0 --port 8080 -m "$MODEL" \
  -c 262144 --threads 13 --threads-batch 24 \
  --jinja -b 2048 -ub 2048 -ngl 99 \
  -fa on --no-mmap --cpu-moe

--cpu-moe: Offloads expert computations that exceed GPU VRAM to CPU. EPYC 9175F’s 12-channel DDR5 bandwidth supplies active experts while GPU accelerates Attention operations.

Multi-GPU Tensor Offload

With 2 GPUs, -ot enables regex-based layer distribution:

  ./build/bin/llama-server \
  --model "$model" \
  -fa on --ctx-size 135168 \
  -ctk q8_0 -ctv q8_0 \
  -ub 2048 -b 2048 -ngl 999 \
  -ot "blk\.(0|1|2|...|12)\.ffn_(gate|up|down)_exps.*=CUDA0,\
       blk\.(47|48|...|60)\.ffn_(gate|up|down)_exps.*=CUDA1" \
  --cpu-moe --threads 24 --no-mmap --jinja

Early layers (0-12) assigned to CUDA0, late layers (47-60) to CUDA1. Middle layers handled by cpu-moe. Flattens VRAM consumption while securing 135K context.

Long-Context Code Generation: WordPress-Like Django CMS

Evaluation Setup

I fed Qwen3.5-397B-A17B a detailed English specification for a WordPress-like Django CMS foundation. The specification covered Posts/Pages, Categories/Tags, Comments, Media Library, Menus/Navigation, Drafts/scheduled publishing, and Revision history.

Inference Configuration

  IMG=compute.home.arpa/ik_llama-cuda
MO=/mnt/data/hf/hub/models--ubergarm--Qwen3.5-397B-A17B-GGUF
MODEL=/models/snapshots/.../IQ4_KSS/Qwen3.5-397B-A17B-IQ4_KSS-00001-of-00006.gguf

podman run --rm -it --device nvidia.com/gpu=all \
  -p 8001:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z "$IMG" \
  --host 0.0.0.0 --port 8080 -m "$MODEL" \
  -c 262144 --threads 13 --threads-batch 26 \
  --jinja --temp 0.7 --repeat-penalty 1.2 \
  --min-p 0.01 --top-p 0.95 --seed 317 --top-k 40 \
  -b 2048 -ub 2048 -ngl 99 \
  -fa on --no-mmap -ger

The 262k context setting allowed the full specification and generated implementation code to live in a single session.

Specification Design Insights

The most effective element was placing a WordPress-to-Django concept mapping table at the beginning. Post/Page -> content.Post/content.Page, Category/Tag -> taxonomy.Term, Post Meta -> content.PostMeta, Revision -> content.Revision. This gave the model a stable vocabulary that kept field design consistent through the long generation.

Explicitly stating non-goals (no WordPress compatibility, no Gutenberg replication, no multisite UI parity, no full WYSIWYG) reduced the chance of the model drifting into overbuilt architectures.

Generated Output

The model produced at least these files:

apps/core/models.py / models_site.py / models_setting.py
apps/content/models.py / models_page.py / models_extra.py
apps/taxonomy/models.py
apps/media/models.py
apps/comments/models.py
apps/navigation/models.py
apps/seo/models.py
config/settings/base.py / dev.py / prod.py
config/urls.py / asgi.py / wsgi.py
manage.py / requirements.txt

That is a meaningful amount of scaffolding from a single specification input.

Generation Strengths

Specification adherence: Post, Term, Revision, SEOEntry scaffolds were directly usable
Concept mapping: WordPress mental model preserved while remaining Django-native
Abstract models: TimeStampedModel, SoftDeleteModel, PublishableModel, SluggedModel generated naturally
SEO constraints: SEOEntry included CheckConstraint binding to either post or page

Generation Weaknesses

Field dropout: slug disappeared from Post mid-generation, had to be re-added via diff
File placement drift: Page was first split to models_page.py, then folded back into models.py
Field inconsistency: Comment.save() referenced self.author_ip before the field existed in the model
Tool failures: ctree_check, ctree_init, filesystem_create_directory failed repeatedly before falling back to write_file

Code Generation Insight

Even with a well-structured specification, long-generation output is not a clean linear path. Tool failures, missing parent directories, model definition gaps, and field inconsistencies recur.

The value is not in trusting the output but in the speed of initial scaffolding. The human reviewer’s job is to quickly separate what is directly usable from what requires mandatory review.

Analysis

Why 397B Hits 22 tok/s

Despite the 397B total, only 17B is active per token. IQ4_NL compresses memory bandwidth load by 4x+. The 12-channel DDR5 bandwidth supplies active experts while GPU accelerates Attention, achieving “17B-class speed” through division of labor.

IQ4_NL Quantization Choice

Unquantized deployment of 397B is impractical. IQ4_NL minimizes quality degradation while drastically reducing memory footprint. No obvious quality degradation observed across 28 runs.

Warm-up Requirement

First few runs show behavioral jitter. 2-3 dummy requests before stable TG speed. For resident deployment, incorporate a warm-up script.

Lessons Learned

“400B-class is experimental” is changing. 22.5 tok/s is sufficient for real-time human interaction, and represents high throughput for async batch processing.

TG speed stability at 21-24 tok/s regardless of context length is particularly important. Feeding 15K tokens of source code does not slow generation. This matters for use cases involving entire repository ingestion.

The Django CMS generation test confirmed that 400B-class models with long context can genuinely accelerate project initialization. The structure of the specification itself (concept mapping tables, explicit non-goals, recommended implementation order) directly affects generation quality, a lesson that applies to future prompt design.

Multi-GPU tensor offload is a “use it if you have 2 GPUs” configuration. cpu-moe alone works, but explicit layer distribution via -ot improves throughput.

Technical Notes

KV Cache Quantization

At 135K-262K context windows, KV cache consumes massive memory. -ctk q8_0 -ctv q8_0 quantizes KV cache to suppress memory pressure. Perceived quality impact is minimal.

Sampling Parameters

For stable code generation: --temp 0.7 --repeat-penalty 1.2 --min-p 0.01 --top-p 0.95 --top-k 40. Temp 1.0 produces creative but verbose output.

Specification Design Guidelines for Code Generation

For repeating this workflow, split the source into three separate artifacts:

A reusable inference launch template
A clean CMS specification (with concept mapping + non-goals + implementation order)
A generation log plus review notes

Placing a concept dictionary first, explicitly stating non-goals, and specifying implementation order helps stabilize model output through long generation sessions.

MiniMax-2.5 (229B MoE) Expert Offload and Web Generation: IQ5_K to IQ3_S

Complete record of running the …

Llama-4-Scout-17B-16E Measured: CPU Q6_K 17tok/s vs GPU nvfp4 60tok/s, Cache Strategy and 100K Context Boundary

Llama-4-Scout (17B active / …