Background

Qwen3.5-397B is a MoE model with 397B total parameters and 17B active. Normally requires multiple H100s, but IQ4_NL quantization + cpu-moe + tensor offloading makes it runnable on EPYC + consumer GPU hardware.

The question is whether it merely “runs” or achieves “daily-use speed.” 28 consecutive inference runs provide the answer. Additionally, I tested its long-context code generation ability by feeding it a WordPress-like Django CMS specification and asking it to scaffold a full project.

Objective

  1. Statistically characterize steady-state TG/PP speed from 28 runs with IQ4_NL
  2. Document hybrid offload execution configs (cpu-moe / multi-GPU tensor offload)
  3. Evaluate context-length dependency and stability
  4. Determine daily viability of 400B-class MoE
  5. Evaluate long-context code generation quality (Django CMS scaffold)

Test Environment

ItemSpecification
CPUAMD EPYC 9175F (Zen 5, 16C, L3 512MB)
MemoryDDR5-6400 768GB (12ch)
GPUNVIDIA RTX PRO 6000 (96GB VRAM)
OSUbuntu 24.04 LTS
Runtimeik_llama.cpp (cpu-moe enabled)
QuantizationIQ4_NL
ContextUp to 262,144

Results

Throughput Statistics (28 Runs)

MetricPrefill (PP)Generation (TG)
Maximum372.24 tok/s24.04 tok/s
Minimum101.49 tok/s19.13 tok/s
Mean (Steady State)~160 tok/s~22.5 tok/s

Representative Runs

#PP(tok)TG(tok)PP(tok/s)TG(tok/s)Total(s)
14,69913314.2221.9515.5
31,1252,048161.6720.81105.4
81,1242,048154.7623.8293.3
1415,8662,048372.2422.98131.7
16550520117.7822.9127.4

Run #14: 15,866-token massive input processed in 42.6s, generation starting at 22.98 tok/s. 15K tokens of source code ingested in ~40 seconds.

Context Length Dependency

  • PP speed: Short prompts (<1k) stay at ~100 tok/s (overhead-dominated). Longer prompts increase parallel efficiency, accelerating to 300+ tok/s
  • TG speed: Remarkably stable at 21-24 tok/s regardless of context length

Hybrid Offload Configurations

Single GPU + cpu-moe

  IMG=compute.home.arpa/ik_llama-cuda
MODEL=/models/.../IQ4_KSS/Qwen3.5-397B-A17B-IQ4_KSS-00001-of-00006.gguf

podman run --rm -it --device nvidia.com/gpu=all \
  -p 8001:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z "$IMG" \
  --host 0.0.0.0 --port 8080 -m "$MODEL" \
  -c 262144 --threads 13 --threads-batch 24 \
  --jinja -b 2048 -ub 2048 -ngl 99 \
  -fa on --no-mmap --cpu-moe
  

--cpu-moe: Offloads expert computations that exceed GPU VRAM to CPU. EPYC 9175F’s 12-channel DDR5 bandwidth supplies active experts while GPU accelerates Attention operations.

Multi-GPU Tensor Offload

With 2 GPUs, -ot enables regex-based layer distribution:

  ./build/bin/llama-server \
  --model "$model" \
  -fa on --ctx-size 135168 \
  -ctk q8_0 -ctv q8_0 \
  -ub 2048 -b 2048 -ngl 999 \
  -ot "blk\.(0|1|2|...|12)\.ffn_(gate|up|down)_exps.*=CUDA0,\
       blk\.(47|48|...|60)\.ffn_(gate|up|down)_exps.*=CUDA1" \
  --cpu-moe --threads 24 --no-mmap --jinja
  

Early layers (0-12) assigned to CUDA0, late layers (47-60) to CUDA1. Middle layers handled by cpu-moe. Flattens VRAM consumption while securing 135K context.

Long-Context Code Generation: WordPress-Like Django CMS

Evaluation Setup

I fed Qwen3.5-397B-A17B a detailed English specification for a WordPress-like Django CMS foundation. The specification covered Posts/Pages, Categories/Tags, Comments, Media Library, Menus/Navigation, Drafts/scheduled publishing, and Revision history.

Inference Configuration

  IMG=compute.home.arpa/ik_llama-cuda
MO=/mnt/data/hf/hub/models--ubergarm--Qwen3.5-397B-A17B-GGUF
MODEL=/models/snapshots/.../IQ4_KSS/Qwen3.5-397B-A17B-IQ4_KSS-00001-of-00006.gguf

podman run --rm -it --device nvidia.com/gpu=all \
  -p 8001:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z "$IMG" \
  --host 0.0.0.0 --port 8080 -m "$MODEL" \
  -c 262144 --threads 13 --threads-batch 26 \
  --jinja --temp 0.7 --repeat-penalty 1.2 \
  --min-p 0.01 --top-p 0.95 --seed 317 --top-k 40 \
  -b 2048 -ub 2048 -ngl 99 \
  -fa on --no-mmap -ger
  

The 262k context setting allowed the full specification and generated implementation code to live in a single session.

Specification Design Insights

The most effective element was placing a WordPress-to-Django concept mapping table at the beginning. Post/Page -> content.Post/content.Page, Category/Tag -> taxonomy.Term, Post Meta -> content.PostMeta, Revision -> content.Revision. This gave the model a stable vocabulary that kept field design consistent through the long generation.

Explicitly stating non-goals (no WordPress compatibility, no Gutenberg replication, no multisite UI parity, no full WYSIWYG) reduced the chance of the model drifting into overbuilt architectures.

Generated Output

The model produced at least these files:

  • apps/core/models.py / models_site.py / models_setting.py
  • apps/content/models.py / models_page.py / models_extra.py
  • apps/taxonomy/models.py
  • apps/media/models.py
  • apps/comments/models.py
  • apps/navigation/models.py
  • apps/seo/models.py
  • config/settings/base.py / dev.py / prod.py
  • config/urls.py / asgi.py / wsgi.py
  • manage.py / requirements.txt

That is a meaningful amount of scaffolding from a single specification input.

Generation Strengths

  • Specification adherence: Post, Term, Revision, SEOEntry scaffolds were directly usable
  • Concept mapping: WordPress mental model preserved while remaining Django-native
  • Abstract models: TimeStampedModel, SoftDeleteModel, PublishableModel, SluggedModel generated naturally
  • SEO constraints: SEOEntry included CheckConstraint binding to either post or page

Generation Weaknesses

  • Field dropout: slug disappeared from Post mid-generation, had to be re-added via diff
  • File placement drift: Page was first split to models_page.py, then folded back into models.py
  • Field inconsistency: Comment.save() referenced self.author_ip before the field existed in the model
  • Tool failures: ctree_check, ctree_init, filesystem_create_directory failed repeatedly before falling back to write_file

Code Generation Insight

Even with a well-structured specification, long-generation output is not a clean linear path. Tool failures, missing parent directories, model definition gaps, and field inconsistencies recur.

The value is not in trusting the output but in the speed of initial scaffolding. The human reviewer’s job is to quickly separate what is directly usable from what requires mandatory review.

Analysis

Why 397B Hits 22 tok/s

Despite the 397B total, only 17B is active per token. IQ4_NL compresses memory bandwidth load by 4x+. The 12-channel DDR5 bandwidth supplies active experts while GPU accelerates Attention, achieving “17B-class speed” through division of labor.

IQ4_NL Quantization Choice

Unquantized deployment of 397B is impractical. IQ4_NL minimizes quality degradation while drastically reducing memory footprint. No obvious quality degradation observed across 28 runs.

Warm-up Requirement

First few runs show behavioral jitter. 2-3 dummy requests before stable TG speed. For resident deployment, incorporate a warm-up script.

Lessons Learned

“400B-class is experimental” is changing. 22.5 tok/s is sufficient for real-time human interaction, and represents high throughput for async batch processing.

TG speed stability at 21-24 tok/s regardless of context length is particularly important. Feeding 15K tokens of source code does not slow generation. This matters for use cases involving entire repository ingestion.

The Django CMS generation test confirmed that 400B-class models with long context can genuinely accelerate project initialization. The structure of the specification itself (concept mapping tables, explicit non-goals, recommended implementation order) directly affects generation quality, a lesson that applies to future prompt design.

Multi-GPU tensor offload is a “use it if you have 2 GPUs” configuration. cpu-moe alone works, but explicit layer distribution via -ot improves throughput.

Technical Notes

KV Cache Quantization

At 135K-262K context windows, KV cache consumes massive memory. -ctk q8_0 -ctv q8_0 quantizes KV cache to suppress memory pressure. Perceived quality impact is minimal.

Sampling Parameters

For stable code generation: --temp 0.7 --repeat-penalty 1.2 --min-p 0.01 --top-p 0.95 --top-k 40. Temp 1.0 produces creative but verbose output.

Specification Design Guidelines for Code Generation

For repeating this workflow, split the source into three separate artifacts:

  1. A reusable inference launch template
  2. A clean CMS specification (with concept mapping + non-goals + implementation order)
  3. A generation log plus review notes

Placing a concept dictionary first, explicitly stating non-goals, and specifying implementation order helps stabilize model output through long generation sessions.