MiniMax-2.5 (229B MoE) Expert Offload and Web Generation: IQ5_K to IQ3_S

Summary

MiniMax-2.5 (229B MoE, 8 active out of 256 experts) runs stably on 96GB VRAM via Expert Offload (-ot exps=CPU). Decode speed holds at 34-37 tok/s (IQ5_K), and dropping to IQ3_S allows more weights on GPU for improved speed. Using this inference capability, two one-shot web generation tests were validated: a React landing page (IQ4_NL) and a dental clinic static site (IQ3_S).

Test Environment

Item	Specification
CPU	AMD EPYC 9175F (Zen 5, 16C/32T, L3 512MB)
GPU	NVIDIA RTX PRO 6000 Blackwell Max-Q (96GB VRAM)
Memory	DDR5-6400 768GB (12ch)
OS	Ubuntu 24.04 LTS
Runtime	ik_llama.cpp (build 4192)
Container	Podman rootless

Model Specifications

Item	Value
Architecture	MiniMax-M2 (MoE)
Size	229B.A10B (229B total, 10B active)
Layers	62
Experts	256 (8 active)
Training context	196,608
rope freq_base	5,000,000

Part 1: Expert Offload Benchmark (IQ5_K)

Execution Command

  podman run --rm -it \
  --device nvidia.com/gpu=all \
  -p 8081:8080 \
  --shm-size 16g \
  --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z \
  $IMG \
  --host 0.0.0.0 --port 8080 \
  -m "$MODEL" \
  --no-mmap --jinja \
  -c 65536 \
  --threads 13 --threads-batch 25 \
  -b 2048 -ub 2048 \
  -ngl 99 \
  -ot exps=CPU \
  -ctk f16 -ctv f16 \
  --warmup-batch \
  -fa on

Key parameters:

-ot exps=CPU: Force Expert weights (bulk of 157GB) to CPU memory
-ngl 99: Shows 63/63 layers GPU-offloaded, but Experts override via -ot
--no-mmap: Pin 157GB in physical RAM, eliminate page faults
--warmup-batch: Run warmup on startup

Memory Layout

Region	Size	Location
CPU buffer (Expert weights)	157,356 MiB (~154GB)	CPU RAM
KV cache	15,872 MiB (~15.5GB)	CUDA0
Compute buffer	1,990 MiB	CUDA0
GPU buffer (Attention, etc.)	3,579 MiB	CUDA0

Actual GPU VRAM usage is ~21GB. The remaining 75GB is available for KV cache expansion.

Results: 8 Consecutive Runs (65K Context)

Run	Prompt(tok)	PP(tok/s)	Gen(tok)	TG(tok/s)	Total(s)
1	753	214.07	215	35.37	9.6
2	386	170.36	196	35.23	7.8
3	297	161.38	240	35.21	8.7
4	341	166.12	783	34.57	24.7
5	1,264	205.46	734	34.53	27.4
6	942	215.21	921	34.30	31.2
7	938	216.23	157	34.31	8.9
8	1,075	176.32	1,351	33.76	46.1

Metric	PP(tok/s)	TG(tok/s)
Mean	190.64	34.66
Median	190.89	34.55
Min	161.38	33.76
Max	216.23	35.37

Extended Session (131K Context)

#	PP(tok)	TG(tok)	Ctx Used	PP(tok/s)	TG(tok/s)
1	3,227	72	4,820	268.08	37.69
4	2,708	512	8,223	289.89	35.96
8	1,965	192	11,172	313.60	35.39

TG speed is extremely stable at 33.7-37.7 tok/s. No sharp degradation as context accumulates.

Analysis: Expert Offload Practicality

Speed stability: ~10% TG variation across 8 runs. PCIe bandwidth is the bottleneck but provides consistent rate-limiting
“All layers GPU offloaded” trap: Logs show “offloaded 63/63 layers to GPU” but Expert weights are actually on CPU via -ot. Display versus reality diverge
Prompt cache: 6,029 tokens → 1,460 MiB state, 23,104 tokens → 5,596 MiB. TTFT reduction confirmed on repeated runs
Tokenizer warning: special_eos_id is not in special_eog_ids — if left unaddressed, EOG/EOS judgment becomes unstable

One-Prompt Generation Live Demonstration

To verify the output quality of one-prompt generation, we recorded MiniMax-2.5 generating a website in real time. By observing the inference server output directly, you can confirm the actual generation quality.

This video demonstrates:

One-shot code generation process from a specification document (AGENTS.md)
Token-by-token generation phase output speed
Structure and quality of the generated code

Part 2: React LP One-Shot Generation (IQ4_NL)

Configuration

Quantization: IQ4_NL (lighter than IQ5_K, quality nearly maintained)
Context: 262,144 (256K)
Input: AGENTS.md (detailed site specification for loFT LLC)
Stack: Vite + React + TypeScript + Tailwind CSS v4

Results

MiniMax-2.5 generated a complete loFT LLC corporate site from the AGENTS.md specification in one shot.

Generated component structure (Atomic Design):

Atoms: Button, Heading, Text, Icon, Input, TextArea, Badge
Molecules: ServiceCard, CaseStudyCard, TestimonialCard, NavItem
Organisms: NavBar, HeroSection, ServiceGrid, ProcessTimeline, ContactForm, Footer

Implemented features:

Hero section (animated background with function curves)
Contact form (react-hook-form + zod validation)
SEO optimization (JSON-LD structured data, dynamic meta tags, favicon generation)
SPA routing (react-router-dom)

Notable: Autonomous Tailwind v4 adaptation Tailwind CSS v4 deprecated tailwind.config.js in favor of CSS-first configuration. MiniMax-2.5 noticed this from npx tailwindcss init failure logs and switched to @import "tailwindcss";.

TypeScript error handling: The model read build error output, attempted to identify and fix type import mismatches and unused variables, ultimately reaching “Build Succeeded.”

Part 3: Dental Clinic Static Site (IQ3_S GPU Full Load)

Design Intent: Why IQ3_S

Dropping to IQ3_S allows placing more Expert weights in GPU VRAM, reducing latency from -ot exps=CPU. Precision is partially traded, but the speed gain is net-positive for structured code generation tasks.

Requirements

6-page static HTML site (index, services, doctors, info, visit, access)
No build step, Tailwind CSS (CDN), Alpine.js (CDN)

Results: 6 Pages in ~18 Minutes

Page	Key Content	Alpine.js Interactions
index.html	Hero, 8 service cards, doctor previews, news	Mobile nav, language switch UI
services.html	14 treatment cards, FAQ section	Category filtering, accordion
doctors.html	6 doctor profiles, specialties	Expandable details
info.html	Fee table, insurance, payment methods	Estimate modal
visit.html	First visit flow, reservations	Validated form
access.html	Map, transit, parking	Direction tab switching

Total time: 18 minutes 2 seconds (design, implementation, diagnostics)
Diagnostics: Zero errors and warnings
Quality: Even at IQ3_S, accessibility (ESC key close, focus management) and semantic HTML structure were maintained

Generated Site Screenshots

Dental clinic site: Homepage hero section and service cards — Homepage — Hero + 8 Service Cards

Dental clinic site: Homepage footer with clinic hours, insurance, and first visit CTA — Homepage Bottom — Clinic Hours, First Visit CTA, Footer

Dental clinic site: Fees and insurance page — Info Page — Fee Schedule + Insurance & Payment Methods

Dental clinic site: Services page with category filter and FAQ accordion — Services Page — Category Filter + FAQ Accordion

Quantization-Level Operating Guidelines

Quantization	Model Size	TG Speed	Use Case
IQ5_K	157.7 GiB	34-37 tok/s	Benchmarking, quality-critical generation
IQ4_NL	~120 GiB (est.)	Improved	Long-context work, React LP generation
IQ3_S	~100 GiB (est.)	Further improved	High-volume structured code, speed-priority

“Running the biggest model at max precision” is not always optimal. Adjusting quantization to task characteristics and prioritizing GPU-resident weights is a practical operational strategy.

Takeaways

A 229B model running locally at 37 tok/s would have been hard to imagine a few years ago. Expert Offload — “if it doesn’t fit on GPU, move it to CPU” — is a rational approach exploiting MoE’s structural property of independent experts. Both the React LP and dental clinic site produced production-grade code from specifications. The 229B-class model’s judgment capacity demonstrably maintains cross-file consistency on complex tasks.

Reproduction

Model Download

  huggingface-cli download <quantizer>/MiniMax-M2.5-GGUF \
  --include "MiniMax-M2.5-IQ5_K.gguf" \
  --local-dir /mnt/data/hf/hub/models--MiniMax-2.5-GGUF

Benchmark

Refer to the “Part 1: Execution Command” section above. Requires ik_llama.cpp with Expert Offload support.

Measurement

  curl -s http://localhost:8081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"minimax","messages":[{"role":"user","content":"Explain MoE architecture"}],"max_tokens":512}'

Technical Notes

–no-mmap Necessity

Reading 157GB via mmap causes intermittent page faults that destabilize latency. --no-mmap pins to physical RAM, eliminating page faults during inference. Startup time increases to several minutes.

Expert Count vs Decode Speed

MiniMax-2.5 selects 8 from 256 experts. Compared to GLM-4.7-Flash (4 from 64), Expert selection compute is larger but individual Expert size is smaller (FFN 1,536 dim). Expert count alone does not predict Decode speed.

MiniMax-2.5 (229B MoE) Expert Offload and Web Generation: IQ5_K to IQ3_S

Summary link

Test Environment link

Model Specifications link

Part 1: Expert Offload Benchmark (IQ5_K) link

Execution Command link

Memory Layout link

Results: 8 Consecutive Runs (65K Context) link

Extended Session (131K Context) link

Analysis: Expert Offload Practicality link

One-Prompt Generation Live Demonstration link

Part 2: React LP One-Shot Generation (IQ4_NL) link

Configuration link

Results link

Part 3: Dental Clinic Static Site (IQ3_S GPU Full Load) link

Design Intent: Why IQ3_S link

Requirements link

Results: 6 Pages in ~18 Minutes link

Generated Site Screenshots link

Quantization-Level Operating Guidelines link

Takeaways link

Reproduction link

Model Download link

Benchmark link

Measurement link

Technical Notes link

–no-mmap Necessity link

Expert Count vs Decode Speed link