MiniMax-2.5 229B MoE with IQ5K Quantization on Blackwell GPU: 35 tok/s Generation, 65K Context Validation

Background

MiniMax-2.5 229B MoE is a mixture-of-experts model featuring 256 experts, designed specifically for long-context processing and knowledge-intensive tasks. When evaluating local deployment options, balancing GPU memory constraints against generation speed is critical. This benchmark validates IQ5K quantization viability on Blackwell-class hardware.

The objective was to assess feasibility and performance on NVIDIA RTX PRO 6000 Blackwell Max-Q (96GB VRAM) under a 65,536-token context window setting.

Objective

Confirm MiniMax-2.5 229B MoE IQ5K quantization executability on Blackwell hardware
Separately measure prompt evaluation (prefill) vs. generation (decode) throughput
Quantify the impact of expert CPU placement (-ot exps=CPU)
Validate 65,536-token KV cache behavior and prompt cache stability

Experimental Environment

Item	Specification
GPU	NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition (96GB VRAM, Compute Capability 12.0)
CPU	Intel/AMD EPYC with AVX/AVX2/AVX512 support
Memory	768GB DDR5
Model	MiniMax-2.5 (minimax-m2 architecture), 229B.A10B (MoE)
Quantization	IQ5_K (5.5 bits per weight nominal, model size 157.77 GiB)
Context Length	65,536 tokens
Runtime	llama.cpp (commit: 1cb7e1bf, build: 4192)

Implementation

Launch Command

  podman run --rm -it \
  --device nvidia.com/gpu=all \
  -p 8081:8080 \
  --shm-size 16g \
  --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z \
  $IMG \
  --host 0.0.0.0 --port 8080 \
  -m "$MODEL" \
  --no-mmap --jinja \
  -c 65536 \
  --threads 13 --threads-batch 25 \
  -b 2048 -ub 2048 \
  -ngl 99 \
  -ot exps=CPU \
  -ctk f16 -ctv f16 \
  --warmup-batch \
  -fa on

Parameter Explanation

-c 65536: Set context length to 65,536 tokens
–threads 13: CPU processing threads
-ngl 99: Offload 99 layers to GPU (near-complete layer offloading)
-ot exps=CPU: Force MoE expert weights to CPU placement
-fa on: Enable Flash Attention
–no-mmap: Direct memory load without mmap

Benchmark Execution

8 request cycles with varying prompt and generation lengths measured via HTTP /chat/completions endpoint. Response times recorded end-to-end.

Results

Memory Layout (Measured)

Memory Region	Size
CPU buffer	157,356 MiB (IQ5K weights)
CUDA0 buffer	3,578.73 MiB (computation temp)
KV cache (CUDA0)	15,872 MiB (K: 7.75 GiB + V: 7.75 GiB)
Compute buffer (CUDA0)	1,990 MiB

KV cache allocated in f16 format and remains stable under 65,536-token load. GPU-side compute buffer ~2 GiB—no operational impediment.

Benchmark Results (8 Runs)

Run	Prompt tok	PP ms	PP tok/s	Gen tok	Gen ms	Gen tok/s	Total tok	Total ms	Total tok/s
1	753	3,517	214.07	215	6,079	35.37	968	9,597	100.87
2	386	2,266	170.36	196	5,563	35.23	582	7,829	74.34
3	297	1,840	161.38	240	6,816	35.21	537	8,656	62.04
4	341	2,053	166.12	783	22,651	34.57	1,124	24,703	45.50
5	1,264	6,152	205.46	734	21,259	34.53	1,998	27,411	72.89
6	942	4,377	215.21	921	26,849	34.30	1,863	31,226	59.66
7	938	4,338	216.23	157	4,576	34.31	1,095	8,914	122.84
8	1,075	6,097	176.32	1,351	40,019	33.76	2,426	46,116	52.61

Statistical Summary

Metric	Prompt tok/s	Gen tok/s	Total tok/s
Mean	190.64	34.66	73.84
Median	190.89	34.55	67.46
Min	161.38	33.76	45.50
Max	216.23	35.37	122.84

Prompt Evaluation Variability

Prefill throughput ranges widely (125-314 tok/s) due to:

Cache hit patterns: Reused prompts benefit from cached entries
KV maintenance overhead: As prompt cache approaches 15.8 GiB, eviction and consistency checks incur costs
NUMA/paging effects: 157 GiB CPU-side memory access patterns not uniform across requests

Generation Stability

Decode consistently remains at 33.76-35.37 tok/s, indicating PCIe/memory bandwidth becomes the dominant constraint as expert computation shifts CPU-ward.

Discussion

Expert CPU Placement Impact

Despite logs showing “offloaded 63/63 layers to GPU,” the -ot exps=CPU flag forces 256 expert MLPs to CPU. This results in:

PCIe expert transfer overhead: Each decode step reads only active 8 experts CPU→GPU
Bandwidth saturation: Expert read-write cycles dominate PCIe throughput
Generation ceiling: The ~35 tok/s plateau signals PCIe bottleneck

Prompt Cache Operation

Case B logs confirm functional prompt cache:

6,029 tokens → 1,460 MiB state saved
23,104 tokens → 5,595 MiB state saved
8,192 MiB limit respected

Reusable system prompts benefit from cache hits, mitigating first-request prefill slowdown.

Flash Attention Effectiveness

With flash_attn=1 active, KV cache reaching 15.8 GiB maintains computational efficiency through optimized memory access patterns.

Conclusion

End-to-end stability from startup through HTTP acceptance through KV allocation and flash-attn activation demonstrates 65,536-token serving is technically achievable.

Performance tuning, however, is limited if -ot exps=CPU and --no-mmap remain unchanged. Minor parameter tweaks (threads, batch size) yield marginal gains. Address the expert/mmap strategy first for meaningful improvement.

The tokenizer warning (special_eos_id not in special_eog_ids) warrants investigation, as it may degrade stop condition and tag interpretation reliability.

Reproduction

1. Model Retrieval

  huggingface-cli download TheBloke/MiniMax-2.5-A10B-IQ5_K-GGUF

2. Start llama.cpp Server

Follow the “Implementation” section. Replace $MODEL with gguf path and $IMG with llama.cpp container image.

3. Benchmark Measurement

  curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "minimax",
    "messages": [{"role": "user", "content": "Your prompt..."}],
    "max_tokens": 1024
  }' | jq .usage

Extract prompt_eval_count / prompt_eval_duration and completion_tokens / completion_eval_duration to compute tok/s. Average multiple requests for stability.

Technical Notes

Expert Offload Strategies

-ot exps=CPU optimizes GPU memory (smaller footprint) but sacrifices throughput. Alternatives:

-ot exps=GPU: Place all experts on GPU (higher VRAM, faster)
Mixed offload: Frequent experts GPU, rare experts CPU (complex but balanced)

No-mmap Implications

--no-mmap loads all 157 GiB weights upfront, increasing startup latency and increasing paging likelihood in NUMA systems. Enabling mmap shortens init but may increase page faults depending on access patterns.

Tokenizer Configuration

The special_eos_id not in special_eog_ids warning indicates mismatch between model tokenizer.json and llama.cpp interpretation. Verify with model provider and optionally use --special-tokens-file to explicitly map special tokens.

Context Length and Batch Efficiency

Larger context windows expand KV cache memory footprint, reducing effective batch efficiency. The -b 2048 -ub 2048 config proved stable here but requires tuning for different hardware/memory configurations.

MiniMax-2.5 Expert Offload and Web Generation — Quantization comparison across IQ4_NL/IQ3_S, one-shot generation of React LP and dental clinic site. Includes a video demonstrating actual generation output
NVIDIA Blackwell compute properties (Capability 12.0)
llama.cpp MoE support maturity
Flash Attention KV cache optimization
PCIe Gen 5 bandwidth and expert computation scaling

MiniMax-2.5 229B MoE with IQ5K Quantization on Blackwell GPU: 35 tok/s Generation, 65K Context Validation

Background link

Objective link

Experimental Environment link

Implementation link

Launch Command link

Parameter Explanation link

Benchmark Execution link

Results link

Memory Layout (Measured) link

Benchmark Results (8 Runs) link

Statistical Summary link

Prompt Evaluation Variability link

Generation Stability link

Discussion link

Expert CPU Placement Impact link

Prompt Cache Operation link

Flash Attention Effectiveness link

Conclusion link

Reproduction link

1. Model Retrieval link

2. Start llama.cpp Server link

3. Benchmark Measurement link

Technical Notes link

Expert Offload Strategies link

No-mmap Implications link

Tokenizer Configuration link

Context Length and Batch Efficiency link

Related Topics link