Background

MiniMax-2.5 229B MoE is a mixture-of-experts model featuring 256 experts, designed specifically for long-context processing and knowledge-intensive tasks. When evaluating local deployment options, balancing GPU memory constraints against generation speed is critical. This benchmark validates IQ5K quantization viability on Blackwell-class hardware.

The objective was to assess feasibility and performance on NVIDIA RTX PRO 6000 Blackwell Max-Q (96GB VRAM) under a 65,536-token context window setting.

Objective

  1. Confirm MiniMax-2.5 229B MoE IQ5K quantization executability on Blackwell hardware
  2. Separately measure prompt evaluation (prefill) vs. generation (decode) throughput
  3. Quantify the impact of expert CPU placement (-ot exps=CPU)
  4. Validate 65,536-token KV cache behavior and prompt cache stability

Experimental Environment

ItemSpecification
GPUNVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition (96GB VRAM, Compute Capability 12.0)
CPUIntel/AMD EPYC with AVX/AVX2/AVX512 support
Memory768GB DDR5
ModelMiniMax-2.5 (minimax-m2 architecture), 229B.A10B (MoE)
QuantizationIQ5_K (5.5 bits per weight nominal, model size 157.77 GiB)
Context Length65,536 tokens
Runtimellama.cpp (commit: 1cb7e1bf, build: 4192)

Implementation

Launch Command

  podman run --rm -it \
  --device nvidia.com/gpu=all \
  -p 8081:8080 \
  --shm-size 16g \
  --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z \
  $IMG \
  --host 0.0.0.0 --port 8080 \
  -m "$MODEL" \
  --no-mmap --jinja \
  -c 65536 \
  --threads 13 --threads-batch 25 \
  -b 2048 -ub 2048 \
  -ngl 99 \
  -ot exps=CPU \
  -ctk f16 -ctv f16 \
  --warmup-batch \
  -fa on
  

Parameter Explanation

  • -c 65536: Set context length to 65,536 tokens
  • –threads 13: CPU processing threads
  • -ngl 99: Offload 99 layers to GPU (near-complete layer offloading)
  • -ot exps=CPU: Force MoE expert weights to CPU placement
  • -fa on: Enable Flash Attention
  • –no-mmap: Direct memory load without mmap

Benchmark Execution

8 request cycles with varying prompt and generation lengths measured via HTTP /chat/completions endpoint. Response times recorded end-to-end.

Results

Memory Layout (Measured)

Memory RegionSize
CPU buffer157,356 MiB (IQ5K weights)
CUDA0 buffer3,578.73 MiB (computation temp)
KV cache (CUDA0)15,872 MiB (K: 7.75 GiB + V: 7.75 GiB)
Compute buffer (CUDA0)1,990 MiB

KV cache allocated in f16 format and remains stable under 65,536-token load. GPU-side compute buffer ~2 GiB—no operational impediment.

Benchmark Results (8 Runs)

RunPrompt tokPP msPP tok/sGen tokGen msGen tok/sTotal tokTotal msTotal tok/s
17533,517214.072156,07935.379689,597100.87
23862,266170.361965,56335.235827,82974.34
32971,840161.382406,81635.215378,65662.04
43412,053166.1278322,65134.571,12424,70345.50
51,2646,152205.4673421,25934.531,99827,41172.89
69424,377215.2192126,84934.301,86331,22659.66
79384,338216.231574,57634.311,0958,914122.84
81,0756,097176.321,35140,01933.762,42646,11652.61

Statistical Summary

MetricPrompt tok/sGen tok/sTotal tok/s
Mean190.6434.6673.84
Median190.8934.5567.46
Min161.3833.7645.50
Max216.2335.37122.84

Prompt Evaluation Variability

Prefill throughput ranges widely (125-314 tok/s) due to:

  • Cache hit patterns: Reused prompts benefit from cached entries
  • KV maintenance overhead: As prompt cache approaches 15.8 GiB, eviction and consistency checks incur costs
  • NUMA/paging effects: 157 GiB CPU-side memory access patterns not uniform across requests

Generation Stability

Decode consistently remains at 33.76-35.37 tok/s, indicating PCIe/memory bandwidth becomes the dominant constraint as expert computation shifts CPU-ward.

Discussion

Expert CPU Placement Impact

Despite logs showing “offloaded 63/63 layers to GPU,” the -ot exps=CPU flag forces 256 expert MLPs to CPU. This results in:

  • PCIe expert transfer overhead: Each decode step reads only active 8 experts CPU→GPU
  • Bandwidth saturation: Expert read-write cycles dominate PCIe throughput
  • Generation ceiling: The ~35 tok/s plateau signals PCIe bottleneck

Prompt Cache Operation

Case B logs confirm functional prompt cache:

  • 6,029 tokens → 1,460 MiB state saved
  • 23,104 tokens → 5,595 MiB state saved
  • 8,192 MiB limit respected

Reusable system prompts benefit from cache hits, mitigating first-request prefill slowdown.

Flash Attention Effectiveness

With flash_attn=1 active, KV cache reaching 15.8 GiB maintains computational efficiency through optimized memory access patterns.

Conclusion

End-to-end stability from startup through HTTP acceptance through KV allocation and flash-attn activation demonstrates 65,536-token serving is technically achievable.

Performance tuning, however, is limited if -ot exps=CPU and --no-mmap remain unchanged. Minor parameter tweaks (threads, batch size) yield marginal gains. Address the expert/mmap strategy first for meaningful improvement.

The tokenizer warning (special_eos_id not in special_eog_ids) warrants investigation, as it may degrade stop condition and tag interpretation reliability.

Reproduction

1. Model Retrieval

  huggingface-cli download TheBloke/MiniMax-2.5-A10B-IQ5_K-GGUF
  

2. Start llama.cpp Server

Follow the “Implementation” section. Replace $MODEL with gguf path and $IMG with llama.cpp container image.

3. Benchmark Measurement

  curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "minimax",
    "messages": [{"role": "user", "content": "Your prompt..."}],
    "max_tokens": 1024
  }' | jq .usage
  

Extract prompt_eval_count / prompt_eval_duration and completion_tokens / completion_eval_duration to compute tok/s. Average multiple requests for stability.

Technical Notes

Expert Offload Strategies

-ot exps=CPU optimizes GPU memory (smaller footprint) but sacrifices throughput. Alternatives:

  1. -ot exps=GPU: Place all experts on GPU (higher VRAM, faster)
  2. Mixed offload: Frequent experts GPU, rare experts CPU (complex but balanced)

No-mmap Implications

--no-mmap loads all 157 GiB weights upfront, increasing startup latency and increasing paging likelihood in NUMA systems. Enabling mmap shortens init but may increase page faults depending on access patterns.

Tokenizer Configuration

The special_eos_id not in special_eog_ids warning indicates mismatch between model tokenizer.json and llama.cpp interpretation. Verify with model provider and optionally use --special-tokens-file to explicitly map special tokens.

Context Length and Batch Efficiency

Larger context windows expand KV cache memory footprint, reducing effective batch efficiency. The -b 2048 -ub 2048 config proved stable here but requires tuning for different hardware/memory configurations.

  • MiniMax-2.5 Expert Offload and Web Generation — Quantization comparison across IQ4_NL/IQ3_S, one-shot generation of React LP and dental clinic site. Includes a video demonstrating actual generation output
  • NVIDIA Blackwell compute properties (Capability 12.0)
  • llama.cpp MoE support maturity
  • Flash Attention KV cache optimization
  • PCIe Gen 5 bandwidth and expert computation scaling