Running MiMo V2.5 Pro IQ2_S Locally: RTX PRO 6000 Blackwell x1/x2 Benchmark
Running the Xiaomi MiMo V2.5 Pro 1.02T MoE IQ2_S GGUF on llama.cpp CUDA13 with RTX PRO 6000 Blackwell Max-Q x 2, measuring single/dual GPU prefill and decode, and recording expert tensor placement, unused MTP state, ik_llama.cpp fused-QKV compatibility limits, and its fit as an orchestrator candidate.
I ran the Xiaomi MiMo V2.5 Pro (1.02T total / 42B active) IQ2_S GGUF on my dual RTX PRO 6000 Blackwell Max-Q machine. The point was not just to measure speed, but to see whether it has enough throughput and operational shape to serve as a resident orchestrator candidate for a multi-agent system.
Video link: https://www.youtube.com/watch?v=tviNguY-HRE
The short version: on dual GPU, decode stabilized around 16.5 tok/s. That is about a 33% improvement over the 12.4-12.5 tok/s single GPU result, but the cost/performance ratio is not great if it requires two GPUs. Since the model is MIT licensed, it still remains a candidate for batch jobs such as dataset generation.
The main points are:
- decode was 12.4-12.5 tok/s on single GPU and 16.5-16.6 tok/s on dual GPU
- a head / tail placement did not scale well, so this run used a more balanced expert tensor placement
- llama.cpp has merged MTP support, but MiMo’s
mimo2MTP tensors were not used in this run - ik_llama.cpp already has MiMo 2.5 support, but it could not load this quant because of the fused QKV layout
Environment
| Item | Value |
|---|---|
| GPU | NVIDIA RTX PRO 6000 Blackwell Max-Q 96GB x 2 |
| CPU | AMD EPYC 9175F 16-Core, SMT disabled |
| RAM | About 755GB DDR5 ECC, 6000 MT/s |
| Runtime | ggml-org/llama.cpp:full-cuda13 |
| Blackwell | native FP4 enabled |
The model is IQ2_S from AesSedai/MiMo-V2.5-Pro-GGUF. The Hugging Face quant table lists it as 297.45 GiB and 2.50 BPW. The base model, XiaomiMiMo/MiMo-V2.5-Pro, is MIT licensed, and the model card describes it as 1.02T total / 42B active with hybrid attention, 3-layer MTP, and up to 1M context.
| Item | Value |
|---|---|
| Model | MiMo V2.5 Pro |
| Architecture | mimo2 |
| Parameters | 1.02T total / 42B active |
| Quantization | IQ2_S |
| GGUF size | 297.45 GiB |
| Layer layout | blk.0-blk.69 active layers, blk.70-blk.72 MTP heads |
Server Commands
The single GPU baseline uses only GPU 0 and places part of the expert tensors on CUDA0.
podman run --rm \
--device nvidia.com/gpu=0 \
-p 8000:8000 \
--shm-size 8g \
--cap-add=SYS_NICE \
-v /mnt/data/models/models--AesSedai--MiMo-V2.5-Pro-GGUF:/models:ro,Z \
ggml-org/llama.cpp:full-cuda13 \
--server \
-m /models/snapshots/a8205a14c95fd13018a7776bf0778d1edb6ba2be/IQ2_S/MiMo-V2.5-Pro-IQ2_S-00001-of-00008.gguf \
--ctx-size 49152 \
-fa on -fit on \
-ngl 99 \
-ctk q8_0 -ctv q8_0 \
-ot "blk\.([0-3]|1[6-7]|2[6-7]|3[6-7]|4[6-7]|5[5-7]|6[6-9])\.ffn_.*_exps.*=CUDA0,exps=CPU" \
--no-mmap \
--parallel 1 \
--threads 15 \
--threads-batch 27 \
--batch-size 12288 \
--ubatch-size 6144 \
--jinja \
--kv-unified \
--host 0.0.0.0 --port 8000 \
--alias grandpa
The dual GPU run distributes expert tensors across CUDA0 and CUDA1. Non-expert tensors stay on GPU via -ngl 99, preserving Flash Attention.
podman run --rm \
--device nvidia.com/gpu=all \
-p 8000:8000 \
--shm-size 8g \
--cap-add=SYS_NICE \
-v /mnt/data/models/models--AesSedai--MiMo-V2.5-Pro-GGUF:/models:ro,Z \
ggml-org/llama.cpp:full-cuda13 \
--server \
-m /models/snapshots/a8205a14c95fd13018a7776bf0778d1edb6ba2be/IQ2_S/MiMo-V2.5-Pro-IQ2_S-00001-of-00008.gguf \
--ctx-size 49152 \
-fa on -fit on \
-ngl 99 \
-ctk q8_0 -ctv q8_0 \
-ot "blk\.([0-3]|1[6-7]|2[6-7]|3[6-7]|4[6-7]|5[5-7]|6[6-9])\.ffn_.*_exps.*=CUDA0,blk\.([4-6]|1[8-9]|2[8-9]|3[8-9]|4[8-9]|5[8-9]|6[2-5])\.ffn_.*_exps.*=CUDA1,exps=CPU" \
--no-mmap \
--parallel 1 \
--threads 15 \
--threads-batch 27 \
--batch-size 12288 \
--ubatch-size 6144 \
--jinja \
--kv-unified \
--host 0.0.0.0 --port 8000 \
--alias grandpa
Expert Tensor Placement
| Target | Layers |
|---|---|
| CUDA0 | blk.0-blk.3, blk.16-blk.17, blk.26-blk.27, blk.36-blk.37, blk.46-blk.47, blk.55-blk.57, blk.66-blk.69 |
| CUDA1 | blk.4-blk.6, blk.18-blk.19, blk.28-blk.29, blk.38-blk.39, blk.48-blk.49, blk.58-blk.59, blk.62-blk.65 |
| CPU | remaining expert tensors |
On single GPU, about 18 layers worth of experts sit on CUDA0 and about 52 layers remain on CPU. On dual GPU, about 18 layers sit on CUDA0, 17 on CUDA1, and about 35 remain on CPU.
The 297.45 GiB IQ2_S quant does not fit directly into 192GB of VRAM. So attention / norm / embedding stay on GPU, while only MoE expert tensors are split with the -ot regex. The goal of this run was first to confirm behavior and speed, so I did not do a deeper layer-placement search.
Benchmark Results
The raw timings were:
Single GPU (1x RTX PRO 6000):
prompt eval time = 27716.99 ms / 15594 tokens ( 1.78 ms per token, 562.62 tokens per second)
eval time = 13512.61 ms / 169 tokens ( 79.96 ms per token, 12.51 tokens per second)
prompt eval time = 4185.57 ms / 136 tokens ( 30.78 ms per token, 32.49 tokens per second)
eval time = 164812.58 ms / 2048 tokens ( 80.47 ms per token, 12.43 tokens per second)
Dual GPU (2x RTX PRO 6000):
prompt eval time = 24150.56 ms / 15594 tokens ( 1.55 ms per token, 645.70 tokens per second)
eval time = 7892.13 ms / 131 tokens ( 60.25 ms per token, 16.60 tokens per second)
prompt eval time = 3109.48 ms / 160 tokens ( 19.43 ms per token, 51.46 tokens per second)
eval time = 123901.10 ms / 2048 tokens ( 60.50 ms per token, 16.53 tokens per second)
| Metric | Single GPU | Dual GPU | Improvement |
|---|---|---|---|
| Long prefill | 562.62 tok/s | 645.70 tok/s | +14.8% |
| Short prefill | 32.49 tok/s | 51.46 tok/s | +58.4% |
| Decode short | 12.51 tok/s | 16.60 tok/s | +32.7% |
| Decode long | 12.43 tok/s | 16.53 tok/s | +33.0% |
Decode improved from 12.4-12.5 tok/s on single GPU to 16.5-16.6 tok/s on dual GPU, roughly +33%. The likely reason is straightforward: more active expert work happens on GPU, reducing the CPU compute and transfer share.
Still, even the dual GPU run leaves about 35 expert layers on CPU. For every token, the MoE router can select active experts that need CPU-side computation and synchronization with the GPU side, so decode is still CPU bound. That is why the run tops out around 16.5 tok/s.
Prefill improved by +14.8% for the long prompt and +58.4% for the short prompt. The short prefill improvement is larger because small prompts benefit more directly from the GPU parallelism once the CPU/GPU plumbing is not dominating the run.
MTP Is Still Unused
MiMo V2.5 Pro includes 3-layer MTP heads. AesSedai’s GGUF includes tensors for blk.70-blk.72, but llama.cpp did not use them in this run.
W model has unused tensor blk.70.attn_output.weight (size = 106954752 bytes) -- ignoring
W model has unused tensor blk.70.attn_norm.weight (size = 24576 bytes) -- ignoring
W model has unused tensor blk.70.attn_sinks.weight (size = 512 bytes) -- ignoring
W model has unused tensor blk.70.ffn_norm.weight (size = 24576 bytes) -- ignoring
W model has unused tensor blk.70.ffn_gate.weight (size = 106954752 bytes) -- ignoring
W model has unused tensor blk.70.ffn_down.weight (size = 106954752 bytes) -- ignoring
W model has unused tensor blk.70.ffn_up.weight (size = 106954752 bytes) -- ignoring
W model has unused tensor blk.70.nextn.eh_proj.weight (size = 80216064 bytes) -- ignoring
W model has unused tensor blk.70.nextn.enorm.weight (size = 24576 bytes) -- ignoring
W model has unused tensor blk.70.nextn.hnorm.weight (size = 24576 bytes) -- ignoring
W model has unused tensor blk.70.layer_output_norm.weight (size = 24576 bytes) -- ignoring
... (same for blk.71, blk.72)
llama.cpp PR #22673 merged MTP support into master on 2026-05-16. The PR examples use --spec-type draft-mtp and --spec-draft-n-max. But for this MiMo GGUF, the mimo2 MTP tensors are still not recognized by the runtime and are discarded at load time.
Xiaomi’s model card describes a 3x output speed improvement from MTP. I would like to hope for that directly from the current 16.5 tok/s, but even a +50% gain on dual GPU would already reach the 24 tok/s range. My current orchestrator models sit around 23-40 tok/s, so MiMo is worth retesting once mimo2 MTP actually works in llama.cpp mainline. The React SPA Spec run also felt promising; the video is useful context for that behavior.
ik_llama.cpp Stops on Fused QKV
ik_llama.cpp tends to reach higher TG, so I wanted to measure MiMo there as well. MiMo 2.5 support had already been merged on the ik_llama.cpp side, but this attempt did not work.
The model fails to load because of the fused QKV layout.
Image: registry.home.arpa/ik_llama:latest (main branch with #1723 "Support Mimo-2.5")
Memory required for model tensors + cache: 314076 MiB
Memory available on all devices - compute: 92720 MiB
llm_load_tensors: ggml ctx size = 0.66 MiB
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q.weight' not found
llama_model_load_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model
As covered in ik_llama.cpp issue #1769, the cause is the fused QKV layout in the GGUF. Xiaomi’s released MiMo V2.5 Pro safetensors use fused QKV, and ggml-org-style conversion keeps that as attn_qkv in GGUF. ik_llama.cpp expects separate attn_q / attn_k / attn_v, so it cannot find blk.0.attn_q.weight. Other quantizations may appear on HF later, but downloading a model this large takes time. Before pulling one, it is worth checking the model card to see whether the QKV layout matches what ik_llama.cpp expects.
If a separate Q/K/V GGUF appears, or if ik_llama.cpp starts un-merging fused QKV at load time, it can be retested. For now, MiMo V2.5 Pro IQ2_S is practically a llama.cpp mainline hybrid inference target.
Code Generation Behavior
Separately from speed, I also tried a React SPA code generation task. The benchmark asks the model to generate a hair salon booking SPA on top of a React SPA Spec scaffold. I stopped before it completed, so I do not have a score.
Even so, the initial tool-use behavior was solid.
- it inspected
package.json,vite.config.ts,tsconfig.json, andsrc/before starting - it ran
npm installto settle dependencies before implementation - it created directories before writing files
- it designed reservation status with TypeScript discriminated unions
- it used
str_replacefor incremental edits instead of clobbering existing files - it noticed that
package.jsonhad changed afternpm installand reread it
Local models often fail earlier than this: they cannot use the terminal, try to edit files that do not exist, or start writing code before installing dependencies. MiMo at least cleared that layer. As an orchestrator candidate, speed alone is not enough; the model also needs enough parameter scale. I want to run it for longer sessions.
Licensing
MiMo V2.5 Pro is attractive because it is a MIT-licensed 1T-class MoE model that clearly targets agentic workloads. Some models, such as Kimi-K2.6, are attractive on capability, but may require additional revenue-based license checks before commercial deployment. For 200B+ parameter models that are license-clear and practical right now, I think the realistic set is DeepSeek V4, GLM-5.1, and MiMo V2.5 Pro.
Conclusion
MiMo V2.5 Pro IQ2_S ran on RTX PRO 6000 Blackwell Max-Q with llama.cpp CUDA13 at roughly 12 tok/s on single GPU and 16.5 tok/s on dual GPU. If MTP support matures, output speed on existing models may improve substantially, and local LLM usage may become increasingly normal.
Claude / Codex are genuinely excellent, but I also suspect pricing may move further toward token-based accounting. If that happens, having an alternative will matter. The hard part is that agent development itself is heavy: without Claude / Codex, debugging, pattern generation, and sheer implementation volume become difficult. Token limits also feel closer than before with only modest use, so I want to get this foundation finished within the year.
