On this page

Running MiMo V2.5 Pro IQ2_S Locally: RTX PRO 6000 Blackwell x1/x2 Benchmark

Running the Xiaomi MiMo V2.5 Pro 1.02T MoE IQ2_S GGUF on llama.cpp CUDA13 with RTX PRO 6000 Blackwell Max-Q x 2, measuring single/dual GPU prefill and decode, and recording expert tensor placement, unused MTP state, ik_llama.cpp fused-QKV compatibility limits, and its fit as an orchestrator candidate.

These articles use AI-generated summaries of Obsidian notes originally kept as technical memos.

English translations are produced with AI assistance.

I ran the Xiaomi MiMo V2.5 Pro (1.02T total / 42B active) IQ2_S GGUF on my dual RTX PRO 6000 Blackwell Max-Q machine. The point was not just to measure speed, but to see whether it has enough throughput and operational shape to serve as a resident orchestrator candidate for a multi-agent system.

Video link: https://www.youtube.com/watch?v=tviNguY-HRE

The short version: on dual GPU, decode stabilized around 16.5 tok/s. That is about a 33% improvement over the 12.4-12.5 tok/s single GPU result, but the cost/performance ratio is not great if it requires two GPUs. Since the model is MIT licensed, it still remains a candidate for batch jobs such as dataset generation.

The main points are:

decode was 12.4-12.5 tok/s on single GPU and 16.5-16.6 tok/s on dual GPU
a head / tail placement did not scale well, so this run used a more balanced expert tensor placement
llama.cpp has merged MTP support, but MiMo’s mimo2 MTP tensors were not used in this run
ik_llama.cpp already has MiMo 2.5 support, but it could not load this quant because of the fused QKV layout

Environment

Item	Value
GPU	NVIDIA RTX PRO 6000 Blackwell Max-Q 96GB x 2
CPU	AMD EPYC 9175F 16-Core, SMT disabled
RAM	About 755GB DDR5 ECC, 6000 MT/s
Runtime	`ggml-org/llama.cpp:full-cuda13`
Blackwell	native FP4 enabled

The model is IQ2_S from AesSedai/MiMo-V2.5-Pro-GGUF. The Hugging Face quant table lists it as 297.45 GiB and 2.50 BPW. The base model, XiaomiMiMo/MiMo-V2.5-Pro, is MIT licensed, and the model card describes it as 1.02T total / 42B active with hybrid attention, 3-layer MTP, and up to 1M context.

Item	Value
Model	MiMo V2.5 Pro
Architecture	`mimo2`
Parameters	1.02T total / 42B active
Quantization	IQ2_S
GGUF size	297.45 GiB
Layer layout	`blk.0`-`blk.69` active layers, `blk.70`-`blk.72` MTP heads

Server Commands

The single GPU baseline uses only GPU 0 and places part of the expert tensors on CUDA0.

  podman run --rm \
  --device nvidia.com/gpu=0 \
  -p 8000:8000 \
  --shm-size 8g \
  --cap-add=SYS_NICE \
  -v /mnt/data/models/models--AesSedai--MiMo-V2.5-Pro-GGUF:/models:ro,Z \
  ggml-org/llama.cpp:full-cuda13 \
  --server \
  -m /models/snapshots/a8205a14c95fd13018a7776bf0778d1edb6ba2be/IQ2_S/MiMo-V2.5-Pro-IQ2_S-00001-of-00008.gguf \
  --ctx-size 49152 \
  -fa on -fit on \
  -ngl 99 \
  -ctk q8_0 -ctv q8_0 \
  -ot "blk\.([0-3]|1[6-7]|2[6-7]|3[6-7]|4[6-7]|5[5-7]|6[6-9])\.ffn_.*_exps.*=CUDA0,exps=CPU" \
  --no-mmap \
  --parallel 1 \
  --threads 15 \
  --threads-batch 27 \
  --batch-size 12288 \
  --ubatch-size 6144 \
  --jinja \
  --kv-unified \
  --host 0.0.0.0 --port 8000 \
  --alias grandpa

The dual GPU run distributes expert tensors across CUDA0 and CUDA1. Non-expert tensors stay on GPU via -ngl 99, preserving Flash Attention.

  podman run --rm \
  --device nvidia.com/gpu=all \
  -p 8000:8000 \
  --shm-size 8g \
  --cap-add=SYS_NICE \
  -v /mnt/data/models/models--AesSedai--MiMo-V2.5-Pro-GGUF:/models:ro,Z \
  ggml-org/llama.cpp:full-cuda13 \
  --server \
  -m /models/snapshots/a8205a14c95fd13018a7776bf0778d1edb6ba2be/IQ2_S/MiMo-V2.5-Pro-IQ2_S-00001-of-00008.gguf \
  --ctx-size 49152 \
  -fa on -fit on \
  -ngl 99 \
  -ctk q8_0 -ctv q8_0 \
  -ot "blk\.([0-3]|1[6-7]|2[6-7]|3[6-7]|4[6-7]|5[5-7]|6[6-9])\.ffn_.*_exps.*=CUDA0,blk\.([4-6]|1[8-9]|2[8-9]|3[8-9]|4[8-9]|5[8-9]|6[2-5])\.ffn_.*_exps.*=CUDA1,exps=CPU" \
  --no-mmap \
  --parallel 1 \
  --threads 15 \
  --threads-batch 27 \
  --batch-size 12288 \
  --ubatch-size 6144 \
  --jinja \
  --kv-unified \
  --host 0.0.0.0 --port 8000 \
  --alias grandpa

Expert Tensor Placement

Target	Layers
CUDA0	`blk.0`-`blk.3`, `blk.16`-`blk.17`, `blk.26`-`blk.27`, `blk.36`-`blk.37`, `blk.46`-`blk.47`, `blk.55`-`blk.57`, `blk.66`-`blk.69`
CUDA1	`blk.4`-`blk.6`, `blk.18`-`blk.19`, `blk.28`-`blk.29`, `blk.38`-`blk.39`, `blk.48`-`blk.49`, `blk.58`-`blk.59`, `blk.62`-`blk.65`
CPU	remaining expert tensors

On single GPU, about 18 layers worth of experts sit on CUDA0 and about 52 layers remain on CPU. On dual GPU, about 18 layers sit on CUDA0, 17 on CUDA1, and about 35 remain on CPU.

The 297.45 GiB IQ2_S quant does not fit directly into 192GB of VRAM. So attention / norm / embedding stay on GPU, while only MoE expert tensors are split with the -ot regex. The goal of this run was first to confirm behavior and speed, so I did not do a deeper layer-placement search.

Benchmark Results

The raw timings were:

  Single GPU (1x RTX PRO 6000):
  prompt eval time =   27716.99 ms / 15594 tokens (    1.78 ms per token,   562.62 tokens per second)
  eval time         =   13512.61 ms /   169 tokens (   79.96 ms per token,    12.51 tokens per second)
  prompt eval time =    4185.57 ms /   136 tokens (   30.78 ms per token,    32.49 tokens per second)
  eval time         =  164812.58 ms /  2048 tokens (   80.47 ms per token,    12.43 tokens per second)

Dual GPU (2x RTX PRO 6000):
  prompt eval time =   24150.56 ms / 15594 tokens (    1.55 ms per token,   645.70 tokens per second)
  eval time         =    7892.13 ms /   131 tokens (   60.25 ms per token,    16.60 tokens per second)
  prompt eval time =    3109.48 ms /   160 tokens (   19.43 ms per token,    51.46 tokens per second)
  eval time         =  123901.10 ms /  2048 tokens (   60.50 ms per token,    16.53 tokens per second)

Metric	Single GPU	Dual GPU	Improvement
Long prefill	562.62 tok/s	645.70 tok/s	+14.8%
Short prefill	32.49 tok/s	51.46 tok/s	+58.4%
Decode short	12.51 tok/s	16.60 tok/s	+32.7%
Decode long	12.43 tok/s	16.53 tok/s	+33.0%

Decode improved from 12.4-12.5 tok/s on single GPU to 16.5-16.6 tok/s on dual GPU, roughly +33%. The likely reason is straightforward: more active expert work happens on GPU, reducing the CPU compute and transfer share.

Still, even the dual GPU run leaves about 35 expert layers on CPU. For every token, the MoE router can select active experts that need CPU-side computation and synchronization with the GPU side, so decode is still CPU bound. That is why the run tops out around 16.5 tok/s.

Prefill improved by +14.8% for the long prompt and +58.4% for the short prompt. The short prefill improvement is larger because small prompts benefit more directly from the GPU parallelism once the CPU/GPU plumbing is not dominating the run.

MTP Is Still Unused

MiMo V2.5 Pro includes 3-layer MTP heads. AesSedai’s GGUF includes tensors for blk.70-blk.72, but llama.cpp did not use them in this run.

  W model has unused tensor blk.70.attn_output.weight (size = 106954752 bytes) -- ignoring
W model has unused tensor blk.70.attn_norm.weight (size = 24576 bytes) -- ignoring
W model has unused tensor blk.70.attn_sinks.weight (size = 512 bytes) -- ignoring
W model has unused tensor blk.70.ffn_norm.weight (size = 24576 bytes) -- ignoring
W model has unused tensor blk.70.ffn_gate.weight (size = 106954752 bytes) -- ignoring
W model has unused tensor blk.70.ffn_down.weight (size = 106954752 bytes) -- ignoring
W model has unused tensor blk.70.ffn_up.weight (size = 106954752 bytes) -- ignoring
W model has unused tensor blk.70.nextn.eh_proj.weight (size = 80216064 bytes) -- ignoring
W model has unused tensor blk.70.nextn.enorm.weight (size = 24576 bytes) -- ignoring
W model has unused tensor blk.70.nextn.hnorm.weight (size = 24576 bytes) -- ignoring
W model has unused tensor blk.70.layer_output_norm.weight (size = 24576 bytes) -- ignoring
... (same for blk.71, blk.72)

llama.cpp PR #22673 merged MTP support into master on 2026-05-16. The PR examples use --spec-type draft-mtp and --spec-draft-n-max. But for this MiMo GGUF, the mimo2 MTP tensors are still not recognized by the runtime and are discarded at load time.

Xiaomi’s model card describes a 3x output speed improvement from MTP. I would like to hope for that directly from the current 16.5 tok/s, but even a +50% gain on dual GPU would already reach the 24 tok/s range. My current orchestrator models sit around 23-40 tok/s, so MiMo is worth retesting once mimo2 MTP actually works in llama.cpp mainline. The React SPA Spec run also felt promising; the video is useful context for that behavior.

ik_llama.cpp Stops on Fused QKV

ik_llama.cpp tends to reach higher TG, so I wanted to measure MiMo there as well. MiMo 2.5 support had already been merged on the ik_llama.cpp side, but this attempt did not work.

The model fails to load because of the fused QKV layout.

  Image: registry.home.arpa/ik_llama:latest (main branch with #1723 "Support Mimo-2.5")

Memory required for model tensors + cache: 314076 MiB
Memory available on all devices - compute: 92720 MiB
llm_load_tensors: ggml ctx size =    0.66 MiB
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q.weight' not found
llama_model_load_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model

As covered in ik_llama.cpp issue #1769, the cause is the fused QKV layout in the GGUF. Xiaomi’s released MiMo V2.5 Pro safetensors use fused QKV, and ggml-org-style conversion keeps that as attn_qkv in GGUF. ik_llama.cpp expects separate attn_q / attn_k / attn_v, so it cannot find blk.0.attn_q.weight. Other quantizations may appear on HF later, but downloading a model this large takes time. Before pulling one, it is worth checking the model card to see whether the QKV layout matches what ik_llama.cpp expects.

If a separate Q/K/V GGUF appears, or if ik_llama.cpp starts un-merging fused QKV at load time, it can be retested. For now, MiMo V2.5 Pro IQ2_S is practically a llama.cpp mainline hybrid inference target.

Code Generation Behavior

Separately from speed, I also tried a React SPA code generation task. The benchmark asks the model to generate a hair salon booking SPA on top of a React SPA Spec scaffold. I stopped before it completed, so I do not have a score.

Even so, the initial tool-use behavior was solid.

it inspected package.json, vite.config.ts, tsconfig.json, and src/ before starting
it ran npm install to settle dependencies before implementation
it created directories before writing files
it designed reservation status with TypeScript discriminated unions
it used str_replace for incremental edits instead of clobbering existing files
it noticed that package.json had changed after npm install and reread it

Local models often fail earlier than this: they cannot use the terminal, try to edit files that do not exist, or start writing code before installing dependencies. MiMo at least cleared that layer. As an orchestrator candidate, speed alone is not enough; the model also needs enough parameter scale. I want to run it for longer sessions.

Licensing

MiMo V2.5 Pro is attractive because it is a MIT-licensed 1T-class MoE model that clearly targets agentic workloads. Some models, such as Kimi-K2.6, are attractive on capability, but may require additional revenue-based license checks before commercial deployment. For 200B+ parameter models that are license-clear and practical right now, I think the realistic set is DeepSeek V4, GLM-5.1, and MiMo V2.5 Pro.

Conclusion

MiMo V2.5 Pro IQ2_S ran on RTX PRO 6000 Blackwell Max-Q with llama.cpp CUDA13 at roughly 12 tok/s on single GPU and 16.5 tok/s on dual GPU. If MTP support matures, output speed on existing models may improve substantially, and local LLM usage may become increasingly normal.

Claude / Codex are genuinely excellent, but I also suspect pricing may move further toward token-based accounting. If that happens, having an alternative will matter. The hard part is that agent development itself is heavy: without Claude / Codex, debugging, pattern generation, and sheer implementation volume become difficult. Token limits also feel closer than before with only modest use, so I want to get this foundation finished within the year.

References

Gemma 4 31B on vLLM/SGLang: NVFP4/FP8 and MTP Benchmark

A record of running Gemma 4 …

familiar - Building a Multi-Agent Development Platform That Runs Only on Local LLMs

A record of the origin and …