Running MiniMax-M2.7 (229B MoE) on 2x Blackwell 96GB: 71.9 t/s on Average, but No Commercial Use
Record of running MiniMax-M2.7 (229B MoE, smol-IQ3_KS) locally on dual RTX PRO 6000 Blackwell 96GB GPUs. Reconstructed 10 sample points from llama.cpp slot print_timing entries in the video, measuring 71.9 t/s average, 105.9 t/s peak, and throughput degradation up to 56k context. Performance was strong, but the modified-MIT non-commercial restriction kept it in benchmark-only territory.
Summary
MiniMax-M2.7 showed a very strong start on 2x RTX PRO 6000 Blackwell 96GB: 105.9 t/s peak, 71.9 t/s average. The fact that a 229B MoE can run this lightly on a local box already makes it highly usable for agent workflows.
What stopped me in the end was not performance but licensing. MiniMax-M2.7 is under a modified-MIT license, and commercial use requires prior written permission from MiniMax. Output quality, tool behavior, and model size all looked excellent, but it is hard to drop it directly into business or commercial workflows under those terms, so I only took a light benchmark and deleted the model.
If usage is strictly personal, the equation changes. As a rough local buffer before handing work to Claude or Codex, or for quick tests of prompts and tool calls, it looks very attractive. This run was on dual Blackwell 96GB GPUs, and I have not validated DGX Spark or Mac Studio (128GB), but based on the size and behavior, it still looks like a plausible candidate for a personal-only local setup.
Context
This benchmark was reconstructed from the 14-minute uncut YouTube video I uploaded. I forgot to preserve the inference logs, so I sampled 10 points by reading the llama.cpp slot print_timing entries visible on the right side of the screen frame by frame.
The task was to hand Zed Editor’s AI Agent a WordPress-like Django blog specification and see how far it could autonomously build it.
Video link: https://www.youtube.com/watch?v=4QIc4A5Yly4
Test Environment
| Item | Specification |
|---|---|
| GPU | 2x NVIDIA RTX PRO 6000 Blackwell 96GB |
| CPU | AMD EPYC 9175F |
| RAM | 768GB |
| Engine | ik_llama.cpp (ubergarm fork) |
| Quantization | smol-IQ3_KS |
| Context | 262,144 tokens |
| KV buffer | CUDA0 32,768 MiB + CUDA1 30,720 MiB |
| Prompt cache | 8,192 MiB |
| Graph splits | 3 |
Benchmark Results
Token Generation
| Phase | KV cache | Speed |
|---|---|---|
| Initial | ~2.7k tokens | 105.9 t/s |
| Early average | under 25k tokens | 77.5 t/s |
| Mid average | 27k-47k tokens | 64.7 t/s |
| Late average | 54k-56k tokens | 54.3 t/s |
- Mean: 71.9 t/s
- Max: 105.9 t/s
- Min: 53.9 t/s
- Degradation: -49% from 2.7k to 56.4k context
Prompt Processing
| Condition | Speed |
|---|---|
| Cold start (2,493 tokens) | 4,067 t/s |
| Cache-hit average | 911 t/s |
10 Sample Points
| KV cache | TG |
|---|---|
| 2.7k | 105.9 t/s |
| 18.3k | 79.2 t/s |
| 19.1k | 78.8 t/s |
| 21.2k | 77.1 t/s |
| 23.7k | 75.0 t/s |
| 27.2k | 71.4 t/s |
| 38.4k | 63.8 t/s |
| 46.6k | 58.9 t/s |
| 53.9k | 54.7 t/s |
| 56.4k | 53.9 t/s |
It starts very fast, but over a long session the speed drops almost linearly. Even at 56k context it is still usable, but once sessions go past 100k, it feels safer to assume task decomposition.
What Was Generated
Given a WordPress-like Django blog/CMS specification, Zed Agent produced the following in an uncut session of about 14 minutes:
- Full Django config
- 8 apps:
accounts,content,core,media,comments,navigation,seo,taxonomy - Model, admin, and app config for each app
- Test suite
- Gitea issue creation API integration
- More than 50 files
During the session it noticed an edit_file path prefix bug on its own and corrected it. That gave a stronger signal than simple code emission; it showed the model could adapt to the quirks of the tool.
Session Screenshots




How It Compared with MiniMax-2.5
In the MiniMax-2.5 validation article, a single 96GB GPU with Expert Offload held 34-37 t/s stably. That run also felt more stable under longer context, and it felt immediately usable for web generation tasks.
This M2.7 run is not a strict apples-to-apples comparison. M2.5 was single GPU + Expert Offload, while M2.7 was closer to full local inference on dual Blackwell + smol-IQ3_KS, so the premises are very different. Even so, my impression was that M2.5 already handled tool conventions and output quality very well, while M2.7 felt like it had another layer of headroom.
Despite working from a more complex specification than in the M2.5 test, token usage also looked roughly half as large. This run was around 66k, while I remember the M2.5 generation taking roughly 122k.
M2.5 was already good enough that I had high expectations for M2.7, and in practice the size, tool behavior, and output quality all delivered. I was not even planning to benchmark it much further because the licensing situation already made it unusable for my purposes, but the community comment threads were heated enough that I still held onto the faint hope there might be some opening, so I at least ran a benchmark.
The Modified-MIT Label Means Something Very Different in M2.5 and M2.7
MiniMax-M2.7 is under a modified-MIT license, but commercial use is not freely allowed.
By contrast, when I read the M2.5 modified-MIT terms in March 2026, they did not appear to block commercial use in the same way. That difference matters. If I want something that could plausibly connect to commercial work, the more realistic path is still to fine-tune M2.5 and shape it toward my own system.
- Interesting for benchmarking, study, and strictly personal local validation
- Hard to place directly in business workflows, client work, or internal commercial development loops
- Impossible to justify as a default local model
Performance is strong, but licensing ends up being the real decision axis. I could not confidently judge how far the restriction might extend to generated outputs or secondary artifacts, and that uncertainty alone was enough to rule it out as a regular option.
Measurement Method
I was never treating this as something that could go straight into production, so the benchmark itself was fairly rough and I also forgot to preserve the inference logs.
Afterward I gave Claude only the relevant parts of the video and had it collect PP/TG values frame by frame, so some reading error is possible. Treat the rendered output in the video as the ground truth.
