Running MiniMax-M2.7 (229B MoE) on 2x Blackwell 96GB: 71.9 t/s on Average, but No Commercial Use

Summary

MiniMax-M2.7 showed a very strong start on 2x RTX PRO 6000 Blackwell 96GB: 105.9 t/s peak, 71.9 t/s average. The fact that a 229B MoE can run this lightly on a local box already makes it highly usable for agent workflows.

What stopped me in the end was not performance but licensing. MiniMax-M2.7 is under a modified-MIT license, and commercial use requires prior written permission from MiniMax. Output quality, tool behavior, and model size all looked excellent, but it is hard to drop it directly into business or commercial workflows under those terms, so I only took a light benchmark and deleted the model.

If usage is strictly personal, the equation changes. As a rough local buffer before handing work to Claude or Codex, or for quick tests of prompts and tool calls, it looks very attractive. This run was on dual Blackwell 96GB GPUs, and I have not validated DGX Spark or Mac Studio (128GB), but based on the size and behavior, it still looks like a plausible candidate for a personal-only local setup.

Context

This benchmark was reconstructed from the 14-minute uncut YouTube video I uploaded. I forgot to preserve the inference logs, so I sampled 10 points by reading the llama.cpp slot print_timing entries visible on the right side of the screen frame by frame.

The task was to hand Zed Editor’s AI Agent a WordPress-like Django blog specification and see how far it could autonomously build it.

Video link: https://www.youtube.com/watch?v=4QIc4A5Yly4

Test Environment

Item	Specification
GPU	2x NVIDIA RTX PRO 6000 Blackwell 96GB
CPU	AMD EPYC 9175F
RAM	768GB
Engine	ik_llama.cpp (ubergarm fork)
Quantization	smol-IQ3_KS
Context	262,144 tokens
KV buffer	CUDA0 32,768 MiB + CUDA1 30,720 MiB
Prompt cache	8,192 MiB
Graph splits	3

Benchmark Results

Token Generation

Phase	KV cache	Speed
Initial	~2.7k tokens	105.9 t/s
Early average	under 25k tokens	77.5 t/s
Mid average	27k-47k tokens	64.7 t/s
Late average	54k-56k tokens	54.3 t/s

Mean: 71.9 t/s
Max: 105.9 t/s
Min: 53.9 t/s
Degradation: -49% from 2.7k to 56.4k context

Prompt Processing

Condition	Speed
Cold start (2,493 tokens)	4,067 t/s
Cache-hit average	911 t/s

10 Sample Points

KV cache	TG
2.7k	105.9 t/s
18.3k	79.2 t/s
19.1k	78.8 t/s
21.2k	77.1 t/s
23.7k	75.0 t/s
27.2k	71.4 t/s
38.4k	63.8 t/s
46.6k	58.9 t/s
53.9k	54.7 t/s
56.4k	53.9 t/s

It starts very fast, but over a long session the speed drops almost linearly. Even at 56k context it is still usable, but once sessions go past 100k, it feels safer to assume task decomposition.

What Was Generated

Given a WordPress-like Django blog/CMS specification, Zed Agent produced the following in an uncut session of about 14 minutes:

Full Django config
8 apps: accounts, content, core, media, comments, navigation, seo, taxonomy
Model, admin, and app config for each app
Test suite
Gitea issue creation API integration
More than 50 files

During the session it noticed an edit_file path prefix bug on its own and corrected it. That gave a stronger signal than simple code emission; it showed the model could adapt to the quirks of the tool.

Session Screenshots

MiniMax-M2.7 server initialization and model loading screen — Startup — model loading and initialization

Zed Agent generating Django configuration and structure — Early phase — generating Django configuration and structure

Mid-session screen with multiple Django apps scaffolded — Mid-session — multiple apps taking shape

Final state with 50 plus files and test suite generated — Late phase — 50+ files and a test suite in place

How It Compared with MiniMax-2.5

In the MiniMax-2.5 validation article, a single 96GB GPU with Expert Offload held 34-37 t/s stably. That run also felt more stable under longer context, and it felt immediately usable for web generation tasks.

This M2.7 run is not a strict apples-to-apples comparison. M2.5 was single GPU + Expert Offload, while M2.7 was closer to full local inference on dual Blackwell + smol-IQ3_KS, so the premises are very different. Even so, my impression was that M2.5 already handled tool conventions and output quality very well, while M2.7 felt like it had another layer of headroom.

Despite working from a more complex specification than in the M2.5 test, token usage also looked roughly half as large. This run was around 66k, while I remember the M2.5 generation taking roughly 122k.

M2.5 was already good enough that I had high expectations for M2.7, and in practice the size, tool behavior, and output quality all delivered. I was not even planning to benchmark it much further because the licensing situation already made it unusable for my purposes, but the community comment threads were heated enough that I still held onto the faint hope there might be some opening, so I at least ran a benchmark.

The Modified-MIT Label Means Something Very Different in M2.5 and M2.7

MiniMax-M2.7 is under a modified-MIT license, but commercial use is not freely allowed.

By contrast, when I read the M2.5 modified-MIT terms in March 2026, they did not appear to block commercial use in the same way. That difference matters. If I want something that could plausibly connect to commercial work, the more realistic path is still to fine-tune M2.5 and shape it toward my own system.

Interesting for benchmarking, study, and strictly personal local validation
Hard to place directly in business workflows, client work, or internal commercial development loops
Impossible to justify as a default local model

Performance is strong, but licensing ends up being the real decision axis. I could not confidently judge how far the restriction might extend to generated outputs or secondary artifacts, and that uncertainty alone was enough to rule it out as a regular option.

Measurement Method

I was never treating this as something that could go straight into production, so the benchmark itself was fairly rough and I also forgot to preserve the inference logs.

Afterward I gave Claude only the relevant parts of the video and had it collect PP/TG values frame by frame, so some reading error is possible. Treat the rendered output in the video as the ground truth.