On this page

Running GLM-5.2 (744B-A40B) GGUFs Locally: Did MTP Help? Notes From a Few Quant and Expert-Placement Tests

Notes from running two GLM-5.2 (744B-A40B MoE) GGUF quants (1.630bpw / 2.244bpw) on a dual RTX PRO 6000 Blackwell Max-Q (96GB x2) + 768GB RAM homelab. The quant author reports MTP raising TG 11.81->18.02 t/s with draft acceptance 0.98 on a 2x24GB VRAM setup; on this 192GB box, within the configs I tried, turning MTP on made TG drop instead.

These articles use AI-generated summaries of Obsidian notes originally kept as technical memos.

English translations are produced with AI assistance.

A record of running sokann/GLM-5.2-GGUF locally. I tried two of them: GLM-5.2-GGUF-1.630bpw and GLM-5.2-GGUF-2.244bpw.

The 2.244bpw README says that enabling Multi-Token Prediction (MTP) with --spec-type mtp:n_max=4,p_min=0.5 raises TG from 11.81 to 18.02 t/s, and that draft acceptance reached 0.98259 (508 accepted / 517 generated). On my side, acceptance stayed under about 0.8 even with the same flags, but putting the weights on the specified layers and using -mla 1 did increase TG. (It might also have had something to do with -cram 0.)

My box has RTX PRO 6000 Blackwell Max-Q 96GB x2 = 192GB VRAM, and I was curious what happens when VRAM isn’t the constraint, so I pulled down both models and measured while changing MTP on/off and expert placement a few different ways. The short version: in the configs I tried, turning MTP on made TG drop instead. I didn’t test exhaustively, so I can’t flatly say “MTP doesn’t help” — but I tried it once before on GLM-5.1 too, and enabling MTP cost about 5 t/s of TG there (I think that was the smol-IQ2K_S).

Video link: https://www.youtube.com/watch?v=Wm0SfXveHnQ

Test environment

GPU: NVIDIA RTX PRO 6000 Blackwell Max-Q 96GB x2 (PCIe Gen5, no NVLink) = 192GB VRAM
CPU: AMD EPYC 9175F, 16 cores, L3 512MB
RAM: 768GB DDR5
Model: two GGUFs by sokann — 1.630bpw and 2.244bpw
Base model: zai-org/GLM-5.2 (753B params / A40B active, arch glm-dsa, 79 layers + 1 NextN/MTP layer, 256 experts, top-8)
Runtime: ik_llama.cpp (CUDA 13), commit 29a54f4, via Podman

The two models I pulled down:

1.630bpw — experts in IQ1_S_R4 / the rest in Q6_0, 142.9 GiB. Fits fully in 192GB VRAM.
2.244bpw — experts in IQ2_KT / the rest in Q6_0, 196.756 GiB (PPL 3.8402). Once you add the KV cache and compute buffers it edges just past 192GB, so it won’t fit entirely on the GPUs.

Both IQ2_KT and IQ1_S_R4 are ik_llama-only quant types; mainline llama.cpp can’t load them (it rejects ggml type 133). The MTP/NextN layer (layer 78) is Q6_0 in both.

--spec-type mtp:n_max=4,p_min=0.5 is the flag that enables MTP. n_max is the max number of draft tokens proposed per speculative step; p_min is the lower bound on the probability the draft head must assign to a token for it to be proposed — anything below that isn’t proposed.

Part 1 — 1.630bpw, full GPU, MTP off

1.630bpw (142.9 GiB) loads fully onto both cards; I ran it at 128K context.

  podman run --rm --device nvidia.com/gpu=all -p 8000:8000 --cap-add SYS_NICE \
  -v /mnt/data/hf/hub/models--sokann--GLM-5.2-GGUF-1.630bpw:/models:ro,Z \
  registry.home.arpa/ik_llama:latest \
  -m /models/snapshots/451d18c5c05952bc322b9f9e612abc625d48b211/GLM-5.2-GGUF-1.630bpw.gguf \
  --merge-qkv --ctx-size 131072 --parallel 1 --threads 15 --threads-batch 15 \
  -ctk q8_0 -b 8192 -ub 2048 -ngl 80 -mla 3 -muge -ger -amb 512 \
  --temp 0.6 --top-k 20 --top-p 0.95 --jinja --reasoning off \
  --host 0.0.0.0 --port 8000 --warmup-batch --alias test-model

Offload is 80/80 layers on GPU (CUDA0 71.0GB / CUDA1 72.7GB / CPU 0.7GB). Trying it through the Zed editor agent, small mistakes tend to show up frequently past roughly 4k tokens of output, so it feels usable mainly for short outputs.

Measured:

  Overall:        PP 359.5 t/s   TG 39.0 t/s   E2E 37.0 t/s
Long prefill:   14,895 tokens at ~747 t/s, 78 gen tokens at ~39 t/s, ~22 s total
Steady-state:   PP ~250-450 t/s (cached-prefix dependent), TG ~34-40 t/s, E2E ~35 t/s

For reference, I’ve seen ~18-20 t/s claimed for CPU-only (EPYC + RAM, no GPU) on the HF community, though without logs. Even if TG holds up, PP is probably pretty rough there. With -cmoe, as shown in the video, I only got around 5 t/s.

Part 2 — Same config, MTP on

Same 1.630bpw full-GPU config as Part 1, just adding --spec-type mtp:n_max=4,p_min=0.5.

  # ... same as Part 1, plus:
  --spec-type mtp:n_max=4,p_min=0.5

Measured:

  MTP on:   TG ~15-22 t/s   draft acceptance swings 0.25-0.56   PP 350-750 t/s

Turning MTP on roughly halved TG. Draft acceptance never got high enough to pay back the draft/verify overhead.

To rule out the draft filter being too strict, I also dropped p_min to 0.0 (no filter, maximum speculation):

  MTP on, p_min=0.0:   TG 17-23 t/s   acceptance never cleared ~0.56

Even then, acceptance only climbed to ~0.56.

Config (1.630bpw, full GPU)	MTP	TG (t/s)	Acceptance
`-ngl 80 -mla 3 -ger`	off	34-40	—
`-ngl 80 -mla 3 -ger` + spec-type	on	15-22	0.25-0.56
`-ngl 80 -mla 3 -ger` + spec-type, p_min=0.0	on	17-23	<= 0.56

Part 3 — 2.244bpw (the larger model)

Once you count the KV and compute buffers, 2.244bpw (196.756 GiB) doesn’t fit in 192GB; some expert layers always spill to CPU. The most stable, fast config I found was 2 GPUs, -ngl 65, grouped expert routing (-ger), MTP off.

  podman run --rm -it --name test-model --network host --device nvidia.com/gpu=all --cap-add SYS_NICE \
  -v /mnt/data/hf/hub/models--sokann--GLM-5.2-GGUF-2.244bpw:/models:ro,Z \
  registry.home.arpa/ik_llama:latest \
  -m /models/snapshots/16264a0f7b811d976a9acf704d8511f1127ada30/GLM-5.2-GGUF-2.244bpw.gguf \
  --merge-qkv --ctx-size 102400 --parallel 1 --threads 15 --threads-batch 15 \
  -ctk q8_0 -b 2048 -ub 2048 -sm graph -ngl 65 -mla 3 -amb 512 -muge -ger \
  --temp 0.6 --top-k 20 --top-p 0.95 --jinja --reasoning off \
  --host 0.0.0.0 --port 8000 --warmup-batch --alias test-model

Measured:

  2.244bpw, 2 GPU, ngl65, -ger, MTP off:   PP 672 t/s   TG 15.28 t/s

The best for 2.244bpw is ~15.28 t/s, less than half of 1.630bpw on full GPU. The quality (PPL 3.84) is better, but it can’t fit entirely on the GPUs.

Part 4 — Packing as many experts as possible onto a single GPU (2.244bpw, MTP off)

How far can one card go if you pack as many experts onto it as possible? I routed experts 3-5, 40-42, and 60-77 to CUDA0 with -ot:

  podman run --rm -it --name test-model --network host --device nvidia.com/gpu=0 --cap-add SYS_NICE \
  -v /mnt/data/hf/hub/models--sokann--GLM-5.2-GGUF-2.244bpw:/models:ro,Z \
  registry.home.arpa/ik_llama:latest \
  -m /models/snapshots/16264a0f7b811d976a9acf704d8511f1127ada30/GLM-5.2-GGUF-2.244bpw.gguf \
  -ngl 99 -cmoe \
  -ot 'blk\.([345]|1[0-9]2[0-9]|6[0-9]|7[0-7])\.ffn_.*_exps\.weight=CUDA0' \
  -ot 'blk\.6\.ffn_(up|gate)_exps\.weight=CUDA0' \
  -ot 'blk\.(40|41|42)\.ffn_.*_exps\.weight=CUDA0' \
  --threads 15 --threads-batch 15 -mla 3 -amb 512 -c 131072 -ctk q8_0 -khad \
  -b 8192 -ub 4096 -wgt 1 -muge --jinja --parallel-tool-calls \
  --chat-template-kwargs '{"reasoning_effort": "low"}' \
  --host 0.0.0.0 --port 8000 --alias test-model

Offload: experts 3-6, 40-42, 60-77 on CUDA0, the rest on CUDA_Host (CPU). CUDA0 buffer 74347 MiB, CUDA_Host buffer 124459 MiB. Allocating the 121.54 GiB of pinned host memory took ~13 s at startup.

Measured (20,508-token prefill, -ub 4096):

  2.244bpw, 1 GPU, expert-saturated, MTP off:   PP 576.73 t/s   TG 10.23 t/s

nvtop showed GPU0 at 85GB/95GB (memory 88%, compute ~32%). This is the fastest single-GPU result, faster than the MTP config described later (8.93 t/s, Part 8). Packing one card to its limit and running MTP off was faster.

Part 5 — The floor when experts are pushed onto CPU (`-ot exps=CPU` and `-cmoe`)

To see the bottom of the range, I also measured configs with experts pushed onto CPU. With mostly CPU + a few GPU layers:

  1.630bpw, -ot exps=CPU hybrid, MTP on:   TG 2.34-2.36 t/s (task 0/13)

  2.244bpw, -cmoe (all experts CPU):   prefill ~63 s for 20,480 tokens, TG a few t/s

Putting all experts on CPU with -cmoe only gives about 5 t/s of TG, as shown in the video. On top of that, just the 20,480-token prefill takes ~63 s, so this isn’t a usable config — only a floor measurement. The -ot exps=CPU hybrid above adds MTP on to that, and drops further to 2.34-2.36 t/s. Adding MTP to a config bottlenecked on the CPU forward pass only makes it slower, through the draft/verify overhead.

Part 6 — Lowering `-ngl` on 2.244bpw (how many layers stay on CPU)

On 2.244bpw with 2 GPUs, lowering -ngl leaves the upper layers’ experts on CPU. I lowered it step by step to see the cost of each layer placed on CPU.

-ngl 70 (10 layers on CPU), MTP off:

  podman run --rm -it --name test-model --network host --device nvidia.com/gpu=all --cap-add SYS_NICE \
  -v /mnt/data/hf/hub/models--sokann--GLM-5.2-GGUF-2.244bpw:/models:ro,Z \
  registry.home.arpa/ik_llama:latest \
  -m /models/snapshots/16264a0f7b811d976a9acf704d8511f1127ada30/GLM-5.2-GGUF-2.244bpw.gguf \
  --merge-qkv --ctx-size 32768 --parallel 1 --threads 15 --threads-batch 15 \
  -ctk q8_0 -b 2048 -ub 2048 -ngl 70 -mla 3 -amb 512 -muge -ger \
  --temp 0.6 --top-k 20 --top-p 0.95 --jinja --reasoning off \
  --host 0.0.0.0 --port 8000 --warmup-batch --alias test-model

Offload: offloaded 70/80 layers to GPU, CUDA0 91.6GB / CUDA1 89.0GB / CUDA_Host 18.1GB. Both GPUs nearly full, experts 69-78 on CPU.

  task 0:   PP 611 t/s   TG 17.64 t/s   (20,502-token prefill)
task 69:  PP 630 t/s   TG 20.02 t/s

TG is 17.6-20 t/s, about half of 1.630bpw on full GPU. In nvtop you can see GPU1 swing to 100% while GPU0 drops to 0%, which shows layers crossing the CPU<->GPU boundary.

-ngl 65 (15 layers on CPU), MTP on: adding --spec-type mtp:n_max=4,p_min=0.5 to the -ngl 65 hybrid:

  2.244bpw, ngl65 hybrid, MTP on:   TG 4.6 t/s   acceptance 0.67-0.71

Even with a not-bad acceptance of 0.67-0.71, the base CPU forward pass is too slow for MTP to recover.

Part 7 — `-mla 1` vs `-mla 3`

-mla 1 uses less KV VRAM but is slower than -mla 3. Separate from speed, -mla 1 consistently gave slightly higher MTP acceptance. The trend held on both 2 GPUs and 1 GPU.

2 GPUs, -ngl 50, -ot for experts 3-9 / 40-49, MTP on:

  # -mla 1 variant
... -ngl 50 -mla 1 -amb 512 -ot "blk\.([3-9]|4[0-9])\.ffn_.*_exps\.weight=CPU" ... \
  --spec-type mtp:n_max=4,p_min=0.5

  ngl50, ot[3-9/40-49], -mla 1, MTP on:   PP 369 t/s   TG 5.40 t/s   acceptance 0.763
ngl50, ot[3-9/40-49], -mla 3, MTP on:   PP 371 t/s   TG 5.14 t/s   acceptance 0.712

1 GPU, -ub 4096, MTP on:

  1 GPU, -mla 1, MTP on:   PP 406 t/s   TG 6.52 t/s   acceptance 0.766   (470-token gen)
1 GPU, -mla 3, MTP on:   PP 505 t/s   TG 7.05 t/s   acceptance 0.682

-mla 1 raises acceptance by about +0.05 (2 GPU 0.712→0.763, 1 GPU 0.682→0.766), while -mla 3 gives slightly higher PP/TG. Either way, neither reaches the MTP-off baseline.

Part 8 — Reproducing the author’s `-ot` placement (the best MTP run here)

The 2.244bpw README targets 2x24GB VRAM and pushes most experts onto CPU with -cmoe + -ot + -mla 1 + -ub 2048 + -cram 0. Reproducing that placement, scaled to my machine, gave the highest acceptance of any MTP run here:

  podman run --rm -it --name test-model --network host --device nvidia.com/gpu=all --cap-add SYS_NICE \
  -v /mnt/data/hf/hub/models--sokann--GLM-5.2-GGUF-2.244bpw:/models:ro,Z \
  registry.home.arpa/ik_llama:latest \
  -m /models/snapshots/16264a0f7b811d976a9acf704d8511f1127ada30/GLM-5.2-GGUF-2.244bpw.gguf \
  --no-mmap -ngl 99 -cmoe -sm graph \
  -ot 'blk\.([345])\.ffn_.*_exps\.weight=CUDA0' \
  -ot 'blk\.6\.ffn_(up|gate)_exps\.weight=CUDA0' \
  -ot 'blk\.(40|41|42)\.ffn_.*_exps\.weight=CUDA1' \
  -mla 1 -amb 512 -c 102400 -ctk q6_0 -khad \
  -b 2048 -ub 2048 -wgt 1 -cram 0 -muge -cuda graphs=1 \
  --jinja --parallel-tool-calls \
  --chat-template-kwargs '{"reasoning_effort": "high"}' \
  --spec-type mtp:n_max=4,p_min=0.5 ...

Offload: experts almost entirely on CPU, only blk 3-6 / 40-42 on GPU. Effective VRAM footprint ~48GB.

  -ot, MTP on:   PP 351 t/s   TG 8.93 t/s   acceptance 0.835

This is the best MTP result here, in both TG (8.93 t/s) and acceptance (0.835). The author’s config is a VRAM-minimization strategy, and getting 8.93 t/s in ~48GB is good value for the VRAM.

Generation quality at 1.630bpw (sub-2-bit)

When I had it write a few short, small functions, it held together from implementation through fixes with no breakdown. But once output went past around 4k tokens, small mistakes started coming one after another — often things like == turning into ==== in repetitive code.

Personally it felt like it might be usable for short outputs. For something like mass-producing short SFT data with a check on the output, it could pay off — you get the kind of content only a large-parameter model produces — but for long outputs I think the yield is poor.

Results summary (within what I tried)

Config	Quant	GPUs	MTP	PP (t/s)	TG (t/s)	Acceptance	VRAM
Full GPU `-ngl 80`	1.630	2	off	250-750	34-40	—	144GB
Full GPU `-ngl 80` + spec	1.630	2	on	350-750	15-22	0.25-0.56	144GB
Full GPU, p_min=0.0	1.630	2	on	350-750	17-23	<= 0.56	144GB
`-ngl 65 -ger -sm graph`	2.244	2	off	672	15.28	—	2 GPUs full + 15 on CPU
`-ngl 70` (10 CPU)	2.244	2	off	611-630	17.6-20	—	GPU 180.6 + CPU 18.1
1 GPU expert-saturated	2.244	1	off	576	10.23	—	85GB
author `-ot` (blk 3-6/40-42)	2.244	2	on	351	8.93	0.835	~48GB
1 GPU `-mla 1`	2.244	1	on	406	6.52	0.766	—
1 GPU `-mla 3`	2.244	1	on	505	7.05	0.682	—
`-ngl 50` ot, `-mla 1`	2.244	2	on	369	5.40	0.763	—
`-ngl 50` ot, `-mla 3`	2.244	2	on	371	5.14	0.712	—
`-ngl 65` hybrid	2.244	2	on	—	4.6	0.67-0.71	—
`-ot exps=CPU` hybrid	1.630	—	on	—	2.34-2.36	—	—
`-cmoe` (all CPU)	2.244	2	on	floor	few	—	—

The PP / TG / VRAM trade-off

Speed-first: 1.630bpw full GPU, MTP off — ~37 t/s, ~144GB VRAM.
Minimal VRAM: the author’s -ot placement, MTP on — TG 8.93 in ~48GB; PP tops out at 351, but the value for the VRAM is good.

Summary

My impressions from measuring GLM-5.2-GGUF while varying GPU count, layer placement, attention mode, and MTP on/off a few ways. The MTP configs I tried gave underwhelming results, but if GLM-5.2 runs at around 48GB VRAM, TG 8, PP 350, there’s probably a use for it in things like long-running pipeline processing. For more on the cost of MTP, and on running GLM-5.2 on different hardware / a different engine, this article is a good reference: https://dnhkng.github.io/posts/gh200-benchmarking-part-3-glm52/

Go + NATS + Dagster AI Orchestration Platform: Design Philosophy and Middleware Selection

Go(Gin) …

Running DeepSeek-V4-Flash on Two DwarfStar4 Nodes for Orchestration

A record of loading …

Running GLM-5.2 (744B-A40B) GGUFs Locally: Did MTP Help? Notes From a Few Quant and Expert-Placement Tests

Test environment

Part 1 — 1.630bpw, full GPU, MTP off

Part 2 — Same config, MTP on

Part 3 — 2.244bpw (the larger model)

Part 4 — Packing as many experts as possible onto a single GPU (2.244bpw, MTP off)

Part 5 — The floor when experts are pushed onto CPU (-ot exps=CPU and -cmoe)

Part 6 — Lowering -ngl on 2.244bpw (how many layers stay on CPU)

Part 7 — -mla 1 vs -mla 3

Part 8 — Reproducing the author’s -ot placement (the best MTP run here)

Generation quality at 1.630bpw (sub-2-bit)

Results summary (within what I tried)

The PP / TG / VRAM trade-off

Part 5 — The floor when experts are pushed onto CPU (`-ot exps=CPU` and `-cmoe`)

Part 6 — Lowering `-ngl` on 2.244bpw (how many layers stay on CPU)

Part 7 — `-mla 1` vs `-mla 3`

Part 8 — Reproducing the author’s `-ot` placement (the best MTP run here)