GLM-4.7-Flash IQ5_K Benchmark: CPU vs Hybrid vs Full GPU Performance Comparison

Benchmarking GLM-4.7-Flash (IQ5_K GGUF) across CPU-only, MoE Expert Offload (Hybrid), and Full GPU configurations. Prefill 100 vs 1635 vs 3723 tok/s, Decode 20 vs 67 vs 99 tok/s—quantifying why Hybrid is the sweet spot.

These articles use AI-generated summaries of Obsidian notes originally kept as technical memos.

English translations are produced with AI assistance.

Background

GLM-4.7-Flash is a 30B-A3B MoE model from THUDM (Tsinghua University), using DeepSeek2 architecture with 64 experts (4 active per token). Total parameters 30B, active 3B—lightweight yet capable with multilingual support and long context (up to 128K tokens).

On our EPYC 9175F + RTX PRO 6000 Blackwell setup, the main question was: how much performance does the “MoE Expert Offload to CPU” hybrid configuration actually deliver compared to CPU-only and Full GPU?

Recommended Configuration by Use Case

Pattern	Setup	Max PP Speed	Avg TG Speed	Best For
A	CPU-only	100.32 t/s	20.23 t/s	Offline and batch-oriented work
B	`exps=CPU` (Hybrid)	1635.35 t/s	66.84 t/s	VRAM headroom + practical throughput
C	exps on GPU (Full)	3723.34 t/s	99.42 t/s	Interactive use, pipelines, and resident agents

Hybrid is not a fallback. In a homelab context, it is a credible mainline configuration.

Objective

Quantify Prefill/Decode speeds for GLM-4.7-Flash (IQ5_K) across CPU/Hybrid/Full GPU
Validate the practicality of MoE Expert Offload (exps=CPU)
Obtain comparison data with NVFP4 quantization on vLLM

Test Environment

Item	Specification
CPU	AMD EPYC 9175F (Zen 5, 16C, L3 512MB)
GPU	NVIDIA RTX PRO 6000 Blackwell Max-Q 96GB
Memory	DDR5-6400 768GB (12ch)
OS	Ubuntu 24.04 LTS
Runtime (CPU/Hybrid/GPU)	ik_llama.cpp (build 4192, commit 1cb7e1bf)
Runtime (NVFP4)	vLLM (OpenAI API compatible)
Model	GLM-4.7-Flash IQ5_K (GGUF, ubergarm quantization)
Context	131,072 tokens (128K)

Model Specifications

Item	Value
Architecture	DeepSeek2 (MoE)
Layers	47
Experts	64 (4 active)
Shared Experts	1
Attention	MLA (Multi-head Latent Attention)
Training Context	202,752
Vocabulary	154,880

Methodology

Pattern A: CPU-Only

  ksh3@compute-server:~$ podman run --rm -it -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z $IMG \
  --host 0.0.0.0 --port 8080 -m "$MODEL" --no-mmap --jinja \
  -c 131072 -n 8192 --threads 13 --threads-batch 23 \
  -b 2048 -ub 2048 -ctk f16 -ctv f16

AVX-512 VNNI / BF16 active (AVX512 = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1). All layers on CPU.

Pattern B: Hybrid (Expert=CPU, Attention=GPU)

Same as Pattern A plus --device nvidia.com/gpu=all and -ot exps=CPU. MoE Expert weights on CPU RAM, Attention/KV cache on GPU.

Pattern C: Full GPU

All 48 layers offloaded to GPU. No expert offloading.

Results

3-Pattern Summary (128K Context, 30K+ Token Processing)

Pattern	Setup	Max PP Speed	Avg TG Speed	Total Time	Notes
A	CPU-only	100.32 t/s	20.23 t/s	879s	Pure CPU, slow for 128K
B	Hybrid (exps=CPU)	1,635.35 t/s	66.84 t/s	169s	16x PP boost over CPU
C	Full GPU	3,723.34 t/s	99.42 t/s	80s	Near 100 t/s generation

Pattern A: CPU-Only Detail

#	PP(tok)	TG(tok)	PP(t/s)	TG(t/s)	Total(s)
1	31,151	427	100.32	21.51	330.4
2	980	6,284	45.55	19.85	338.1
3	2,886	2,921	48.53	19.34	210.5
Total	35,017	9,632	89.44	19.76	879.0

Pattern B: Hybrid Detail

#	PP(tok)	TG(tok)	PP(t/s)	TG(t/s)	Total(s)
1	31,151	774	1,635.35	70.01	30.1
2	981	4,091	792.91	67.04	62.3
3	2,388	2,692	900.82	66.26	43.3
4	874	2,106	619.90	66.10	33.3
Total	35,394	9,663	1,453.76	66.84	168.9

16.3x PP improvement and 3.3x TG improvement over CPU-only.

Pattern C: Full GPU Detail

#	PP(tok)	TG(tok)	PP(t/s)	TG(t/s)	Total(s)
1	31,151	630	3,723.34	106.67	14.3
2	981	4,325	1,638.04	99.16	44.2
3	2,373	1,918	1,619.97	97.84	21.1
Total	34,505	6,873	3,308.19	99.43	79.6

NVFP4 (vLLM) Reference

Metric	Value	Notes
Prefill	80-250 t/s (peak 459 t/s)	Peak with prefix cache
Decode	60-100 t/s (peak 112 t/s)	Stable range
TTFT (800-1100 token input)	4-6 seconds	Reduced with prefix cache
Prefix cache hit rate	20-40%	Rises with repeated agent calls

Analysis

The Hybrid Sweet Spot

Pattern B was the standout finding. Offloading only MoE Experts to CPU seems like a compromise, but TG at 67 t/s is potentially fast enough for interactive use. While Full GPU reaches 99 t/s, keeping Experts on CPU saves massive VRAM, enabling longer contexts or multi-model concurrent execution.

This is a viable strategy for GPUs under 96GB that still want MoE model benefits.

PP vs TG: Different Bottlenecks

PP (Prefill): Compute-bound. GPU parallelism scales it 37x over CPU
TG (Decode): Memory-bandwidth-bound. CPU-to-GPU improvement is “only” 5x

This asymmetry stems from MoE structure: Prefill parallelizes across batch dimensions, but Decode is sequential per-token with memory access dominating.

CPU-Only at 20 t/s Could Be a Viable Option

Pattern A’s 20 t/s exceeds human reading speed (~6 t/s). Sufficient for batch processing (Dagster pipelines), though the 5+ minutes for 30K+ token PP processing makes it unsuitable for real-time long-context use.

Why Hybrid Matters Beyond VRAM Savings

Offloading expert weights to CPU is not just a VRAM-saving trick. It is a configuration that preserves GPU capacity for other uses while keeping GLM-4.7-Flash at a practical operating speed.

  GPU utilization: Hybrid vs Full GPU
┌───────────────────────────────┐
│ Full GPU                      │
│  Attention  │  Experts (GPU)  │  <- GPU fully saturated
└───────────────────────────────┘

┌───────────────────────────────┐
│ Hybrid                        │
│  Attention (GPU) │ [Free VRAM]│  <- Available for other models/jobs
└───────────────────────────────┘
        ↓
        Experts (CPU)

This makes Hybrid a strong choice when:

Full GPU monopolization of the card is not acceptable
Longer context windows or co-resident models are needed
CPU-only latency is too high for the workload

Lessons Learned

The Hybrid configuration (-ot exps=CPU) performed far better than expected. Even with the majority of model weights on CPU, GPU-accelerated Attention alone yields a 3.3x TG improvement. This demonstrates the maturity of ik_llama.cpp’s Expert Offload feature.

Full GPU is the clear winner for pure speed, but for homelabs running multiple models on a single GPU, the Hybrid approach offers “67 t/s while saving most of the VRAM” as a compelling trade-off.

NVFP4 + vLLM Operational Evaluation

In addition to the IQ5_K benchmarks above, GLM-4.7-Flash-NVFP4 was also evaluated on vLLM for operational suitability.

Output Quality

Instruction-following accuracy was high with almost no output breakdowns. The model completed a complex end-to-end workflow — “repository reading -> architecture explanation -> file generation -> Git commit” — without interruption. Quantization-induced degradation was not a practical concern.

Comparison with CPU MoE Inference

Metric	GLM-4.7-Flash (GPU/vLLM)	Maverick Q4/Q8 (CPU)
TTFT	~4–6s	12–20s
Prefill tok/s	80–250 (peak ~459)	50–68
Decode tok/s	60–100 (peak ~112)	15–24
Dialogue suitability	Excellent	Fair
Batch suitability	Excellent	Adequate

Use Case Suitability

Suitable:

Resident agent / multi-chat — fast responses with headroom for concurrent sessions
Stream-first UI / API — high Decode speeds enable smooth streaming
High-throughput generation, summarization, and transformation pipelines
Asynchronous workflows paired with NATS

Unsuitable:

Pure CPU-only environments (fundamentally depends on vLLM and a GPU)
Scenarios requiring extremely low per-request cost

Overall Assessment

Dimension	Assessment
Performance	Throughput in a local GPU environment presents no practical bottleneck
Stability	Prefix cache functions effectively; strong resilience in repetitive workflows
Practicality	Fully capable of balancing interactive agent tasks and backend batch workloads

GLM-4.7-Flash-NVFP4 is the strongest candidate for the primary local LLM in this environment, complementing or replacing the CPU-driven MoE models relied on previously.

Next Steps

Full GPU is the best choice for daily interactive use and maximum throughput.
Hybrid (-ot exps=CPU) is the best balance when preserving VRAM without falling back to CPU-only latency.
CPU-only is viable for batch-oriented or fully air-gapped setups, and confirms the EPYC 9175F can carry the model alone.

The next useful step is to push on context length and concurrency under the same GLM-4.7-Flash model family, and measure how much GPU capacity the hybrid layout leaves available for other resident models or concurrent jobs. The current data suggests the hybrid mode has real headroom to explore there.

Reproduction Steps

1. Download Model

  huggingface-cli download ubergarm/GLM-4.7-Flash-GGUF \
  --include "GLM-4.7-Flash-IQ5_K.gguf" \
  --local-dir /mnt/data/hf/hub/models--ubergarm--GLM-4.7-Flash-GGUF

2. Build ik_llama.cpp

ik_llama.cpp is a llama.cpp fork with native MLA support and Expert Offload. Build with Zen 5 optimization (-march=znver5 or -DGGML_NATIVE=ON).

3. Run (3 Patterns)

  # Common variables
IMG=compute.home.arpa/ik_llama-cpu:latest
MO=/mnt/data/hf/hub/models--ubergarm--GLM-4.7-Flash-GGUF
MODEL=/models/snapshots/.../GLM-4.7-Flash-IQ5_K.gguf

# Pattern A: CPU-only
podman run --rm -it -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z $IMG \
  -m "$MODEL" --no-mmap --jinja \
  -c 131072 -n 8192 --threads 13 --threads-batch 23 \
  -b 2048 -ub 2048 -ctk f16 -ctv f16 \
  --host 0.0.0.0 --port 8080

# Pattern B: Hybrid - add --device nvidia.com/gpu=all -ot exps=CPU
# Pattern C: Full GPU - add --device nvidia.com/gpu=all (no -ot flag)

Technical Notes

How Expert Offload Works

In MoE models, Expert weights dominate total parameter count (GLM-4.7-Flash: 64 experts x ~1.5GB each). -ot exps=CPU places only Expert weights in CPU RAM while Attention, Embedding, and Router layers stay on GPU.

Post-selection Expert computation runs on CPU, but GPU-accelerated Attention (especially KV cache access) shifts the bottleneck, significantly improving Decode speed.

ik_llama.cpp vs llama.cpp

ik_llama.cpp provides native MLA (Multi-head Latent Attention) support, optimized for DeepSeek2/GLM-4.7 architectures. Standard llama.cpp can load the GGUF but may lack MLA-specific optimizations.

For GPUs Under 96GB

With Expert Offload, VRAM consumption drops to roughly Attention layers + KV cache (~10-15GB for GLM-4.7-Flash IQ5_K). A 24GB+ GPU should deliver TG 60+ t/s in Hybrid mode.

Qwen3-Coder-Next 80B in Three Modes: BF16 CPU / IQ4_NL Hybrid / nvfp4 GPU Measured

Qwen3-Coder-Next (~80B MoE) …

Why DeepSeek-V3.2 Appears Slower Than Kimi-K2.5: Prompt Cache Mismatches and TG Bottleneck Analysis

Analyzing why DeepSeek-V3.2 …