Background

GLM-4.7-Flash is a 30B-A3B MoE model from THUDM (Tsinghua University), using DeepSeek2 architecture with 64 experts (4 active per token). Total parameters 30B, active 3B—lightweight yet capable with multilingual support and long context (up to 128K tokens).

On our EPYC 9175F + RTX PRO 6000 Blackwell setup, the main question was: how much performance does the “MoE Expert Offload to CPU” hybrid configuration actually deliver compared to CPU-only and Full GPU?

PatternSetupMax PP SpeedAvg TG SpeedBest For
ACPU-only100.32 t/s20.23 t/sOffline and batch-oriented work
Bexps=CPU (Hybrid)1635.35 t/s66.84 t/sVRAM headroom + practical throughput
Cexps on GPU (Full)3723.34 t/s99.42 t/sInteractive use, pipelines, and resident agents

Hybrid is not a fallback. In a homelab context, it is a credible mainline configuration.

Objective

  1. Quantify Prefill/Decode speeds for GLM-4.7-Flash (IQ5_K) across CPU/Hybrid/Full GPU
  2. Validate the practicality of MoE Expert Offload (exps=CPU)
  3. Obtain comparison data with NVFP4 quantization on vLLM

Test Environment

ItemSpecification
CPUAMD EPYC 9175F (Zen 5, 16C, L3 512MB)
GPUNVIDIA RTX PRO 6000 Blackwell Max-Q 96GB
MemoryDDR5-6400 768GB (12ch)
OSUbuntu 24.04 LTS
Runtime (CPU/Hybrid/GPU)ik_llama.cpp (build 4192, commit 1cb7e1bf)
Runtime (NVFP4)vLLM (OpenAI API compatible)
ModelGLM-4.7-Flash IQ5_K (GGUF, ubergarm quantization)
Context131,072 tokens (128K)

Model Specifications

ItemValue
ArchitectureDeepSeek2 (MoE)
Layers47
Experts64 (4 active)
Shared Experts1
AttentionMLA (Multi-head Latent Attention)
Training Context202,752
Vocabulary154,880

Methodology

Pattern A: CPU-Only

  ksh3@compute-server:~$ podman run --rm -it -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z $IMG \
  --host 0.0.0.0 --port 8080 -m "$MODEL" --no-mmap --jinja \
  -c 131072 -n 8192 --threads 13 --threads-batch 23 \
  -b 2048 -ub 2048 -ctk f16 -ctv f16
  

AVX-512 VNNI / BF16 active (AVX512 = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1). All layers on CPU.

Pattern B: Hybrid (Expert=CPU, Attention=GPU)

Same as Pattern A plus --device nvidia.com/gpu=all and -ot exps=CPU. MoE Expert weights on CPU RAM, Attention/KV cache on GPU.

Pattern C: Full GPU

All 48 layers offloaded to GPU. No expert offloading.

Results

3-Pattern Summary (128K Context, 30K+ Token Processing)

PatternSetupMax PP SpeedAvg TG SpeedTotal TimeNotes
ACPU-only100.32 t/s20.23 t/s879sPure CPU, slow for 128K
BHybrid (exps=CPU)1,635.35 t/s66.84 t/s169s16x PP boost over CPU
CFull GPU3,723.34 t/s99.42 t/s80sNear 100 t/s generation

Pattern A: CPU-Only Detail

#PP(tok)TG(tok)PP(t/s)TG(t/s)Total(s)
131,151427100.3221.51330.4
29806,28445.5519.85338.1
32,8862,92148.5319.34210.5
Total35,0179,63289.4419.76879.0

Pattern B: Hybrid Detail

#PP(tok)TG(tok)PP(t/s)TG(t/s)Total(s)
131,1517741,635.3570.0130.1
29814,091792.9167.0462.3
32,3882,692900.8266.2643.3
48742,106619.9066.1033.3
Total35,3949,6631,453.7666.84168.9

16.3x PP improvement and 3.3x TG improvement over CPU-only.

Pattern C: Full GPU Detail

#PP(tok)TG(tok)PP(t/s)TG(t/s)Total(s)
131,1516303,723.34106.6714.3
29814,3251,638.0499.1644.2
32,3731,9181,619.9797.8421.1
Total34,5056,8733,308.1999.4379.6

NVFP4 (vLLM) Reference

MetricValueNotes
Prefill80-250 t/s (peak 459 t/s)Peak with prefix cache
Decode60-100 t/s (peak 112 t/s)Stable range
TTFT (800-1100 token input)4-6 secondsReduced with prefix cache
Prefix cache hit rate20-40%Rises with repeated agent calls

Analysis

The Hybrid Sweet Spot

Pattern B was the standout finding. Offloading only MoE Experts to CPU seems like a compromise, but TG at 67 t/s is potentially fast enough for interactive use. While Full GPU reaches 99 t/s, keeping Experts on CPU saves massive VRAM, enabling longer contexts or multi-model concurrent execution.

This is a viable strategy for GPUs under 96GB that still want MoE model benefits.

PP vs TG: Different Bottlenecks

  • PP (Prefill): Compute-bound. GPU parallelism scales it 37x over CPU
  • TG (Decode): Memory-bandwidth-bound. CPU-to-GPU improvement is “only” 5x

This asymmetry stems from MoE structure: Prefill parallelizes across batch dimensions, but Decode is sequential per-token with memory access dominating.

CPU-Only at 20 t/s Could Be a Viable Option

Pattern A’s 20 t/s exceeds human reading speed (~6 t/s). Sufficient for batch processing (Dagster pipelines), though the 5+ minutes for 30K+ token PP processing makes it unsuitable for real-time long-context use.

Why Hybrid Matters Beyond VRAM Savings

Offloading expert weights to CPU is not just a VRAM-saving trick. It is a configuration that preserves GPU capacity for other uses while keeping GLM-4.7-Flash at a practical operating speed.

  GPU utilization: Hybrid vs Full GPU
┌───────────────────────────────┐
│ Full GPU                      │
│  Attention  │  Experts (GPU)  │  <- GPU fully saturated
└───────────────────────────────┘

┌───────────────────────────────┐
│ Hybrid                        │
│  Attention (GPU) │ [Free VRAM]│  <- Available for other models/jobs
└───────────────────────────────┘
        ↓
        Experts (CPU)
  

This makes Hybrid a strong choice when:

  • Full GPU monopolization of the card is not acceptable
  • Longer context windows or co-resident models are needed
  • CPU-only latency is too high for the workload

Lessons Learned

The Hybrid configuration (-ot exps=CPU) performed far better than expected. Even with the majority of model weights on CPU, GPU-accelerated Attention alone yields a 3.3x TG improvement. This demonstrates the maturity of ik_llama.cpp’s Expert Offload feature.

Full GPU is the clear winner for pure speed, but for homelabs running multiple models on a single GPU, the Hybrid approach offers “67 t/s while saving most of the VRAM” as a compelling trade-off.

NVFP4 + vLLM Operational Evaluation

In addition to the IQ5_K benchmarks above, GLM-4.7-Flash-NVFP4 was also evaluated on vLLM for operational suitability.

Output Quality

Instruction-following accuracy was high with almost no output breakdowns. The model completed a complex end-to-end workflow — “repository reading -> architecture explanation -> file generation -> Git commit” — without interruption. Quantization-induced degradation was not a practical concern.

Comparison with CPU MoE Inference

MetricGLM-4.7-Flash (GPU/vLLM)Maverick Q4/Q8 (CPU)
TTFT~4–6s12–20s
Prefill tok/s80–250 (peak ~459)50–68
Decode tok/s60–100 (peak ~112)15–24
Dialogue suitabilityExcellentFair
Batch suitabilityExcellentAdequate

Use Case Suitability

Suitable:

  • Resident agent / multi-chat — fast responses with headroom for concurrent sessions
  • Stream-first UI / API — high Decode speeds enable smooth streaming
  • High-throughput generation, summarization, and transformation pipelines
  • Asynchronous workflows paired with NATS

Unsuitable:

  • Pure CPU-only environments (fundamentally depends on vLLM and a GPU)
  • Scenarios requiring extremely low per-request cost

Overall Assessment

DimensionAssessment
PerformanceThroughput in a local GPU environment presents no practical bottleneck
StabilityPrefix cache functions effectively; strong resilience in repetitive workflows
PracticalityFully capable of balancing interactive agent tasks and backend batch workloads

GLM-4.7-Flash-NVFP4 is the strongest candidate for the primary local LLM in this environment, complementing or replacing the CPU-driven MoE models relied on previously.

Next Steps

  • Full GPU is the best choice for daily interactive use and maximum throughput.
  • Hybrid (-ot exps=CPU) is the best balance when preserving VRAM without falling back to CPU-only latency.
  • CPU-only is viable for batch-oriented or fully air-gapped setups, and confirms the EPYC 9175F can carry the model alone.

The next useful step is to push on context length and concurrency under the same GLM-4.7-Flash model family, and measure how much GPU capacity the hybrid layout leaves available for other resident models or concurrent jobs. The current data suggests the hybrid mode has real headroom to explore there.

Reproduction Steps

1. Download Model

  huggingface-cli download ubergarm/GLM-4.7-Flash-GGUF \
  --include "GLM-4.7-Flash-IQ5_K.gguf" \
  --local-dir /mnt/data/hf/hub/models--ubergarm--GLM-4.7-Flash-GGUF
  

2. Build ik_llama.cpp

ik_llama.cpp is a llama.cpp fork with native MLA support and Expert Offload. Build with Zen 5 optimization (-march=znver5 or -DGGML_NATIVE=ON).

3. Run (3 Patterns)

  # Common variables
IMG=compute.home.arpa/ik_llama-cpu:latest
MO=/mnt/data/hf/hub/models--ubergarm--GLM-4.7-Flash-GGUF
MODEL=/models/snapshots/.../GLM-4.7-Flash-IQ5_K.gguf

# Pattern A: CPU-only
podman run --rm -it -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z $IMG \
  -m "$MODEL" --no-mmap --jinja \
  -c 131072 -n 8192 --threads 13 --threads-batch 23 \
  -b 2048 -ub 2048 -ctk f16 -ctv f16 \
  --host 0.0.0.0 --port 8080

# Pattern B: Hybrid - add --device nvidia.com/gpu=all -ot exps=CPU
# Pattern C: Full GPU - add --device nvidia.com/gpu=all (no -ot flag)
  

Technical Notes

How Expert Offload Works

In MoE models, Expert weights dominate total parameter count (GLM-4.7-Flash: 64 experts x ~1.5GB each). -ot exps=CPU places only Expert weights in CPU RAM while Attention, Embedding, and Router layers stay on GPU.

Post-selection Expert computation runs on CPU, but GPU-accelerated Attention (especially KV cache access) shifts the bottleneck, significantly improving Decode speed.

ik_llama.cpp vs llama.cpp

ik_llama.cpp provides native MLA (Multi-head Latent Attention) support, optimized for DeepSeek2/GLM-4.7 architectures. Standard llama.cpp can load the GGUF but may lack MLA-specific optimizations.

For GPUs Under 96GB

With Expert Offload, VRAM consumption drops to roughly Attention layers + KV cache (~10-15GB for GLM-4.7-Flash IQ5_K). A 24GB+ GPU should deliver TG 60+ t/s in Hybrid mode.