1T MoE Kimi-K2.5 CPU Inference: Thread Optimization Through Long Context Operations

Background

Kimi-K2.5 is a 1.03 trillion parameter MoE model by Moonshot AI. It selects 8 out of 384 experts, keeping active parameters at ~32B. Built on the DeepSeek-V2 architecture (MLA: Multi-head Latent Attention), it achieves compressed KV cache for better memory efficiency.

At Q4_K_S quantization, RSS is ~523 GiB. At Q4_K_M, ~579 GiB. Both fit within 768GB DDR5 memory, making GPU-free CPU inference physically possible. The question is whether “physically possible” translates to “practically usable.”

Objective

Benchmark Q4_K_S CPU inference speed (baseline performance)
Measure thread count vs throughput relationship, identify optimal thread count
Measure Q4_K_M Prefill/Decode speed at 32K context
Validate Prompt Cache (LCP similarity) effectiveness
Determine viability for Dagster pipeline batch operations

Test Environment

Item	Specification
CPU	AMD EPYC 9175F (Zen 5, 16C, L3 512MB)
Memory	DDR5-6400 768GB (12ch)
GPU	NVIDIA RTX PRO 6000 Blackwell Max-Q 96GB
OS	Ubuntu 24.04 LTS
Runtime	llama.cpp server (Podman rootless)

Model Specifications

Item	Q4_K_S	Q4_K_M
Architecture	deepseek2 (MoE + MLA)	Same
Total Parameters	1.03T	Same
Layers	61	Same
Experts	384 (8 active)	Same
Quantization	Q4_K_S	Q4_K_M (4.84 bpw)
Model Size	~520 GiB (RSS)	578.57 GiB
Training Context	262,144	Same

Methodology

Benchmark Command (llama-sweep-bench)

  MODEL=/models/snapshots/386fed8b054275941d6a495a9a7010fbf31b560d/Q4_K_S/Kimi-K2.5-Q4_K_S-00001-of-00013.gguf
IMG=compute.home.arpa/ik_llama-cuda:latest
podman run --rm -it \
  --device nvidia.com/gpu=all \
  --shm-size 16g \
  --cap-add=SYS_NICE \
  -v /mnt/data/hf/hub/models--unsloth--Kimi-K2.5-GGUF:/models:ro,Z \
  $IMG \
  /app/llama-sweep-bench \
    --model "$MODEL" \
    --no-mmap --merge-qkv \
    -mla 3 -amb 512 \
    -b 4096 -ub 4096 \
    -ctk f16 -ctv f16 \
    -c 131072 \
    -ngl 999 -ot exps=CPU \
    --threads 13 \
    --threads-batch 26 \
    --warmup-batch \
    -n 128

This command performs:

llama-sweep-bench: Dedicated benchmarking tool
-ngl 999 -ot exps=CPU: Full GPU offload with Expert weights on CPU
-c 131072: 131K context support
-ctk f16 -ctv f16: KV cache in f16 precision
--threads 13 --threads-batch 26: Thread configuration
-mla 3 -amb 512: MLA (Multi-head Latent Attention) parameters

Memory Layout (Q4_K_S)

Region	Size
KV cache (K)	1,098 MiB
KV cache (V)	976 MiB
CPU compute buffer	348 MiB
Total RSS	~523 GiB / 755 GiB
Swap usage	799 MiB (no si/so activity)

Memory Layout (Q4_K_M / ctx=32K)

Region	Size
KV cache	4,148 MiB (K: 2,196 / V: 1,952)
CPU compute buffer	348 MiB
Model buffers	578.57 GiB (13-split GGUF)

CPU Inference Live Demonstration

To verify that CPU inference is actually working on Kimi-K2.5, we’ve recorded real-time execution footage on the EPYC 9175F. By observing the standard output of the inference server, you can directly see tokens being generated as they’re computed—providing tangible proof that this large model runs on CPU alone.

This video verifies actual CPU inference output quality by capturing real-time execution on EPYC 9175F. By observing token generation streaming in the inference server’s stdout, you can validate the actual output content.

The video demonstrates:

llama.cpp server startup and model loading (quantized weights)
Prefill phase (prompt evaluation) token generation speed and content
Token-by-token Generate phase output
Verification of actual generated text quality

Results

Q4_K_S Baseline (th=14, ctx=16K)

Request	Prompt(tok)	PP(tok/s)	Gen(tok)	TG(tok/s)	Total(s)
1st (no cache)	823	22.24	438	10.27	79.7
2nd (cache saved)	1,335	19.98	1,012	8.76	115.6
3rd (LCP hit)	-	-	-	-	cache lookup 62ms

Thread Optimization (ctx=8K)

Threads	PP(tok/s)	TG(tok/s)	Assessment
16	24.43	12.94	Maximum output (baseline)
14	21.32	12.50	Bandwidth saturation onset
13	21.58	11.67	Sweet spot
12	14.58	11.86	Resource efficiency focus

Q4_K_M Long Context (th=13, ctx=32K)

Request	Prompt(tok)	PP(tok/s)	Gen(tok)	TG(tok/s)	Notes
1st	16,148	6.15	333	2.44	Full 16K prefill, ~44 min
2nd (LCP 0.978)	356	3.40	2,048	2.26	Cache hit, diff-only prefill
3rd (LCP 0.999)	12	3.11	1,024	2.15	Near-full cache restore
4th (LCP 0.939)	1,050	3.21	1,024	2.07	Partial cache + diff prefill

Prompt Cache Effectiveness (Q4_K_S)

State	Size	Effect
1,260 tokens saved	159.5 MiB	LCP similarity > 0.5 triggers hit
Cache restore	-	Tens of ms (62ms measured)
TTFT reduction	-	Dramatic prompt eval time reduction on repeat

Addendum: Improvements with ik_llama.cpp (Expert CPU + Attention GPU Hybrid)

Using the optimized ik_llama.cpp build with Expert weights on CPU and Attention layers on GPU (-ngl 999 -ot exps=CPU), we collected the following performance data:

Execution Command

  podman run --rm -it --device nvidia.com/gpu=all \
  -p 8081:8080 \
  --shm-size 32g \
  --cap-add=SYS_NICE \
  -v /mnt/data/hf/hub/models--unsloth--Kimi-K2.5-GGUF:/models:ro,Z \
  $IMG \
  --host 0.0.0.0 --port 8080 \
  -m "$MODEL" --no-mmap --jinja \
  -c 131072 \
  -n 128 \
  --threads 13 --threads-batch 26 \
  -b 2048 -ub 512 \
  -ngl 999 -ot exps=CPU \
  -ctk f16 -ctv f16 \
  --merge-qkv -mla 3 -amb 512

-ngl 999 -ot exps=CPU: Offload Attention layers to GPU, Expert weights remain on CPU (Hybrid configuration)

Benchmark Results (Initial)

Task	PP(tok)	TG(tok)	N_KV(tok)	T_PP(s)	S_PP(t/s)	T_TG(s)	S_TG(t/s)
0	5,264	744	6,007	59.596	88.33	37.815	19.67
747	765	259	6,287	13.277	57.62	13.164	19.68
1,007	279	1,024	7,331	6.243	44.69	52.452	19.52
2,032	1,037	1,024	8,368	16.772	61.83	51.793	19.77
3,057	1,041	310	8,695	16.637	62.57	16.124	19.23
Avg	-	-	-	-	63.0	-	19.6

Subsequent Measurements

After applying Prompt Cache (LCP) and template optimizations:

Run	PP(tok)	TG(tok)	N_KV(tok)	T_PP(s)	S_PP(t/s)	T_TG(s)	S_TG(t/s)	Notes
1	5,330	401	5,730	41.298	129.06	20.458	19.60	Fresh request
2	416	2,241	7,986	8.363	49.75	114.552	19.56	Cache partial miss
3	2,255	919	8,919	20.631	109.30	48.056	19.12	Cache partial miss

Metric Explanations

Metric	Meaning	Example
PP (Prompt eval tokens)	Tokens evaluated during prefill phase	Prompt length + cache diff
TG (eval tokens)	Tokens evaluated during generation phase	Output token count
N_KV	Total KV cache tokens at decode completion	Accumulated cache depth
T_PP(s)	Prefill duration (seconds)	Input processing time
S_PP(t/s)	Prefill speed (tokens/sec)	PP / T_PP
T_TG(s)	Generation duration (seconds)	Token-by-token generation time
S_TG(t/s)	Generation speed (tokens/sec)	TG / T_TG

Assessment (Improvements / Limitations)

Improvements:

Prefill tolerance enhanced: Runs 1 and 3 achieve 100–130 t/s prefill speed. Long prompts process quickly
Generation speed stabilized: S_TG hovers around 19 t/s across all runs. Blackwell + Q4_K_S hits a hard ceiling
Cache effectiveness: Partial cache hits in Runs 2–3 maintain S_PP at 50–110 t/s

Current constraints:

Generation bottleneck: S_TG fixed at ~19 t/s. Perceived latency dominated by prefill time and output token count
Cache consistency: Runs 2–3 show “Common part does not match fully” warnings. Even minor changes to System Prompt or templates (whitespace, timestamps) fragment the cache

Next Steps (Priority Order)

1. Maximize cache efficiency ⭐ Highest impact

Completely fix System Prompt, tool declarations, and templates
Remove dynamic strings (timestamps, random IDs, session markers)
Systemize OpenWebUI dynamic injection to unify prompt structure
Effect: Runs 2–3 prefill speed approaches ideal (100+ t/s)

2. Ensure parameter alignment

Match server startup -n (max generation tokens) with OpenWebUI max_tokens
Eliminate mismatches like params.n_predict=2048 slot.n_predict=128 observed in earlier logs
Remove wasted buffering and computation

3. Structural generation improvements (requires alternative approach)

Kimi-K2.5 Q4_K_S architecture makes S_TG improvement difficult
Next candidates:
- Quantization change: Q4_K_S → IQ4_XS/IQ3_M (ik_llama.cpp recommended)
- Model size: Switch to lighter MoE model
- Architecture: Select model with MoE activation patterns optimized for GPU

Build Verification (SM_120 Support)

To confirm Blackwell (SM 120) build is in use:

Quick check: If compiled for: 520 no longer appears in server startup logs, the new build is active
Thorough check: Review build log (cmake configure phase) for:
```
  CMAKE_CUDA_ARCHITECTURES=120
  
```

Analysis

Memory Bandwidth and Why th=13

Decode speed saturates at th=13-14. The 12-channel DDR5-6400 theoretical bandwidth is ~614 GB/s, but MoE random access patterns cannot fully utilize it. th=16 gives 12.94 tok/s, th=13 gives 11.67 tok/s—a 3-thread reduction for only 10% speed loss. The rationale for th=13: the remaining 3 cores are freed for Dagster/Trino and other data pipeline processes. 90% inference speed retained while enabling process coexistence. Some have pointed out that on other multi-core EPYCs, speed scales with more cores. Consider this ratio as a good yield balance for our setup.

Long Context Reality Check

16K bulk prefill taking 44 minutes suggests that “fill 256K from scratch every time” is unrealistic. At 20 tok/s, 256K would take ~3.5 hours.

The practical solution:

Limit ctx to 16K-32K
Prefill a System Digest (8K-16K) once at startup
Use Prompt Cache (LCP similarity) for diff-only processing on subsequent requests
Keep output length at 1K default, 2K only when necessary

Decode 2.4 tok/s vs 10 tok/s

Q4_K_S at ctx=16K delivers 10 tok/s. Q4_K_M at ctx=32K delivers 2.4 tok/s. The gap is context-length driven—the 4.1GB KV cache attention computation becomes the bottleneck at 32K. Unsuitable for interactive chat, but batch processing tolerates the wait.

Lessons Learned

Running a 1T model on CPU is technically validated. Decode at 10 tok/s is insufficient for interactive use but fully practical for Dagster pipeline batch generation, dataset augmentation, and distillation teacher generation.

Operational conclusion: run interactive/fast inference on the GPU side (RTX PRO 6000, vLLM), while CPU llama.cpp runs at th=13 as a resident batch intelligence engine. Of 768GB memory, 523 GiB goes to the model, leaving 200GB+ for DataFrame operations and Trino queries in parallel.

Reproduction Steps

1. Download Models

  # Q4_K_S
huggingface-cli download unsloth/Kimi-K2.5-GGUF \
  --include "Q4_K_S/*" \
  --local-dir /mnt/data/hf/hub/models--Kimi-K2.5-GGUF

# Q4_K_M
huggingface-cli download unsloth/Kimi-K2.5-GGUF \
  --include "Q4_K_M/*" \
  --local-dir /mnt/data/hf/hub/models--Kimi-K2.5-GGUF

2. Run

See commands in “Methodology” section. Requires llama.cpp with flash-attn and prompt cache support.

3. Measure

  curl -s http://localhost:8081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"kimi","messages":[{"role":"user","content":"Explain MoE architecture"}],"max_tokens":512}'

Extract prompt eval time and eval time from server logs.

Technical Notes

Prompt Cache Design Principles

System Digest must be byte-identical across requests (whitespace/date changes break cache)
RAG context goes in user messages, not system (cache preservation is priority)

Q4_K_S vs Q4_K_M

Q4_K_M is ~60GB larger (520→579 GiB). Quality is marginally better, but speed difference is minimal. For ctx=16K batch operations, Q4_K_S is sufficient.

Recommended Parameters (Batch Operations)

Parameter	Recommended	Rationale
ctx	16,384-32,768	256K is impractical
threads	13	Memory bandwidth saturation, frees 3 cores for pipeline
ubatch	256	More stable than 512
cache-ram	32,768 MiB	Stabilizes LCP hit rate
output	1,024	Generation speed is the bottleneck, keep output short

1T MoE Kimi-K2.5 CPU Inference: Thread Optimization Through Long Context Operations

Background link

Objective link

Test Environment link

Model Specifications link

Methodology link

Benchmark Command (llama-sweep-bench) link

Memory Layout (Q4_K_S) link

Memory Layout (Q4_K_M / ctx=32K) link

CPU Inference Live Demonstration link

Results link

Q4_K_S Baseline (th=14, ctx=16K) link

Thread Optimization (ctx=8K) link

Q4_K_M Long Context (th=13, ctx=32K) link

Prompt Cache Effectiveness (Q4_K_S) link

Addendum: Improvements with ik_llama.cpp (Expert CPU + Attention GPU Hybrid) link

Execution Command link

Benchmark Results (Initial) link

Subsequent Measurements link

Metric Explanations link

Assessment (Improvements / Limitations) link

Next Steps (Priority Order) link

Build Verification (SM_120 Support) link

Analysis link

Memory Bandwidth and Why th=13 link

Long Context Reality Check link

Decode 2.4 tok/s vs 10 tok/s link

Lessons Learned link

Reproduction Steps link

1. Download Models link

2. Run link

3. Measure link

Technical Notes link

Prompt Cache Design Principles link

Q4_K_S vs Q4_K_M link

Recommended Parameters (Batch Operations) link