1T MoE Kimi-K2.5 CPU Inference: Thread Optimization Through Long Context Operations
Complete CPU inference benchmark of Kimi-K2.5 (1.03T MoE, Q4_K_S/Q4_K_M) on EPYC 9175F. Why th=13 is the sweet spot, 16K long context measured performance, LCP cache effectiveness, and the design path to Dagster batch operations.
Background
Kimi-K2.5 is a 1.03 trillion parameter MoE model by Moonshot AI. It selects 8 out of 384 experts, keeping active parameters at ~32B. Built on the DeepSeek-V2 architecture (MLA: Multi-head Latent Attention), it achieves compressed KV cache for better memory efficiency.
At Q4_K_S quantization, RSS is ~523 GiB. At Q4_K_M, ~579 GiB. Both fit within 768GB DDR5 memory, making GPU-free CPU inference physically possible. The question is whether “physically possible” translates to “practically usable.”
Objective
- Benchmark Q4_K_S CPU inference speed (baseline performance)
- Measure thread count vs throughput relationship, identify optimal thread count
- Measure Q4_K_M Prefill/Decode speed at 32K context
- Validate Prompt Cache (LCP similarity) effectiveness
- Determine viability for Dagster pipeline batch operations
Test Environment
| Item | Specification |
|---|---|
| CPU | AMD EPYC 9175F (Zen 5, 16C, L3 512MB) |
| Memory | DDR5-6400 768GB (12ch) |
| GPU | NVIDIA RTX PRO 6000 Blackwell Max-Q 96GB |
| OS | Ubuntu 24.04 LTS |
| Runtime | llama.cpp server (Podman rootless) |
Model Specifications
| Item | Q4_K_S | Q4_K_M |
|---|---|---|
| Architecture | deepseek2 (MoE + MLA) | Same |
| Total Parameters | 1.03T | Same |
| Layers | 61 | Same |
| Experts | 384 (8 active) | Same |
| Quantization | Q4_K_S | Q4_K_M (4.84 bpw) |
| Model Size | ~520 GiB (RSS) | 578.57 GiB |
| Training Context | 262,144 | Same |
Methodology
Benchmark Command (llama-sweep-bench)
MODEL=/models/snapshots/386fed8b054275941d6a495a9a7010fbf31b560d/Q4_K_S/Kimi-K2.5-Q4_K_S-00001-of-00013.gguf
IMG=compute.home.arpa/ik_llama-cuda:latest
podman run --rm -it \
--device nvidia.com/gpu=all \
--shm-size 16g \
--cap-add=SYS_NICE \
-v /mnt/data/hf/hub/models--unsloth--Kimi-K2.5-GGUF:/models:ro,Z \
$IMG \
/app/llama-sweep-bench \
--model "$MODEL" \
--no-mmap --merge-qkv \
-mla 3 -amb 512 \
-b 4096 -ub 4096 \
-ctk f16 -ctv f16 \
-c 131072 \
-ngl 999 -ot exps=CPU \
--threads 13 \
--threads-batch 26 \
--warmup-batch \
-n 128
This command performs:
llama-sweep-bench: Dedicated benchmarking tool-ngl 999 -ot exps=CPU: Full GPU offload with Expert weights on CPU-c 131072: 131K context support-ctk f16 -ctv f16: KV cache in f16 precision--threads 13 --threads-batch 26: Thread configuration-mla 3 -amb 512: MLA (Multi-head Latent Attention) parameters
Memory Layout (Q4_K_S)
| Region | Size |
|---|---|
| KV cache (K) | 1,098 MiB |
| KV cache (V) | 976 MiB |
| CPU compute buffer | 348 MiB |
| Total RSS | ~523 GiB / 755 GiB |
| Swap usage | 799 MiB (no si/so activity) |
Memory Layout (Q4_K_M / ctx=32K)
| Region | Size |
|---|---|
| KV cache | 4,148 MiB (K: 2,196 / V: 1,952) |
| CPU compute buffer | 348 MiB |
| Model buffers | 578.57 GiB (13-split GGUF) |
CPU Inference Live Demonstration
To verify that CPU inference is actually working on Kimi-K2.5, we’ve recorded real-time execution footage on the EPYC 9175F. By observing the standard output of the inference server, you can directly see tokens being generated as they’re computed—providing tangible proof that this large model runs on CPU alone.
This video verifies actual CPU inference output quality by capturing real-time execution on EPYC 9175F. By observing token generation streaming in the inference server’s stdout, you can validate the actual output content.
The video demonstrates:
- llama.cpp server startup and model loading (quantized weights)
- Prefill phase (prompt evaluation) token generation speed and content
- Token-by-token Generate phase output
- Verification of actual generated text quality
Results
Q4_K_S Baseline (th=14, ctx=16K)
| Request | Prompt(tok) | PP(tok/s) | Gen(tok) | TG(tok/s) | Total(s) |
|---|---|---|---|---|---|
| 1st (no cache) | 823 | 22.24 | 438 | 10.27 | 79.7 |
| 2nd (cache saved) | 1,335 | 19.98 | 1,012 | 8.76 | 115.6 |
| 3rd (LCP hit) | - | - | - | - | cache lookup 62ms |
Thread Optimization (ctx=8K)
| Threads | PP(tok/s) | TG(tok/s) | Assessment |
|---|---|---|---|
| 16 | 24.43 | 12.94 | Maximum output (baseline) |
| 14 | 21.32 | 12.50 | Bandwidth saturation onset |
| 13 | 21.58 | 11.67 | Sweet spot |
| 12 | 14.58 | 11.86 | Resource efficiency focus |
Q4_K_M Long Context (th=13, ctx=32K)
| Request | Prompt(tok) | PP(tok/s) | Gen(tok) | TG(tok/s) | Notes |
|---|---|---|---|---|---|
| 1st | 16,148 | 6.15 | 333 | 2.44 | Full 16K prefill, ~44 min |
| 2nd (LCP 0.978) | 356 | 3.40 | 2,048 | 2.26 | Cache hit, diff-only prefill |
| 3rd (LCP 0.999) | 12 | 3.11 | 1,024 | 2.15 | Near-full cache restore |
| 4th (LCP 0.939) | 1,050 | 3.21 | 1,024 | 2.07 | Partial cache + diff prefill |
Prompt Cache Effectiveness (Q4_K_S)
| State | Size | Effect |
|---|---|---|
| 1,260 tokens saved | 159.5 MiB | LCP similarity > 0.5 triggers hit |
| Cache restore | - | Tens of ms (62ms measured) |
| TTFT reduction | - | Dramatic prompt eval time reduction on repeat |
Addendum: Improvements with ik_llama.cpp (Expert CPU + Attention GPU Hybrid)
Using the optimized ik_llama.cpp build with Expert weights on CPU and Attention layers on GPU (-ngl 999 -ot exps=CPU), we collected the following performance data:
Execution Command
podman run --rm -it --device nvidia.com/gpu=all \
-p 8081:8080 \
--shm-size 32g \
--cap-add=SYS_NICE \
-v /mnt/data/hf/hub/models--unsloth--Kimi-K2.5-GGUF:/models:ro,Z \
$IMG \
--host 0.0.0.0 --port 8080 \
-m "$MODEL" --no-mmap --jinja \
-c 131072 \
-n 128 \
--threads 13 --threads-batch 26 \
-b 2048 -ub 512 \
-ngl 999 -ot exps=CPU \
-ctk f16 -ctv f16 \
--merge-qkv -mla 3 -amb 512
-ngl 999 -ot exps=CPU: Offload Attention layers to GPU, Expert weights remain on CPU (Hybrid configuration)
Benchmark Results (Initial)
| Task | PP(tok) | TG(tok) | N_KV(tok) | T_PP(s) | S_PP(t/s) | T_TG(s) | S_TG(t/s) |
|---|---|---|---|---|---|---|---|
| 0 | 5,264 | 744 | 6,007 | 59.596 | 88.33 | 37.815 | 19.67 |
| 747 | 765 | 259 | 6,287 | 13.277 | 57.62 | 13.164 | 19.68 |
| 1,007 | 279 | 1,024 | 7,331 | 6.243 | 44.69 | 52.452 | 19.52 |
| 2,032 | 1,037 | 1,024 | 8,368 | 16.772 | 61.83 | 51.793 | 19.77 |
| 3,057 | 1,041 | 310 | 8,695 | 16.637 | 62.57 | 16.124 | 19.23 |
| Avg | - | - | - | - | 63.0 | - | 19.6 |
Subsequent Measurements
After applying Prompt Cache (LCP) and template optimizations:
| Run | PP(tok) | TG(tok) | N_KV(tok) | T_PP(s) | S_PP(t/s) | T_TG(s) | S_TG(t/s) | Notes |
|---|---|---|---|---|---|---|---|---|
| 1 | 5,330 | 401 | 5,730 | 41.298 | 129.06 | 20.458 | 19.60 | Fresh request |
| 2 | 416 | 2,241 | 7,986 | 8.363 | 49.75 | 114.552 | 19.56 | Cache partial miss |
| 3 | 2,255 | 919 | 8,919 | 20.631 | 109.30 | 48.056 | 19.12 | Cache partial miss |
Metric Explanations
| Metric | Meaning | Example |
|---|---|---|
| PP (Prompt eval tokens) | Tokens evaluated during prefill phase | Prompt length + cache diff |
| TG (eval tokens) | Tokens evaluated during generation phase | Output token count |
| N_KV | Total KV cache tokens at decode completion | Accumulated cache depth |
| T_PP(s) | Prefill duration (seconds) | Input processing time |
| S_PP(t/s) | Prefill speed (tokens/sec) | PP / T_PP |
| T_TG(s) | Generation duration (seconds) | Token-by-token generation time |
| S_TG(t/s) | Generation speed (tokens/sec) | TG / T_TG |
Assessment (Improvements / Limitations)
Improvements:
- Prefill tolerance enhanced: Runs 1 and 3 achieve 100–130 t/s prefill speed. Long prompts process quickly
- Generation speed stabilized: S_TG hovers around 19 t/s across all runs. Blackwell + Q4_K_S hits a hard ceiling
- Cache effectiveness: Partial cache hits in Runs 2–3 maintain S_PP at 50–110 t/s
Current constraints:
- Generation bottleneck: S_TG fixed at ~19 t/s. Perceived latency dominated by prefill time and output token count
- Cache consistency: Runs 2–3 show “Common part does not match fully” warnings. Even minor changes to System Prompt or templates (whitespace, timestamps) fragment the cache
Next Steps (Priority Order)
1. Maximize cache efficiency ⭐ Highest impact
- Completely fix System Prompt, tool declarations, and templates
- Remove dynamic strings (timestamps, random IDs, session markers)
- Systemize OpenWebUI dynamic injection to unify prompt structure
- Effect: Runs 2–3 prefill speed approaches ideal (100+ t/s)
2. Ensure parameter alignment
- Match server startup
-n(max generation tokens) with OpenWebUImax_tokens - Eliminate mismatches like
params.n_predict=2048 slot.n_predict=128observed in earlier logs - Remove wasted buffering and computation
3. Structural generation improvements (requires alternative approach)
- Kimi-K2.5 Q4_K_S architecture makes S_TG improvement difficult
- Next candidates:
- Quantization change: Q4_K_S → IQ4_XS/IQ3_M (ik_llama.cpp recommended)
- Model size: Switch to lighter MoE model
- Architecture: Select model with MoE activation patterns optimized for GPU
Build Verification (SM_120 Support)
To confirm Blackwell (SM 120) build is in use:
- Quick check: If
compiled for: 520no longer appears in server startup logs, the new build is active - Thorough check: Review build log (cmake configure phase) for:
CMAKE_CUDA_ARCHITECTURES=120
Analysis
Memory Bandwidth and Why th=13
Decode speed saturates at th=13-14. The 12-channel DDR5-6400 theoretical bandwidth is ~614 GB/s, but MoE random access patterns cannot fully utilize it. th=16 gives 12.94 tok/s, th=13 gives 11.67 tok/s—a 3-thread reduction for only 10% speed loss. The rationale for th=13: the remaining 3 cores are freed for Dagster/Trino and other data pipeline processes. 90% inference speed retained while enabling process coexistence. Some have pointed out that on other multi-core EPYCs, speed scales with more cores. Consider this ratio as a good yield balance for our setup.
Long Context Reality Check
16K bulk prefill taking 44 minutes suggests that “fill 256K from scratch every time” is unrealistic. At 20 tok/s, 256K would take ~3.5 hours.
The practical solution:
- Limit ctx to 16K-32K
- Prefill a System Digest (8K-16K) once at startup
- Use Prompt Cache (LCP similarity) for diff-only processing on subsequent requests
- Keep output length at 1K default, 2K only when necessary
Decode 2.4 tok/s vs 10 tok/s
Q4_K_S at ctx=16K delivers 10 tok/s. Q4_K_M at ctx=32K delivers 2.4 tok/s. The gap is context-length driven—the 4.1GB KV cache attention computation becomes the bottleneck at 32K. Unsuitable for interactive chat, but batch processing tolerates the wait.
Lessons Learned
Running a 1T model on CPU is technically validated. Decode at 10 tok/s is insufficient for interactive use but fully practical for Dagster pipeline batch generation, dataset augmentation, and distillation teacher generation.
Operational conclusion: run interactive/fast inference on the GPU side (RTX PRO 6000, vLLM), while CPU llama.cpp runs at th=13 as a resident batch intelligence engine. Of 768GB memory, 523 GiB goes to the model, leaving 200GB+ for DataFrame operations and Trino queries in parallel.
Reproduction Steps
1. Download Models
# Q4_K_S
huggingface-cli download unsloth/Kimi-K2.5-GGUF \
--include "Q4_K_S/*" \
--local-dir /mnt/data/hf/hub/models--Kimi-K2.5-GGUF
# Q4_K_M
huggingface-cli download unsloth/Kimi-K2.5-GGUF \
--include "Q4_K_M/*" \
--local-dir /mnt/data/hf/hub/models--Kimi-K2.5-GGUF
2. Run
See commands in “Methodology” section. Requires llama.cpp with flash-attn and prompt cache support.
3. Measure
curl -s http://localhost:8081/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"kimi","messages":[{"role":"user","content":"Explain MoE architecture"}],"max_tokens":512}'
Extract prompt eval time and eval time from server logs.
Technical Notes
Prompt Cache Design Principles
- System Digest must be byte-identical across requests (whitespace/date changes break cache)
- RAG context goes in user messages, not system (cache preservation is priority)
Q4_K_S vs Q4_K_M
Q4_K_M is ~60GB larger (520→579 GiB). Quality is marginally better, but speed difference is minimal. For ctx=16K batch operations, Q4_K_S is sufficient.
Recommended Parameters (Batch Operations)
| Parameter | Recommended | Rationale |
|---|---|---|
| ctx | 16,384-32,768 | 256K is impractical |
| threads | 13 | Memory bandwidth saturation, frees 3 cores for pipeline |
| ubatch | 256 | More stable than 512 |
| cache-ram | 32,768 MiB | Stabilizes LCP hit rate |
| output | 1,024 | Generation speed is the bottleneck, keep output short |

