1T MoE Kimi-K2.5 CPU Inference: Thread Optimization Through Long Context Operations

Complete CPU inference benchmark of Kimi-K2.5 (1.03T MoE, Q4_K_S/Q4_K_M) on EPYC 9175F. Why th=13 is the sweet spot, 16K long context measured performance, LCP cache effectiveness, and the design path to Dagster batch operations.

These articles use AI-generated summaries of Obsidian notes originally kept as technical memos.

English translations are produced with AI assistance.

Introduction

I was not trying to prove that a 1T-class model can technically boot on a local machine. The real question was whether it could serve as a useful part of a batch-oriented production workflow. My target was Dagster pipelines, asynchronous generation, dataset expansion, distillation, and local fallback inference. Interactive chat latency was not the primary goal.

With that framing, I evaluated Kimi-K2.5 Q4_K_S on llama.cpp server running on an AMD EPYC 9175F with 768GB of DDR5-6400 memory. The result was better than a simple “it works.” The platform reached a stable operating point where CPU-only inference remained practical, as long as I treated memory bandwidth and prompt-cache reuse as first-class design constraints.

Background

Kimi-K2.5 is a 1.03 trillion parameter MoE model by Moonshot AI. It selects 8 out of 384 experts, keeping active parameters at ~32B. Built on the DeepSeek-V2 architecture (MLA: Multi-head Latent Attention), it achieves compressed KV cache for better memory efficiency.

At Q4_K_S quantization, RSS is ~523 GiB. At Q4_K_M, ~579 GiB. Both fit within 768GB DDR5 memory, making GPU-free CPU inference physically possible. The question is whether “physically possible” translates to “practically usable.”

The intended workload was:

Dagster pipelines
asynchronous batch generation
dataset expansion
distillation teacher generation
support or validation for a GPU-based agent stack
a local fallback LLM

That workload tolerates seconds of latency where an interactive assistant would not. So the benchmark had to answer a different question: where is the operational sweet spot, and how much reusable context can I carry before the system stops being practical?

Objective

Benchmark Q4_K_S CPU inference speed (baseline performance)
Measure thread count vs throughput relationship, identify optimal thread count
Measure Q4_K_M Prefill/Decode speed at 32K context
Validate Prompt Cache (LCP similarity) effectiveness
Determine viability for Dagster pipeline batch operations

Test Environment

Item	Specification
CPU	AMD EPYC 9175F (Zen 5, 16C, L3 512MB)
Memory	DDR5-6400 768GB (12ch)
GPU	NVIDIA RTX PRO 6000 Blackwell Max-Q 96GB
OS	Ubuntu 24.04 LTS
Runtime	llama.cpp server (Podman rootless)

Model Specifications

Item	Q4_K_S	Q4_K_M
Architecture	deepseek2 (MoE + MLA)	Same
Total Parameters	1.03T	Same
Layers	61	Same
Experts	384 (8 active)	Same
Quantization	Q4_K_S	Q4_K_M (4.84 bpw)
Model Size	~520 GiB (RSS)	578.57 GiB
Training Context	262,144	Same

Methodology

Benchmark Command (llama-sweep-bench)

  MODEL=/models/snapshots/386fed8b054275941d6a495a9a7010fbf31b560d/Q4_K_S/Kimi-K2.5-Q4_K_S-00001-of-00013.gguf
IMG=compute.home.arpa/ik_llama-cuda:latest
podman run --rm -it \
  --device nvidia.com/gpu=all \
  --shm-size 16g \
  --cap-add=SYS_NICE \
  -v /mnt/data/hf/hub/models--unsloth--Kimi-K2.5-GGUF:/models:ro,Z \
  $IMG \
  /app/llama-sweep-bench \
    --model "$MODEL" \
    --no-mmap --merge-qkv \
    -mla 3 -amb 512 \
    -b 4096 -ub 4096 \
    -ctk f16 -ctv f16 \
    -c 131072 \
    -ngl 999 -ot exps=CPU \
    --threads 13 \
    --threads-batch 26 \
    --warmup-batch \
    -n 128

This command performs:

llama-sweep-bench: Dedicated benchmarking tool
-ngl 999 -ot exps=CPU: Full GPU offload with Expert weights on CPU
-c 131072: 131K context support
-ctk f16 -ctv f16: KV cache in f16 precision
--threads 13 --threads-batch 26: Thread configuration
-mla 3 -amb 512: MLA (Multi-head Latent Attention) parameters

Thread-Scaling Launch Commands

The benchmark varied thread counts while keeping other parameters fixed. Below are representative CPU-only launch commands.

th=16 (Maximum Performance)

  podman run --rm -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v /mnt/data/hf/hub/models--unsloth--Kimi-K2.5-GGUF:/models:Z \
  compute.home.arpa/llamacpp-zen5:latest \
  -m /models/snapshots/386fed8b054275941d6a495a9a7010fbf31b560d/Q4_K_S/Kimi-K2.5-Q4_K_S-00001-of-00013.gguf \
  --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on \
  --ctx-size 8192 --parallel 1 --threads 16 --threads-batch 16 \
  --batch-size 2048 --ubatch-size 512 --jinja --host 0.0.0.0 --port 8080

th=13 (Operational Sweetspot)

  podman run --rm -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v /mnt/data/hf/hub/models--unsloth--Kimi-K2.5-GGUF:/models:Z \
  compute.home.arpa/llamacpp-zen5:latest \
  -m /models/snapshots/386fed8b054275941d6a495a9a7010fbf31b560d/Q4_K_S/Kimi-K2.5-Q4_K_S-00001-of-00013.gguf \
  --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on \
  --ctx-size 8192 --parallel 1 --threads 13 --threads-batch 13 \
  --batch-size 2048 --ubatch-size 512 --jinja --host 0.0.0.0 --port 8080

Memory Layout (Q4_K_S)

Region	Size
KV cache (K)	1,098 MiB
KV cache (V)	976 MiB
CPU compute buffer	348 MiB
Total RSS	~523 GiB / 755 GiB
Swap usage	799 MiB (no si/so activity)

Memory Layout (Q4_K_M / ctx=32K)

Region	Size
KV cache	4,148 MiB (K: 2,196 / V: 1,952)
CPU compute buffer	348 MiB
CPU repack buffer	459,665 MiB
Model buffers	578.57 GiB (13-split GGUF)

A 768GB RAM machine can hold a 16k context workload comfortably. At this scale, the dominant cost is the model itself, not the KV cache.

CPU Inference Live Demonstration

To verify CPU inference output quality, I recorded real-time execution on the EPYC 9175F.

Video link: https://www.youtube.com/watch?v=n8htU2pmzNI

The video demonstrates:

llama.cpp server startup and model loading (quantized weights)
Prefill phase token generation speed and content
Token-by-token Generate phase output
Verification of actual generated text quality

Results

Q4_K_S Baseline (th=14, ctx=16K)

Request	Prompt(tok)	PP(tok/s)	Gen(tok)	TG(tok/s)	Total(s)
1st (no cache)	823	22.24	438	10.27	79.7
2nd (cache saved)	1,335	19.98	1,012	8.76	115.6
3rd (LCP hit)	-	-	-	-	cache lookup 62ms

That first request is heavy. It is not the profile for a chat-first UX. But for a batch job, it is still within a usable range. Getting one substantial generation back in roughly eighty seconds is entirely reasonable for overnight or background batch work.

The third request’s LCP similarity hit, reducing cache lookup to 62ms, was one of the most important results. Once the runtime can reuse a prior prompt state instead of pre-filling everything from scratch, 1T-class local CPU inference becomes much more compatible with recurring jobs.

Thread Optimization (ctx=8K)

Threads	PP(tok/s)	TG(tok/s)	Assessment
16	24.43	12.94	Maximum output (baseline)
14	21.32	12.50	Bandwidth saturation onset
13	21.58	11.67	Sweet spot
12	14.58	11.86	Resource efficiency focus

Decode stops improving much once I reach the th=13 to th=14 range. That strongly suggests I am hitting the memory-bandwidth wall rather than a compute wall. At that point, spending more cores on inference buys very little, while those same cores remain useful for orchestration and data tasks.

The rationale for th=13: the remaining 3 cores are freed for Dagster/Trino and other data pipeline processes. 90% inference speed retained while enabling process coexistence.

Q4_K_M Long Context (th=13, ctx=32K)

Request	Prompt(tok)	PP(tok/s)	Gen(tok)	TG(tok/s)	Notes
1st	16,148	6.15	333	2.44	Full 16K prefill, ~44 min
2nd (LCP 0.978)	356	3.40	2,048	2.26	Cache hit, diff-only prefill
3rd (LCP 0.999)	12	3.11	1,024	2.15	Near-full cache restore
4th (LCP 0.939)	1,050	3.21	1,024	2.07	Partial cache + diff prefill

16K bulk prefill taking 44 minutes suggests that “fill 256K from scratch every time” is unrealistic. At 20 tok/s, 256K would take ~3.5 hours.

However, prompt cache (LCP similarity) changes the economics dramatically. Later requests only prefill 356, 12, and 1,050 token deltas. A stable digest design translates directly into reusable prefixes.

The prefill slowdown was also pronounced. Throughput started at 10 tok/s, dropped to 7 tok/s midway, and bottomed out near 4 tok/s. Decode landed at 2.44 tok/s, meaning 1k output takes about 6-7 minutes and 2k takes 13-14 minutes. Too slow for interactive chat, but workable for asynchronous batch tasks.

Prompt Cache Effectiveness (Q4_K_S)

State	Size	Effect
1,260 tokens saved	159.5 MiB	LCP similarity > 0.5 triggers hit
Cache restore	-	Tens of ms (62ms measured)
TTFT reduction	-	Dramatic prompt eval time reduction on repeat

Long context logs show repeated LCP similarity reuse:

selected slot by LCP similarity, sim_best = 0.978 (> 0.100 thold), f_keep = 0.980
selected slot by LCP similarity, sim_best = 0.999 (> 0.100 thold), f_keep = 0.870
selected slot by LCP similarity, sim_best = 0.939 (> 0.100 thold), f_keep = 0.940

Later requests are not rebuilding the entire fixed prefix. The server reprocesses only the delta, which is exactly why a stable digest design matters.

Addendum: Improvements with ik_llama.cpp (Expert CPU + Attention GPU Hybrid)

Using the optimized ik_llama.cpp build with Expert weights on CPU and Attention layers on GPU (-ngl 999 -ot exps=CPU):

Execution Command

  podman run --rm -it --device nvidia.com/gpu=all \
  -p 8081:8080 \
  --shm-size 32g \
  --cap-add=SYS_NICE \
  -v /mnt/data/hf/hub/models--unsloth--Kimi-K2.5-GGUF:/models:ro,Z \
  $IMG \
  --host 0.0.0.0 --port 8080 \
  -m "$MODEL" --no-mmap --jinja \
  -c 131072 \
  -n 128 \
  --threads 13 --threads-batch 26 \
  -b 2048 -ub 512 \
  -ngl 999 -ot exps=CPU \
  -ctk f16 -ctv f16 \
  --merge-qkv -mla 3 -amb 512

Benchmark Results (Initial)

Task	PP(tok)	TG(tok)	N_KV(tok)	T_PP(s)	S_PP(t/s)	T_TG(s)	S_TG(t/s)
0	5,264	744	6,007	59.596	88.33	37.815	19.67
747	765	259	6,287	13.277	57.62	13.164	19.68
1,007	279	1,024	7,331	6.243	44.69	52.452	19.52
2,032	1,037	1,024	8,368	16.772	61.83	51.793	19.77
3,057	1,041	310	8,695	16.637	62.57	16.124	19.23
Avg	-	-	-	-	63.0	-	19.6

Subsequent Measurements

After applying Prompt Cache (LCP) and template optimizations:

Run	PP(tok)	TG(tok)	N_KV(tok)	T_PP(s)	S_PP(t/s)	T_TG(s)	S_TG(t/s)	Notes
1	5,330	401	5,730	41.298	129.06	20.458	19.60	Fresh request
2	416	2,241	7,986	8.363	49.75	114.552	19.56	Cache partial miss
3	2,255	919	8,919	20.631	109.30	48.056	19.12	Cache partial miss

Assessment

Improvements:

Prefill tolerance enhanced: Runs 1 and 3 achieve 100-130 t/s prefill speed. Long prompts process quickly
Generation speed stabilized: S_TG hovers around 19 t/s across all runs. Blackwell + Q4_K_S hits a hard ceiling
Cache effectiveness: Partial cache hits in Runs 2-3 maintain S_PP at 50-110 t/s

Current constraints:

Generation bottleneck: S_TG fixed at ~19 t/s. Perceived latency dominated by prefill time and output token count
Cache consistency: Runs 2-3 show “Common part does not match fully” warnings. Even minor changes to System Prompt or templates (whitespace, timestamps) fragment the cache

Analysis

Memory Bandwidth and Why th=13

Decode speed saturates at th=13-14. The 12-channel DDR5-6400 theoretical bandwidth is ~614 GB/s, but MoE random access patterns cannot fully utilize it. th=16 gives 12.94 tok/s, th=13 gives 11.67 tok/s – a 3-thread reduction for only 10% speed loss.

The rationale for th=13: the remaining 3 cores are freed for Dagster/Trino and other data pipeline processes. 90% inference speed retained while enabling process coexistence.

Long Context Reality Check

16K bulk prefill taking 44 minutes suggests that “fill 256K from scratch every time” is unrealistic. At 20 tok/s, 256K would take ~3.5 hours.

The practical solution:

Limit ctx to 16K-32K
Prefill a System Digest (8K-16K) once at startup
Use Prompt Cache (LCP similarity) for diff-only processing on subsequent requests
Keep output length at 1K default, 2K only when necessary

Decode 2.4 tok/s vs 10 tok/s

Q4_K_S at ctx=16K delivers 10 tok/s. Q4_K_M at ctx=32K delivers 2.4 tok/s. The gap is context-length driven – the 4.1GB KV cache attention computation becomes the bottleneck at 32K. Unsuitable for interactive chat, but batch processing tolerates the wait.

Why This Configuration Performs Well

MoE inference working this well on CPU is not because “everything fits in cache.” It is because the high-reuse working set probabilistically hits L3 at a high rate. The hot regions that L3 accelerates include:

Router / Gating logic
Projection-adjacent compute
Recent layer weights and intermediate tensors
KV reuse paths

The EPYC 9175F physical characteristics also contribute:

Huge L3 (512MB) x low core count: 512MB L3 for only 16 cores. Cross-core contention nearly absent
Very low memory latency configuration: 12 channels for 16 cores, keeping memory-controller queues shallow
Zen 5 BF16 / AVX-512: Physical 512-bit datapath and native BF16 align well with llama.cpp optimizations

Prompt Cache Design Principles

System Digest must be byte-identical across requests (whitespace/date changes break cache)
RAG context goes in user messages, not system (cache preservation is priority)
Keep output style short and controlled (since generation is slow, shorter output directly improves usability)

The aichat .file /path/to/file approach is also effective: keep a fixed digest in the system prompt and inject only necessary documents on the user side for better cache stability.

Conclusion

Running a 1T model on CPU is technically validated. Decode at 10 tok/s is insufficient for interactive use but fully practical for Dagster pipeline batch generation, dataset augmentation, and distillation teacher generation.

Operational conclusion: run interactive/fast inference on the GPU side (RTX PRO 6000, vLLM), while CPU llama.cpp runs at th=13 as a resident batch intelligence engine. Of 768GB memory, 523 GiB goes to the model, leaving 200GB+ for DataFrame operations and Trino queries in parallel.

The biggest win was not just that the model ran. It was that I could identify the actual operating point: th=13, prompt-cache reuse through LCP similarity, and a bandwidth-aware split between inference work and surrounding data services.

Recommended Parameters (Batch Operations)

Parameter	Recommended	Rationale
ctx	16,384-32,768	256K is impractical
system digest	8k (at most 12k-16k)	Prefill once at startup, then cache
threads	13	Memory bandwidth saturation, frees 3 cores for pipeline
ubatch	256	More stable than 512 (may avoid slowdown)
cache-ram	32,768 MiB	Stabilizes LCP hit rate
output	1,024	Generation speed is the bottleneck, keep output short

Recommended Launch Command

  podman run --rm -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v /mnt/data/hf/hub/models--unsloth--Kimi-K2.5-GGUF:/models:Z \
  compute.home.arpa/llamacpp-zen5:latest \
  -m /models/snapshots/386fed8b054275941d6a495a9a7010fbf31b560d/Q4_K_S/Kimi-K2.5-Q4_K_S-00001-of-00013.gguf \
  --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on \
  --ctx-size 131072 --parallel 1 --threads 13 --threads-batch 13 \
  --batch-size 2048 --ubatch-size 512 --jinja --host 0.0.0.0 --port 8080

Reproduction Steps

1. Download Models

  # Q4_K_S
huggingface-cli download unsloth/Kimi-K2.5-GGUF \
  --include "Q4_K_S/*" \
  --local-dir /mnt/data/hf/hub/models--Kimi-K2.5-GGUF

# Q4_K_M
huggingface-cli download unsloth/Kimi-K2.5-GGUF \
  --include "Q4_K_M/*" \
  --local-dir /mnt/data/hf/hub/models--Kimi-K2.5-GGUF

2. Run

See commands in “Methodology” section. Requires llama.cpp with flash-attn and prompt cache support.

3. Measure

  curl -s http://localhost:8081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"kimi","messages":[{"role":"user","content":"Explain MoE architecture"}],"max_tokens":512}'

Extract prompt eval time and eval time from server logs.

Technical Notes

Q4_K_S vs Q4_K_M

Q4_K_M is ~60GB larger (520 to 579 GiB). Quality is marginally better, but speed difference is minimal. For ctx=16K batch operations, Q4_K_S is sufficient.

Reference Benchmark: Llama-4-Maverick-17B-128E

For calibration, I also kept one comparison table for another MoE model:

Quant	Prefill (tok/s)	Decode (tok/s)	TTFT (1k context)
Q4_K_M	65-68	21-24	12-17s
Q8_0	50-52	15-16	16-20s

This helped reinforce that quantization choice and model architecture materially affect the practical CPU inference envelope.

Llama-4-Scout-17B-16E Measured: CPU Q6_K 17tok/s vs GPU nvfp4 60tok/s, Cache Strategy and 100K Context Boundary

Llama-4-Scout (17B active / …

Llama-4-Maverick-17B-128E CPU Inference: Q4_K_M vs Q8_0 Speed-Quality Trade-off Measured

Llama-4-Maverick (17B active / …