Introduction

I was not trying to prove that a 1T-class model can technically boot on a local machine. The real question was whether it could serve as a useful part of a batch-oriented production workflow. My target was Dagster pipelines, asynchronous generation, dataset expansion, distillation, and local fallback inference. Interactive chat latency was not the primary goal.

With that framing, I evaluated Kimi-K2.5 Q4_K_S on llama.cpp server running on an AMD EPYC 9175F with 768GB of DDR5-6400 memory. The result was better than a simple “it works.” The platform reached a stable operating point where CPU-only inference remained practical, as long as I treated memory bandwidth and prompt-cache reuse as first-class design constraints.

Background

Kimi-K2.5 is a 1.03 trillion parameter MoE model by Moonshot AI. It selects 8 out of 384 experts, keeping active parameters at ~32B. Built on the DeepSeek-V2 architecture (MLA: Multi-head Latent Attention), it achieves compressed KV cache for better memory efficiency.

At Q4_K_S quantization, RSS is ~523 GiB. At Q4_K_M, ~579 GiB. Both fit within 768GB DDR5 memory, making GPU-free CPU inference physically possible. The question is whether “physically possible” translates to “practically usable.”

The intended workload was:

  • Dagster pipelines
  • asynchronous batch generation
  • dataset expansion
  • distillation teacher generation
  • support or validation for a GPU-based agent stack
  • a local fallback LLM

That workload tolerates seconds of latency where an interactive assistant would not. So the benchmark had to answer a different question: where is the operational sweet spot, and how much reusable context can I carry before the system stops being practical?

Objective

  1. Benchmark Q4_K_S CPU inference speed (baseline performance)
  2. Measure thread count vs throughput relationship, identify optimal thread count
  3. Measure Q4_K_M Prefill/Decode speed at 32K context
  4. Validate Prompt Cache (LCP similarity) effectiveness
  5. Determine viability for Dagster pipeline batch operations

Test Environment

ItemSpecification
CPUAMD EPYC 9175F (Zen 5, 16C, L3 512MB)
MemoryDDR5-6400 768GB (12ch)
GPUNVIDIA RTX PRO 6000 Blackwell Max-Q 96GB
OSUbuntu 24.04 LTS
Runtimellama.cpp server (Podman rootless)

Model Specifications

ItemQ4_K_SQ4_K_M
Architecturedeepseek2 (MoE + MLA)Same
Total Parameters1.03TSame
Layers61Same
Experts384 (8 active)Same
QuantizationQ4_K_SQ4_K_M (4.84 bpw)
Model Size~520 GiB (RSS)578.57 GiB
Training Context262,144Same

Methodology

Benchmark Command (llama-sweep-bench)

  MODEL=/models/snapshots/386fed8b054275941d6a495a9a7010fbf31b560d/Q4_K_S/Kimi-K2.5-Q4_K_S-00001-of-00013.gguf
IMG=compute.home.arpa/ik_llama-cuda:latest
podman run --rm -it \
  --device nvidia.com/gpu=all \
  --shm-size 16g \
  --cap-add=SYS_NICE \
  -v /mnt/data/hf/hub/models--unsloth--Kimi-K2.5-GGUF:/models:ro,Z \
  $IMG \
  /app/llama-sweep-bench \
    --model "$MODEL" \
    --no-mmap --merge-qkv \
    -mla 3 -amb 512 \
    -b 4096 -ub 4096 \
    -ctk f16 -ctv f16 \
    -c 131072 \
    -ngl 999 -ot exps=CPU \
    --threads 13 \
    --threads-batch 26 \
    --warmup-batch \
    -n 128
  

This command performs:

  • llama-sweep-bench: Dedicated benchmarking tool
  • -ngl 999 -ot exps=CPU: Full GPU offload with Expert weights on CPU
  • -c 131072: 131K context support
  • -ctk f16 -ctv f16: KV cache in f16 precision
  • --threads 13 --threads-batch 26: Thread configuration
  • -mla 3 -amb 512: MLA (Multi-head Latent Attention) parameters

Thread-Scaling Launch Commands

The benchmark varied thread counts while keeping other parameters fixed. Below are representative CPU-only launch commands.

th=16 (Maximum Performance)

  podman run --rm -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v /mnt/data/hf/hub/models--unsloth--Kimi-K2.5-GGUF:/models:Z \
  compute.home.arpa/llamacpp-zen5:latest \
  -m /models/snapshots/386fed8b054275941d6a495a9a7010fbf31b560d/Q4_K_S/Kimi-K2.5-Q4_K_S-00001-of-00013.gguf \
  --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on \
  --ctx-size 8192 --parallel 1 --threads 16 --threads-batch 16 \
  --batch-size 2048 --ubatch-size 512 --jinja --host 0.0.0.0 --port 8080
  

th=13 (Operational Sweetspot)

  podman run --rm -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v /mnt/data/hf/hub/models--unsloth--Kimi-K2.5-GGUF:/models:Z \
  compute.home.arpa/llamacpp-zen5:latest \
  -m /models/snapshots/386fed8b054275941d6a495a9a7010fbf31b560d/Q4_K_S/Kimi-K2.5-Q4_K_S-00001-of-00013.gguf \
  --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on \
  --ctx-size 8192 --parallel 1 --threads 13 --threads-batch 13 \
  --batch-size 2048 --ubatch-size 512 --jinja --host 0.0.0.0 --port 8080
  

Memory Layout (Q4_K_S)

RegionSize
KV cache (K)1,098 MiB
KV cache (V)976 MiB
CPU compute buffer348 MiB
Total RSS~523 GiB / 755 GiB
Swap usage799 MiB (no si/so activity)

Memory Layout (Q4_K_M / ctx=32K)

RegionSize
KV cache4,148 MiB (K: 2,196 / V: 1,952)
CPU compute buffer348 MiB
CPU repack buffer459,665 MiB
Model buffers578.57 GiB (13-split GGUF)

A 768GB RAM machine can hold a 16k context workload comfortably. At this scale, the dominant cost is the model itself, not the KV cache.

CPU Inference Live Demonstration

To verify CPU inference output quality, I recorded real-time execution on the EPYC 9175F.

Video link: https://www.youtube.com/watch?v=n8htU2pmzNI

The video demonstrates:

  • llama.cpp server startup and model loading (quantized weights)
  • Prefill phase token generation speed and content
  • Token-by-token Generate phase output
  • Verification of actual generated text quality

Results

Q4_K_S Baseline (th=14, ctx=16K)

RequestPrompt(tok)PP(tok/s)Gen(tok)TG(tok/s)Total(s)
1st (no cache)82322.2443810.2779.7
2nd (cache saved)1,33519.981,0128.76115.6
3rd (LCP hit)----cache lookup 62ms

That first request is heavy. It is not the profile for a chat-first UX. But for a batch job, it is still within a usable range. Getting one substantial generation back in roughly eighty seconds is entirely reasonable for overnight or background batch work.

The third request’s LCP similarity hit, reducing cache lookup to 62ms, was one of the most important results. Once the runtime can reuse a prior prompt state instead of pre-filling everything from scratch, 1T-class local CPU inference becomes much more compatible with recurring jobs.

Thread Optimization (ctx=8K)

ThreadsPP(tok/s)TG(tok/s)Assessment
1624.4312.94Maximum output (baseline)
1421.3212.50Bandwidth saturation onset
1321.5811.67Sweet spot
1214.5811.86Resource efficiency focus

Decode stops improving much once I reach the th=13 to th=14 range. That strongly suggests I am hitting the memory-bandwidth wall rather than a compute wall. At that point, spending more cores on inference buys very little, while those same cores remain useful for orchestration and data tasks.

The rationale for th=13: the remaining 3 cores are freed for Dagster/Trino and other data pipeline processes. 90% inference speed retained while enabling process coexistence.

Q4_K_M Long Context (th=13, ctx=32K)

RequestPrompt(tok)PP(tok/s)Gen(tok)TG(tok/s)Notes
1st16,1486.153332.44Full 16K prefill, ~44 min
2nd (LCP 0.978)3563.402,0482.26Cache hit, diff-only prefill
3rd (LCP 0.999)123.111,0242.15Near-full cache restore
4th (LCP 0.939)1,0503.211,0242.07Partial cache + diff prefill

16K bulk prefill taking 44 minutes suggests that “fill 256K from scratch every time” is unrealistic. At 20 tok/s, 256K would take ~3.5 hours.

However, prompt cache (LCP similarity) changes the economics dramatically. Later requests only prefill 356, 12, and 1,050 token deltas. A stable digest design translates directly into reusable prefixes.

The prefill slowdown was also pronounced. Throughput started at 10 tok/s, dropped to 7 tok/s midway, and bottomed out near 4 tok/s. Decode landed at 2.44 tok/s, meaning 1k output takes about 6-7 minutes and 2k takes 13-14 minutes. Too slow for interactive chat, but workable for asynchronous batch tasks.

Prompt Cache Effectiveness (Q4_K_S)

StateSizeEffect
1,260 tokens saved159.5 MiBLCP similarity > 0.5 triggers hit
Cache restore-Tens of ms (62ms measured)
TTFT reduction-Dramatic prompt eval time reduction on repeat

Long context logs show repeated LCP similarity reuse:

  • selected slot by LCP similarity, sim_best = 0.978 (> 0.100 thold), f_keep = 0.980
  • selected slot by LCP similarity, sim_best = 0.999 (> 0.100 thold), f_keep = 0.870
  • selected slot by LCP similarity, sim_best = 0.939 (> 0.100 thold), f_keep = 0.940

Later requests are not rebuilding the entire fixed prefix. The server reprocesses only the delta, which is exactly why a stable digest design matters.

Addendum: Improvements with ik_llama.cpp (Expert CPU + Attention GPU Hybrid)

Using the optimized ik_llama.cpp build with Expert weights on CPU and Attention layers on GPU (-ngl 999 -ot exps=CPU):

Execution Command

  podman run --rm -it --device nvidia.com/gpu=all \
  -p 8081:8080 \
  --shm-size 32g \
  --cap-add=SYS_NICE \
  -v /mnt/data/hf/hub/models--unsloth--Kimi-K2.5-GGUF:/models:ro,Z \
  $IMG \
  --host 0.0.0.0 --port 8080 \
  -m "$MODEL" --no-mmap --jinja \
  -c 131072 \
  -n 128 \
  --threads 13 --threads-batch 26 \
  -b 2048 -ub 512 \
  -ngl 999 -ot exps=CPU \
  -ctk f16 -ctv f16 \
  --merge-qkv -mla 3 -amb 512
  

Benchmark Results (Initial)

TaskPP(tok)TG(tok)N_KV(tok)T_PP(s)S_PP(t/s)T_TG(s)S_TG(t/s)
05,2647446,00759.59688.3337.81519.67
7477652596,28713.27757.6213.16419.68
1,0072791,0247,3316.24344.6952.45219.52
2,0321,0371,0248,36816.77261.8351.79319.77
3,0571,0413108,69516.63762.5716.12419.23
Avg----63.0-19.6

Subsequent Measurements

After applying Prompt Cache (LCP) and template optimizations:

RunPP(tok)TG(tok)N_KV(tok)T_PP(s)S_PP(t/s)T_TG(s)S_TG(t/s)Notes
15,3304015,73041.298129.0620.45819.60Fresh request
24162,2417,9868.36349.75114.55219.56Cache partial miss
32,2559198,91920.631109.3048.05619.12Cache partial miss

Assessment

Improvements:

  • Prefill tolerance enhanced: Runs 1 and 3 achieve 100-130 t/s prefill speed. Long prompts process quickly
  • Generation speed stabilized: S_TG hovers around 19 t/s across all runs. Blackwell + Q4_K_S hits a hard ceiling
  • Cache effectiveness: Partial cache hits in Runs 2-3 maintain S_PP at 50-110 t/s

Current constraints:

  • Generation bottleneck: S_TG fixed at ~19 t/s. Perceived latency dominated by prefill time and output token count
  • Cache consistency: Runs 2-3 show “Common part does not match fully” warnings. Even minor changes to System Prompt or templates (whitespace, timestamps) fragment the cache

Analysis

Memory Bandwidth and Why th=13

Decode speed saturates at th=13-14. The 12-channel DDR5-6400 theoretical bandwidth is ~614 GB/s, but MoE random access patterns cannot fully utilize it. th=16 gives 12.94 tok/s, th=13 gives 11.67 tok/s – a 3-thread reduction for only 10% speed loss.

The rationale for th=13: the remaining 3 cores are freed for Dagster/Trino and other data pipeline processes. 90% inference speed retained while enabling process coexistence.

Long Context Reality Check

16K bulk prefill taking 44 minutes suggests that “fill 256K from scratch every time” is unrealistic. At 20 tok/s, 256K would take ~3.5 hours.

The practical solution:

  • Limit ctx to 16K-32K
  • Prefill a System Digest (8K-16K) once at startup
  • Use Prompt Cache (LCP similarity) for diff-only processing on subsequent requests
  • Keep output length at 1K default, 2K only when necessary

Decode 2.4 tok/s vs 10 tok/s

Q4_K_S at ctx=16K delivers 10 tok/s. Q4_K_M at ctx=32K delivers 2.4 tok/s. The gap is context-length driven – the 4.1GB KV cache attention computation becomes the bottleneck at 32K. Unsuitable for interactive chat, but batch processing tolerates the wait.

Why This Configuration Performs Well

MoE inference working this well on CPU is not because “everything fits in cache.” It is because the high-reuse working set probabilistically hits L3 at a high rate. The hot regions that L3 accelerates include:

  • Router / Gating logic
  • Projection-adjacent compute
  • Recent layer weights and intermediate tensors
  • KV reuse paths

The EPYC 9175F physical characteristics also contribute:

  1. Huge L3 (512MB) x low core count: 512MB L3 for only 16 cores. Cross-core contention nearly absent
  2. Very low memory latency configuration: 12 channels for 16 cores, keeping memory-controller queues shallow
  3. Zen 5 BF16 / AVX-512: Physical 512-bit datapath and native BF16 align well with llama.cpp optimizations

Prompt Cache Design Principles

  • System Digest must be byte-identical across requests (whitespace/date changes break cache)
  • RAG context goes in user messages, not system (cache preservation is priority)
  • Keep output style short and controlled (since generation is slow, shorter output directly improves usability)

The aichat .file /path/to/file approach is also effective: keep a fixed digest in the system prompt and inject only necessary documents on the user side for better cache stability.

Conclusion

Running a 1T model on CPU is technically validated. Decode at 10 tok/s is insufficient for interactive use but fully practical for Dagster pipeline batch generation, dataset augmentation, and distillation teacher generation.

Operational conclusion: run interactive/fast inference on the GPU side (RTX PRO 6000, vLLM), while CPU llama.cpp runs at th=13 as a resident batch intelligence engine. Of 768GB memory, 523 GiB goes to the model, leaving 200GB+ for DataFrame operations and Trino queries in parallel.

The biggest win was not just that the model ran. It was that I could identify the actual operating point: th=13, prompt-cache reuse through LCP similarity, and a bandwidth-aware split between inference work and surrounding data services.

ParameterRecommendedRationale
ctx16,384-32,768256K is impractical
system digest8k (at most 12k-16k)Prefill once at startup, then cache
threads13Memory bandwidth saturation, frees 3 cores for pipeline
ubatch256More stable than 512 (may avoid slowdown)
cache-ram32,768 MiBStabilizes LCP hit rate
output1,024Generation speed is the bottleneck, keep output short
  podman run --rm -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v /mnt/data/hf/hub/models--unsloth--Kimi-K2.5-GGUF:/models:Z \
  compute.home.arpa/llamacpp-zen5:latest \
  -m /models/snapshots/386fed8b054275941d6a495a9a7010fbf31b560d/Q4_K_S/Kimi-K2.5-Q4_K_S-00001-of-00013.gguf \
  --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on \
  --ctx-size 131072 --parallel 1 --threads 13 --threads-batch 13 \
  --batch-size 2048 --ubatch-size 512 --jinja --host 0.0.0.0 --port 8080
  

Reproduction Steps

1. Download Models

  # Q4_K_S
huggingface-cli download unsloth/Kimi-K2.5-GGUF \
  --include "Q4_K_S/*" \
  --local-dir /mnt/data/hf/hub/models--Kimi-K2.5-GGUF

# Q4_K_M
huggingface-cli download unsloth/Kimi-K2.5-GGUF \
  --include "Q4_K_M/*" \
  --local-dir /mnt/data/hf/hub/models--Kimi-K2.5-GGUF
  

2. Run

See commands in “Methodology” section. Requires llama.cpp with flash-attn and prompt cache support.

3. Measure

  curl -s http://localhost:8081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"kimi","messages":[{"role":"user","content":"Explain MoE architecture"}],"max_tokens":512}'
  

Extract prompt eval time and eval time from server logs.

Technical Notes

Q4_K_S vs Q4_K_M

Q4_K_M is ~60GB larger (520 to 579 GiB). Quality is marginally better, but speed difference is minimal. For ctx=16K batch operations, Q4_K_S is sufficient.

Reference Benchmark: Llama-4-Maverick-17B-128E

For calibration, I also kept one comparison table for another MoE model:

QuantPrefill (tok/s)Decode (tok/s)TTFT (1k context)
Q4_K_M65-6821-2412-17s
Q8_050-5215-1616-20s

This helped reinforce that quantization choice and model architecture materially affect the practical CPU inference envelope.