Background

From late 2025 into 2026, multiple 1-trillion-parameter (1T) class Mixture-of-Experts (MoE) models were released. Kimi-K2.5 is a prime example: 1.04T total parameters, with only 32B active per token (selecting 8 out of 384 experts).

Running these on GPUs is straightforward but expensive—Moonshot AI recommends at least 4x NVIDIA H200 ($150k–$200k). A CPU server with 768GB DDR5 can be built for around $15k instead. The question was whether CPU inference on a 1T model could be practical for batch workloads.

The AMD EPYC 9175F has an unusual architecture: only 16 cores, but with 512MB of L3 cache. The hypothesis was that this cache-heavy, low-core-count design would benefit MoE inference patterns. This article documents the validation of that hypothesis.

Objective

Three questions to answer:

  1. Does the hypothesis “active MoE experts fit in the 512MB L3 cache, accelerating inference” hold?
  2. Is the throughput of Kimi-K2.5 (Q4_K_S) on EPYC 9175F practical for batch processing?
  3. What is the optimal thread count, given memory bandwidth constraints?

Test Environment

Hardware

ItemSpecification
CPUAMD EPYC 9175F (Zen 5, 16C/16T, SMT=OFF)
L3 Cache512MB (32MB per core)
MemoryDDR5-6400 64GB x 12ch = 768GB
GPURTX PRO 6000 MAX-Q 96GB (unused in this test)
TDP320W (cTDP 400W)

Software

ItemVersion
OSUbuntu 24.04 LTS
Runtimellama.cpp (server mode)
ContainerPodman rootless

Model

ItemSpecification
ModelKimi-K2.5 (Moonshot AI)
Total Parameters1.04T (61 layers: 60 MoE + 1 Dense)
Active Parameters32B (8 of 384 experts per token)
QuantizationQ4_K_S (GGUF)
KV Cache QuantizationQ8_0
Context LengthUp to 256K (tested 8K–128K)

CPU Topology

  ksh3@compute-server:~$ lscpu | egrep 'CPU\(s\)|Thread|Core|Socket|NUMA'
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Model name:                              AMD EPYC 9175F 16-Core Processor
Thread(s) per core:                      1
Core(s) per socket:                      16
Socket(s):                               1
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15

ksh3@compute-server:~$ cat /sys/devices/system/cpu/smt/active
0
  

SMT is OFF. 16 physical cores appear as 16 logical cores. Single NUMA node.

Methodology

llama.cpp Launch Parameters

Thread count was varied across 16, 14, 13, and 12 to measure Prefill (input processing) and Decode (token generation) throughput.

Base command (th=13 example):

  podman run --rm -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v /mnt/data/hf/hub/models--unsloth--Kimi-K2.5-GGUF:/models:Z \
  compute.home.arpa/llamacpp-zen5:latest \
  -m /models/snapshots/386fed8b054275941d6a495a9a7010fbf31b560d/Q4_K_S/Kimi-K2.5-Q4_K_S-00001-of-00013.gguf \
  --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on \
  --ctx-size 131072 --parallel 1 --threads 13 --threads-batch 13 \
  --batch-size 2048 --ubatch-size 512 --jinja --host 0.0.0.0 --port 8080
  

Parameter rationale:

  • --cache-type-k q8_0 --cache-type-v q8_0: Quantize KV cache to reduce memory footprint
  • --flash-attn on: Enable Flash Attention on CPU to reduce bandwidth pressure with long contexts
  • --cap-add=SYS_NICE: Allow thread priority optimization
  • --batch-size 2048 --ubatch-size 512: Maximize Prefill throughput with Zen 5’s AVX-512 VNNI

Prompt Cache Testing

Repeated requests with the same prompt prefix were tested to measure the effect of llama.cpp’s selected slot by LCP similarity feature.

Results

Thread-Count Scaling (ctx=8K, Q4_K_S)

ThreadsPrefill (tok/s)Decode (tok/s)Latency (ms/tok)Notes
16 (all cores)24.4312.9477.28Maximum throughput
1421.3212.5079.97Bandwidth saturation visible
13 (recommended)21.5811.6785.70Best efficiency/headroom balance
1214.5811.8684.32Compute resource starvation begins

Key observation: Decode speed plateaus across th=13–16 (11.67–12.94 tok/s), indicating memory bandwidth is the bottleneck. Meanwhile Prefill drops sharply at th=12, marking the point where AVX-512 compute resources become insufficient.

Long-Context Stability at 128K (th=13)

MetricValueAssessment
Prefill22.39 tok/sAVX-512 VNNI sweet spot
Decode9.34 tok/sExceeds human reading speed (~6 tok/s) even at 128K
TTFT (39 new tokens)1,741 msLCP cache effective
KV Cache Latency107.10 ms/tokStable bandwidth control via 12ch DDR5

Decode at 9.34 tok/s was maintained at 128K context. No catastrophic speed degradation observed as context depth increased.

Prompt Cache Effect (ctx=16K)

RequestConditionPrompt ProcessingGenerationNotes
1stNo cache22.24 tok/s (823tok/37s)10.27 tok/s (438tok/42.7s)Cold start
2ndCache saving19.98 tok/s8.76 tok/sIncludes save overhead
3rdLCP similarity hit62 ms (cache lookup)10.0+ tok/sDramatic TTFT reduction

Prompt cache for 1260 tokens consumed 159.5 MiB. From the 3rd request onward, prefix recomputation was skipped, reducing TTFT from seconds to 62ms.

Memory Footprint

ItemMeasured ValueNotes
Model weights~522 GBQ4_K_S quantization
KV cache (16K ctx)~2.0 GBK:1098MiB / V:976MiB
Prompt cache~160 MBAt 1.2K tokens
Total RSS~523 GiB / 755 GiBNo swap activity

Runs without swap on 768GB. Over 200GB headroom remains for OS and background services.

Analysis

Original Hypothesis: Rejected (but Directionally Correct)

Original hypothesis:

MoE models have 10–30B active parameters, so the entire active expert set fits in EPYC 9175F’s 512MB L3 cache, accelerating inference.

Result: Rejected. Kimi-K2.5’s active parameters are 32B; even with Q4 quantization, ~16GB of data moves per token generation. 512MB cannot hold all of that.

Refined Hypothesis: Partial/Probabilistic Cache Hits

What L3 actually accelerates are “high-reuse hot regions”:

  • Router / Gating logic: Accessed for every token to determine expert selection. Small, high-frequency—naturally stays in L3
  • Projection / Bias tensors: Layer input/output boundaries. Small size, high reuse
  • Previous-layer weights / intermediate tensors: Temporal locality keeps them in L3 briefly
  • KV cache reuse fragments: Frequently accessed portions during attention computation

MoE models are “faster than expected” on CPU not because “everything fits in cache,” but because the high-reuse working set probabilistically hits L3 at a high rate. EPYC 9175F’s 32MB-per-core L3 sustains this hit rate at an unusually high level.

Why “Fewer Cores, More Cache” Suits MoE

A typical high-core EPYC (e.g., 128 cores) shares 512MB L3 across all cores—just 4MB per core. MoE’s irregular access patterns cause inter-core L3 thrashing, degrading effective hit rates.

EPYC 9175F allocates 512MB to only 16 cores:

  1. Minimizes inter-core L3 contention
  2. Hot regions are less likely to be evicted
  3. Avoids memory controller request congestion

Memory Bandwidth as the Bottleneck Boundary

Thread-count benchmarks show Decode speed tends to saturate at th=13–16. The theoretical bandwidth of 12-channel DDR5-6400 is 614 GB/s, but effective bandwidth saturates around th=13. This is a direct indication that LLM inference is memory-bound, consistent with general understanding.

Zen 5 AVX-512 Contribution

Zen 5 has a physical 512-bit datapath—fundamentally different from previous generations’ 256-bit x2 pseudo-implementation. Native BF16 processing and AVX-512 VNNI instructions execute Q4_K_S dequantization and dot products at throughputs near the 5.0GHz peak clock. The 24.43 tok/s Prefill figure is largely attributable to this.

Lessons Learned

The initial hypothesis that “experts fit in L3” was naive—basic arithmetic should have caught that. But the directional insight was correct: MoE access patterns have locality that L3 can exploit, unlike Dense models.

Thread count optimization at th=13 turned out to be the operational sweet spot: 90% of maximum inference performance while freeing 3 cores for Dagster, Trino, and network IO. This directly enables stable operation as a batch processing server.

Sustaining 9.34 tok/s at 128K context exceeded expectations. MLA (Multi-head Latent Attention) in Kimi-K2.5 compresses KV cache effectively enough that long contexts do not cause catastrophic slowdowns.

Reproduction Steps

1. Hardware Requirements

  • AMD EPYC 9175F server (768GB DDR5-6400 recommended)
  • Storage: ~600GB for model files (NVMe recommended)

2. Model Download

  huggingface-cli download unsloth/Kimi-K2.5-GGUF \
  --include "Q4_K_S/*" \
  --local-dir /mnt/data/hf/hub/models--unsloth--Kimi-K2.5-GGUF
  

3. llama.cpp Container Build (Zen 5 Optimized)

Build with AVX-512 VNNI / BF16 enabled. Use -march=znver5 or -DGGML_NATIVE=ON.

  podman run --rm -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v /mnt/data/hf/hub/models--unsloth--Kimi-K2.5-GGUF:/models:Z \
  compute.home.arpa/llamacpp-zen5:latest \
  -m /models/snapshots/386fed8b054275941d6a495a9a7010fbf31b560d/Q4_K_S/Kimi-K2.5-Q4_K_S-00001-of-00013.gguf \
  --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on \
  --ctx-size 131072 --parallel 1 --threads 13 --threads-batch 13 \
  --batch-size 2048 --ubatch-size 512 --jinja --host 0.0.0.0 --port 8080
  

5. Verify

  curl -s http://localhost:8081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"kimi-k2.5","messages":[{"role":"user","content":"Hello"}],"max_tokens":100}'
  

Technical Notes

SMT (Hyper-Threading) Should Be OFF

With SMT=ON, the 9175F exposes 32 logical cores, but effective L3 utilization drops. 16 physical cores used directly yields more stable MoE inference. Verify with /sys/devices/system/cpu/smt/active returning 0.

L3 Benefits Diminish with Dense 70B-Class Models

Dense models have uniform access patterns that do not benefit from L3 caching the way MoE’s expert-selection locality does. This configuration should be considered MoE-specific.

Prompt Cache Is the Real Weapon for CPU Inference

CPU’s weakness in Prefill speed can be compensated by aggressive prompt caching. For Dagster pipelines processing thousands of documents, caching the shared System Prompt dramatically reduces total execution time.

Cost Comparison

PlatformConfigurationEst. Decode SpeedHardware Cost
AMD EPYC 9175F1x CPU, 768GB RAM10–13 tok/s~$15k
Mac Studio M3 Ultra2x units (512GB)~21 tok/s~$20k
NVIDIA GPU Cluster4x H20040+ tok/s$150k–$200k

Less than 1/10 the cost of a GPU cluster for a “working” environment. Not suitable for interactive chat, but sufficient for nightly batch processing and dataset generation.