Why EPYC 9175F's 512MB L3 Cache Accelerates MoE Inference: Hypothesis Validation with a 1T Model
Running Kimi-K2.5 (1T MoE) CPU-only on AMD EPYC 9175F to validate the hypothesis that massive L3 cache accelerates MoE inference. Covers hypothesis rejection and refinement, thread-count benchmarks, and reproduction steps.
Background
From late 2025 into 2026, multiple 1-trillion-parameter (1T) class Mixture-of-Experts (MoE) models were released. Kimi-K2.5 is a prime example: 1.04T total parameters, with only 32B active per token (selecting 8 out of 384 experts).
Running these on GPUs is straightforward but expensive—Moonshot AI recommends at least 4x NVIDIA H200 ($150k–$200k). A CPU server with 768GB DDR5 can be built for around $15k instead. The question was whether CPU inference on a 1T model could be practical for batch workloads.
The AMD EPYC 9175F has an unusual architecture: only 16 cores, but with 512MB of L3 cache. The hypothesis was that this cache-heavy, low-core-count design would benefit MoE inference patterns. This article documents the validation of that hypothesis.
Objective
Three questions to answer:
- Does the hypothesis “active MoE experts fit in the 512MB L3 cache, accelerating inference” hold?
- Is the throughput of Kimi-K2.5 (Q4_K_S) on EPYC 9175F practical for batch processing?
- What is the optimal thread count, given memory bandwidth constraints?
Test Environment
Hardware
| Item | Specification |
|---|---|
| CPU | AMD EPYC 9175F (Zen 5, 16C/16T, SMT=OFF) |
| L3 Cache | 512MB (32MB per core) |
| Memory | DDR5-6400 64GB x 12ch = 768GB |
| GPU | RTX PRO 6000 MAX-Q 96GB (unused in this test) |
| TDP | 320W (cTDP 400W) |
Software
| Item | Version |
|---|---|
| OS | Ubuntu 24.04 LTS |
| Runtime | llama.cpp (server mode) |
| Container | Podman rootless |
Model
| Item | Specification |
|---|---|
| Model | Kimi-K2.5 (Moonshot AI) |
| Total Parameters | 1.04T (61 layers: 60 MoE + 1 Dense) |
| Active Parameters | 32B (8 of 384 experts per token) |
| Quantization | Q4_K_S (GGUF) |
| KV Cache Quantization | Q8_0 |
| Context Length | Up to 256K (tested 8K–128K) |
CPU Topology
ksh3@compute-server:~$ lscpu | egrep 'CPU\(s\)|Thread|Core|Socket|NUMA'
CPU(s): 16
On-line CPU(s) list: 0-15
Model name: AMD EPYC 9175F 16-Core Processor
Thread(s) per core: 1
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
NUMA node0 CPU(s): 0-15
ksh3@compute-server:~$ cat /sys/devices/system/cpu/smt/active
0
SMT is OFF. 16 physical cores appear as 16 logical cores. Single NUMA node.
Methodology
llama.cpp Launch Parameters
Thread count was varied across 16, 14, 13, and 12 to measure Prefill (input processing) and Decode (token generation) throughput.
Base command (th=13 example):
podman run --rm -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
-v /mnt/data/hf/hub/models--unsloth--Kimi-K2.5-GGUF:/models:Z \
compute.home.arpa/llamacpp-zen5:latest \
-m /models/snapshots/386fed8b054275941d6a495a9a7010fbf31b560d/Q4_K_S/Kimi-K2.5-Q4_K_S-00001-of-00013.gguf \
--cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on \
--ctx-size 131072 --parallel 1 --threads 13 --threads-batch 13 \
--batch-size 2048 --ubatch-size 512 --jinja --host 0.0.0.0 --port 8080
Parameter rationale:
--cache-type-k q8_0 --cache-type-v q8_0: Quantize KV cache to reduce memory footprint--flash-attn on: Enable Flash Attention on CPU to reduce bandwidth pressure with long contexts--cap-add=SYS_NICE: Allow thread priority optimization--batch-size 2048 --ubatch-size 512: Maximize Prefill throughput with Zen 5’s AVX-512 VNNI
Prompt Cache Testing
Repeated requests with the same prompt prefix were tested to measure the effect of llama.cpp’s selected slot by LCP similarity feature.
Results
Thread-Count Scaling (ctx=8K, Q4_K_S)
| Threads | Prefill (tok/s) | Decode (tok/s) | Latency (ms/tok) | Notes |
|---|---|---|---|---|
| 16 (all cores) | 24.43 | 12.94 | 77.28 | Maximum throughput |
| 14 | 21.32 | 12.50 | 79.97 | Bandwidth saturation visible |
| 13 (recommended) | 21.58 | 11.67 | 85.70 | Best efficiency/headroom balance |
| 12 | 14.58 | 11.86 | 84.32 | Compute resource starvation begins |
Key observation: Decode speed plateaus across th=13–16 (11.67–12.94 tok/s), indicating memory bandwidth is the bottleneck. Meanwhile Prefill drops sharply at th=12, marking the point where AVX-512 compute resources become insufficient.
Long-Context Stability at 128K (th=13)
| Metric | Value | Assessment |
|---|---|---|
| Prefill | 22.39 tok/s | AVX-512 VNNI sweet spot |
| Decode | 9.34 tok/s | Exceeds human reading speed (~6 tok/s) even at 128K |
| TTFT (39 new tokens) | 1,741 ms | LCP cache effective |
| KV Cache Latency | 107.10 ms/tok | Stable bandwidth control via 12ch DDR5 |
Decode at 9.34 tok/s was maintained at 128K context. No catastrophic speed degradation observed as context depth increased.
Prompt Cache Effect (ctx=16K)
| Request | Condition | Prompt Processing | Generation | Notes |
|---|---|---|---|---|
| 1st | No cache | 22.24 tok/s (823tok/37s) | 10.27 tok/s (438tok/42.7s) | Cold start |
| 2nd | Cache saving | 19.98 tok/s | 8.76 tok/s | Includes save overhead |
| 3rd | LCP similarity hit | 62 ms (cache lookup) | 10.0+ tok/s | Dramatic TTFT reduction |
Prompt cache for 1260 tokens consumed 159.5 MiB. From the 3rd request onward, prefix recomputation was skipped, reducing TTFT from seconds to 62ms.
Memory Footprint
| Item | Measured Value | Notes |
|---|---|---|
| Model weights | ~522 GB | Q4_K_S quantization |
| KV cache (16K ctx) | ~2.0 GB | K:1098MiB / V:976MiB |
| Prompt cache | ~160 MB | At 1.2K tokens |
| Total RSS | ~523 GiB / 755 GiB | No swap activity |
Runs without swap on 768GB. Over 200GB headroom remains for OS and background services.
Analysis
Original Hypothesis: Rejected (but Directionally Correct)
Original hypothesis:
MoE models have 10–30B active parameters, so the entire active expert set fits in EPYC 9175F’s 512MB L3 cache, accelerating inference.
Result: Rejected. Kimi-K2.5’s active parameters are 32B; even with Q4 quantization, ~16GB of data moves per token generation. 512MB cannot hold all of that.
Refined Hypothesis: Partial/Probabilistic Cache Hits
What L3 actually accelerates are “high-reuse hot regions”:
- Router / Gating logic: Accessed for every token to determine expert selection. Small, high-frequency—naturally stays in L3
- Projection / Bias tensors: Layer input/output boundaries. Small size, high reuse
- Previous-layer weights / intermediate tensors: Temporal locality keeps them in L3 briefly
- KV cache reuse fragments: Frequently accessed portions during attention computation
MoE models are “faster than expected” on CPU not because “everything fits in cache,” but because the high-reuse working set probabilistically hits L3 at a high rate. EPYC 9175F’s 32MB-per-core L3 sustains this hit rate at an unusually high level.
Why “Fewer Cores, More Cache” Suits MoE
A typical high-core EPYC (e.g., 128 cores) shares 512MB L3 across all cores—just 4MB per core. MoE’s irregular access patterns cause inter-core L3 thrashing, degrading effective hit rates.
EPYC 9175F allocates 512MB to only 16 cores:
- Minimizes inter-core L3 contention
- Hot regions are less likely to be evicted
- Avoids memory controller request congestion
Memory Bandwidth as the Bottleneck Boundary
Thread-count benchmarks show Decode speed tends to saturate at th=13–16. The theoretical bandwidth of 12-channel DDR5-6400 is 614 GB/s, but effective bandwidth saturates around th=13. This is a direct indication that LLM inference is memory-bound, consistent with general understanding.
Zen 5 AVX-512 Contribution
Zen 5 has a physical 512-bit datapath—fundamentally different from previous generations’ 256-bit x2 pseudo-implementation. Native BF16 processing and AVX-512 VNNI instructions execute Q4_K_S dequantization and dot products at throughputs near the 5.0GHz peak clock. The 24.43 tok/s Prefill figure is largely attributable to this.
Lessons Learned
The initial hypothesis that “experts fit in L3” was naive—basic arithmetic should have caught that. But the directional insight was correct: MoE access patterns have locality that L3 can exploit, unlike Dense models.
Thread count optimization at th=13 turned out to be the operational sweet spot: 90% of maximum inference performance while freeing 3 cores for Dagster, Trino, and network IO. This directly enables stable operation as a batch processing server.
Sustaining 9.34 tok/s at 128K context exceeded expectations. MLA (Multi-head Latent Attention) in Kimi-K2.5 compresses KV cache effectively enough that long contexts do not cause catastrophic slowdowns.
Reproduction Steps
1. Hardware Requirements
- AMD EPYC 9175F server (768GB DDR5-6400 recommended)
- Storage: ~600GB for model files (NVMe recommended)
2. Model Download
huggingface-cli download unsloth/Kimi-K2.5-GGUF \
--include "Q4_K_S/*" \
--local-dir /mnt/data/hf/hub/models--unsloth--Kimi-K2.5-GGUF
3. llama.cpp Container Build (Zen 5 Optimized)
Build with AVX-512 VNNI / BF16 enabled. Use -march=znver5 or -DGGML_NATIVE=ON.
4. Launch (Recommended: th=13, ctx=128K)
podman run --rm -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
-v /mnt/data/hf/hub/models--unsloth--Kimi-K2.5-GGUF:/models:Z \
compute.home.arpa/llamacpp-zen5:latest \
-m /models/snapshots/386fed8b054275941d6a495a9a7010fbf31b560d/Q4_K_S/Kimi-K2.5-Q4_K_S-00001-of-00013.gguf \
--cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on \
--ctx-size 131072 --parallel 1 --threads 13 --threads-batch 13 \
--batch-size 2048 --ubatch-size 512 --jinja --host 0.0.0.0 --port 8080
5. Verify
curl -s http://localhost:8081/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"kimi-k2.5","messages":[{"role":"user","content":"Hello"}],"max_tokens":100}'
Technical Notes
SMT (Hyper-Threading) Should Be OFF
With SMT=ON, the 9175F exposes 32 logical cores, but effective L3 utilization drops. 16 physical cores used directly yields more stable MoE inference. Verify with /sys/devices/system/cpu/smt/active returning 0.
L3 Benefits Diminish with Dense 70B-Class Models
Dense models have uniform access patterns that do not benefit from L3 caching the way MoE’s expert-selection locality does. This configuration should be considered MoE-specific.
Prompt Cache Is the Real Weapon for CPU Inference
CPU’s weakness in Prefill speed can be compensated by aggressive prompt caching. For Dagster pipelines processing thousands of documents, caching the shared System Prompt dramatically reduces total execution time.
Cost Comparison
| Platform | Configuration | Est. Decode Speed | Hardware Cost |
|---|---|---|---|
| AMD EPYC 9175F | 1x CPU, 768GB RAM | 10–13 tok/s | ~$15k |
| Mac Studio M3 Ultra | 2x units (512GB) | ~21 tok/s | ~$20k |
| NVIDIA GPU Cluster | 4x H200 | 40+ tok/s | $150k–$200k |
Less than 1/10 the cost of a GPU cluster for a “working” environment. Not suitable for interactive chat, but sufficient for nightly batch processing and dataset generation.

