Why EPYC 9175F's 512MB L3 Cache Accelerates MoE Inference: Hypothesis Validation with a 1T Model

Running Kimi-K2.5 (1T MoE) CPU-only on AMD EPYC 9175F to validate the hypothesis that massive L3 cache accelerates MoE inference. Covers hypothesis rejection and refinement, thread-count benchmarks, and reproduction steps.

info

These articles use AI-generated summaries of Obsidian notes originally kept as technical memos.

info

English translations are produced with AI assistance.

Background

From late 2025 into 2026, multiple 1-trillion-parameter (1T) class Mixture-of-Experts (MoE) models were released. Kimi-K2.5 is a prime example: 1.04T total parameters, with only 32B active per token (selecting 8 out of 384 experts).

Running these on GPUs is straightforward but expensive—Moonshot AI recommends at least 4x NVIDIA H200 ($150k–$200k). A CPU server with 768GB DDR5 can be built for around $15k instead. The question was whether CPU inference on a 1T model could be practical for batch workloads.

The AMD EPYC 9175F has an unusual architecture: only 16 cores, but with 512MB of L3 cache. The hypothesis was that this cache-heavy, low-core-count design would benefit MoE inference patterns. This article documents the validation of that hypothesis.

Objective

Three questions to answer:

Does the hypothesis “active MoE experts fit in the 512MB L3 cache, accelerating inference” hold?
Is the throughput of Kimi-K2.5 (Q4_K_S) on EPYC 9175F practical for batch processing?
What is the optimal thread count, given memory bandwidth constraints?

Test Environment

Hardware

Item	Specification
CPU	AMD EPYC 9175F (Zen 5, 16C/16T, SMT=OFF)
L3 Cache	512MB (32MB per core)
Memory	DDR5-6400 64GB x 12ch = 768GB
GPU	RTX PRO 6000 MAX-Q 96GB (unused in this test)
TDP	320W (cTDP 400W)

Software

Item	Version
OS	Ubuntu 24.04 LTS
Runtime	llama.cpp (server mode)
Container	Podman rootless

Model

Item	Specification
Model	Kimi-K2.5 (Moonshot AI)
Total Parameters	1.04T (61 layers: 60 MoE + 1 Dense)
Active Parameters	32B (8 of 384 experts per token)
Quantization	Q4_K_S (GGUF)
KV Cache Quantization	Q8_0
Context Length	Up to 256K (tested 8K–128K)

CPU Topology

  ksh3@compute-server:~$ lscpu | egrep 'CPU\(s\)|Thread|Core|Socket|NUMA'
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Model name:                              AMD EPYC 9175F 16-Core Processor
Thread(s) per core:                      1
Core(s) per socket:                      16
Socket(s):                               1
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15

ksh3@compute-server:~$ cat /sys/devices/system/cpu/smt/active
0

SMT is OFF. 16 physical cores appear as 16 logical cores. Single NUMA node.

Methodology

llama.cpp Launch Parameters

Thread count was varied across 16, 14, 13, and 12 to measure Prefill (input processing) and Decode (token generation) throughput.

Base command (th=13 example):

  podman run --rm -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v /mnt/data/hf/hub/models--unsloth--Kimi-K2.5-GGUF:/models:Z \
  compute.home.arpa/llamacpp-zen5:latest \
  -m /models/snapshots/386fed8b054275941d6a495a9a7010fbf31b560d/Q4_K_S/Kimi-K2.5-Q4_K_S-00001-of-00013.gguf \
  --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on \
  --ctx-size 131072 --parallel 1 --threads 13 --threads-batch 13 \
  --batch-size 2048 --ubatch-size 512 --jinja --host 0.0.0.0 --port 8080

Parameter rationale:

--cache-type-k q8_0 --cache-type-v q8_0: Quantize KV cache to reduce memory footprint
--flash-attn on: Enable Flash Attention on CPU to reduce bandwidth pressure with long contexts
--cap-add=SYS_NICE: Allow thread priority optimization
--batch-size 2048 --ubatch-size 512: Maximize Prefill throughput with Zen 5’s AVX-512 VNNI

Prompt Cache Testing

Repeated requests with the same prompt prefix were tested to measure the effect of llama.cpp’s selected slot by LCP similarity feature.

Results

Thread-Count Scaling (ctx=8K, Q4_K_S)

Threads	Prefill (tok/s)	Decode (tok/s)	Latency (ms/tok)	Notes
16 (all cores)	24.43	12.94	77.28	Maximum throughput
14	21.32	12.50	79.97	Bandwidth saturation visible
13 (recommended)	21.58	11.67	85.70	Best efficiency/headroom balance
12	14.58	11.86	84.32	Compute resource starvation begins

Key observation: Decode speed plateaus across th=13–16 (11.67–12.94 tok/s), indicating memory bandwidth is the bottleneck. Meanwhile Prefill drops sharply at th=12, marking the point where AVX-512 compute resources become insufficient.

Long-Context Stability at 128K (th=13)

Metric	Value	Assessment
Prefill	22.39 tok/s	AVX-512 VNNI sweet spot
Decode	9.34 tok/s	Exceeds human reading speed (~6 tok/s) even at 128K
TTFT (39 new tokens)	1,741 ms	LCP cache effective
KV Cache Latency	107.10 ms/tok	Stable bandwidth control via 12ch DDR5

Decode at 9.34 tok/s was maintained at 128K context. No catastrophic speed degradation observed as context depth increased.

Prompt Cache Effect (ctx=16K)

Request	Condition	Prompt Processing	Generation	Notes
1st	No cache	22.24 tok/s (823tok/37s)	10.27 tok/s (438tok/42.7s)	Cold start
2nd	Cache saving	19.98 tok/s	8.76 tok/s	Includes save overhead
3rd	LCP similarity hit	62 ms (cache lookup)	10.0+ tok/s	Dramatic TTFT reduction

Prompt cache for 1260 tokens consumed 159.5 MiB. From the 3rd request onward, prefix recomputation was skipped, reducing TTFT from seconds to 62ms.

Memory Footprint

Item	Measured Value	Notes
Model weights	~522 GB	Q4_K_S quantization
KV cache (16K ctx)	~2.0 GB	K:1098MiB / V:976MiB
Prompt cache	~160 MB	At 1.2K tokens
Total RSS	~523 GiB / 755 GiB	No swap activity

Runs without swap on 768GB. Over 200GB headroom remains for OS and background services.

Analysis

Original Hypothesis: Rejected (but Directionally Correct)

Original hypothesis:

MoE models have 10–30B active parameters, so the entire active expert set fits in EPYC 9175F’s 512MB L3 cache, accelerating inference.

Result: Rejected. Kimi-K2.5’s active parameters are 32B; even with Q4 quantization, ~16GB of data moves per token generation. 512MB cannot hold all of that.

Refined Hypothesis: Partial/Probabilistic Cache Hits

What L3 actually accelerates are “high-reuse hot regions”:

Router / Gating logic: Accessed for every token to determine expert selection. Small, high-frequency—naturally stays in L3
Projection / Bias tensors: Layer input/output boundaries. Small size, high reuse
Previous-layer weights / intermediate tensors: Temporal locality keeps them in L3 briefly
KV cache reuse fragments: Frequently accessed portions during attention computation

MoE models are “faster than expected” on CPU not because “everything fits in cache,” but because the high-reuse working set probabilistically hits L3 at a high rate. EPYC 9175F’s 32MB-per-core L3 sustains this hit rate at an unusually high level.

Why “Fewer Cores, More Cache” Suits MoE

A typical high-core EPYC (e.g., 128 cores) shares 512MB L3 across all cores—just 4MB per core. MoE’s irregular access patterns cause inter-core L3 thrashing, degrading effective hit rates.

EPYC 9175F allocates 512MB to only 16 cores:

Minimizes inter-core L3 contention
Hot regions are less likely to be evicted
Avoids memory controller request congestion

Memory Bandwidth as the Bottleneck Boundary

Thread-count benchmarks show Decode speed tends to saturate at th=13–16. The theoretical bandwidth of 12-channel DDR5-6400 is 614 GB/s, but effective bandwidth saturates around th=13. This is a direct indication that LLM inference is memory-bound, consistent with general understanding.

Zen 5 AVX-512 Contribution

Zen 5 has a physical 512-bit datapath—fundamentally different from previous generations’ 256-bit x2 pseudo-implementation. Native BF16 processing and AVX-512 VNNI instructions execute Q4_K_S dequantization and dot products at throughputs near the 5.0GHz peak clock. The 24.43 tok/s Prefill figure is largely attributable to this.

Lessons Learned

The initial hypothesis that “experts fit in L3” was naive—basic arithmetic should have caught that. But the directional insight was correct: MoE access patterns have locality that L3 can exploit, unlike Dense models.

Thread count optimization at th=13 turned out to be the operational sweet spot: 90% of maximum inference performance while freeing 3 cores for Dagster, Trino, and network IO. This directly enables stable operation as a batch processing server.

Sustaining 9.34 tok/s at 128K context exceeded expectations. MLA (Multi-head Latent Attention) in Kimi-K2.5 compresses KV cache effectively enough that long contexts do not cause catastrophic slowdowns.

Reproduction Steps

1. Hardware Requirements

AMD EPYC 9175F server (768GB DDR5-6400 recommended)
Storage: ~600GB for model files (NVMe recommended)

2. Model Download

  huggingface-cli download unsloth/Kimi-K2.5-GGUF \
  --include "Q4_K_S/*" \
  --local-dir /mnt/data/hf/hub/models--unsloth--Kimi-K2.5-GGUF

3. llama.cpp Container Build (Zen 5 Optimized)

Build with AVX-512 VNNI / BF16 enabled. Use -march=znver5 or -DGGML_NATIVE=ON.

4. Launch (Recommended: th=13, ctx=128K)

  podman run --rm -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v /mnt/data/hf/hub/models--unsloth--Kimi-K2.5-GGUF:/models:Z \
  compute.home.arpa/llamacpp-zen5:latest \
  -m /models/snapshots/386fed8b054275941d6a495a9a7010fbf31b560d/Q4_K_S/Kimi-K2.5-Q4_K_S-00001-of-00013.gguf \
  --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on \
  --ctx-size 131072 --parallel 1 --threads 13 --threads-batch 13 \
  --batch-size 2048 --ubatch-size 512 --jinja --host 0.0.0.0 --port 8080

5. Verify

  curl -s http://localhost:8081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"kimi-k2.5","messages":[{"role":"user","content":"Hello"}],"max_tokens":100}'

Technical Notes

SMT (Hyper-Threading) Should Be OFF

With SMT=ON, the 9175F exposes 32 logical cores, but effective L3 utilization drops. 16 physical cores used directly yields more stable MoE inference. Verify with /sys/devices/system/cpu/smt/active returning 0.

L3 Benefits Diminish with Dense 70B-Class Models

Dense models have uniform access patterns that do not benefit from L3 caching the way MoE’s expert-selection locality does. This configuration should be considered MoE-specific.

Prompt Cache Is the Real Weapon for CPU Inference

CPU’s weakness in Prefill speed can be compensated by aggressive prompt caching. For Dagster pipelines processing thousands of documents, caching the shared System Prompt dramatically reduces total execution time.

Cost Comparison

Platform	Configuration	Est. Decode Speed	Hardware Cost
AMD EPYC 9175F	1x CPU, 768GB RAM	10–13 tok/s	~$15k
Mac Studio M3 Ultra	2x units (512GB)	~21 tok/s	~$20k
NVIDIA GPU Cluster	4x H200	40+ tok/s	$150k–$200k

Less than 1/10 the cost of a GPU cluster for a “working” environment. Not suitable for interactive chat, but sufficient for nightly batch processing and dataset generation.

Rust + NATS + Dagster AI Factory: OpenAI Proxy, Idempotent Design, SSE Streaming, and Go Migration Record

Rust(axum) OpenAI-compatible …

Why Quantization Choice Changes Everything for Hermes-4.3-36B: BF16/FP8/nvfp4 Measured Comparison

Comparing Hermes-4.3-36B …

Why EPYC 9175F's 512MB L3 Cache Accelerates MoE Inference: Hypothesis Validation with a 1T Model

Background link

Objective link

Test Environment link

Hardware link

Software link

Model link

CPU Topology link

Methodology link

llama.cpp Launch Parameters link

Prompt Cache Testing link

Results link

Thread-Count Scaling (ctx=8K, Q4_K_S) link

Long-Context Stability at 128K (th=13) link

Prompt Cache Effect (ctx=16K) link

Memory Footprint link

Analysis link

Original Hypothesis: Rejected (but Directionally Correct) link

Refined Hypothesis: Partial/Probabilistic Cache Hits link

Why “Fewer Cores, More Cache” Suits MoE link

Memory Bandwidth as the Bottleneck Boundary link

Zen 5 AVX-512 Contribution link

Lessons Learned link

Reproduction Steps link

1. Hardware Requirements link

2. Model Download link

3. llama.cpp Container Build (Zen 5 Optimized) link

4. Launch (Recommended: th=13, ctx=128K) link

5. Verify link

Technical Notes link

SMT (Hyper-Threading) Should Be OFF link

L3 Benefits Diminish with Dense 70B-Class Models link

Prompt Cache Is the Real Weapon for CPU Inference link

Cost Comparison link