Why DeepSeek-V3.2 Appears Slower Than Kimi-K2.5: Prompt Cache Mismatches and TG Bottleneck Analysis
Analyzing why DeepSeek-V3.2 decode speed plateaus at 14-15 tok/s in llama.cpp, traced to prompt cache mismatches and KV cache precision settings.
Background
After establishing stable 10-13 tok/s decode speeds with Kimi-K2.5 (1T MoE) on EPYC 9175F for batch processing, the next step was evaluating DeepSeek-V3.2 as a second-opinion model lane.
Running it in the same llama.cpp environment, Prefill (PP) was fast at 50-100 tok/s, but decode (TG) stuck at 14-15 tok/s and would not go higher. It appeared “slower” than Kimi-K2.5, but the root cause needed isolation: was it model capability or operational configuration?
Objective
- Identify why DeepSeek-V3.2 decode speed stalls at 14-15 tok/s
- Quantify the impact of cache control and optimization flags from llama.cpp logs
- Prioritize improvement actions
Test Environment
- CPU: AMD EPYC 9175F (Zen 5, 16C)
- Memory: DDR5-6400 768GB (12ch)
- OS: Ubuntu 24.04 LTS
- Runtime: llama.cpp (server mode, fused_moe=1)
- Model: DeepSeek-V3.2 Speciale (MoE)
- KV Cache: f16 (default)
Results
Measured Inference Throughput
Values extracted from logs of 4 consecutive tasks:
| Task ID | PP (tok) | TG (tok) | Cumulative Tokens | PP Speed (tok/s) | TG Speed (tok/s) | PP Time (s) | TG Time (s) |
|---|---|---|---|---|---|---|---|
| 0 | 2,731 | 1,024 | 3,755 | 99.75 | 14.57 | 27.4 | 70.3 |
| 1026 | 857 | 1,024 | 5,636 | 74.47 | 15.21 | 11.5 | 67.3 |
| 2051 | 311 | 982 | 6,929 | 52.92 | 14.51 | 5.9 | 67.7 |
| 3034 | 4,865 | 1,024 | 12,818 | 100.22 | 14.22 | 48.5 | 72.0 |
PP varies between 50-100 tok/s depending on input size. TG concentrates in a tight 14.22-15.21 tok/s band, independent of input size or cumulative token count—a clear “wall.”
Cache Mismatch in Logs
Common part does not match fully → kv cache rm [p0, end)
This appeared on every task. The leading token sequence differed between requests, causing KV cache to be discarded each time.
Speculative Decoding Status
no implementations specified for speculative decoding
No draft model configured; Speculative Decoding was inactive.
Analysis
What the 14-15 tok/s Wall Means
The near-constant decode speed across tasks points to a physical memory bandwidth limit. KV cache is stored at f16 precision, and as context grows, attention computation becomes memory-bandwidth-dominated.
Kimi-K2.5 was run with q8_0 KV cache quantization to reduce bandwidth pressure. Applying the same setting to DeepSeek-V3.2 should improve TG speed.
Impact of Prompt Cache Mismatches
With Kimi-K2.5, a fixed prefix (System Prompt + Knowledge Digest) maintained high LCP cache hit rates. The DeepSeek-V3.2 test had three issues:
<think>tag inconsistency: Thinking Prompt presence varied per request, breaking leading token alignment- System Prompt variance: Templates were not locked down
- Context management difference: Kimi-K2.5 reused prior context; DeepSeek rebuilt context from scratch each time
The conclusion: “DeepSeek is slower” was largely “DeepSeek was tested without cache benefits.”
MoE Optimization Gap
While fused_moe=1 is active in logs, llama.cpp’s MoE implementation is more generic compared to specialized kernels in vLLM or cloud services. Expert routing implementation differences likely contribute to the speed gap.
Lessons Learned
This was a textbook “benchmark trap.” Same hardware, same runtime, but prompt cache management alone created a significant throughput difference. Initial assumption of “DeepSeek is slower than Kimi” turned out to be mostly an operational configuration issue.
The TG 14-15 tok/s plateau itself is explained by f16 KV cache settings and memory bandwidth. Applying the same q8_0 settings used for Kimi-K2.5 would likely have produced different results.
Reproduction Steps
1. Run DeepSeek-V3.2
podman run --rm -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
-v /path/to/deepseek-v3.2:/models:Z \
compute.home.arpa/llamacpp-zen5:latest \
-m /models/DeepSeek-V3.2-Speciale.gguf \
--cache-type-k f16 --cache-type-v f16 --flash-attn on \
--ctx-size 16384 --parallel 1 --threads 13 --threads-batch 13 \
--batch-size 2048 --ubatch-size 512 --jinja --host 0.0.0.0 --port 8080
2. Improved Version (KV Cache Quantization + Fixed Prompt)
# Change KV cache to q8_0
--cache-type-k q8_0 --cache-type-v q8_0
# Enable prompt cache
--prompt-cache /tmp/deepseek-cache.bin
Additionally, lock down System Prompt and <think> tag presence across all requests.
3. Measurement
Extract S_PP and S_TG from llama.cpp server logs. Compare before and after optimization.
Technical Notes
Principles for Effective Prompt Caching
- Fix the leading token sequence: System Prompt → Fixed Context → Variable Parts, in strict order
- Keep Thinking mode consistent: If enabled, enable for all requests. Toggling per-request invalidates cache every time
- Align generation parameters: Temperature, top_p differences can also affect cache hit rates
Fair Comparison with Kimi-K2.5
- Match output token count, temperature, top_p, stop sequences, and stream settings exactly
- Unify Thinking Token handling (logs show
Exclude reasoning tokensfor slot selection, but generation still occurs) - Use identical hardware, thread count, and KV cache settings
Improvement Priority
| Priority | Action | Expected Impact | Implementation Cost |
|---|---|---|---|
| A | Fix prompt prefix consistency | Major PP reduction | Low (config change) |
| B | Enable Speculative Decoding | TG perceived speed gain | Medium (draft model selection) |
| C | Quantize KV cache to q8_0 | TG bandwidth relief | Low (flag change) |
| D | Standardize generation conditions | Fair comparison | Low (test design) |

