Llama-4-Scout-17B-16E Measured: CPU Q6_K 17tok/s vs GPU nvfp4 60tok/s, Cache Strategy and 100K Context Boundary
Llama-4-Scout (17B active / 16-expert MoE) benchmarked on EPYC 9175F CPU Q6_K inference and RTX PRO 6000 Blackwell Max-Q GPU nvfp4 inference. CPU 17tok/s vs GPU 30-60tok/s. Validating mmap cache strategy, prompt cache effectiveness, and 100K context boundary.
Background
Llama-4-Scout is Meta’s MoE model with 16 experts and 17B active parameters. Compared to Maverick’s 128 experts, fewer experts mean structurally better CPU cache efficiency.
The same model was tested on both CPU inference (Q6_K / llama.cpp) and GPU inference (nvfp4 / vLLM), comparing speed, quality, and memory behavior. The impression that “CPU LLM is slow” may be shaped by cold-start behavior; steady-state performance after cache warming tells a different story.
What I wanted to verify was whether CPU-only llama.cpp deployment could handle large-prefill oneshot summarization and fire-and-forget style idempotent pipelines in a way that is actually practical. In parallel, I evaluated whether nvfp4 on a Blackwell GPU could serve as a genuine daily-use local model.
Objective
- Measure CPU Q6_K steady-state speed (after cache warming)
- Record GPU nvfp4 Prefill/Decode speed and VRAM allocation
- Quantify mmap page cache and prompt cache effectiveness
- Establish the 100K context boundary reality
Test Environment
| Item | Specification |
|---|---|
| CPU | AMD EPYC 9175F (Zen 5, 16C, L3 512MB) |
| Memory | DDR5-6400 768GB (12ch) |
| GPU | NVIDIA RTX PRO 6000 Blackwell Max-Q (96GB VRAM) |
| OS | Ubuntu 24.04 LTS |
Two Configurations
| Config | Runtime | Quantization | Placement | ctx |
|---|---|---|---|---|
| A: CPU | llama.cpp | Q6_K (3-split GGUF) | All CPU | 8,192 |
| B: GPU | vLLM 0.14.0rc1 | nvfp4 (NVIDIA ModelOpt) | All GPU | ~110K |
Results
Config A: CPU Q6_K (th=16, ctx=8K)
| Metric | Measured | Notes |
|---|---|---|
| PP (Prefill) | ~40 tok/s | Stable even with 2.5k-5k token prompts |
| TG (Decode) | 16.3-17.5 tok/s | Uniform 16-core load, temp stable at 64C |
The first run is extremely slow due to mmap page faults + full Prefill. The model loads through mmap, so weights are not eagerly materialized at process start. When a large prompt arrives, page faults, full prompt eval, and KV cache initialization all hit simultaneously.
From the second run onward, OS page cache + prompt cache reach steady-state speed. Prompt cache LCP (Longest Common Prefix) match shows f_keep > 0.7. For repeated jobs with a stable front half, later runs avoid full prefill cost.
Prefill does not disappear – it still happens every run. What changes is whether the system has to do a full prefill every time. With stable leading context, the cost drops to delta-only processing.
Config B: GPU nvfp4 (vLLM, Blackwell 96GB)
| Metric | Range | Peak | Notes |
|---|---|---|---|
| PP (Prefill) | 700-1,200 tok/s | 1,370 tok/s | Evident in long prompts |
| TG (Decode) | 30-60 tok/s | 112 tok/s | Stable post-warmup |
Prefix Cache Hit Rate: 10-48%. At 48%, TTFT drops dramatically.
GPU performance was overwhelming. The 1,370 tok/s prefill translates directly to development velocity in aider iterative refactoring. Technical Japanese and Django/Python design task quality remain practical even at FP4. Instruction following and Japanese proficiency are stable. FP4 degradation is nearly imperceptible in practice – design review and implementation support are fully workable.
VRAM Breakdown
| Item | Usage |
|---|---|
| Model weights | ~63.5 GiB |
| Available KV Cache | ~20.2 GiB |
| Max context length | ~110,256 tokens |
With BF16-equivalent KV management, ~100K tokens is the single-GPU ceiling. 256K+ requires FP8 KV or TP=2. Even with 96GB, ultra-long context at 256K/512K is not automatic – model residency and KV cache compete for the same VRAM.
Comparison Summary
| Config | TG(tok/s) | VRAM | Use Case |
|---|---|---|---|
| A: CPU Q6_K | 16-17 | 0 | Batch summarization, F&F pipeline |
| B: GPU nvfp4 | 30-60 | ~64GB | aider, interactive code generation |
Analysis
CPU Cache Strategy Effectiveness
“CPU LLM is slow” is a limited verdict based on first-run behavior. With correct mmap and prompt cache design, 17 tok/s is stable in steady state. Key factors:
--mlockto lock all weights in physical memory- Fix System Prompt to maximize LCP match rate
- Avoid process restarts (preserve OS page cache)
If the evaluation axis is not limited to interactive latency, this configuration holds up well. For oneshot summarization and idempotent batch processing with reusable prefixes, CPU-only operation is genuinely practical.
16-Expert Structural Advantage
Compared to Maverick’s 128 experts, Scout’s 16 experts have lower “sparsity density.” On EPYC 9175F’s 1 CCD = 1 Core layout, fewer active experts increase working set residency in L3 cache, reducing memory bandwidth dependency.
The tok/s difference in CPU inference (Scout 17 vs Maverick 21-24) is dominated by model size rather than expert count, but Scout shows more predictable behavior in terms of stability.
nvfp4 Quality
Django/Python design task quality remains practical even at FP4. Instruction following and Japanese proficiency are stable. Mathematically rigorous reasoning is slightly weaker, but acceptable for boilerplate generation and logic reviews. The model stays coherent when asked to explain backend code structure in Japanese.
Lessons Learned
The gap between CPU 17 tok/s and GPU 30-60 tok/s is clearly perceptible. But CPU’s 17 tok/s is fully practical for non-interactive pipelines, freeing the GPU for other models.
GPU nvfp4’s 1,370 tok/s Prefill is overwhelming. In aider iterative refactoring, short TTFT directly translates to development velocity. At 48% Prefix Cache hit, responses are near-instant.
A split where CPU handles stable large-prefix batch jobs and GPU handles low-latency interactive work feels like the most realistic operational model.
Reproduction Steps
CPU Q6_K
numactl --cpunodebind=0 --membind=0 \
podman run --rm -it \
-p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
-v "$MO":/models:ro,Z $IMG \
--host 0.0.0.0 --port 8080 \
-m /models/Scout-17B-Q6_K.gguf \
--threads 16 --threads-batch 16 \
--batch-size 2048 --ubatch-size 512 \
--mlock --ctx-size 8192 --flash-attn on \
--parallel 1 --jinja
GPU nvfp4 (vLLM)
vllm serve nvidia/Llama-4-Scout-17B-16E-Instruct-NVFP4 \
--dtype auto --gpu-memory-utilization 0.88 \
--max-num-seqs 1 --max-model-len 110000 \
--enable-prefix-caching --trust-remote-code
Technical Notes
When Cache Stops Working
- Changing even a single character in prompt-head templates or tool definitions invalidates prompt cache, reverting to full Prefill
- Process restarts purge OS page cache, triggering cold-start I/O waits again
- Cache capacity limits cause old prompts to be evicted
100K Context Boundary
With nvfp4 model (63.5 GiB) on a single 96GB GPU, KV cache gets ~20 GiB. At BF16 KV precision, ~110K tokens is the physical limit. Handling 256K/512K requires FP8 KV + TP=2 expansion.
