Llama-4-Scout-17B-16E Measured: CPU Q6_K 17tok/s vs GPU nvfp4 60tok/s, Cache Strategy and 100K Context Boundary

Background

Llama-4-Scout is Meta’s MoE model with 16 experts and 17B active parameters. Compared to Maverick’s 128 experts, fewer experts mean structurally better CPU cache efficiency.

The same model was tested on both CPU inference (Q6_K / llama.cpp) and GPU inference (nvfp4 / vLLM), comparing speed, quality, and memory behavior. The impression that “CPU LLM is slow” may be influenced by cold-start behavior; steady-state performance after cache warming tells a different story.

Objective

Measure CPU Q6_K steady-state speed (after cache warming)
Record GPU nvfp4 Prefill/Decode speed and VRAM allocation
Quantify mmap page cache and prompt cache effectiveness
Establish the 100K context boundary reality

Test Environment

Item	Specification
CPU	AMD EPYC 9175F (Zen 5, 16C, L3 512MB)
Memory	DDR5-6400 768GB (12ch)
GPU	NVIDIA RTX PRO 6000 Blackwell Max-Q (96GB VRAM)
OS	Ubuntu 24.04 LTS

Two Configurations

Config	Runtime	Quantization	Placement	ctx
A: CPU	llama.cpp	Q6_K (3-split GGUF)	All CPU	8,192
B: GPU	vLLM 0.14.0rc1	nvfp4 (NVIDIA ModelOpt)	All GPU	~110K

Results

Config A: CPU Q6_K (th=16, ctx=8K)

Metric	Measured	Notes
PP (Prefill)	~40 tok/s	Stable even with 2.5k-5k token prompts
TG (Decode)	16.3-17.5 tok/s	Uniform 16-core load, temp stable at 64°C

First run is extremely slow due to mmap page faults + full Prefill. From the second run onward, OS page cache + prompt cache reach steady-state speed. Prompt cache LCP (Longest Common Prefix) match shows f_keep > 0.7.

Config B: GPU nvfp4 (vLLM, Blackwell 96GB)

Metric	Range	Peak	Notes
PP (Prefill)	700-1,200 tok/s	1,370 tok/s	Evident in long prompts
TG (Decode)	30-60 tok/s	112 tok/s	Stable post-warmup

Prefix Cache Hit Rate: 10-48%. At 48%, TTFT drops dramatically.

VRAM Breakdown

Item	Usage
Model weights	~63.5 GiB
Available KV Cache	~20.2 GiB
Max context length	~110,256 tokens

With BF16-equivalent KV management, ~100K tokens is the single-GPU ceiling. 256K+ requires FP8 KV or TP=2.

Comparison Summary

Config	TG(tok/s)	VRAM	Use Case
A: CPU Q6_K	16-17	0	Batch summarization, F&F pipeline
B: GPU nvfp4	30-60	~64GB	aider, interactive code generation

Analysis

CPU Cache Strategy Effectiveness

“CPU LLM is slow” is a limited verdict based on first-run behavior. With correct mmap and prompt cache design, 17 tok/s is stable in steady state. Key factors:

--mlock to lock all weights in physical memory
Fix System Prompt to maximize LCP match rate
Avoid process restarts (preserve OS page cache)

16-Expert Structural Advantage

Compared to Maverick’s 128 experts, Scout’s 16 experts have lower “sparsity density.” On EPYC 9175F’s 1 CCD = 1 Core layout, fewer active experts increase working set residency in L3 cache, reducing memory bandwidth dependency.

nvfp4 Quality

Django/Python design task quality remains practical even at FP4. Instruction following and Japanese proficiency are stable. Mathematically rigorous reasoning is slightly weaker, but acceptable for boilerplate generation and logic reviews.

Lessons Learned

The gap between CPU 17 tok/s and GPU 30-60 tok/s is clearly perceptible. But CPU’s 17 tok/s is fully practical for non-interactive pipelines, freeing the GPU for other models.

GPU nvfp4’s 1,370 tok/s Prefill is overwhelming. In aider iterative refactoring, short TTFT directly translates to development velocity. At 48% Prefix Cache hit, responses are near-instant.

The 16-expert layout is “cache-friendlier” than Maverick’s 128 experts. The tok/s difference in CPU inference (Scout 17 vs Maverick 21-24) is dominated by model size rather than expert count, but Scout shows more predictable behavior in terms of stability.

Reproduction Steps

CPU Q6_K

  numactl --cpunodebind=0 --membind=0 \
  podman run --rm -it \
  -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z $IMG \
  --host 0.0.0.0 --port 8080 \
  -m /models/Scout-17B-Q6_K.gguf \
  --threads 16 --threads-batch 16 \
  --batch-size 2048 --ubatch-size 512 \
  --mlock --ctx-size 8192 --flash-attn on \
  --parallel 1 --jinja

GPU nvfp4 (vLLM)

  vllm serve nvidia/Llama-4-Scout-17B-16E-Instruct-NVFP4 \
  --dtype auto --gpu-memory-utilization 0.88 \
  --max-num-seqs 1 --max-model-len 110000 \
  --enable-prefix-caching --trust-remote-code

Technical Notes

When Cache Stops Working

Changing even a single character in prompt-head templates or tool definitions invalidates prompt cache, reverting to full Prefill
Process restarts purge OS page cache, triggering cold-start I/O waits again
Cache capacity limits cause old prompts to be evicted

100K Context Boundary

With nvfp4 model (63.5 GiB) on a single 96GB GPU, KV cache gets ~20 GiB. At BF16 KV precision, ~110K tokens is the physical limit. Handling 256K/512K requires FP8 KV + TP=2 expansion.

Llama-4-Scout-17B-16E Measured: CPU Q6_K 17tok/s vs GPU nvfp4 60tok/s, Cache Strategy and 100K Context Boundary

Background link

Objective link

Test Environment link

Two Configurations link

Results link

Config A: CPU Q6_K (th=16, ctx=8K) link

Config B: GPU nvfp4 (vLLM, Blackwell 96GB) link

VRAM Breakdown link

Comparison Summary link

Analysis link

CPU Cache Strategy Effectiveness link

16-Expert Structural Advantage link

nvfp4 Quality link

Lessons Learned link

Reproduction Steps link

CPU Q6_K link

GPU nvfp4 (vLLM) link

Technical Notes link

When Cache Stops Working link

100K Context Boundary link