Llama-4-Maverick-17B-128E CPU Inference: Q4_K_M vs Q8_0 Speed-Quality Trade-off Measured
Llama-4-Maverick (17B active / 128-expert MoE) CPU inference on EPYC 9175F, comparing Q4_K_M and Q8_0. Q4 delivers 21-24 tok/s, Q8 delivers 15-16 tok/s. Quantization selection criteria for MoE CPU inference on 768GB RAM.
Background
Llama-4-Maverick is Meta’s MoE model with 128 experts. While the total parameter count is massive, only 17B parameters are active per token, making CPU inference viable despite the model’s apparent scale.
On EPYC 9175F with 768GB DDR5, both Q4_K_M (~290GB) and Q8_0 (~426GB) fit in physical memory. The question was which to choose. I compared both on identical hardware and tasks to establish per-use-case selection criteria.
Running trillion-parameter-class models locally always runs into the GPU VRAM wall. I used an AMD EPYC 9175F with 768GB of high-bandwidth DDR5 memory to measure both Q4_K_M and Q8_0 of Llama-4-Maverick-17B-128E-Instruct through llama.cpp.
As trillion-scale MoE models keep appearing, CPU inference is getting renewed attention. EPYC (Zen 5 / Turin) processors bring 12-channel DDR5 memory bandwidth and large L3 caches, offering a GPU-independent inference path for large sparse models.
Objective
- Compare Q4_K_M and Q8_0 Prefill/Decode speeds on identical hardware
- Quantify TTFT (Time To First Token) difference
- Evaluate output quality difference on practical tasks
- Establish selection criteria for batch processing vs aider workflows
Test Environment
| Item | Specification |
|---|---|
| CPU | AMD EPYC 9175F (Zen 5, 16C, L3 512MB) |
| Memory | DDR5-6400 768GB (12ch) |
| GPU | Not used (CPU inference) |
| OS | Ubuntu 24.04 LTS |
| Runtime | llama.cpp server (Podman rootless) |
| Threads | 14 |
| Context | 16,384 |
Thread count is set to 14 against a 16-core processor. This leaves headroom for OS and network IO, staying below the memory bandwidth saturation point for stable operation.
Model Specifications
| Item | Q4_K_M | Q8_0 |
|---|---|---|
| Architecture | Llama-4-Maverick (MoE, 128E) | Same |
| Model Size | ~290GB | ~426GB |
| Quantization | 4-bit mixed | 8-bit |
Results
Speed Comparison
| Metric | Q4_K_M | Q8_0 | Delta |
|---|---|---|---|
| PP(tok/s) | 65-68 | 50-52 | Q4 ~30% faster |
| TG(tok/s) | 21-24 | 15-16 | Q4 ~40% faster |
| TTFT (800-1000tok) | 12-17s | 16-20s | Q4 3-5s faster |
Q4_K_M Measurements
| Prompt(tok) | PP time(s) | PP(tok/s) | Gen(tok) | TG(tok/s) |
|---|---|---|---|---|
| 819 | 12.0 | 68.4 | 165 | 23.7 |
| 1,101 | 16.8 | 65.4 | 814 | 21.3 |
Q8_0 Measurements
| Prompt(tok) | PP time(s) | PP(tok/s) | Gen(tok) | TG(tok/s) |
|---|---|---|---|---|
| 819 | 15.6 | 52.4 | 104 | 16.6 |
| 1,000 | 19.7 | 50.8 | 916 | 15.2 |
Memory Behavior
- Q4_K_M: mmap + page cache loads only needed portions. Prompt cache ~180-380 MiB per entry
- Q8_0: Hundreds of GB in buff/cache. RSS appears small (normal mmap behavior)
- MoE (128E): only active experts are frequently accessed; unused pages stay cold
Model weights are loaded through mmap and page cache. Linux free output shows hundreds of GB in buff/cache, while the application’s RSS looks small. This is normal llama.cpp behavior.
Due to the MoE (128E) architecture, only active experts are frequently referenced per token. Unused weight pages stay cold, and memory access locality has a major impact on performance.
Analysis
Speed Gap Explained
Q4_K_M is ~68% of Q8_0’s model size (290GB vs 426GB). MoE inference reads selected expert weights from memory per token. Smaller quantization means less data per read, directly reducing memory bandwidth pressure.
The TG speed gap (40%) exceeds the model size gap (32%) because Decode’s random access pattern amplifies cache miss penalties and memory latency impact.
Quality Difference in Practice
Q8_0 advantages:
- More stable context retention (less topic drift in long conversations)
- Fewer destructive edits (notably in aider code modifications)
- General sense of reliability
The gap is not dramatic. Code scaffolding and summarization work fine at Q4_K_M. The difference surfaces in complex repository operations and long-form generation. High-level comprehension and structure generation ability are equivalent – Q4_K_M rarely committed critical errors in aider code generation or document scaffolding.
Why Q8_0 Is Viable on 768GB RAM
Q8_0’s 426GB is impractical on typical servers. On 768GB DDR5, the model + KV cache + page cache all fit in memory. “Use Q8_0 when memory is abundant” is a rational choice specific to this environment.
Selection Criteria by Use Case
Q4_K_M vs Q8_0 selection is driven by task nature, not a simple speed-quality trade-off.
- High-throughput oneshot generation, Dagster/dbt batch: Q4_K_M. 40% speed difference compounds over volume
- Aider workflows, complex repo operations, long-form: Q8_0. Reduced destructive edit risk outweighs speed loss
- Daily on/off operation: Q4_K_M. Faster startup
- Always-on resident: Q8_0. Quality stability pays off over long sessions
- Local knowledge processing: Both work for structuring/distilling sensitive internal data locally
MoE CPU inference uniquely makes both quantization levels practical. The 128-expert locality pattern enables efficient memory bandwidth utilization, keeping Q8_0 at a usable 15-16 tok/s.
Switching to Q8_0 for quality-critical phases while using Q4_K_M as the default is the current best practice for on-premises LLM operation.
Reproduction Steps
Q4_K_M
podman run --rm -it \
-p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
-v "$MO":/models:ro,Z $IMG \
--host 0.0.0.0 --port 8080 \
-m /models/Llama-4-Maverick-17B-128E-Instruct-Q4_K_M.gguf \
--jinja -c 16384 \
--threads 14 --threads-batch 14 \
-b 1024 -ub 256 \
--parallel 1 --flash-attn on
Q8_0
Same command, substitute Q8_0 model file.
Technical Notes
MoE mmap Behavior
Only a few of 128 experts are active per token. mmap page cache “warms” unevenly. Sustained single-topic processing hits the same experts repeatedly, improving cache hit rate. Frequent topic changes increase page faults.
Prompt Cache
LCP similarity hit works on both Q4_K_M and Q8_0. With fixed System Prompts, TTFT on subsequent requests drops significantly.
