Llama-4-Maverick-17B-128E CPU Inference: Q4_K_M vs Q8_0 Speed-Quality Trade-off Measured

Background

Llama-4-Maverick is Meta’s MoE model with 128 experts. While the total parameter count is massive, only 17B parameters are active per token, making CPU inference viable despite the model’s apparent scale.

On EPYC 9175F with 768GB DDR5, both Q4_K_M (~290GB) and Q8_0 (~426GB) fit in physical memory. The question was which to choose. I compared both on identical hardware and tasks to establish per-use-case selection criteria.

Running trillion-parameter-class models locally always runs into the GPU VRAM wall. I used an AMD EPYC 9175F with 768GB of high-bandwidth DDR5 memory to measure both Q4_K_M and Q8_0 of Llama-4-Maverick-17B-128E-Instruct through llama.cpp.

As trillion-scale MoE models keep appearing, CPU inference is getting renewed attention. EPYC (Zen 5 / Turin) processors bring 12-channel DDR5 memory bandwidth and large L3 caches, offering a GPU-independent inference path for large sparse models.

Objective

Compare Q4_K_M and Q8_0 Prefill/Decode speeds on identical hardware
Quantify TTFT (Time To First Token) difference
Evaluate output quality difference on practical tasks
Establish selection criteria for batch processing vs aider workflows

Test Environment

Item	Specification
CPU	AMD EPYC 9175F (Zen 5, 16C, L3 512MB)
Memory	DDR5-6400 768GB (12ch)
GPU	Not used (CPU inference)
OS	Ubuntu 24.04 LTS
Runtime	llama.cpp server (Podman rootless)
Threads	14
Context	16,384

Thread count is set to 14 against a 16-core processor. This leaves headroom for OS and network IO, staying below the memory bandwidth saturation point for stable operation.

Model Specifications

Item	Q4_K_M	Q8_0
Architecture	Llama-4-Maverick (MoE, 128E)	Same
Model Size	~290GB	~426GB
Quantization	4-bit mixed	8-bit

Results

Speed Comparison

Metric	Q4_K_M	Q8_0	Delta
PP(tok/s)	65-68	50-52	Q4 ~30% faster
TG(tok/s)	21-24	15-16	Q4 ~40% faster
TTFT (800-1000tok)	12-17s	16-20s	Q4 3-5s faster

Q4_K_M Measurements

Prompt(tok)	PP time(s)	PP(tok/s)	Gen(tok)	TG(tok/s)
819	12.0	68.4	165	23.7
1,101	16.8	65.4	814	21.3

Q8_0 Measurements

Prompt(tok)	PP time(s)	PP(tok/s)	Gen(tok)	TG(tok/s)
819	15.6	52.4	104	16.6
1,000	19.7	50.8	916	15.2

Memory Behavior

Q4_K_M: mmap + page cache loads only needed portions. Prompt cache ~180-380 MiB per entry
Q8_0: Hundreds of GB in buff/cache. RSS appears small (normal mmap behavior)
MoE (128E): only active experts are frequently accessed; unused pages stay cold

Model weights are loaded through mmap and page cache. Linux free output shows hundreds of GB in buff/cache, while the application’s RSS looks small. This is normal llama.cpp behavior.

Due to the MoE (128E) architecture, only active experts are frequently referenced per token. Unused weight pages stay cold, and memory access locality has a major impact on performance.

Analysis

Speed Gap Explained

Q4_K_M is ~68% of Q8_0’s model size (290GB vs 426GB). MoE inference reads selected expert weights from memory per token. Smaller quantization means less data per read, directly reducing memory bandwidth pressure.

The TG speed gap (40%) exceeds the model size gap (32%) because Decode’s random access pattern amplifies cache miss penalties and memory latency impact.

Quality Difference in Practice

Q8_0 advantages:

More stable context retention (less topic drift in long conversations)
Fewer destructive edits (notably in aider code modifications)
General sense of reliability

The gap is not dramatic. Code scaffolding and summarization work fine at Q4_K_M. The difference surfaces in complex repository operations and long-form generation. High-level comprehension and structure generation ability are equivalent – Q4_K_M rarely committed critical errors in aider code generation or document scaffolding.

Why Q8_0 Is Viable on 768GB RAM

Q8_0’s 426GB is impractical on typical servers. On 768GB DDR5, the model + KV cache + page cache all fit in memory. “Use Q8_0 when memory is abundant” is a rational choice specific to this environment.

Selection Criteria by Use Case

Q4_K_M vs Q8_0 selection is driven by task nature, not a simple speed-quality trade-off.

High-throughput oneshot generation, Dagster/dbt batch: Q4_K_M. 40% speed difference compounds over volume
Aider workflows, complex repo operations, long-form: Q8_0. Reduced destructive edit risk outweighs speed loss
Daily on/off operation: Q4_K_M. Faster startup
Always-on resident: Q8_0. Quality stability pays off over long sessions
Local knowledge processing: Both work for structuring/distilling sensitive internal data locally

MoE CPU inference uniquely makes both quantization levels practical. The 128-expert locality pattern enables efficient memory bandwidth utilization, keeping Q8_0 at a usable 15-16 tok/s.

Switching to Q8_0 for quality-critical phases while using Q4_K_M as the default is the current best practice for on-premises LLM operation.

Reproduction Steps

Q4_K_M

  podman run --rm -it \
  -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z $IMG \
  --host 0.0.0.0 --port 8080 \
  -m /models/Llama-4-Maverick-17B-128E-Instruct-Q4_K_M.gguf \
  --jinja -c 16384 \
  --threads 14 --threads-batch 14 \
  -b 1024 -ub 256 \
  --parallel 1 --flash-attn on

Q8_0

Same command, substitute Q8_0 model file.

Technical Notes

MoE mmap Behavior

Only a few of 128 experts are active per token. mmap page cache “warms” unevenly. Sustained single-topic processing hits the same experts repeatedly, improving cache hit rate. Frequent topic changes increase page faults.

Prompt Cache

LCP similarity hit works on both Q4_K_M and Q8_0. With fixed System Prompts, TTFT on subsequent requests drops significantly.