The Reality of 40B Dense Models: What Running IQuest-Coder-V1-40B on CPU/GPU/Aider Actually Showed

IQuest-Coder-V1-40B-Instruct (Dense 40B) tested across CPU Q5_K_M, GPU nvfp4, and Aider whole-edit. CPU inference proved structurally challenging, nvfp4 delivers 25-28 tok/s in the production range, Aider whole-edit is fundamentally incompatible with 40B. Performance Factor confirms 6-7x efficiency over 111B-class. Measured data on 40B Dense operational limits.

These articles use AI-generated summaries of Obsidian notes originally kept as technical memos.

English translations are produced with AI assistance.

Introduction

In daily development pipelines and agent-based workflows, constructing the right inference environment for Large Language Models is a constant challenge. In this article, I cover my full evaluation of “IQuest-Coder-V1-40B”—a model that has caught my attention for coding tasks. I measured performance across three configurations: CPU-only inference (EPYC 9175F), GPU-accelerated inference (vLLM with nvfp4), and Aider whole-edit mode. The goal was to find which configurations actually hold up in production.

To cut to the chase: running a 40B Dense model on a CPU is structurally unsuitable for pipeline use cases, and Aider’s whole-edit format compounds the problem. Migrating to GPU—specifically leveraging nvfp4—delivers overwhelming efficiency and controllability. This post breaks down the specific behaviors and measured data behind these conclusions.

Test Environment

Item	Specification
CPU	AMD EPYC 9175F (Zen 5, 16C, L3 512MB)
GPU	NVIDIA RTX PRO 6000 Blackwell Max-Q (96GB VRAM)
Memory	DDR5-6400 768GB (12ch)
OS	Ubuntu 24.04 LTS

Test Configurations

Config	Runtime	Quantization	Placement
A: CPU	llama.cpp server (Podman)	Q5_K_M (GGUF)	CPU RAM
B: GPU	vLLM	nvfp4	GPU VRAM
C: Aider	vLLM (Loop-Instruct variant)	nvfp4	GPU VRAM

Background and Challenges: The Limits of CPU Inference

IQuest-Coder-V1-40B-Instruct is a 40B-class Dense (non-MoE) coding-specialized model. Unlike MoE models, Dense models use all parameters during inference—computation scales linearly with model size.

My first attempt ran IQuest-Coder-V1-40B-Instruct.q5_k_m.gguf using llama.cpp server within a Podman container, relying entirely on the CPU (AMD EPYC 9175F, 16 cores) without any GPU acceleration. The context length was set to 8192, targeting “wait-free” pipeline processing that demands low Time To First Token (TTFT) and low latency—essential for tools like Aider and other agent systems.

Once deployed, several fatal issues emerged:

Constant Core Maxout: Unlike MoE models, a 40B Dense model must compute every layer for every token. This inability to bypass computations placed an excessive load on the CPU.
TTFT Breakdown: When prompts exceeded 4k to 5k tokens (e.g., task.n_tokens=4640), the prompt evaluation phase dominated processing time. Perceived performance degraded completely before generation even began.
No Benefit from Storage Speed: Placing the model on an SSD improved load times but had virtually no impact on generation speed or prompt evaluation time.
Throughput-Biased Settings: The default llama-server configuration (batch-size 2048 and ubatch-size 512) is heavily biased toward throughput, which critically conflicts with the low-latency requirements of pipeline processing.

I applied fundamental tuning—NUMA binding, thread allocation, mlock—and the CPU was being fully utilized. But the root bottleneck was simply running a 40B Dense model on a CPU. This poor experience is not misconfiguration; it is the expected behavior.

Config A Results

Item	Measured
TTFT (4K-5K prompt)	Tens of seconds (UX failure)
CPU usage	All cores pinned at 100%
Root cause	40B full-layer computation per token, no shortcuts

Migration to GPU (nvfp4) and Measured Results

Given the CPU limitations, I transitioned to a GPU-based configuration using vLLM and nvfp4. For comparison, I also evaluated command-a-reasoning—a 111B class reasoning model—in parallel.

Measured Throughput

Metric	Measured
PP speed (Prompt throughput)	1,100-2,300 tok/s
TG speed (Generation throughput)	25-28 tok/s (stable)
KV cache usage	2-12%
Prefix cache hit rate	20-45%

From continuous vLLM logs:

  Engine 000: Avg generation throughput: 28.3 tokens/s, KV cache usage: 2.0%, Prefix cache hit rate: 22.8%
Engine 000: Avg generation throughput: 28.0 tokens/s, KV cache usage: 2.3%, Prefix cache hit rate: 22.8%
Engine 000: Avg generation throughput: 27.8 tokens/s, KV cache usage: 2.8%, Prefix cache hit rate: 22.8%
Engine 000: Avg generation throughput: 27.4 tokens/s, KV cache usage: 3.5%, Prefix cache hit rate: 22.8%

Prompt throughput of 1,100–2,300 tok/s means input processing is completely unproblematic, leaving ample headroom for the CPU, tokenizer, and IPC.
Generation throughput of 25–28 tok/s is exceptionally strong for a 40B class model—about 2.3 to 2.5 times faster than the 111B command-a-reasoning.
KV cache usage of 2–12% means the model is not bogged down by long histories or massive contexts, allowing smooth decoding.
The prefix cache hit rate of 20–45% has room to improve further by locking down system prompts and tool schemas.

Execution remains highly stable. As long as Running: 1 reqs is maintained, throughput does not drop. The occasional “0 tok/s” in logs simply reflects an aggregation window timing; there were no signs of hangs, deadlocks, or any underlying failure.

Latency by output length:

200 tokens: ~7-8 seconds
400 tokens: ~14-16 seconds
800 tokens: ~30 seconds

This is less than half the wait time of the 111B model, making a clear difference when working with agents like Aider or during test generation.

Aider Whole-Edit Evaluation

Testing Aider in whole-edit mode on the GPU nvfp4 configuration surfaced a different class of problem.

Metric	Measured
TG speed	0.6-8 tok/s (unstable)
KV cache usage	7-13% (rapid growth)
Prefix cache hit rate	6% (effectively disabled)

Whole-edit regenerates entire files, causing token count explosion. repo-map plus multi-file context inflates the prompt, and KV cache grows rapidly. The 6% prefix cache hit rate reflects the constantly shifting context—there is nothing stable to cache.

Comparison Summary

Config	TG (tok/s)	Viability	Use Case
A: CPU Q5_K_M	Unmeasurable (TTFT failure)	No	-
B: GPU nvfp4	25-28	Production	agent/test gen/CI
C: Aider whole-edit	0.6-8	No	-

Performance Factor: Generation Efficiency per Parameter

To translate these metrics into a practical sense of performance, I defined a simple metric called the “Performance Factor (PF)"—generated tokens per second divided by model size in billions of parameters.

PF ≈ generation tok/s ÷ model size (B)

Model	Params	TG speed	tok/s per B	Relative PF
command-a-reasoning	111B	~11 tok/s	0.10	1.0
IQuest-Coder-40B nvfp4	40B	26-28 tok/s	0.65-0.70	≈6.5-7.0
Small 7B fp16	7B	60-80 tok/s	9-11	Separate category

In terms of generation performance per parameter, the IQuest-Coder-40B delivers 6 to 7 times the efficiency of the 111B class. This proves nvfp4 is working correctly and places the model at an excellent sweet spot in the trade-off between large model intelligence and execution speed.

Analysis

Why Dense Fails on CPU

MoE models (e.g., Kimi-K2.5 with 32B active parameters) compute only a subset of experts per token. Dense computes all 40B layers every time—no computation reduction possible. “40B-class” means fundamentally different CPU loads between MoE and Dense.

Even EPYC 9175F’s 12-channel memory bandwidth saturates under Dense full-layer access patterns. The L3 cache expert-locality strategy that works for MoE is unusable here.

Item	Dense 40B	MoE 229B (10B active)
Per-token computation	All 40B layers	~10B equivalent
L3 cache utilization	Ineffective (full-layer access)	Effective (expert locality)
CPU TG speed	Unmeasurable	10-37 tok/s
CPU viability	Non-viable	Viable for batch

Aider Whole-Edit Structural Problem

Why whole-edit is slow:

Regenerating entire files produces massive output token counts
repo-map + attached files inflate the prompt
Prefix cache hit rate at 6% (constantly shifting context)
KV cache grows rapidly (7→13%)

The fix is switching to diff/patch format, which drastically reduces output tokens and structurally bypasses the generation speed problem.

Model Quality and Controllability

In generating long, structured Go test code, the model did not break down. Logical failures and hallucinations were sometimes fewer than with 111B Reasoning models. The straightforward EOS characteristic of an “Instruct type that doesn’t overthink,” combined with reliable stop and max_tokens behavior, pairs exceptionally well with automated code generation.

While command-a-reasoning undeniably excels in deep reasoning, proofs, and long-form thought, in practical workflows, speed, willingness to stop, and ease of control often hold far more value.

Why the Setup Looks Stable

The win is not just the model. It is the combination of model, serving path, and surrounding implementation.

nvfp4 is working correctly with vLLM main/nightly
stop and max_tokens are behaving as expected
The instruct-style EOS behavior is clean and predictable
The model lines up well with the agent-gateway design, especially where retrieval behavior differs between streaming and non-streaming paths

That makes the result more valuable than a one-off benchmark. The operational shape of the system is understandable, and the model does not have to be treated as a fragile special case.

Prioritizing Improvements

If you find yourself struggling with a similar setup, address the issues in this order:

Use a GPU (Highest Priority) Specify --device nvidia.com/gpu=all and --n-gpu-layers 999 to fully utilize GPU VRAM. With high-capacity VRAM like the RTX PRO 6000 MAX-Q 96GB, there is no reason not to use it.
Temporary Measures for CPU-Only Operation
- Lower batch-size to 512–1024.
- Lower ubatch-size to 128–256.
- Trim prompts by removing unnecessary history or repository maps (repo-map).
Scale Down the Model If pipeline constraints force CPU-only operation, stepping down to a 7B–14B class model is the most realistic solution.

Reproduction Steps

GPU nvfp4 (Recommended)

  vllm serve IQuestLab/IQuest-Coder-V1-40B-Instruct-nvfp4 \
  --max-num-seqs 1 \
  --max-model-len 32768

CPU Q5_K_M (Reference: Not Recommended)

  podman run --rm -it \
  -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z $IMG \
  --host 0.0.0.0 --port 8080 -m "$MODEL" \
  --jinja -c 8192 \
  --threads 14 --threads-batch 14 \
  -b 2048 -ub 512 \
  --parallel 1 --flash-attn on

Aider Optimization Settings

--edit-format diff: Avoid whole-edit, reduce output tokens
temperature=0: Greedy decoding for speed
Minimize repo-map and /add targets
Set max_model_len to minimum required

Conclusion and Next Steps

Running IQuest-Coder-V1-40B on nvfp4 and vLLM functions superbly as a production default, excelling in speed, stability, and control. The position of 40B Dense models is now clear:

Daily work (agent/aider/CI): IQuest-Coder-40B nvfp4 (primary)
Deep reasoning / design review: command-a-reasoning (secondary)
Batch processing (CPU resident): MoE models (Kimi-K2.5 etc.)

If a 40B class model can consistently output 25–28 tok/s, the need for a 111B model in daily agent tasks, CI, and test generation almost entirely disappears. The most elegant architecture positions IQuest-Coder-40B as the primary routing destination, offloading only those cases that require complex reasoning or design reviews to command-a-reasoning.

Next steps: fixing prompts to raise the prefix cache hit rate from 40% to over 60%, and final tuning based on SLOs targeting the p95 latency threshold.

How I'd Choose a Daily Quantization Setup for Hermes-4.3-36B

Comparing Hermes-4.3-36B …

What I Learned from Running Command-A Reasoning 08-2025 Inside an Aider Coding Loop

A hands-on evaluation of …