The Reality of 40B Dense Models: What Running IQuest-Coder-V1-40B on CPU/GPU/Aider Actually Showed
IQuest-Coder-V1-40B-Instruct (Dense 40B) tested across CPU Q5_K_M, GPU nvfp4, and Aider whole-edit. CPU inference proved structurally challenging, nvfp4 delivers 25-28 tok/s in the production range, Aider whole-edit is fundamentally incompatible with 40B. Performance Factor confirms 6-7x efficiency over 111B-class. Measured data on 40B Dense operational limits.
Introduction
In daily development pipelines and agent-based workflows, constructing the right inference environment for Large Language Models is a constant challenge. In this article, I cover my full evaluation of “IQuest-Coder-V1-40B”—a model that has caught my attention for coding tasks. I measured performance across three configurations: CPU-only inference (EPYC 9175F), GPU-accelerated inference (vLLM with nvfp4), and Aider whole-edit mode. The goal was to find which configurations actually hold up in production.
To cut to the chase: running a 40B Dense model on a CPU is structurally unsuitable for pipeline use cases, and Aider’s whole-edit format compounds the problem. Migrating to GPU—specifically leveraging nvfp4—delivers overwhelming efficiency and controllability. This post breaks down the specific behaviors and measured data behind these conclusions.
Test Environment
| Item | Specification |
|---|---|
| CPU | AMD EPYC 9175F (Zen 5, 16C, L3 512MB) |
| GPU | NVIDIA RTX PRO 6000 Blackwell Max-Q (96GB VRAM) |
| Memory | DDR5-6400 768GB (12ch) |
| OS | Ubuntu 24.04 LTS |
Test Configurations
| Config | Runtime | Quantization | Placement |
|---|---|---|---|
| A: CPU | llama.cpp server (Podman) | Q5_K_M (GGUF) | CPU RAM |
| B: GPU | vLLM | nvfp4 | GPU VRAM |
| C: Aider | vLLM (Loop-Instruct variant) | nvfp4 | GPU VRAM |
Background and Challenges: The Limits of CPU Inference
IQuest-Coder-V1-40B-Instruct is a 40B-class Dense (non-MoE) coding-specialized model. Unlike MoE models, Dense models use all parameters during inference—computation scales linearly with model size.
My first attempt ran IQuest-Coder-V1-40B-Instruct.q5_k_m.gguf using llama.cpp server within a Podman container, relying entirely on the CPU (AMD EPYC 9175F, 16 cores) without any GPU acceleration. The context length was set to 8192, targeting “wait-free” pipeline processing that demands low Time To First Token (TTFT) and low latency—essential for tools like Aider and other agent systems.
Once deployed, several fatal issues emerged:
- Constant Core Maxout: Unlike MoE models, a 40B Dense model must compute every layer for every token. This inability to bypass computations placed an excessive load on the CPU.
- TTFT Breakdown: When prompts exceeded 4k to 5k tokens (e.g.,
task.n_tokens=4640), the prompt evaluation phase dominated processing time. Perceived performance degraded completely before generation even began. - No Benefit from Storage Speed: Placing the model on an SSD improved load times but had virtually no impact on generation speed or prompt evaluation time.
- Throughput-Biased Settings: The default
llama-serverconfiguration (batch-size 2048andubatch-size 512) is heavily biased toward throughput, which critically conflicts with the low-latency requirements of pipeline processing.
I applied fundamental tuning—NUMA binding, thread allocation, mlock—and the CPU was being fully utilized. But the root bottleneck was simply running a 40B Dense model on a CPU. This poor experience is not misconfiguration; it is the expected behavior.
Config A Results
| Item | Measured |
|---|---|
| TTFT (4K-5K prompt) | Tens of seconds (UX failure) |
| CPU usage | All cores pinned at 100% |
| Root cause | 40B full-layer computation per token, no shortcuts |
Migration to GPU (nvfp4) and Measured Results
Given the CPU limitations, I transitioned to a GPU-based configuration using vLLM and nvfp4. For comparison, I also evaluated command-a-reasoning—a 111B class reasoning model—in parallel.
Measured Throughput
| Metric | Measured |
|---|---|
| PP speed (Prompt throughput) | 1,100-2,300 tok/s |
| TG speed (Generation throughput) | 25-28 tok/s (stable) |
| KV cache usage | 2-12% |
| Prefix cache hit rate | 20-45% |
From continuous vLLM logs:
Engine 000: Avg generation throughput: 28.3 tokens/s, KV cache usage: 2.0%, Prefix cache hit rate: 22.8%
Engine 000: Avg generation throughput: 28.0 tokens/s, KV cache usage: 2.3%, Prefix cache hit rate: 22.8%
Engine 000: Avg generation throughput: 27.8 tokens/s, KV cache usage: 2.8%, Prefix cache hit rate: 22.8%
Engine 000: Avg generation throughput: 27.4 tokens/s, KV cache usage: 3.5%, Prefix cache hit rate: 22.8%
- Prompt throughput of 1,100–2,300 tok/s means input processing is completely unproblematic, leaving ample headroom for the CPU, tokenizer, and IPC.
- Generation throughput of 25–28 tok/s is exceptionally strong for a 40B class model—about 2.3 to 2.5 times faster than the 111B
command-a-reasoning. - KV cache usage of 2–12% means the model is not bogged down by long histories or massive contexts, allowing smooth decoding.
- The prefix cache hit rate of 20–45% has room to improve further by locking down system prompts and tool schemas.
Execution remains highly stable. As long as Running: 1 reqs is maintained, throughput does not drop. The occasional “0 tok/s” in logs simply reflects an aggregation window timing; there were no signs of hangs, deadlocks, or any underlying failure.
Latency by output length:
- 200 tokens: ~7-8 seconds
- 400 tokens: ~14-16 seconds
- 800 tokens: ~30 seconds
This is less than half the wait time of the 111B model, making a clear difference when working with agents like Aider or during test generation.
Aider Whole-Edit Evaluation
Testing Aider in whole-edit mode on the GPU nvfp4 configuration surfaced a different class of problem.
| Metric | Measured |
|---|---|
| TG speed | 0.6-8 tok/s (unstable) |
| KV cache usage | 7-13% (rapid growth) |
| Prefix cache hit rate | 6% (effectively disabled) |
Whole-edit regenerates entire files, causing token count explosion. repo-map plus multi-file context inflates the prompt, and KV cache grows rapidly. The 6% prefix cache hit rate reflects the constantly shifting context—there is nothing stable to cache.
Comparison Summary
| Config | TG (tok/s) | Viability | Use Case |
|---|---|---|---|
| A: CPU Q5_K_M | Unmeasurable (TTFT failure) | No | - |
| B: GPU nvfp4 | 25-28 | Production | agent/test gen/CI |
| C: Aider whole-edit | 0.6-8 | No | - |
Performance Factor: Generation Efficiency per Parameter
To translate these metrics into a practical sense of performance, I defined a simple metric called the “Performance Factor (PF)"—generated tokens per second divided by model size in billions of parameters.
PF ≈ generation tok/s ÷ model size (B)
| Model | Params | TG speed | tok/s per B | Relative PF |
|---|---|---|---|---|
| command-a-reasoning | 111B | ~11 tok/s | 0.10 | 1.0 |
| IQuest-Coder-40B nvfp4 | 40B | 26-28 tok/s | 0.65-0.70 | ≈6.5-7.0 |
| Small 7B fp16 | 7B | 60-80 tok/s | 9-11 | Separate category |
In terms of generation performance per parameter, the IQuest-Coder-40B delivers 6 to 7 times the efficiency of the 111B class. This proves nvfp4 is working correctly and places the model at an excellent sweet spot in the trade-off between large model intelligence and execution speed.
Analysis
Why Dense Fails on CPU
MoE models (e.g., Kimi-K2.5 with 32B active parameters) compute only a subset of experts per token. Dense computes all 40B layers every time—no computation reduction possible. “40B-class” means fundamentally different CPU loads between MoE and Dense.
Even EPYC 9175F’s 12-channel memory bandwidth saturates under Dense full-layer access patterns. The L3 cache expert-locality strategy that works for MoE is unusable here.
| Item | Dense 40B | MoE 229B (10B active) |
|---|---|---|
| Per-token computation | All 40B layers | ~10B equivalent |
| L3 cache utilization | Ineffective (full-layer access) | Effective (expert locality) |
| CPU TG speed | Unmeasurable | 10-37 tok/s |
| CPU viability | Non-viable | Viable for batch |
Aider Whole-Edit Structural Problem
Why whole-edit is slow:
- Regenerating entire files produces massive output token counts
- repo-map + attached files inflate the prompt
- Prefix cache hit rate at 6% (constantly shifting context)
- KV cache grows rapidly (7→13%)
The fix is switching to diff/patch format, which drastically reduces output tokens and structurally bypasses the generation speed problem.
Model Quality and Controllability
In generating long, structured Go test code, the model did not break down. Logical failures and hallucinations were sometimes fewer than with 111B Reasoning models. The straightforward EOS characteristic of an “Instruct type that doesn’t overthink,” combined with reliable stop and max_tokens behavior, pairs exceptionally well with automated code generation.
While command-a-reasoning undeniably excels in deep reasoning, proofs, and long-form thought, in practical workflows, speed, willingness to stop, and ease of control often hold far more value.
Why the Setup Looks Stable
The win is not just the model. It is the combination of model, serving path, and surrounding implementation.
nvfp4is working correctly withvLLM main/nightlystopandmax_tokensare behaving as expected- The instruct-style EOS behavior is clean and predictable
- The model lines up well with the
agent-gatewaydesign, especially where retrieval behavior differs between streaming and non-streaming paths
That makes the result more valuable than a one-off benchmark. The operational shape of the system is understandable, and the model does not have to be treated as a fragile special case.
Prioritizing Improvements
If you find yourself struggling with a similar setup, address the issues in this order:
- Use a GPU (Highest Priority)
Specify
--device nvidia.com/gpu=alland--n-gpu-layers 999to fully utilize GPU VRAM. With high-capacity VRAM like the RTX PRO 6000 MAX-Q 96GB, there is no reason not to use it. - Temporary Measures for CPU-Only Operation
- Lower
batch-sizeto 512–1024. - Lower
ubatch-sizeto 128–256. - Trim prompts by removing unnecessary history or repository maps (repo-map).
- Lower
- Scale Down the Model If pipeline constraints force CPU-only operation, stepping down to a 7B–14B class model is the most realistic solution.
Reproduction Steps
GPU nvfp4 (Recommended)
vllm serve IQuestLab/IQuest-Coder-V1-40B-Instruct-nvfp4 \
--max-num-seqs 1 \
--max-model-len 32768
CPU Q5_K_M (Reference: Not Recommended)
podman run --rm -it \
-p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
-v "$MO":/models:ro,Z $IMG \
--host 0.0.0.0 --port 8080 -m "$MODEL" \
--jinja -c 8192 \
--threads 14 --threads-batch 14 \
-b 2048 -ub 512 \
--parallel 1 --flash-attn on
Aider Optimization Settings
--edit-format diff: Avoid whole-edit, reduce output tokenstemperature=0: Greedy decoding for speed- Minimize repo-map and /add targets
- Set
max_model_lento minimum required
Conclusion and Next Steps
Running IQuest-Coder-V1-40B on nvfp4 and vLLM functions superbly as a production default, excelling in speed, stability, and control. The position of 40B Dense models is now clear:
- Daily work (agent/aider/CI): IQuest-Coder-40B nvfp4 (primary)
- Deep reasoning / design review: command-a-reasoning (secondary)
- Batch processing (CPU resident): MoE models (Kimi-K2.5 etc.)
If a 40B class model can consistently output 25–28 tok/s, the need for a 111B model in daily agent tasks, CI, and test generation almost entirely disappears. The most elegant architecture positions IQuest-Coder-40B as the primary routing destination, offloading only those cases that require complex reasoning or design reviews to command-a-reasoning.
Next steps: fixing prompts to raise the prefix cache hit rate from 40% to over 60%, and final tuning based on SLOs targeting the p95 latency threshold.
