Introduction

In daily development pipelines and agent-based workflows, constructing the right inference environment for Large Language Models is a constant challenge. In this article, I cover my full evaluation of “IQuest-Coder-V1-40B”—a model that has caught my attention for coding tasks. I measured performance across three configurations: CPU-only inference (EPYC 9175F), GPU-accelerated inference (vLLM with nvfp4), and Aider whole-edit mode. The goal was to find which configurations actually hold up in production.

To cut to the chase: running a 40B Dense model on a CPU is structurally unsuitable for pipeline use cases, and Aider’s whole-edit format compounds the problem. Migrating to GPU—specifically leveraging nvfp4—delivers overwhelming efficiency and controllability. This post breaks down the specific behaviors and measured data behind these conclusions.

Test Environment

ItemSpecification
CPUAMD EPYC 9175F (Zen 5, 16C, L3 512MB)
GPUNVIDIA RTX PRO 6000 Blackwell Max-Q (96GB VRAM)
MemoryDDR5-6400 768GB (12ch)
OSUbuntu 24.04 LTS

Test Configurations

ConfigRuntimeQuantizationPlacement
A: CPUllama.cpp server (Podman)Q5_K_M (GGUF)CPU RAM
B: GPUvLLMnvfp4GPU VRAM
C: AidervLLM (Loop-Instruct variant)nvfp4GPU VRAM

Background and Challenges: The Limits of CPU Inference

IQuest-Coder-V1-40B-Instruct is a 40B-class Dense (non-MoE) coding-specialized model. Unlike MoE models, Dense models use all parameters during inference—computation scales linearly with model size.

My first attempt ran IQuest-Coder-V1-40B-Instruct.q5_k_m.gguf using llama.cpp server within a Podman container, relying entirely on the CPU (AMD EPYC 9175F, 16 cores) without any GPU acceleration. The context length was set to 8192, targeting “wait-free” pipeline processing that demands low Time To First Token (TTFT) and low latency—essential for tools like Aider and other agent systems.

Once deployed, several fatal issues emerged:

  • Constant Core Maxout: Unlike MoE models, a 40B Dense model must compute every layer for every token. This inability to bypass computations placed an excessive load on the CPU.
  • TTFT Breakdown: When prompts exceeded 4k to 5k tokens (e.g., task.n_tokens=4640), the prompt evaluation phase dominated processing time. Perceived performance degraded completely before generation even began.
  • No Benefit from Storage Speed: Placing the model on an SSD improved load times but had virtually no impact on generation speed or prompt evaluation time.
  • Throughput-Biased Settings: The default llama-server configuration (batch-size 2048 and ubatch-size 512) is heavily biased toward throughput, which critically conflicts with the low-latency requirements of pipeline processing.

I applied fundamental tuning—NUMA binding, thread allocation, mlock—and the CPU was being fully utilized. But the root bottleneck was simply running a 40B Dense model on a CPU. This poor experience is not misconfiguration; it is the expected behavior.

Config A Results

ItemMeasured
TTFT (4K-5K prompt)Tens of seconds (UX failure)
CPU usageAll cores pinned at 100%
Root cause40B full-layer computation per token, no shortcuts

Migration to GPU (nvfp4) and Measured Results

Given the CPU limitations, I transitioned to a GPU-based configuration using vLLM and nvfp4. For comparison, I also evaluated command-a-reasoning—a 111B class reasoning model—in parallel.

Measured Throughput

MetricMeasured
PP speed (Prompt throughput)1,100-2,300 tok/s
TG speed (Generation throughput)25-28 tok/s (stable)
KV cache usage2-12%
Prefix cache hit rate20-45%

From continuous vLLM logs:

  Engine 000: Avg generation throughput: 28.3 tokens/s, KV cache usage: 2.0%, Prefix cache hit rate: 22.8%
Engine 000: Avg generation throughput: 28.0 tokens/s, KV cache usage: 2.3%, Prefix cache hit rate: 22.8%
Engine 000: Avg generation throughput: 27.8 tokens/s, KV cache usage: 2.8%, Prefix cache hit rate: 22.8%
Engine 000: Avg generation throughput: 27.4 tokens/s, KV cache usage: 3.5%, Prefix cache hit rate: 22.8%
  
  • Prompt throughput of 1,100–2,300 tok/s means input processing is completely unproblematic, leaving ample headroom for the CPU, tokenizer, and IPC.
  • Generation throughput of 25–28 tok/s is exceptionally strong for a 40B class model—about 2.3 to 2.5 times faster than the 111B command-a-reasoning.
  • KV cache usage of 2–12% means the model is not bogged down by long histories or massive contexts, allowing smooth decoding.
  • The prefix cache hit rate of 20–45% has room to improve further by locking down system prompts and tool schemas.

Execution remains highly stable. As long as Running: 1 reqs is maintained, throughput does not drop. The occasional “0 tok/s” in logs simply reflects an aggregation window timing; there were no signs of hangs, deadlocks, or any underlying failure.

Latency by output length:

  • 200 tokens: ~7-8 seconds
  • 400 tokens: ~14-16 seconds
  • 800 tokens: ~30 seconds

This is less than half the wait time of the 111B model, making a clear difference when working with agents like Aider or during test generation.

Aider Whole-Edit Evaluation

Testing Aider in whole-edit mode on the GPU nvfp4 configuration surfaced a different class of problem.

MetricMeasured
TG speed0.6-8 tok/s (unstable)
KV cache usage7-13% (rapid growth)
Prefix cache hit rate6% (effectively disabled)

Whole-edit regenerates entire files, causing token count explosion. repo-map plus multi-file context inflates the prompt, and KV cache grows rapidly. The 6% prefix cache hit rate reflects the constantly shifting context—there is nothing stable to cache.

Comparison Summary

ConfigTG (tok/s)ViabilityUse Case
A: CPU Q5_K_MUnmeasurable (TTFT failure)No-
B: GPU nvfp425-28Productionagent/test gen/CI
C: Aider whole-edit0.6-8No-

Performance Factor: Generation Efficiency per Parameter

To translate these metrics into a practical sense of performance, I defined a simple metric called the “Performance Factor (PF)"—generated tokens per second divided by model size in billions of parameters.

PF ≈ generation tok/s ÷ model size (B)

ModelParamsTG speedtok/s per BRelative PF
command-a-reasoning111B~11 tok/s0.101.0
IQuest-Coder-40B nvfp440B26-28 tok/s0.65-0.70≈6.5-7.0
Small 7B fp167B60-80 tok/s9-11Separate category

In terms of generation performance per parameter, the IQuest-Coder-40B delivers 6 to 7 times the efficiency of the 111B class. This proves nvfp4 is working correctly and places the model at an excellent sweet spot in the trade-off between large model intelligence and execution speed.

Analysis

Why Dense Fails on CPU

MoE models (e.g., Kimi-K2.5 with 32B active parameters) compute only a subset of experts per token. Dense computes all 40B layers every time—no computation reduction possible. “40B-class” means fundamentally different CPU loads between MoE and Dense.

Even EPYC 9175F’s 12-channel memory bandwidth saturates under Dense full-layer access patterns. The L3 cache expert-locality strategy that works for MoE is unusable here.

ItemDense 40BMoE 229B (10B active)
Per-token computationAll 40B layers~10B equivalent
L3 cache utilizationIneffective (full-layer access)Effective (expert locality)
CPU TG speedUnmeasurable10-37 tok/s
CPU viabilityNon-viableViable for batch

Aider Whole-Edit Structural Problem

Why whole-edit is slow:

  1. Regenerating entire files produces massive output token counts
  2. repo-map + attached files inflate the prompt
  3. Prefix cache hit rate at 6% (constantly shifting context)
  4. KV cache grows rapidly (7→13%)

The fix is switching to diff/patch format, which drastically reduces output tokens and structurally bypasses the generation speed problem.

Model Quality and Controllability

In generating long, structured Go test code, the model did not break down. Logical failures and hallucinations were sometimes fewer than with 111B Reasoning models. The straightforward EOS characteristic of an “Instruct type that doesn’t overthink,” combined with reliable stop and max_tokens behavior, pairs exceptionally well with automated code generation.

While command-a-reasoning undeniably excels in deep reasoning, proofs, and long-form thought, in practical workflows, speed, willingness to stop, and ease of control often hold far more value.

Why the Setup Looks Stable

The win is not just the model. It is the combination of model, serving path, and surrounding implementation.

  1. nvfp4 is working correctly with vLLM main/nightly
  2. stop and max_tokens are behaving as expected
  3. The instruct-style EOS behavior is clean and predictable
  4. The model lines up well with the agent-gateway design, especially where retrieval behavior differs between streaming and non-streaming paths

That makes the result more valuable than a one-off benchmark. The operational shape of the system is understandable, and the model does not have to be treated as a fragile special case.

Prioritizing Improvements

If you find yourself struggling with a similar setup, address the issues in this order:

  1. Use a GPU (Highest Priority) Specify --device nvidia.com/gpu=all and --n-gpu-layers 999 to fully utilize GPU VRAM. With high-capacity VRAM like the RTX PRO 6000 MAX-Q 96GB, there is no reason not to use it.
  2. Temporary Measures for CPU-Only Operation
    • Lower batch-size to 512–1024.
    • Lower ubatch-size to 128–256.
    • Trim prompts by removing unnecessary history or repository maps (repo-map).
  3. Scale Down the Model If pipeline constraints force CPU-only operation, stepping down to a 7B–14B class model is the most realistic solution.

Reproduction Steps

  vllm serve IQuestLab/IQuest-Coder-V1-40B-Instruct-nvfp4 \
  --max-num-seqs 1 \
  --max-model-len 32768
  
  podman run --rm -it \
  -p 8081:8080 --shm-size 16g --cap-add=SYS_NICE \
  -v "$MO":/models:ro,Z $IMG \
  --host 0.0.0.0 --port 8080 -m "$MODEL" \
  --jinja -c 8192 \
  --threads 14 --threads-batch 14 \
  -b 2048 -ub 512 \
  --parallel 1 --flash-attn on
  

Aider Optimization Settings

  • --edit-format diff: Avoid whole-edit, reduce output tokens
  • temperature=0: Greedy decoding for speed
  • Minimize repo-map and /add targets
  • Set max_model_len to minimum required

Conclusion and Next Steps

Running IQuest-Coder-V1-40B on nvfp4 and vLLM functions superbly as a production default, excelling in speed, stability, and control. The position of 40B Dense models is now clear:

  • Daily work (agent/aider/CI): IQuest-Coder-40B nvfp4 (primary)
  • Deep reasoning / design review: command-a-reasoning (secondary)
  • Batch processing (CPU resident): MoE models (Kimi-K2.5 etc.)

If a 40B class model can consistently output 25–28 tok/s, the need for a 111B model in daily agent tasks, CI, and test generation almost entirely disappears. The most elegant architecture positions IQuest-Coder-40B as the primary routing destination, offloading only those cases that require complex reasoning or design reviews to command-a-reasoning.

Next steps: fixing prompts to raise the prefix cache hit rate from 40% to over 60%, and final tuning based on SLOs targeting the p95 latency threshold.