How I'd Choose a Daily Quantization Setup for Hermes-4.3-36B
Comparing Hermes-4.3-36B across BF16, FP8, and nvfp4 on a Blackwell GPU. Not just raw throughput — this covers initial responsiveness, context headroom, and code safety, to reach a practical decision on which setup to use daily.
Introduction
I wanted a practical way to decide how to run Hermes-4.3-36B locally, not just which mode posts the best headline benchmark. For this comparison, I looked at BF16, FP8, and nvfp4 on vLLM 0.14.0rc1 with an RTX PRO 6000 Blackwell Max-Q 96GB. What mattered was not only raw throughput, but also initial responsiveness, room for context growth, and how safe each option felt for code-oriented work.
The goal was to clarify which setup I would keep as the default for chat and exploration, and which one I would switch to when I needed a more conservative model profile.
Background and Motivation
For local LLM workflows involving chat, code generation, and MCP tool integration, quantization level selection is a recurring decision. NousResearch Hermes-4.3-36B is a 36B-class model strong in tool use (Function Calling), evaluated as a vLLM candidate.
The target workload is broader than casual prompting. I am assuming:
- Interactive chat where TTFT noticeably affects usability
- Code generation and MCP-assisted workflows
- Future use with context7
Because of that, I treated the following as the real decision criteria:
- Generation throughput
- TTFT, which is strongly affected by prefill behavior
- Context headroom, especially via KV cache usage
- Perceived quality and stability for the intended workload
Those four criteria matter more than a single throughput number when choosing a model to use every day. With an RTX PRO 6000 Blackwell (96GB VRAM), BF16 runs fine but pushes VRAM above 90%, leaving little room for context. nvfp4 needs only around 22GB. The real question: what do you give up in exchange for speed?
Test Environment
| Item | Specification |
|---|---|
| GPU | NVIDIA RTX PRO 6000 Blackwell Max-Q 96GB |
| CPU | AMD EPYC 9175F |
| Memory | DDR5-6400 768GB |
| Runtime | vLLM 0.14.0rc1 |
| Model | NousResearch / Hermes-4.3-36B |
Since the hardware already has plenty of GPU memory, the practical question is not whether the model fits. The question is how much responsiveness I can gain without fooling myself into treating speed as quality.
Quantization Patterns I Tested
I compared three operating modes: BF16, FP8, and nvfp4. The raw numbers matter, but the behavioral differences are just as important, so I am keeping the qualitative judgments alongside the measurements.
1. BF16 (Unquantized)
I used BF16 as the baseline because I need a trustworthy reference point. Without that, it is too easy to overrate a faster configuration.
Example
vllm serve NousResearch/Hermes-4.3-36B \
--dtype bfloat16 \
--max-num-seqs 1 \
--max-model-len 65536
Observed behavior
- Generation throughput: 17-19 tok/s
- Prompt throughput: around 300-500 tok/s
- VRAM usage: very high and easy to push above 90%
- KV cache usage: 6-8% for short prompts
- Stability: very high
Evaluation
- Quality and consistency are the most trustworthy here
- Initial response is heavier, so TTFT tends to be longer
- In interactive use it feels a bit slower and more noticeable
Best suited for
- Code edits where I want to minimize breakage
- Long-form specifications and strict procedures
- Baseline measurement and fallback use
BF16 is not unusably slow, but in a chat-driven workflow the extra wait accumulates quickly. On the other hand, for final edits and review work, it still feels like the safest anchor.
2. FP8 (vLLM / –quantization fp8)
FP8 was my attempt to keep more of the BF16 safety profile while improving TTFT.
Example
vllm serve NousResearch/Hermes-4.3-36B \
--dtype bfloat16 \
--quantization fp8 \
--max-num-seqs 1 \
--max-model-len 65536
Observed behavior
- Generation throughput: 18-20 tok/s
- Prompt throughput: over 1000 tok/s, with clearly faster prefill
- VRAM usage: reduced versus BF16
- Prefix cache hit rate: useful depending on the workload
Evaluation
- TTFT improves relative to BF16
- Decode throughput does not change dramatically, but the system feels lighter
- Quality loss is minor at the level I can perceive from these tasks
Best suited for
- Balanced interactive usage
- Cases where BF16 feels too heavy but 4-bit still feels risky
- A realistic day-to-day candidate
FP8 does not blow BF16 away on decode speed, but it does improve the overall feel because prefill is faster. For conversational use and lighter code tasks, that change is enough to matter.
Note: vLLM’s --quantization fp8 converts BF16 models to FP8 at runtime. No pre-quantized model is needed. Blackwell-generation GPU (compute capability 12.0) is required.
3. nvfp4 (4-bit, around 22GB)
nvfp4 changed the user experience the most. If the goal is a model I can keep running and interact with comfortably, this is the one that stands out.
Example
vllm serve NousResearch/Hermes-4.3-36B-nvfp4 \
--max-num-seqs 1 \
--max-model-len 32768
Observed behavior
- Generation throughput: 31-33 tok/s, stable
- Prompt throughput: 280-500 tok/s
- VRAM usage: around 22GB
- KV cache usage: 1-2%, leaving a lot of headroom
Evaluation
- Generation speed is roughly 1.7x to 2x faster
- Interactive feel improves significantly
- I can allocate much more context comfortably
- Output quality still feels reasonably capable
Cautions
- Higher speed can create the illusion of higher intelligence
- Long-context consistency and strictness may be worse than BF16
- Code modification still requires tests as the final check
Best suited for
- Chat-first operation
- MCP + context7-oriented workflows
- Design exploration and other long-context tasks
- Daily use where I want the least friction
The most important part here is not just the throughput. With VRAM usage around 22GB and KV cache usage in the 1-2% range, the model leaves a large amount of operating headroom. On a 96GB system, it is feasible to load two or three nvfp4 models simultaneously.
At the same time, I do not want to confuse responsiveness with reliability. A model that answers fast can easily feel smarter than it is. That matters most in code-editing scenarios, where the safer answer is still to treat tests as the final arbiter.
Cross-Comparison Summary
| Metric | BF16 | FP8 | nvfp4 |
|---|---|---|---|
| Generation (TG) | 17-19 tok/s | 18-20 tok/s | 31-33 tok/s |
| Prefill (PP) | 300-500 tok/s | 1000+ tok/s | 280-500 tok/s |
| TTFT | Slower | Improved | Good |
| VRAM Usage | 90%+ | Medium | ~22GB |
| KV Cache Usage | 6-8% (short text) | Improved | 1-2% (ample headroom) |
| Quality / Stability | Most stable | Good | Some instability in edge cases |
| Interactive feel | △ | ○ | ◎ |
The high-level picture is straightforward. If I care most about responsiveness and context flexibility, nvfp4 wins. If I want a high-confidence baseline for delicate work, BF16 still matters. FP8 sits in the middle as the pragmatic compromise.
Analysis
The “Fast = Smart” Illusion
nvfp4’s snappy responses create a subjective impression that the model got smarter. In reality, quantization-induced quality degradation surfaces in long-form coherence and complex reasoning. This illusion is easy to miss with subjective evaluation alone.
The right evaluation tool is an objective metric like first-pass test success rate, not perceived responsiveness. Comfort and actual quality need to be tracked on separate axes.
Context-Switching Design Philosophy
The most rational approach was switching by use case:
Exploratory development (nvfp4 advantage):
- “Just try it” phase
- Rapid iteration on code fragments
- MCP + context7 conversational workflows
- Short latency maintains development rhythm
Destructive changes (BF16/FP8 advantage):
- Repository-wide refactoring requiring consistency
- Critical logic modifications
- Phases where first-pass test success rate determines efficiency
- Final review stages
Tool Use Capability
Hermes-4.3-36B showed limitations in deep reasoning but was relatively stable in tool use (MCP, Function Calling). Argument specification and task chaining worked reliably, making it practical in workflows that combine LLM with external tools like static analysis.
High-History Environments like aider
In aider-style workflows that resend around 6000 tokens of history each time, model quality matters less than context design — including MCP and context7. In those cases, nvfp4’s VRAM savings and context headroom translate directly into operational benefit.
This point is worth carrying forward. The bottleneck is not always the model itself. In workflows with large history payloads, how you structure what you send matters more than which quantization you choose.
Operational Conclusion
For daily interactive use →
nvfp4is the strongest option. Speed, context headroom, and responsiveness are all clearly better.For quality- and safety-sensitive work →
BF16orFP8. Especially for final edits and review.Practical best setup
nvfp4: default for chat and explorationFP8/BF16: fallback profiles- The real decision metric should be whether tests pass
Rather than treating this as “nvfp4 is fast but sloppy” versus “BF16 is slow but solid,” switching by development phase is the practical answer. Using nvfp4 as default and switching to BF16 only for final modifications produced the least friction.
FP8 fills the “BF16 is too heavy but 4-bit is scary” gap perfectly. Its prefill speed (1000+ tok/s) is particularly valuable for RAG and MCP workflows with long contexts.
Reproduction Steps
1. Download Models
# BF16/FP8
huggingface-cli download NousResearch/Hermes-4.3-36B
# nvfp4
huggingface-cli download NousResearch/Hermes-4.3-36B-nvfp4
2. Launch vLLM Server
See commands in the “Quantization Patterns I Tested” section. --max-num-seqs 1 is for single-user chat. Increase for batch processing.
3. Measure
Extract Avg generation throughput and Avg prompt throughput from vLLM logs. Confirm stable values across multiple requests.
Technical Notes
FP8 Quantization in vLLM
vLLM’s --quantization fp8 converts BF16 models to FP8 at runtime. No pre-quantized model is needed. Requires Blackwell-generation GPU (compute capability 12.0).
nvfp4 VRAM Estimate
36B model in nvFP4: around 22GB. Runnable on 24GB GPUs, but 32GB+ is recommended for KV cache headroom. On a 96GB system, two or three nvfp4 models can be loaded simultaneously.
Quantization Selection Flowchart
- VRAM insufficient -> nvfp4 (only option)
- VRAM available + chat focus -> nvfp4 (speed priority)
- VRAM available + code editing -> FP8 (balanced)
- Final review / precision required -> BF16 (quality priority)
Future Work
The next useful step is to split this into workload-specific launch templates and quantify quality more rigorously. I want separate profiles for chat, code generation, and final review, then compare them using a hard metric such as first-pass test success rate.
Responsiveness matters, but it is only half the decision. The durable setup is the one that preserves both speed and reproducibility.
