How I'd Choose a Daily Quantization Setup for Hermes-4.3-36B

Comparing Hermes-4.3-36B across BF16, FP8, and nvfp4 on a Blackwell GPU. Not just raw throughput — this covers initial responsiveness, context headroom, and code safety, to reach a practical decision on which setup to use daily.

These articles use AI-generated summaries of Obsidian notes originally kept as technical memos.

English translations are produced with AI assistance.

Introduction

I wanted a practical way to decide how to run Hermes-4.3-36B locally, not just which mode posts the best headline benchmark. For this comparison, I looked at BF16, FP8, and nvfp4 on vLLM 0.14.0rc1 with an RTX PRO 6000 Blackwell Max-Q 96GB. What mattered was not only raw throughput, but also initial responsiveness, room for context growth, and how safe each option felt for code-oriented work.

The goal was to clarify which setup I would keep as the default for chat and exploration, and which one I would switch to when I needed a more conservative model profile.

Background and Motivation

For local LLM workflows involving chat, code generation, and MCP tool integration, quantization level selection is a recurring decision. NousResearch Hermes-4.3-36B is a 36B-class model strong in tool use (Function Calling), evaluated as a vLLM candidate.

The target workload is broader than casual prompting. I am assuming:

Interactive chat where TTFT noticeably affects usability
Code generation and MCP-assisted workflows
Future use with context7

Because of that, I treated the following as the real decision criteria:

Generation throughput
TTFT, which is strongly affected by prefill behavior
Context headroom, especially via KV cache usage
Perceived quality and stability for the intended workload

Those four criteria matter more than a single throughput number when choosing a model to use every day. With an RTX PRO 6000 Blackwell (96GB VRAM), BF16 runs fine but pushes VRAM above 90%, leaving little room for context. nvfp4 needs only around 22GB. The real question: what do you give up in exchange for speed?

Test Environment

Item	Specification
GPU	NVIDIA RTX PRO 6000 Blackwell Max-Q 96GB
CPU	AMD EPYC 9175F
Memory	DDR5-6400 768GB
Runtime	vLLM 0.14.0rc1
Model	NousResearch / Hermes-4.3-36B

Since the hardware already has plenty of GPU memory, the practical question is not whether the model fits. The question is how much responsiveness I can gain without fooling myself into treating speed as quality.

Quantization Patterns I Tested

I compared three operating modes: BF16, FP8, and nvfp4. The raw numbers matter, but the behavioral differences are just as important, so I am keeping the qualitative judgments alongside the measurements.

1. BF16 (Unquantized)

I used BF16 as the baseline because I need a trustworthy reference point. Without that, it is too easy to overrate a faster configuration.

Example

  vllm serve NousResearch/Hermes-4.3-36B \
  --dtype bfloat16 \
  --max-num-seqs 1 \
  --max-model-len 65536

Observed behavior

Generation throughput: 17-19 tok/s
Prompt throughput: around 300-500 tok/s
VRAM usage: very high and easy to push above 90%
KV cache usage: 6-8% for short prompts
Stability: very high

Evaluation

Quality and consistency are the most trustworthy here
Initial response is heavier, so TTFT tends to be longer
In interactive use it feels a bit slower and more noticeable

Best suited for

Code edits where I want to minimize breakage
Long-form specifications and strict procedures
Baseline measurement and fallback use

BF16 is not unusably slow, but in a chat-driven workflow the extra wait accumulates quickly. On the other hand, for final edits and review work, it still feels like the safest anchor.

2. FP8 (vLLM / –quantization fp8)

FP8 was my attempt to keep more of the BF16 safety profile while improving TTFT.

Example

  vllm serve NousResearch/Hermes-4.3-36B \
  --dtype bfloat16 \
  --quantization fp8 \
  --max-num-seqs 1 \
  --max-model-len 65536

Observed behavior

Generation throughput: 18-20 tok/s
Prompt throughput: over 1000 tok/s, with clearly faster prefill
VRAM usage: reduced versus BF16
Prefix cache hit rate: useful depending on the workload

Evaluation

TTFT improves relative to BF16
Decode throughput does not change dramatically, but the system feels lighter
Quality loss is minor at the level I can perceive from these tasks

Best suited for

Balanced interactive usage
Cases where BF16 feels too heavy but 4-bit still feels risky
A realistic day-to-day candidate

FP8 does not blow BF16 away on decode speed, but it does improve the overall feel because prefill is faster. For conversational use and lighter code tasks, that change is enough to matter.

Note: vLLM’s --quantization fp8 converts BF16 models to FP8 at runtime. No pre-quantized model is needed. Blackwell-generation GPU (compute capability 12.0) is required.

3. nvfp4 (4-bit, around 22GB)

nvfp4 changed the user experience the most. If the goal is a model I can keep running and interact with comfortably, this is the one that stands out.

Example

  vllm serve NousResearch/Hermes-4.3-36B-nvfp4 \
  --max-num-seqs 1 \
  --max-model-len 32768

Observed behavior

Generation throughput: 31-33 tok/s, stable
Prompt throughput: 280-500 tok/s
VRAM usage: around 22GB
KV cache usage: 1-2%, leaving a lot of headroom

Evaluation

Generation speed is roughly 1.7x to 2x faster
Interactive feel improves significantly
I can allocate much more context comfortably
Output quality still feels reasonably capable

Cautions

Higher speed can create the illusion of higher intelligence
Long-context consistency and strictness may be worse than BF16
Code modification still requires tests as the final check

Best suited for

Chat-first operation
MCP + context7-oriented workflows
Design exploration and other long-context tasks
Daily use where I want the least friction

The most important part here is not just the throughput. With VRAM usage around 22GB and KV cache usage in the 1-2% range, the model leaves a large amount of operating headroom. On a 96GB system, it is feasible to load two or three nvfp4 models simultaneously.

At the same time, I do not want to confuse responsiveness with reliability. A model that answers fast can easily feel smarter than it is. That matters most in code-editing scenarios, where the safer answer is still to treat tests as the final arbiter.

Cross-Comparison Summary

Metric	BF16	FP8	nvfp4
Generation (TG)	17-19 tok/s	18-20 tok/s	31-33 tok/s
Prefill (PP)	300-500 tok/s	1000+ tok/s	280-500 tok/s
TTFT	Slower	Improved	Good
VRAM Usage	90%+	Medium	~22GB
KV Cache Usage	6-8% (short text)	Improved	1-2% (ample headroom)
Quality / Stability	Most stable	Good	Some instability in edge cases
Interactive feel	△	○	◎

The high-level picture is straightforward. If I care most about responsiveness and context flexibility, nvfp4 wins. If I want a high-confidence baseline for delicate work, BF16 still matters. FP8 sits in the middle as the pragmatic compromise.

Analysis

The “Fast = Smart” Illusion

nvfp4’s snappy responses create a subjective impression that the model got smarter. In reality, quantization-induced quality degradation surfaces in long-form coherence and complex reasoning. This illusion is easy to miss with subjective evaluation alone.

The right evaluation tool is an objective metric like first-pass test success rate, not perceived responsiveness. Comfort and actual quality need to be tracked on separate axes.

Context-Switching Design Philosophy

The most rational approach was switching by use case:

Exploratory development (nvfp4 advantage):

“Just try it” phase
Rapid iteration on code fragments
MCP + context7 conversational workflows
Short latency maintains development rhythm

Destructive changes (BF16/FP8 advantage):

Repository-wide refactoring requiring consistency
Critical logic modifications
Phases where first-pass test success rate determines efficiency
Final review stages

Tool Use Capability

Hermes-4.3-36B showed limitations in deep reasoning but was relatively stable in tool use (MCP, Function Calling). Argument specification and task chaining worked reliably, making it practical in workflows that combine LLM with external tools like static analysis.

High-History Environments like aider

In aider-style workflows that resend around 6000 tokens of history each time, model quality matters less than context design — including MCP and context7. In those cases, nvfp4’s VRAM savings and context headroom translate directly into operational benefit.

This point is worth carrying forward. The bottleneck is not always the model itself. In workflows with large history payloads, how you structure what you send matters more than which quantization you choose.

Operational Conclusion

For daily interactive use → nvfp4 is the strongest option. Speed, context headroom, and responsiveness are all clearly better.
For quality- and safety-sensitive work → BF16 or FP8. Especially for final edits and review.
Practical best setup
- nvfp4: default for chat and exploration
- FP8/BF16: fallback profiles
- The real decision metric should be whether tests pass

Rather than treating this as “nvfp4 is fast but sloppy” versus “BF16 is slow but solid,” switching by development phase is the practical answer. Using nvfp4 as default and switching to BF16 only for final modifications produced the least friction.

FP8 fills the “BF16 is too heavy but 4-bit is scary” gap perfectly. Its prefill speed (1000+ tok/s) is particularly valuable for RAG and MCP workflows with long contexts.

Reproduction Steps

1. Download Models

  # BF16/FP8
huggingface-cli download NousResearch/Hermes-4.3-36B

# nvfp4
huggingface-cli download NousResearch/Hermes-4.3-36B-nvfp4

2. Launch vLLM Server

See commands in the “Quantization Patterns I Tested” section. --max-num-seqs 1 is for single-user chat. Increase for batch processing.

3. Measure

Extract Avg generation throughput and Avg prompt throughput from vLLM logs. Confirm stable values across multiple requests.

Technical Notes

FP8 Quantization in vLLM

vLLM’s --quantization fp8 converts BF16 models to FP8 at runtime. No pre-quantized model is needed. Requires Blackwell-generation GPU (compute capability 12.0).

nvfp4 VRAM Estimate

36B model in nvFP4: around 22GB. Runnable on 24GB GPUs, but 32GB+ is recommended for KV cache headroom. On a 96GB system, two or three nvfp4 models can be loaded simultaneously.

Quantization Selection Flowchart

VRAM insufficient -> nvfp4 (only option)
VRAM available + chat focus -> nvfp4 (speed priority)
VRAM available + code editing -> FP8 (balanced)
Final review / precision required -> BF16 (quality priority)

Future Work

The next useful step is to split this into workload-specific launch templates and quantify quality more rigorously. I want separate profiles for chat, code generation, and final review, then compare them using a hard metric such as first-pass test success rate.

Responsiveness matters, but it is only half the decision. The durable setup is the one that preserves both speed and reproducibility.

LTX-2 Video Generation Prompt Engineering: From 36-Scene Horror to Cinematic Continuity Pipelines

Structured prompt …

The Reality of 40B Dense Models: What Running IQuest-Coder-V1-40B on CPU/GPU/Aider Actually Showed

IQuest-Coder-V1-40B-Instruct …