On this page

Running DeepSeek-V4-Flash with a llama.cpp WIP Branch: First Local Inference on Dual Blackwell Max-Q 96GB GPUs

A first-pass validation of DeepSeek-V4-Flash (284B MoE / 13B active) on dual RTX PRO 6000 Blackwell Max-Q 96GB GPUs using a llama.cpp WIP DeepSeek-V4 branch and native FP4/FP8 GGUF, covering official inference-code attempts, GGUF conversion attempts, PP/TG measurements, and the current Flash Attention limitation.

These articles use AI-generated summaries of Obsidian notes originally kept as technical memos.

English translations are produced with AI assistance.

I ran DeepSeek-V4-Flash locally on dual RTX PRO 6000 Blackwell Max-Q 96GB GPUs. The result is intentionally narrow: this is a first-pass “it runs” validation using a llama.cpp WIP branch and a community GGUF.

Even with that limitation, a 284B MoE / 13B active model is already reaching around 35 t/s TG locally while staying in native FP4/FP8 GGUF. Flash Attention is still disabled, and the DSV4 graph implementation is still in progress, so I am recording this as an experimental snapshot from 2026-04-27 rather than a final performance evaluation.

Video link: https://www.youtube.com/watch?v=Hjl4efNonxE

Result First

The working setup used nsparks/DeepSeek-V4-Flash-FP4-FP8-GGUF and the wip/deepseek-v4-support branch from llama.cpp PR #22378. The Hugging Face model card reports DeepSeek-V4-Flash as 284B params, and the GGUF is published as a deepseek4 architecture model.

Item	Value
Model	DeepSeek-V4-Flash
Parameters	284B MoE
Active parameters	13B
Experts	256 experts / 6 active
GGUF	`nsparks/DeepSeek-V4-Flash-FP4-FP8-GGUF`
Runtime	`llama.cpp` `wip/deepseek-v4-support`
Commit range	`b8942-ba173dd08`
Quantization	native FP4 + FP8
GGUF size	146GB
BPW	4.39
GPUs	RTX PRO 6000 Blackwell Max-Q 96GB x2

The benchmark summary:

Metric	Value
Prompt eval (PP)	36.5-39.4 t/s
Token generation (TG)	34.1-41.7 t/s
PP average	38.3 t/s
TG average	35.7 t/s
VRAM	GPU0: 75.1GB, GPU1: 72.8GB
Offloaded layers	44/44
CPU mapped	1010 MiB
Flash Attention	disabled
GPU utilization	30-40% burst
Peak power	GPU0: 97.8W, GPU1: 115W
Graph splits	3

GPU utilization stayed around 30-40% in bursts, and power draw stayed around 100-115W against a 300W TDP. Once Flash Attention and the expert dispatch graph improve, there should still be plenty of headroom.

DCGM GPU Monitoring dashboard while running DeepSeek-V4-Flash GGUF — DCGM while running DeepSeek-V4-Flash GGUF with the llama.cpp WIP branch. GPU utilization was around 40%, VRAM was GPU0 75.8GB / GPU1 73.4GB, and power stayed around 97.8W / 115W.

Launch Command

This is the final command I used:

  podman run --rm \
  -p 8000:8000 \
  --device nvidia.com/gpu=all \
  --shm-size 8g \
  -v /mnt/data/models/models--nsparks--DeepSeek-V4-Flash-FP4-FP8-GGUF:/models:Z \
  llama.cpp:deepseek-v4 \
  -s -m /models/snapshots/.../DeepSeek-V4-Flash-FP4-FP8-native.gguf \
  --n-gpu-layers 999 --threads 15 --threads-batch 24 \
  --ctx-size 8192 --parallel 1 -b 4096 -ub 2048 \
  --jinja --host 0.0.0.0 --port 8000 --alias deepseek-v4

llama-server already exposes an OpenAI-compatible endpoint, so I did not need to keep the API wrapper I had built for the official inference/ path.

Test Environment

Item	Value
GPU	NVIDIA RTX PRO 6000 Blackwell Max-Q 96GB x2
Compute Capability	sm_120
Driver	580.126.09
CUDA	13.0
CPU	AMD EPYC 9175F
RAM	768GB DDR5-6400
Container	Podman
Native FP4	`BLACKWELL_NATIVE_FP4` enabled

When I loaded the model directly through transformers, both GPU0 and GPU1 used roughly 80GB of VRAM. I confirmed the first response, but the process crashed when I sent a prompt request. I then investigated the official inference-code path, but eventually switched to the GGUF that had appeared on Hugging Face.

Per-Request Benchmark

The prompts are short, so this is not a full PP benchmark. It is enough to understand the speed profile at this experimental stage.

#	Request	prompt tokens	PP (ms/t)	PP (t/s)	gen tokens	TG (ms/t)	TG (t/s)	total (s)
1	Japanese question	14	26.7	37.4	41	27.1	36.9	1.5
2	MoE explanation	20	25.9	38.6	132	27.8	36.0	4.2
3	FP4 vs FP8	19	25.9	38.7	125	27.7	36.1	4.0
4	system prompt	31	25.8	38.7	182	27.8	36.0	5.9
5	multi-turn	50	25.9	38.6	512	28.0	35.7	15.6
6	Go code	23	25.8	38.7	230	27.8	35.9	7.0
7	logic puzzle	23	25.9	38.7	207	27.8	35.9	6.4
8	comparison analysis	30	25.8	38.7	427	28.0	35.8	12.7
9	JSON output	26	27.4	36.5	122	29.3	34.1	4.3
10	DSV4 architecture	45	25.8	38.7	414	27.9	35.8	12.7

Quality looked fine on short tests such as MoE explanation, code generation, and logic questions, but I also saw odd Japanese streaming artifacts such as 東東京圜 a few times. Since this is still a test branch, I would not read much into quality yet. The useful signal here is mostly the TG estimate.

Flash Attention Is Disabled

The logs show Flash Attention being disabled automatically:

  sched_reserve: layer 0 is assigned to device CUDA0 but the Flash Attention tensor is assigned to device CPU (usually due to missing support)
sched_reserve: Flash Attention was auto, set to disabled

DeepSeek-V4 uses a custom attention architecture that includes CSA, HCA, and an Indexer. In this WIP branch, Flash Attention support for that graph still appears incomplete. That is probably the main reason GPU utilization stays around 30-40%.

PP should improve significantly once Flash Attention lands. TG is different: with a 4.39 BPW native FP4/FP8 GGUF split across two GPUs, each token is more constrained by memory bandwidth and expert dispatch than by Flash Attention itself. Around 35 t/s is already good for an experimental bring-up.

What I Tried with the Official Inference Code

Before using GGUF, I first tried to run the official repository’s inference/*.py path directly. The official code is built around local generation through generate.py; it does not ship with an HTTP endpoint. I therefore wrote a thin FastAPI + uvicorn wrapper around the tokenizer, model, and distributed runtime used by generate.py, exposing it as an OpenAI-compatible /v1/chat/completions endpoint.

The MP=2 weight conversion succeeded. With direct transformers loading, both GPUs used around 80GB. Through the official inference/*.py path, I also got as far as confirming the first response. At startup, nvtop showed Python processes using about 79680MiB on both GPU0 and GPU1, roughly 81% VRAM usage.

nvtop showing VRAM usage on both GPUs while starting the official inference code — During the official `inference/*.py` attempt, Python processes allocated roughly 79,680MiB on both GPU0 and GPU1. I confirmed the first response, but prompt requests crashed.

However, the process crashed when I sent a prompt request. I worked through the issues below, but the remaining blocker was the combination of the NGC container’s torch version and the FP4 dtype requirement in DSV4.

  python convert.py --hf-ckpt-path ${HF_CKPT_PATH} --save-path ${SAVE_PATH} \
  --n-experts 256 --model-parallel 2

  NCCL_NET_PLUGIN=none NCCL_IB_DISABLE=1 PYTHONPATH=. \
torchrun --standalone --nproc-per-node 2 main.py \
  --ckpt-path ${SAVE_PATH} --config ${CONFIG} --port 8000

The main issues were:

Issue	Status
NCCL segfault	Segfault around `ncclNetPluginInit` during broadcast. Avoided with `NCCL_NET_PLUGIN=none NCCL_IB_DISABLE=1`
tilelang could not detect CUDA	The bare-metal environment did not have the CUDA toolkit; worked around by symlinking `nvcc` from a container overlay
sparse attention shared memory	tilelang’s CSA sparse attention kernel required 104KB of dynamic shared memory
block size adjustment	Lowering sparse attention block size from `64 -> 32` got past the shared-memory side
NGC torch too old	`nvcr.io/nvidia/pytorch:25.04-py3` pins torch 2.7.0
DSV4 FP4 dtype	DeepSeek-V4-Flash needs `torch.float4_e2m1fn_x2`, which requires torch 2.11+

Blackwell supports up to 228KB/SM of dynamic shared memory. The 104KB required by tilelang’s sparse attention kernel should be reachable on the hardware, and reducing block size from 64 -> 32 did get past that part.

The remaining blocker was torch. The NGC container was pinned to torch 2.7.0, which does not provide the float4_e2m1fn_x2 dtype used by DeepSeek-V4-Flash. The constraints were strong enough that a simple replacement was not viable. I stopped pursuing the official inference/*.py route there and continued with the native FP4/FP8 GGUF that had appeared on Hugging Face.

What I Tried for Self-Conversion to GGUF

Next I tried converting the model myself using convert_hf_to_gguf.py from the nsparks WIP branch.

  python3 convert_hf_to_gguf.py ${HF_SNAP} \
  --outtype native \
  --torch-threads 16 \
  --outfile dsv4-flash-native.gguf

This also hit several stages:

Stage	Result
torch 2.6	`F8_E8M0` `KeyError`
torch 2.11 CPU	`F8_E8M0` passed
transformers	`deepseek_v4` `model_type` was not recognized
tokenizer	Partly worked around by switching to `PreTrainedTokenizerFast`
pre-tokenizer	Stopped at unsupported `joyai-llm` pre-tokenizer

At that point, the community GGUF had already been published, so I prioritized runtime validation over continuing the conversion work.

Final GGUF Used

The model I used was nsparks/DeepSeek-V4-Flash-FP4-FP8-GGUF. The model card lists deepseek-ai/DeepSeek-V4-Flash as the upstream source, and shows a conversion command in this shape:

  python3 convert_hf_to_gguf.py /mnt/models/hf/DeepSeek-V4-Flash \
  --outtype moe-f8-e4m3-mxfp4 \
  --torch-threads 96 \
  --outfile DeepSeek-V4-Flash-FP4-FP8-native.gguf

The official DeepSeek Hugging Face repository is MIT licensed.

Upstream Work I Am Tracking

As of 2026-04-27, DeepSeek-V4 support in upstream llama.cpp is still WIP.

PR / Discussion	Purpose
llama.cpp PR #22378	`wip/deepseek-v4-support`, including runtime graph, FP4/FP8 support, and performance hot paths
llama.cpp PR #22359	DeepSeek-V4 GGUF conversion script
Discussion #22376	DeepSeek-V4 support discussion
nsparks GGUF	native FP4/FP8 GGUF
official HF	official DeepSeek-V4-Flash weights

Looking at the PR #22378 history, a lot has landed quickly: FP4/FP8 support, DeepSeek4 runtime state save, F8 decode tuning, TOP_K fast path, RMSNorm/copy kernel tuning, and more. TG may move closer to the numbers seen when using -ot exps=CPU.

Takeaways

The first important point is that I was lucky: the model size happens to fit my workstation well. A 284B MoE responding locally at around 35 t/s TG is already significant. The result is experimental and may change completely once the runtime is optimized. Flash Attention is disabled, graph splits are 3, GPU utilization is only 30-40%, and PP in particular should have a lot of room to improve.

TG is already in a practical range. The license is also clear, so for SFT/DPO distillation data, pipeline work, and batch jobs, 35 t/s is enough to be useful. I had been evaluating GLM-5.1, Kimi-K2.6, and Qwen3.5-397B as orchestrator candidates for my own agent system, and if DeepSeek-V4-Flash gets optimized in ik_llama.cpp or llama.cpp, it could become the best orchestrator in a CPU/GPU hybrid setup. Even as a fully GPU-loaded standalone model, the reported KV reduction of around 90% makes me interested in higher-context single-model use as well.

DeepSeek says DSV4 attention reduces KV cache by 93% and FLOPs by 90% compared with V3.2. In the current setup, total VRAM is 192GB. The model occupies 75.1GB on GPU0 and 72.8GB on GPU1, for 147.9GB total. That leaves 21.5GB on GPU0 and 23.9GB on GPU1, or 45.4GB free in total. If KV cache really uses only 7% of the usual footprint, that 32-45GB of free VRAM can hold a very large context. Right now I am working hard to coordinate multiple role-based models and optimize context management around them, but DSV4 feels like the kind of model meant to collapse that into a single model. If that works, some of the orchestration layer may no longer be necessary.

Measuring Qwen3.6-27B NVFP4+MTP on vLLM: ~190 tok/s TG on Dual RTX PRO 6000 Blackwell Max-Q

Running …

Qwen3.6-27B-FP8: Role-Specific Fine-Tuning Strategy and Integration into My Agent Stack

Running Qwen3.6-27B-FP8 on RTX …