Qwen3.6-27B-FP8: Role-Specific Fine-Tuning Strategy and Integration into My Agent Stack

I ran Qwen3.6-27B-FP8 for a day as a worker in my agent execution stack. The first impression is very strong. It sustains about 100 tok/s in a single stream, 160-180 tok/s combined at two-way concurrency, and even with EAGLE speculative decoding the accept rate stays stably above 0.8. In practice it feels easier to use than Qwen3-Coder-Next 80B, and at this point it looks like a very strong option.

That said, raw inference speed is not the main point. What is actually interesting is how far this model can be pushed as a role-specific fine-tuned worker.

During the video recording, I did not include the dedicated system prompt I normally use in production. That may be why the outputs fluctuated more than usual and why it slipped into revision loops a few times.

Video link: https://www.youtube.com/watch?v=H5w4zBDmv2g

Benchmark Results

Item	Value
TG sustained, single stream	~100 tok/s
TG combined, 2 concurrent	160–180 tok/s
PP chunked (8k)	4.0k–5.3k tok/s
EAGLE accept rate	0.80–1.00
EAGLE accept len	3.0–4.0
VRAM breakdown	weights 28.5GB + KV/Mamba 28.5GB + MTP 7GB

Hardware

Item	Specification
GPU	NVIDIA RTX PRO 6000 Blackwell Max-Q 96GB
CPU	AMD EPYC 9175F
RAM	768GB DDR5-6400
Inference engine	SGLang nightly (CUDA 13)

SGLang Launch Configuration

  SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 \
SGLANG_ENABLE_SPEC_V2=1 \
sglang serve \
  --model-path Qwen/Qwen3.6-27B-FP8 \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --mamba-scheduler-strategy extra_buffer \
  --page-size 64 \
  --mem-fraction-static 0.9 \
  --context-length 262144 \
  --served-model-name frisky

The Main Pitfall: SGLang Version Differences

SGLang 0.5.10.post1 does not include the Qwen3.6 model class (qwen3_6.py). It falls back to the Qwen3.5 code path (Qwen3_5ForConditionalGeneration), so Qwen3.6-specific Gated Delta Networks (GDN) layers are not handled correctly and the thinking trace collapses from the very first turn.

The visible symptom is that the thinking output gets trapped in loops such as weak weak 弱的弱的 weakest or atomic atomicAtomic atomicatomic. The engine itself keeps emitting tokens normally, but the content is completely broken.

Updating to nightly-dev-cu13-20260424 fixed it. That build includes qwen3_6.py, so the GDN layers are handled correctly.

On Blackwell GPUs, DeepGemm also emits a scale_fmt warning (not ue8m0). That means the FP8 checkpoint scale format is not compatible with Blackwell’s DeepGemm-optimized path. So far I have not observed a real accuracy penalty, but it is something to keep watching.

Role-Specific LoRA Adapter Strategy

In my agent execution stack, tasks are assigned to role-based talents. Each talent has its own system prompt, tool policy, and expected behavior. A coder and a reviewer do not think the same way, and a reviewer and a tester do not think the same way either. Simply swapping prompts on top of the same base model leaves performance on the table.

The basic strategy is simple: build role-specific LoRA adapters from operational data. I keep day-to-day execution logs and outcomes in structured form, then reuse them as training samples. Pass/fail judgments can serve as a natural DPO signal.

I plan to build dedicated adapters for four core roles and train them on role-filtered data. Because SGLang supports dynamic LoRA loading, switching adapters per request is almost free.

Generating DPO Signals

The quality signal comes from two sources. A locally running MIT-licensed model acts as the judge, while an external frontier model handles subset evaluation for calibration. If both agree, the output becomes a strong SFT candidate. When they disagree, I can form a DPO pair and place the weaker output on the rejected side.

I expect this loop to form a natural PDCA cycle. Better adapters produce better outputs, better outputs produce cleaner training data, and that cleaner data should in turn improve the next round of adapters.

Training Environment

LoRA training and small-scale exploration for a 27B model should be very workable on my local 2x96GB GPU setup. Full-parameter fine-tuning is a different story because both memory demand and training time become much heavier, so the plan is to use the homelab first to find the promising role-specific adapters and only scale further on cloud GPUs when needed.

Axolotl handles training execution, while Dagster manages the pipeline from operational logs through dataset packaging and evaluation.

Why This Model

Inference cost. At 27B parameters and FP8 quantization, it fits comfortably on a single GPU. Even including KV cache, Mamba state, and MTP weights, it stays within 96GB. Four-worker parallelism is realistic.

Architecture. The hybrid design with Gated Delta Networks means tool calling, multi-step reasoning, and code generation already work well even without fine-tuning. The adapters are not there to patch a broken model. They are there to polish an already capable one for specific roles.

License. Apache 2.0. Outputs, adapters, and derived artifacts are all commercially usable, which fits well with a stack built around proprietary tooling and long-term internal reuse.

Next Steps

The near-term plan is to stabilize the four-role adapter pipeline, run local small-batch training, and compare the results against the base model with A/B evaluations. Anything that looks promising can then be scaled further with cloud GPUs when needed.

The long-term goal is a self-improving agent system where the operational loop continuously generates its own training data. Each model gets better over time within its specialist role, without depending on the availability or pricing volatility of external APIs.

I originally started with a structure centered on Claude CLI subprocesses (claude -p -r), but after Anthropic updated its usage guidance in April this year, I decided to move toward an OSS-model-centered setup. The migration was not small, but it finally feels like I have a shape that can hold up in production.

From here, I want to rebuild the agent stack around this assumption, mass-produce LoRA adapters, and push into larger-scale training where it actually matters.