About This Article

A record of unifying the homelab LLM inference stack around the Gemma 4 family. Covers hardware selection, model allocation, familiar orchestrator role design, Dagster pipeline separation, and telemetry visualization – all in one flow.


Hardware: EPYC + Dual Blackwell

Compute node specs:

  • CPU: AMD EPYC 9175F
  • RAM: 768GB
  • GPU: RTX 6000 PRO MAX-Q 96GB Blackwell x 2 (no NVLink)

I chose to run the two GPUs independently. Without NVLink, Tensor Parallel isn’t an option, but loading two separate models on separate GPUs to maximize parallel slots is more advantageous in the initial phase. The goal is data collection throughput, not single-model inference speed.


The Decision to Unify on Gemma 4

In the launch phase, I prioritized minimizing complexity over model diversity. When Gemma 4 was released, I picked the 31B dense and 26B-A4B MoE – both scoring well and fitting in GPU memory.

Data diversity from heterogeneous models is a problem for after familiar is running stably and the data pipeline is turning. Mixing multiple model families from the start means more prompt tuning and format debugging per model. First, run everything on Gemma 4 and validate the pipeline.


Model Allocation and GPU Memory Design

slotmodelformatruntimeVRAMparallelGPU
naughtyGemma 4 31B ITNVFP4vLLM~17GiB3GPU0
kindergartenGemma 4 26B-A4B ITIQ4_XSllama.cpp~13GiB5GPU1

GPU memory allocation:

  • GPU0 (96GB): Gemma 4 31B NVFP4 ~17GiB + KV cache x 3 parallel slots (~79GB free)
  • GPU1 (96GB): Gemma 4 26B-A4B IQ4_XS ~13GiB + KV cache x 6 parallel slots (~83GB free)

naughty: Gemma 4 31B NVFP4 on vLLM

NVFP4 is a TensorRT Model Optimizer format, fully optimized for Blackwell FP4 kernels. vLLM’s continuous batching maximizes throughput at parallel 3. Supports thinking mode, native function calling, and system prompts.

Benchmarks:

  • AIME 2026: 89.2%
  • LiveCodeBench v6: 80.0%
  • Codeforces ELO: 2150
  • GPQA Diamond: 84.3%
  • MMLU Pro: 85.2%

AIME 89.2% from a 31B dense model is quite strong. NVFP4 quantization fitting in ~17GiB VRAM is significant – plenty of room for KV cache on a 96GB GPU.

kindergarten: Gemma 4 26B-A4B MoE on llama.cpp

26B-A4B is MoE (Mixture of Experts) with 4B active parameters. Small inference memory footprint means more parallel slots. IQ4_XS (bartowski) at ~13GiB.

ik_llama.cpp is used because MoE model inference performance is good there. vLLM excels at continuous batching for dense models, but GGUF-quantized MoE models are easier to work with on ik_llama.cpp.


familiar Model Role Design

familiar has four roles: grandpa / naughty / kindergarten / translator. Configuration is switched via CLAUDE.md:

  grandpa x naughty x kindergarten x translator    # full config
naughty x translator                              # minimal config
naughty x kindergarten x translator               # standard config
  

grandpa is the slot that prioritizes quality over inference speed, handling token/sec balancing. However, in the initial phase where data collection throughput is the priority, grandpa is dropped – naughty x kindergarten x translator becomes the standard config.

translator uses plamo-2-translate, a dedicated translation slot. High ja-en translation quality, used in the multilingual article pipeline.


model-foundry: Dagster Pipeline Separation

In parallel with the inference stack, I also separated the Dagster pipeline into its own repository.

The Dagster pipeline had been co-located in agent-gateway’s devstack/dagster/, but it was functioning as an independent data pipeline platform, not a gateway infrastructure addon. Separated it as model-foundry.

What DuckDB Made Visible

After separation, introducing DuckDB dramatically improved Dagster UI visibility.

Before DuckDB, each asset had to be inspected individually. After, pipeline_event_inbox_record asset STEP_OUTPUT shows event_id, subject, correlation_id as structured dicts. Pipeline execution status visible in a single screen.

A type check error appeared right after DuckDB introduction:

  dagster._core.errors.DagsterTypeCheckDidNotPass: Type check failed for step input
"pipeline_event_inbox_record" - expected type "dict".
Value of type <class 'NoneType'> failed type check
  

The asset was returning None when the NATS queue was empty. Fixed with proper empty event handling.

Event Name Redesign

Also cleaned up knowledge domain pipeline event names:

Old nameNew name
knowledge.chat.persistllm.chat.persist
knowledge.embeddingobsidian.semantic_search
knowledge.flow.lineagellm.flow.lineage
knowledge.tool_callllm.tool_call

knowledge.* was leaking knowledge domain internals. The new names express “which system is this targeting” – a topic design change. This was also a stepping stone toward the 3-domain split in the later v3 redesign.


Grafana LLM Dashboard

Telemetry from the inference stack is visualized in Grafana.

Data Flow

  familiar -> Vector (HTTP port 8687 /telemetry) -> Prometheus / Loki -> Grafana
  

Rendering principles:

  • Grafana renders from real-time data via Vector (does not query PostgreSQL directly)
  • Persistence goes flat into PostgreSQL
  • Summarization needs are handled by Dagster + DuckDB

Vector’s http_telemetry source receives familiar’s telemetry events, handling Prometheus metric forwarding and Loki log storage simultaneously. Sessions can be traced end-to-end via correlation_id.

Vector configuration:

  [sources.http_telemetry]
type = "http_server"
address = "0.0.0.0:8687"
path = "/telemetry"
encoding = "json"
  

Telemetry events from familiar are POSTed as JSON. Vector receives them and runs Prometheus metric transformation and Loki log forwarding in parallel. The Grafana dashboard combines these two data sources to render real-time inference performance and session flows.


Impressions from Actually Running Gemma 4

Tested with nvidia/gemma-4-31b-it-nvfp4 on vLLM at --max-num-seqs 3, and ik_llama.cpp gemma4 branch with Q4_K_L at --parallel 5. Queueing controlled via Go channels, pushing full concurrency.

Gemma 4 31B session flow on Grafana dashboard
Gemma 4 31B naughty x 3 parallel execution flow -- 584 Edge Events, 702 Node Events. Model name gemma-4-31b-it highlighted in raw telemetry

As it turned out, the orchestrator model didn’t route to the 26B MoE side (kindergarten), so the only configuration I could test was 31B IT x 3 in parallel. The deliverables came out complete, but lacked coherence.

I expected KV cache co-location to nudge the outputs toward similar directions, but it didn’t really matter in practice. Even with the same planning passed to all three workers, coordination wasn’t there. Each worker proceeded with inference independently, resulting in one worker ignoring decisions made by another.

Lesson learned: for coding tasks specifically, any-to-any generalist models should only be used for the orchestrator. Workers need constrained prompts and context to maintain quality in parallel execution. Parallelism is effective for data collection throughput, but if you expect coherent collaborative output, the orchestrator design needs to come first.

After That: GLM-5.1 + Qwen3-Coder-Next Configuration

After this experiment, GLM-5.1 was released at just the right time for the grandpa role. I swapped out Kimi-K2.5, combined it with the coding-oriented Qwen3-Coder-Next, and got significantly better results.

  dev0 / GLM-5.1 (grandpa)
  PP:  340 tok/s
  TG:  12-16 tok/s

dev0 / Qwen3-Coder-Next (worker-0)
  PP:  2500 tok/s
  TG:  114 tok/s

dev1 / Qwen3-Coder-Next (worker-1)
  PP:  5100 tok/s
  TG:  161 tok/s

dev1 / PLaMo2Translate (translator)
  PP:  100-240 tok/s
  TG:  35-140 tok/s
  

The configuration with a high-reasoning model as orchestrator and coding-specialized models as workers showed clearly better coordination than the Gemma 4 unified setup. Details of this configuration for another time.