Go + NATS + Dagster AI Orchestration Platform: Design Philosophy and Middleware Selection

Go(Gin) OpenAI/Anthropic-compatible proxy, NATS JetStream event relay, Dagster sensor-driven job execution, pgvector ANN search, ColBERT reranking. Complete design record of a 3-host local AI orchestration platform.

These articles use AI-generated summaries of Obsidian notes originally kept as technical memos.

English translations are produced with AI assistance.

Conclusion

To integrate LLMs into operational pipelines, an OpenAI/Anthropic-compatible API gateway was built in Go. The gateway handles RAG orchestration on incoming requests and publishes events to NATS JetStream in a fire-and-forget pattern. On the consumption side, Dagster daemon sensors pull from JetStream and execute jobs directly. Telemetry is collected by Vector (Rust) subscribing to NATS and forwarding to Prometheus/Loki.

The core design principle: “the gateway only publishes, each middleware consumes independently.” This decision eliminated Dagster dependencies from the gateway code, fully decoupling deployments.

An initial prototype was designed and implemented in Rust (axum). After that experience, the system was fully migrated to Go. Managing async contexts for SSE relay + NATS subscription + PG/pgvector writes simultaneously was verbose in Rust — goroutine + channel fits this pattern more naturally.

Prerequisites

Language: Go 1.25, Gin framework
Messaging: NATS 2.11 (JetStream enabled)
Orchestration: Dagster (Python) — daemon + sensor + asset materialization
Data store: PostgreSQL 18 + pgvector (JIT enabled)
Telemetry: Vector 0.45 (Rust) → Prometheus + Loki + Grafana
Inference backends: vLLM, llama.cpp, LM Studio
Embedding + Reranker: multi-bert-inference — Rust (Axum) + ONNX Runtime (INT8, ~17MB, ~10ms). ColBERT 256d embed + 64d token MaxSim rerank. DB-free, gRPC 256d vec direct transmission (~1KB/candidate)
Containers: Podman rootless + podman-compose
Host topology: 3 hosts (storage server / desktop mac / compute server)

Design Decisions

Layered Architecture

Following Clean Architecture dependency direction rules: Transport → Domain → Infrastructure.

  Client (opencode CLI, API consumer)
    |
    v
Transport Layer (Gin) [internal/transport/http/]
    |-- middleware: RequestContext -> Logger -> Recovery
    |-- v1/ handlers (OpenAI/Anthropic compatible endpoints)
    |-- Request parsing and validation
    +-- Presentation DTO conversion
    |
    |  Presentation Layer [pkg/openai/, pkg/anthropic/]
    |    +-- OpenAI/Anthropic compatible request/response DTOs
    |
    v
Domain Layer [internal/domain/]
    |-- knowledge/  RAG orchestration
    |-- llm/        Multi-backend LLM routing
    |-- vectorstore/ Vector search (pgvector ANN)
    +-- pipeline/   Pipeline integration (NATS publish, fire-and-forget)
    |
    v
Infrastructure Layer [internal/infra/]
    |-- vllm/       vLLM client
    |-- llamacpp/   llama.cpp client
    |-- lmstudio/   LM Studio client
    |-- postgres/   PostgreSQL + pgvector
    |-- nats/       NATS JetStream publish
    |-- vector/     Vector telemetry publish
    +-- reranker/   multi-bert-inference gRPC client

The Domain layer has zero external dependencies. Adding LLM backends or changing data stores only requires swapping Infrastructure layer implementations.

Gateway Is Publish-Only

The gateway only publishes to NATS. No subscribing, no triggering. Consumers (Dagster sensor, Vector) independently pull/subscribe.

Rationale:

No Dagster dependency in gateway code → deployment isolation
Adding or changing consumers has zero impact on the gateway
Gateway failures do not propagate to the pipeline side

Dagster Sensor with JetStream Pull

Dagster daemon acts as a durable consumer, pulling from JetStream. Sensor → RunRequest → direct job execution. The GraphQL trigger bridge was eliminated in favor of Dagster’s native sensor mechanism.

Why Dagster (Python)

Dagster is an orchestrator (conductor), not a computation executor. Heavy processing runs entirely outside Python.

Actual Computation	Execution Engine	Python/Dagster Role
LLM inference	vLLM / llama.cpp (C++/CUDA)	API call only
Vector search	PostgreSQL + pgvector (C)	SQL query only
Embedding + Reranking	multi-bert-inference (Rust/ONNX Runtime)	gRPC call only
Telemetry conversion	Vector (Rust)	YAML config only

No Rust alternative to Dagster exists. The orchestrator’s performance requirements are I/O wait + state management — Rust’s strengths (CPU-bound, memory safety) do not apply here. Dagster’s value lies in its lineage tracking + sensor + asset materialization design philosophy.

LLM Backend Selection

Backend is automatically selected based on the requested model name.

Backend	Default URL	Example Models
vLLM	`http://$COMPUTE_HOST:8000`	qwen3-next:80b, qwen3-coder:30b
llama.cpp	`http://$COMPUTE_HOST:8081`	nemotron-3-nano:30b
LM Studio	`http://$COMPUTE_HOST:1234`	lfm2.5-1.2b-instruct-mlx

LLM routing is handled by gateway goroutines. The control flow that determines the next backend based on response content requires stateful judgment — Vector cannot substitute this.

Models are continuously evaluated. Verified models are archived on the storage server’s cold storage in GGUF, NVFP4, ONNX and other formats, with performance and quality assessments tracking new model releases.

Model archive on storage server — Verified model archive on storage server cold storage. Accumulated in GGUF, NVFP4, ONNX and other formats

Rust to Go Migration

The initial prototype was designed in Rust (axum). The precise type system was valuable for design, but the writing cost for heavy async patterns was high.

Managing async contexts for SSE relay + NATS subscription + PG/pgvector writes simultaneously was verbose in Rust
goroutine + channel naturally fits this pattern
Runtime overhead difference is negligible for this use case

All design principles were preserved after migration: OpenAI-compatible endpoints, NATS Pub/Sub event-driven architecture, trace_id on all logs, idempotent design. The Rust design documents served as specifications for the Go implementation.

Implementation

Endpoints

Method	Path	Description
POST	`/v1/chat/completions`	Chat completions (OpenAI compatible)
POST	`/v1/messages`	Anthropic Messages API
POST	`/v1/responses`	Responses API
POST	`/v1/embeddings`	Embeddings
GET	`/v1/models`	Model listing
GET	`/healthz`	Health check

Middleware Chain

Gin middleware: RequestContext → Logger → Recovery

RequestContext — Extracts request tracking ID from X-Correlation-ID or X-Request-ID header. Auto-generates a timestamp-based ID when absent, setting it on both context and response header
Logger — Records method, path, status, duration per HTTP request via slog
Recovery — Gin standard panic recovery

RAG Data Flow

  Client
  |
  v
Chat Completion Request
  |
  v
Knowledge Service (RAG orchestration)
  |-- 1. User query extraction
  |-- 2. Embedding generation (multi-bert-inference -> 256d ColBERT)
  |-- 3. Vector search (pgvector HNSW ANN) + reranking (ColBERT MaxSim)
  |-- 4. Context injection
  |-- 5. Send to LLM backend
  |-- 6. NATS publish (pipeline.* + telemetry.*)  <- fire-and-forget
  +-- 7. Response to client

System Topology

                                 NATS JetStream :4222
                              +----------------------+
                              |  pipeline.*           |
                              |  telemetry.*          |
                              +---+----------+--------+
                    publish ------+          | pull subscribe
                    (fire & forget)         | (durable consumer)
                                           |
+-- agent-gateway :8080 ------+    +-- Dagster daemon -----------------------+
|                             |    |                                         |
|  Client Request             |    |  sensor: nats_dagster_chat_persist      |
|    |                        |    |  sensor: nats_dagster_embedding         |
|    v                        |    |  sensor: nats_dagster_retrieve          |
|  Knowledge Service          |    |  sensor: nats_dagster_compose           |
|    |-- embed (multi-bert)   |    |  sensor: nats_dagster_flow_lineage     |
|    |-- vector search (pg)   |    |  sensor: nats_dagster_tool_call        |
|    |-- rerank (ColBERT)     |    |         |                               |
|    |-- LLM inference -------|-->-|  vLLM / llama.cpp / LM Studio          |
|    |                        |    |         |                               |
|    |-- NATS pub: pipeline.* |    |         v                               |
|    |-- NATS pub: telemetry.*|    |  RunRequest -> Dagster job execution   |
|    +-- Response             |    |    |-- chat persist -> pg               |
|                             |    |    |-- embedding event -> pg            |
|  LLM Backends               |    |    |-- retrieve event -> pg             |
|    |-- vLLM :8000           |    |    |-- compose event -> pg              |
|    |-- llama.cpp :8081      |    |    |-- flow lineage -> pg               |
|    +-- LM Studio :1234     |    |    +-- tool call -> pg                  |
|                             |    |                                         |
+-----------------------------+    +-----------------------------------------+
                                           |
                              +------------+
                              | pull subscribe
                              v
                    +-- Vector (Rust) ----------------------+
                    |  source: nats (telemetry.*)           |
                    |    |-- sink: Prometheus exporter      |
                    |    +-- sink: Loki (CorrelationID)    |
                    +--------------------------------------+

NATS Event Design

Two independent paths share the same CorrelationID.

Pipeline (Dagster sensor → job execution):

Topic	Dagster Job
`pipeline.knowledge.chat.persist`	`knowledge_chat_persist` + `knowledge_chat_pair_materialize`
`pipeline.knowledge.embedding`	`knowledge_embedding`
`pipeline.knowledge.retrieve`	`knowledge_retrieve`
`pipeline.knowledge.compose`	`knowledge_compose`
`pipeline.knowledge.flow.lineage`	`knowledge_flow_lineage`
`pipeline.knowledge.tool_call`	`knowledge_tool_call`

Telemetry (Vector → Prometheus / Loki):

Topic	Content
`telemetry.knowledge.chat`	Model, backend, token usage, streaming flag
`telemetry.knowledge.embedding`	Model, input count, dimensions
`telemetry.knowledge.retrieve`	TopK, hit count, scores
`telemetry.knowledge.tool_call`	Function name, model

JetStream stream configuration:

PIPELINE — subjects: pipeline.>, retention: limits, max-age: 72h, storage: file
TELEMETRY — subjects: telemetry.>, retention: limits, max-age: 24h, storage: file

Dagster Assets and Jobs

The Dagster side is structured with asset materialization + sensor patterns.

Dagster Assets catalog — knowledge group asset listing — Dagster Assets catalog. chat_pair, flow_lineage, tool_call are Materialized; others show Never materialized

Each sensor pull-subscribes from NATS JetStream, detects events, and issues RunRequests for direct job execution. No GraphQL API involved.

Dagster tool_call_record asset detail — tool_call_record asset metadata. event_id, trace_id, function_name, model recorded at materialization time

Example tool_call_record: function_name: tokei, model: minimax-m2.5 — records which tool was called by which model as an event, enabling lineage tracking in Dagster.

Dagster knowledge_chat_pair_materialize job — chat_pair_record → chat_pair_dataset asset graph. Prompt-response pair persistence job

The knowledge_chat_pair_materialize job materializes two assets: chat_pair_record and chat_pair_dataset. It persists prompt-response pairs to PostgreSQL as raw material for future fine-tuning dataset construction.

devstack Container Layout

All middleware runs as rootless containers via podman-compose.

Podman Desktop — devstack container listing — Running devstack containers: postgres, reranker, vector, nats, dagster (user-code, webserver, daemon), grafana, node-exporter, promtail

Service	Image	Ports	Role
NATS	nats:2.11-alpine	4222, 8222	JetStream messaging
PostgreSQL	postgres:18-jit-vector	5432	pgvector + JIT, agent_gateway / dagster DB
Dagster	Custom build	3000	Webserver + Daemon + User-code gRPC
Vector	timberio/vector:0.45.0-alpine	8686, 9598	NATS telemetry → Prometheus/Loki
multi-bert-inference	Rust (Axum) + ONNX Runtime	3001 (HTTP), gRPC	ColBERT embed (256d) + rerank (64d MaxSim)

Lakehouse profile (for Phase 2, explicit startup):

Nessie (19120) — Iceberg metadata catalog (git-like branching)
Trino (8081) — SQL engine with Iceberg catalog
dbt-fusion — dbt CLI container

3-Host Network Topology

NATS co-located with the gateway, keeping all paths within 1 hop.

  +-- storage server (observability) --+  +-- desktop mac (gateway) ----------+  +-- compute server (inference) ------+
|                                    |  |                                   |  |                                     |
|  Prometheus <-- scrape --------------|-- :9598 (Vector exporter)          |  |  vLLM :8000 (GPU inference)         |
|  Grafana                           |  |                                   |  |  llama.cpp :8081                    |
|  Loki <-- push -----------------------  Vector (nats source)             |  |                                     |
|  Vector (aggregator)               |  |       |                          |  |  Dagster :3000                       |
|    |-- nats source --------------------  NATS JetStream :4222 <---------|--|---- daemon sensor (pull subscribe)  |
|    |-- -> Loki sink                |  |       ^ publish                  |  |    |-- webserver                     |
|    +-- -> Prometheus exporter      |  |       | (fire-and-forget)        |  |    |-- daemon (sensor -> job)        |
|  MinIO :9000                       |  |       |                          |  |    +-- user-code gRPC               |
|                                    |  |  agent-gateway :8080             |  |                                     |
|                                    |  |    |-- proxy API                 |  |  PostgreSQL 18 :5432                |
|                                    |  |    |   (OpenAI/Anthropic/        |  |    +-- pgvector (ANN + full-text)   |
|                                    |  |    |    Responses)               |  |                                     |
|                                    |  |    |-- knowledge orchestrator    |  |  multi-bert-inference :3001         |
|                                    |  |    +-- NATS pub ----------------|--->  pipeline.* + telemetry.*          |
|                                    |  |                                   |  |                                     |
|                                    |  |  LM Studio :1234 (CPU inference) |  |  Trino :8080                        |
|                                    |  |                                   |  |  Nessie :19120                       |
+------------------------------------+  +-----------------------------------+  +-------------------------------------+

Host	Role	Key Services	Uptime
storage server	Observability + object storage	Prometheus, Grafana, Loki, Vector, MinIO	24/7
desktop mac	Gateway + messaging	agent-gateway, NATS JetStream, LM Studio	Working hours
compute server	GPU inference + data platform	vLLM, llama.cpp, Dagster, PostgreSQL, Trino	Working hours

NATS placed on desktop mac because the gateway (largest publisher) co-locates for zero pub latency. Desktop/compute start and stop together, so NATS availability is not a concern. Only the storage server runs 24/7; Vector detects disconnection and auto-reconnects to NATS on restart.

Data Storage Strategy

Storage is split by data characteristics between PostgreSQL (pgvector) and Iceberg (Parquet).

                      PostgreSQL (pgvector)             Iceberg (Parquet via Trino/Nessie)
                    --------------------             -----------------------------------
Characteristics     Structured + vector search        File-oriented + bulk accumulation
Queries             WHERE + ANN (nearest neighbor)    SQL JOIN + time-travel
Update pattern      UPSERT (row-level)                append-only (immutable Parquet)
Versioning          None (self-managed history)       Nessie branches + Iceberg snapshots

In Phase 1 (real-time), all data is written directly to PostgreSQL. From Phase 2 onward, Dagster batch processes will convert PG data to Parquet → Iceberg append, with dbt-fusion building analytical datasets.

PostgreSQL Schema

document_chunks — 256-dimension ColBERT embeddings + HNSW vector index
chat_history — Conversation history by correlation_id
rerank_scores — Reranking scores per query x document x model
api_responses — Responses API persistence

Caveats

When desktop mac / compute server are stopped, NATS goes down too, but JetStream persistent data is retained on desktop mac disk and resumes from unconsumed messages on next startup
Vector detects disconnection and auto-reconnects, but telemetry events during NATS downtime are lost (pipeline side is protected by JetStream durable consumers)
PostgreSQL 18 JIT requires shm_size: 4gb configuration. Insufficient allocation risks OOM kills

Verification

Confirmed asset materialization status in Dagster UI: chat_pair, flow_lineage, tool_call successfully Materialized
Monitored stream state and consumer lag via NATS monitoring (:8222)
Verified telemetry metrics delivery through Vector exporter (:9598) → Prometheus
Integration tests: go test -tags=integration ./internal/infra/integration ./internal/domain/pipeline

Next Actions

Phase 1 (real-time platform) is currently operational. Roadmap:

Phase	Content	Status
1	gateway → NATS → Dagster/pg, Vector → Prometheus/Loki	Running
2	pg → Parquet → Iceberg batch, dbt-fusion processing	devstack ready
3	MCP server tool definitions, Trino query tool, hypothesis testing automation	Not started
4	Fine-tuning pipeline, MOE expert pruning, synthetic data augmentation	Not started

loFT LLC Established

loFT LLC was established in …

Running GLM-5.2 (744B-A40B) GGUFs Locally: Did MTP Help? Notes From a Few Quant and Expert-Placement Tests

Notes from running two GLM-5.2 …