About This Article

agent-gateway is a proxy gateway that exposes LLM and RAG capabilities through an OpenAI-compatible HTTP API. It was designed to serve as the core of my homelab AI infrastructure. This article covers Phase 1 – building the real-time foundation, spanning roughly two weeks from early to mid-March.

The subsequent v3 redesign split the domains, and the full-layer refactor applied Clean Architecture across the board. This article is about the earlier stage: getting the first working system assembled.


The Architecture Overview

Layered architecture built with Go + Gin, with clear separation of responsibilities at each layer:

  Transport Layer (Gin) [internal/transport/http/]
    -- middleware: RequestContext -> Logger -> Recovery
    -- v1/ handlers (OpenAI-compatible endpoints)

Presentation Layer [pkg/openai/]
    -- OpenAI-compatible / Anthropic-compatible request/response DTOs

Domain Layer [internal/domain/]
    |-- knowledge/  RAG orchestration
    |-- llm/        Multi-backend LLM routing
    |-- vectorstore/ Vector search
    +-- pipeline/   Pipeline integration (NATS pub/sub, Dagster triggers)

Infrastructure Layer [internal/infra/]
    |-- vllm/       vLLM client
    |-- llamacpp/   llama.cpp client
    |-- lmstudio/   LM Studio client
    |-- postgres/   PostgreSQL + pgvector
    |-- nats/       NATS JetStream pub/sub
    |-- dagster/    Dagster job triggers
    +-- reranker/   ColBERT reranker client
  

Three LLM backends are switchable by configuration. In the homelab, GPU inference and CPU inference are used for different purposes:

BackendHostExample Models
vLLMcompute.home.arpa:8000qwen3-next:80b, qwen3-coder:30b
llama.cppcompute.home.arpa:8081nemotron-3-nano:30b
LM Studiocompute.home.arpa:1234lfm2.5-1.2b-instruct-mlx

Over 40 models are managed in the catalog, with pattern matching on model names to automatically route to the appropriate backend. Names starting with qwen3- go to vLLM, nemotron- goes to llama.cpp, and so on.


Fire-and-Forget Event-Driven Design with NATS

The design decision I cared most about in the gateway was separating request processing from data persistence. I didn’t want persistence logic in the response path of a chat completion.

The solution is fire-and-forget publishing to NATS. The gateway processes the request, returns the response, then asynchronously fires events to NATS. Dagster consumes them asynchronously for persistence:

  User -> gateway -> LLM backend -> response returned
                \-> NATS publish (fire-and-forget)
                      \-> Dagster sensor -> JetStream pull -> PostgreSQL
  

Published topics:

  pipeline.knowledge.chat.persist    -- chat persistence
pipeline.knowledge.embedding       -- embedding generation events
pipeline.knowledge.retrieve        -- vector search events
pipeline.knowledge.flow.lineage    -- flow DAG lineage tracking
pipeline.knowledge.tool_call       -- tool call logs
  

Telemetry goes through a separate path: Vector subscribes to NATS and forwards to Prometheus and Loki.

Dagster-Side NATS Pipeline

On the Dagster side in model-foundry, I implemented assets and jobs using NATS JetStream consumers:

  • Sensors: poll topics via NATS JetStream durable consumers
  • Assets: materialize received events as DuckDB tables
  • Jobs: knowledge_chat_persist, knowledge_embedding, knowledge_lineage

JetStream pull consumers can redeliver, so idempotency is guaranteed on the persistence side:

  INSERT INTO chat_pairs (correlation_id, prompt, response, model, created_at)
VALUES ($1, $2, $3, $4, $5)
ON CONFLICT (correlation_id) DO NOTHING;
  

RAG Orchestration Flow

The RAG flow for a chat completion request:

  1. Extract user query
  2. Generate embeddings (multi-bert-inference gRPC)
  3. Vector search with pgvector + ColBERT reranking
  4. Attach search results as context
  5. Send to LLM backend
  6. Publish pipeline events to NATS
  7. Return response

The vector store is PostgreSQL + pgvector. The document_chunks table has an embedding vector(256) column with an HNSW index on cosine distance:

  CREATE TABLE document_chunks (
    id SERIAL PRIMARY KEY,
    document_id TEXT NOT NULL,
    chunk_index INT NOT NULL,
    content TEXT NOT NULL,
    embedding vector(256),
    created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_chunks_embedding ON document_chunks
    USING hnsw (embedding vector_cosine_ops);
  

On a cache hit, existing embeddings are searched and context is attached. On a cache miss, embeddings are generated and stored before passing to the LLM.


Containerizing multi-bert-inference

The core of RAG – embeddings and reranking – is handled by multi-bert-inference, a Rust gRPC server. Originally running as a standalone process on desktop.home.arpa, I containerized it for devstack integration.

3-Stage Containerfile

  # Stage 1: Build only dependencies for caching
FROM rust:1.94.0-slim-trixie AS deps
RUN apt-get update && apt-get install -y \
    protobuf-compiler pkg-config libssl-dev g++
WORKDIR /app
COPY Cargo.toml Cargo.lock build.rs proto/ ./
RUN mkdir src && echo "fn main() {}" > src/main.rs && \
    cargo build --release && rm -rf src

# Stage 2: Build actual source
FROM deps AS builder
COPY src/ src/
RUN cargo build --release && strip target/release/multi-bert-inference

# Stage 3: Minimal runtime
FROM debian:trixie-slim
RUN apt-get update && apt-get install -y ca-certificates curl && \
    rm -rf /var/lib/apt/lists/*
RUN useradd -r -s /bin/false app
USER app
COPY --from=builder /app/target/release/multi-bert-inference /usr/local/bin/
EXPOSE 50051 3000
ENTRYPOINT ["multi-bert-inference"]
  

The key is the dummy main.rs cache layer in Stage 1. Building Cargo dependencies first means source changes result in fast incremental rebuilds. Model files are not included in the image – they’re passed via volume mount. Rebuilding a multi-GB image every time a model updates is something I wanted to avoid.

  podman build -t desktop.home.arpa/multi-bert-inference .
  

Devstack Integration

Added as the inference service in devstack/desktop/podman-compose.yml:

  inference:
  image: desktop.home.arpa/multi-bert-inference
  ports:
    - "50051:50051"
    - "3800:3000"
  volumes:
    - /Users/ksh3/Development/multi-bert-inference/models:/app/models:ro,z
  environment:
    RUST_LOG: multi_bert_inference=info
  healthcheck:
    test: ["CMD-SHELL", "curl -sf http://localhost:3000/healthz || exit 1"]
    interval: 10s
    timeout: 3s
    retries: 3
  restart: unless-stopped
  

gRPC port 50051 and REST health check port (3000 -> 3800 host mapping) are exposed. The model directory is mounted read-only.


gRPC Client and Graceful Degradation

The inference service connection on the agent-gateway side is managed by internal/infra/inference/client.go. It provides three service clients:

  • EmbeddingServiceClient: Embed() (256-dimensional) and EmbedContext() (late chunking)
  • RerankServiceClient: Rerank() (ColBERT MaxSim)
  • SearchServiceClient: MaxSimSearch()

The inference service connection is optional. The gateway runs fine without it:

  var inferClient *inference.Client
if strings.TrimSpace(cfg.InferenceGRPCAddr) != "" {
    var inferErr error
    inferClient, inferErr = inference.NewClient(ctx, cfg.InferenceGRPCAddr)
    if inferErr != nil {
        slog.Warn("inference gRPC unavailable", "err", inferErr, "addr", cfg.InferenceGRPCAddr)
    } else {
        slog.Info("inference gRPC connected", "addr", cfg.InferenceGRPCAddr)
        orc.SetRerankerRepository(inference.NewRerankRepository(inferClient))
        defer inferClient.Close()
    }
}
  

If the inference service is down, only embed/rerank is disabled – the LLM proxy and other endpoints continue working normally. Not having to spin up every service during development turned out to be quietly important.

Embedding Calls from Dagster

Dagster also hits the same gRPC endpoint. The EmbeddingResource directly encodes and decodes protobuf messages to call /search.EmbeddingService/Embed, generating 256-dimensional dense embeddings for the knowledge_events and chat_pairs assets.


col-bert-api: The INT8 Trap and Switch to FP32

In parallel with multi-bert-inference, I was also running col-bert-api, a candle-onnx based axum API for embed/rerank. This is where the INT8 quantization trap hit.

Loading onnx/model_int8.onnx and running inference produced a runtime error. The FP32 version (onnx/model.onnx) worked fine with the same code. The cause was CPU instruction set compatibility – the INT8 quantized model required i8/VNNI instructions that weren’t available.

  // Before: INT8 priority fallback
// 1. onnx/model_int8.onnx -> 2. onnx/model_fp16.onnx -> 3. onnx/model.onnx

// After: FP32 fixed
let model = candle_onnx::read_file("onnx/model.onnx")?;
  

Quantized models are meant to reduce inference cost, but using them without verifying hardware compatibility leads to a subtle failure mode: “loads fine but crashes during inference.” I also updated the README from “Prefer provided INT8 models if available” to reflect the FP32-based reality.


What Phase 1 Delivered

Implemented Endpoints

EndpointStatus
POST /v1/chat/completionsDone
POST /v1/messages (Anthropic API)Done
POST /v1/responses + GET/DELETEDone
GET /v1/models + GET /v1/models/:idDone
POST /v1/embeddingsDone
POST /v1/moderationsDone
GET /healthzDone

LLM Parser Chain

LLM output is normalized through a 4-stage parser chain:

  1. ApplyToolCallParser – tool call normalization
  2. ApplyReasoningParser – thinking/reasoning tag extraction
  3. ApplyVisionParser – image base64 processing
  4. ApplyOutputParser – output format normalization

Middleware

RequestContext -> Logger -> Recovery chain. RequestContext extracts the X-Correlation-ID header or auto-generates one, binding it to all subsequent logs.

Service Topology

  desktop.home.arpa
  -- agent-gateway :8080
  -- NATS :4222
  -- multi-bert-inference :50051

compute.home.arpa
  -- vLLM :8000
  -- llama.cpp :8081
  -- LM Studio :1234
  -- PostgreSQL :5432
  -- Dagster :3300

storage.home.arpa
  -- Prometheus :9090
  -- Loki :3100
  -- MinIO :9000
  

The Phase 1 goal of a “real-time foundation” was complete:

  • gateway -> NATS publish (fire-and-forget)
  • Dagster sensor -> JetStream pull -> PostgreSQL
  • Vector -> NATS subscribe -> Prometheus/Loki
  • pgvector ANN search + ColBERT reranking

From here, the v3 redesign would split the domains apart, but that’s another article.