Building agent-gateway -- Phase 1 Real-Time Knowledge Pipeline and Embedding Service Integration
Building agent-gateway Phase 1: designing the Go + Gin OpenAI-compatible gateway, wiring NATS + Dagster knowledge pipelines, and containerizing multi-bert-inference for gRPC embedding and reranking integration.
About This Article
agent-gateway is a proxy gateway that exposes LLM and RAG capabilities through an OpenAI-compatible HTTP API. It was designed to serve as the core of my homelab AI infrastructure. This article covers Phase 1 – building the real-time foundation, spanning roughly two weeks from early to mid-March.
The subsequent v3 redesign split the domains, and the full-layer refactor applied Clean Architecture across the board. This article is about the earlier stage: getting the first working system assembled.
The Architecture Overview
Layered architecture built with Go + Gin, with clear separation of responsibilities at each layer:
Transport Layer (Gin) [internal/transport/http/]
-- middleware: RequestContext -> Logger -> Recovery
-- v1/ handlers (OpenAI-compatible endpoints)
Presentation Layer [pkg/openai/]
-- OpenAI-compatible / Anthropic-compatible request/response DTOs
Domain Layer [internal/domain/]
|-- knowledge/ RAG orchestration
|-- llm/ Multi-backend LLM routing
|-- vectorstore/ Vector search
+-- pipeline/ Pipeline integration (NATS pub/sub, Dagster triggers)
Infrastructure Layer [internal/infra/]
|-- vllm/ vLLM client
|-- llamacpp/ llama.cpp client
|-- lmstudio/ LM Studio client
|-- postgres/ PostgreSQL + pgvector
|-- nats/ NATS JetStream pub/sub
|-- dagster/ Dagster job triggers
+-- reranker/ ColBERT reranker client
Three LLM backends are switchable by configuration. In the homelab, GPU inference and CPU inference are used for different purposes:
| Backend | Host | Example Models |
|---|---|---|
| vLLM | compute.home.arpa:8000 | qwen3-next:80b, qwen3-coder:30b |
| llama.cpp | compute.home.arpa:8081 | nemotron-3-nano:30b |
| LM Studio | compute.home.arpa:1234 | lfm2.5-1.2b-instruct-mlx |
Over 40 models are managed in the catalog, with pattern matching on model names to automatically route to the appropriate backend. Names starting with qwen3- go to vLLM, nemotron- goes to llama.cpp, and so on.
Fire-and-Forget Event-Driven Design with NATS
The design decision I cared most about in the gateway was separating request processing from data persistence. I didn’t want persistence logic in the response path of a chat completion.
The solution is fire-and-forget publishing to NATS. The gateway processes the request, returns the response, then asynchronously fires events to NATS. Dagster consumes them asynchronously for persistence:
User -> gateway -> LLM backend -> response returned
\-> NATS publish (fire-and-forget)
\-> Dagster sensor -> JetStream pull -> PostgreSQL
Published topics:
pipeline.knowledge.chat.persist -- chat persistence
pipeline.knowledge.embedding -- embedding generation events
pipeline.knowledge.retrieve -- vector search events
pipeline.knowledge.flow.lineage -- flow DAG lineage tracking
pipeline.knowledge.tool_call -- tool call logs
Telemetry goes through a separate path: Vector subscribes to NATS and forwards to Prometheus and Loki.
Dagster-Side NATS Pipeline
On the Dagster side in model-foundry, I implemented assets and jobs using NATS JetStream consumers:
- Sensors: poll topics via NATS JetStream durable consumers
- Assets: materialize received events as DuckDB tables
- Jobs:
knowledge_chat_persist,knowledge_embedding,knowledge_lineage
JetStream pull consumers can redeliver, so idempotency is guaranteed on the persistence side:
INSERT INTO chat_pairs (correlation_id, prompt, response, model, created_at)
VALUES ($1, $2, $3, $4, $5)
ON CONFLICT (correlation_id) DO NOTHING;
RAG Orchestration Flow
The RAG flow for a chat completion request:
- Extract user query
- Generate embeddings (multi-bert-inference gRPC)
- Vector search with pgvector + ColBERT reranking
- Attach search results as context
- Send to LLM backend
- Publish pipeline events to NATS
- Return response
The vector store is PostgreSQL + pgvector. The document_chunks table has an embedding vector(256) column with an HNSW index on cosine distance:
CREATE TABLE document_chunks (
id SERIAL PRIMARY KEY,
document_id TEXT NOT NULL,
chunk_index INT NOT NULL,
content TEXT NOT NULL,
embedding vector(256),
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_chunks_embedding ON document_chunks
USING hnsw (embedding vector_cosine_ops);
On a cache hit, existing embeddings are searched and context is attached. On a cache miss, embeddings are generated and stored before passing to the LLM.
Containerizing multi-bert-inference
The core of RAG – embeddings and reranking – is handled by multi-bert-inference, a Rust gRPC server. Originally running as a standalone process on desktop.home.arpa, I containerized it for devstack integration.
3-Stage Containerfile
# Stage 1: Build only dependencies for caching
FROM rust:1.94.0-slim-trixie AS deps
RUN apt-get update && apt-get install -y \
protobuf-compiler pkg-config libssl-dev g++
WORKDIR /app
COPY Cargo.toml Cargo.lock build.rs proto/ ./
RUN mkdir src && echo "fn main() {}" > src/main.rs && \
cargo build --release && rm -rf src
# Stage 2: Build actual source
FROM deps AS builder
COPY src/ src/
RUN cargo build --release && strip target/release/multi-bert-inference
# Stage 3: Minimal runtime
FROM debian:trixie-slim
RUN apt-get update && apt-get install -y ca-certificates curl && \
rm -rf /var/lib/apt/lists/*
RUN useradd -r -s /bin/false app
USER app
COPY --from=builder /app/target/release/multi-bert-inference /usr/local/bin/
EXPOSE 50051 3000
ENTRYPOINT ["multi-bert-inference"]
The key is the dummy main.rs cache layer in Stage 1. Building Cargo dependencies first means source changes result in fast incremental rebuilds. Model files are not included in the image – they’re passed via volume mount. Rebuilding a multi-GB image every time a model updates is something I wanted to avoid.
podman build -t desktop.home.arpa/multi-bert-inference .
Devstack Integration
Added as the inference service in devstack/desktop/podman-compose.yml:
inference:
image: desktop.home.arpa/multi-bert-inference
ports:
- "50051:50051"
- "3800:3000"
volumes:
- /Users/ksh3/Development/multi-bert-inference/models:/app/models:ro,z
environment:
RUST_LOG: multi_bert_inference=info
healthcheck:
test: ["CMD-SHELL", "curl -sf http://localhost:3000/healthz || exit 1"]
interval: 10s
timeout: 3s
retries: 3
restart: unless-stopped
gRPC port 50051 and REST health check port (3000 -> 3800 host mapping) are exposed. The model directory is mounted read-only.
gRPC Client and Graceful Degradation
The inference service connection on the agent-gateway side is managed by internal/infra/inference/client.go. It provides three service clients:
EmbeddingServiceClient:Embed()(256-dimensional) andEmbedContext()(late chunking)RerankServiceClient:Rerank()(ColBERT MaxSim)SearchServiceClient:MaxSimSearch()
The inference service connection is optional. The gateway runs fine without it:
var inferClient *inference.Client
if strings.TrimSpace(cfg.InferenceGRPCAddr) != "" {
var inferErr error
inferClient, inferErr = inference.NewClient(ctx, cfg.InferenceGRPCAddr)
if inferErr != nil {
slog.Warn("inference gRPC unavailable", "err", inferErr, "addr", cfg.InferenceGRPCAddr)
} else {
slog.Info("inference gRPC connected", "addr", cfg.InferenceGRPCAddr)
orc.SetRerankerRepository(inference.NewRerankRepository(inferClient))
defer inferClient.Close()
}
}
If the inference service is down, only embed/rerank is disabled – the LLM proxy and other endpoints continue working normally. Not having to spin up every service during development turned out to be quietly important.
Embedding Calls from Dagster
Dagster also hits the same gRPC endpoint. The EmbeddingResource directly encodes and decodes protobuf messages to call /search.EmbeddingService/Embed, generating 256-dimensional dense embeddings for the knowledge_events and chat_pairs assets.
col-bert-api: The INT8 Trap and Switch to FP32
In parallel with multi-bert-inference, I was also running col-bert-api, a candle-onnx based axum API for embed/rerank. This is where the INT8 quantization trap hit.
Loading onnx/model_int8.onnx and running inference produced a runtime error. The FP32 version (onnx/model.onnx) worked fine with the same code. The cause was CPU instruction set compatibility – the INT8 quantized model required i8/VNNI instructions that weren’t available.
// Before: INT8 priority fallback
// 1. onnx/model_int8.onnx -> 2. onnx/model_fp16.onnx -> 3. onnx/model.onnx
// After: FP32 fixed
let model = candle_onnx::read_file("onnx/model.onnx")?;
Quantized models are meant to reduce inference cost, but using them without verifying hardware compatibility leads to a subtle failure mode: “loads fine but crashes during inference.” I also updated the README from “Prefer provided INT8 models if available” to reflect the FP32-based reality.
What Phase 1 Delivered
Implemented Endpoints
| Endpoint | Status |
|---|---|
POST /v1/chat/completions | Done |
POST /v1/messages (Anthropic API) | Done |
POST /v1/responses + GET/DELETE | Done |
GET /v1/models + GET /v1/models/:id | Done |
POST /v1/embeddings | Done |
POST /v1/moderations | Done |
GET /healthz | Done |
LLM Parser Chain
LLM output is normalized through a 4-stage parser chain:
ApplyToolCallParser– tool call normalizationApplyReasoningParser– thinking/reasoning tag extractionApplyVisionParser– image base64 processingApplyOutputParser– output format normalization
Middleware
RequestContext -> Logger -> Recovery chain. RequestContext extracts the X-Correlation-ID header or auto-generates one, binding it to all subsequent logs.
Service Topology
desktop.home.arpa
-- agent-gateway :8080
-- NATS :4222
-- multi-bert-inference :50051
compute.home.arpa
-- vLLM :8000
-- llama.cpp :8081
-- LM Studio :1234
-- PostgreSQL :5432
-- Dagster :3300
storage.home.arpa
-- Prometheus :9090
-- Loki :3100
-- MinIO :9000
The Phase 1 goal of a “real-time foundation” was complete:
- gateway -> NATS publish (fire-and-forget)
- Dagster sensor -> JetStream pull -> PostgreSQL
- Vector -> NATS subscribe -> Prometheus/Loki
- pgvector ANN search + ColBERT reranking
From here, the v3 redesign would split the domains apart, but that’s another article.
