Conclusion

To unify the entry point of a local LLM platform behind an OpenAI-compatible API, I designed and prototyped a proxy in Rust (axum). Roughly 1,600 lines of Rust delivered a working OpenAI/Ollama-compatible multi-backend proxy with SSE streaming, model-name-based backend routing, and a layered architecture.

However, the broader vision — NATS event relay, Dagster oneshot jobs, RAG orchestration, Qdrant semantic cache — remained stubbed. Managing async contexts for simultaneous SSE relay, NATS subscription, and PostgreSQL/Qdrant writes in Rust was verbose enough that the implementation cost of wiring everything together outweighed the runtime benefits. That led to the decision to migrate to Go (Gin).

What the Rust prototype accomplished was not a production system, but a locked-down design specification and type contract. The trait definitions and data structures served directly as the spec for the Go rewrite. Goroutine plus channel mapped cleanly onto the same architecture, and the Go implementation moved much faster because the design questions were already settled.

This article is the prequel to the full design record of the Go-based AI orchestration platform.


Prerequisites

  • Language: Rust 2024 edition
  • Framework: axum 0.8, tokio 1.48
  • Serialization: serde / serde_json
  • HTTP client: reqwest 0.12
  • Messaging (design only): async-nats 0.45 (stubbed)
  • Data stores (design only): sqlx 0.8 (PostgreSQL), qdrant-client 1.16 (stubbed)
  • Logging: tracing / tracing-subscriber
  • Containers: Docker (proxy + PostgreSQL 18 + Qdrant)
  • Bind address: 0.0.0.0:8080

Design Decisions

Layered Architecture

Following Clean Architecture dependency rules, the system was split into HTTP, Domain, and Infrastructure layers.

  Client (CLI / IDE / API consumer)
    |
    v
HTTP Layer [bk/http/]
    |-- routes.rs         endpoint registration
    |-- handlers_chat.rs  OpenAI/Ollama chat handlers
    |-- handlers_embeddings.rs  embeddings handler
    |-- error.rs          OpenAI-compatible error mapping
    |
    v
Domain Layer [bk/domain/]
    |-- chat.rs     ChatCompletionRequest/Response, ChatRoute (Direct/Rag/Workflow)
    |-- rag.rs      RagParams, EmbeddingsRequest/Response
    |-- workflow.rs JobRunId
    |
    v
Service Layer [bk/services/]
    |-- ChatService trait     -> HttpChatService (prod) / StubChatService (dev)
    |-- EmbeddingsService trait -> StubEmbeddingsService
    |-- RagService trait      -> StubRagService
    |
    v
Infrastructure Layer [bk/infra/]
    |-- llm.rs          HttpClient (backend LLM calls)
    |-- qdrant_client.rs QdrantClient (vector search)
    |-- auth.rs         ApiKeyAuthorizer
  

The Domain layer has zero external dependencies. Adding an LLM backend or swapping a data store is an Infra-layer change only. This structure survived the Go migration intact.

Dual OpenAI/Ollama Compatibility

The proxy needed to serve both OpenAI and Ollama clients through a single entry point.

MethodPathFormat
POST/v1/chat/completionsOpenAI compatible
POST/v1/embeddingsOpenAI compatible
GET/v1/modelsOpenAI compatible
POST/api/chatOllama compatible
GET/api/tagsOllama compatible
GET/healthHealth check
GET/readyReadiness

The Ollama handler converts between Ollama-native DTOs and the internal ChatService interface. Internally, both paths converge on the same service.

Multi-Backend Routing

The proxy selects a backend automatically based on model name prefix.

  // HttpChatService
fn pick_client(&self, model: &str) -> &HttpClient {
    // model prefix -> route-specific client, fallback -> default client
}
  

A default LLM client plus additional route-specific clients configured at startup. If no prefix matches, the request falls back to the default. Keeping this logic in the Service layer prevents routing changes from propagating into the HTTP layer.

ChatRoute Branching

Custom headers on the request explicitly select the processing path.

  pub enum ChatRoute {
    Direct,    // forward to backend LLM
    Rag,       // vector search + context injection + LLM
    Workflow,  // delegate to Dagster pipeline
}
  

Additional extension headers: x-workspace (RAG workspace ID), x-pipeline (workflow pipeline ID), x-correlation-id (request tracing). This means a single /v1/chat/completions endpoint can express direct inference, RAG, and workflow execution.

SSE Streaming

The service layer returns one of three response types.

  pub enum ChatServiceResponse {
    Once(ChatCompletionResponse),       // non-stream: single JSON
    Stream(Pin<Box<dyn Stream<...>>>),  // chunked: OpenAI-compatible
    StreamRaw(Pin<Box<dyn Stream<...>>>), // raw text: backend passthrough
}
  

Stream mode sends data: {...}\n\n SSE frames and terminates with [DONE]. StreamRaw passes through the backend SSE response without transformation.


Implementation

What Worked

The prototype was verified to work for the following scope.

  • /v1/chat/completions — OpenAI-compatible streaming and non-streaming responses
  • /api/chat — Ollama compatibility with internal DTO conversion
  • /v1/models — model listing from backends (both OpenAI and Ollama formats)
  • Multi-backend routing — automatic client selection by model name prefix
  • SSE streaming — parsing SSE lines from reqwest byte streams and relaying them
  • Structured logging — tracing spans per handler recording model name, stream flag, and message count

What Remained Stubbed

The following had complete trait definitions and design specs but stub implementations only.

FeatureStatusDesign Position
RAG context buildingStubRagService (returns fixed string)Qdrant + PostgreSQL pgvector search
EmbeddingsStubEmbeddingsService (fixed response)ONNX model inference
Workflow pipelineChatRoute::Workflow stubDagster oneshot job launch
NATS event relayasync-nats in dependencies but not wiredevt.chat.{trace_id} publish
AuthenticationApiKeyAuthorizer defined but not connectedhandler returns Ok(())
PostgreSQL idempotency logsqlx in dependencies but not wiredidempotency_log / completions_cache

Error Format

Errors follow the OpenAI JSON format.

  pub enum ApiError {
    Unauthorized,      // 401
    Forbidden,         // 403
    BadRequest(String), // 400 -> validation_error
    NotFound(String),  // 404 -> validation_error
    Backend(String),   // 502 -> backend_error
    Internal(String),  // 500 -> proxy_error
}
// -> { "error": { "message": "...", "type": "...", "code": N } }
  

Docker Compose Setup

  services:
  proxy:    # Rust proxy :8080
    environment:
      LLM_BASE_URL: http://host.docker.internal:14434  # vLLM/Ollama
      DATABASE_URL: postgres://postgres:postgres@db:5432/openai
      QDRANT_URL: http://qdrant:6333
    depends_on: [db, qdrant]
  qdrant:   # Vector DB :6333
  db:       # PostgreSQL 18 :5432
  

Multi-stage Dockerfile: Rust 1.79 build stage, Debian bookworm-slim runtime with dependency caching.


NATS + Dagster Design Spec (Not Implemented)

The following design was completed as a specification but not implemented in Rust. After migrating to Go, this spec became the basis for the production implementation.

Runtime Flow

  1. Client -> /v1/chat/completions(stream=true)
2. Rust allocates trace_id, starts SSE
3. systemd / Quadlet launches dagster-<job>@{trace_id} as oneshot
4. Dagster ops run LLMs and tools in parallel, publish to evt.chat.{trace_id}
5. Rust subscribes -> converts to OpenAI chunks -> streams via SSE
6. On completion, UPSERT to PostgreSQL / Qdrant
  

Event Schema

  {"type":"role","role":"assistant"}
{"type":"token","text":"...","task":"A"}
{"type":"tool_call","name":"search","arguments":"{...}"}
{"type":"usage","usage":{...}}
{"type":"finished","reason":"stop","winner":"llama3.1-8b"}
  

Idempotency Design

  • trace_id: session identifier
  • req_id = sha256(model + messages + params): request identifier
  • All side effects through UPSERTs or unique constraints
  • finished emitted exactly once after artifacts are committed

Staged Migration to JetStream

The original plan was to start with NATS Core for simplicity and migrate to JetStream only when durable intake and replay became necessary. After the Go migration, JetStream was adopted from the start.


The Decision to Migrate to Go

After the Rust prototype was running, the following assessment led to the migration decision.

Async Context Management Cost

SSE relay, NATS subscription, and PostgreSQL/Qdrant writes run concurrently within the same request scope. In Rust, wiring these together requires Pin<Box<dyn Stream>>, tokio::select!, and lifetime management that adds significant code volume unrelated to the design intent.

Go’s goroutine plus channel model maps directly onto this pattern. Each async task launches as a goroutine, results merge through channels, and the control flow reads linearly.

Runtime Overhead

The bottleneck in this use case is LLM inference I/O wait, not proxy-layer CPU work. Rust’s zero-cost abstractions do not provide meaningful advantage here.

Design Asset Reuse

The type contracts and design specs from Rust translated directly to Go.

RustGo
trait ChatServiceinterface ChatService
ChatCompletionRequest structChatCompletionRequest struct
ChatRoute enumChatRoute const
ApiError enumApiError type + HTTP status mapping
Layered architectureinternal/transport, internal/domain, internal/infra

The Rust type definitions converted almost one-to-one into Go struct definitions. Routing logic, error format, and extension header specs carried over unchanged.

What Changed After Migration

AreaRust PrototypeGo Production
NATSStubbed (async-nats not wired)JetStream publish (fire-and-forget)
DagsterDesign spec onlydaemon sensor + asset materialization
RAGStubRagServiceKnowledge Service (embed -> pgvector ANN -> rerank)
AuthNot connectedRequestContext middleware with tracking IDs
Telemetrytracing logs onlyVector (Rust) -> Prometheus + Loki + Grafana
Host topologySingle-host Docker Compose3-host (storage / desktop / compute)
RerankerDesign concept onlymulti-bert-inference (Rust + ONNX Runtime) gRPC integration

Caveats

  • The Rust prototype code remains in the openai-api-proxy repository but is not maintained after the Go migration
  • NATS, Dagster, and RAG design specs evolved during the Go implementation (NATS Core to JetStream, oneshot to sensor-driven, Qdrant to pgvector + ColBERT rerank)
  • The Ollama-compatible endpoint was replaced by Anthropic Messages API in the Go version

Verification

  • OpenAI-compatible endpoint confirmed working for both streaming and non-streaming responses
  • Ollama-compatible endpoint DTO conversion verified
  • Multi-backend routing by model name prefix confirmed
  • Docker Compose startup and connectivity for proxy + PostgreSQL + Qdrant verified

Next Steps

The Rust prototype served its purpose: locking down the design specification and type contract. That objective was achieved.

The Go production implementation now runs NATS JetStream event relay, Dagster sensor-based pull subscribe with asset materialization, pgvector ANN plus ColBERT rerank RAG orchestration, and a 3-host local AI platform.

The full design record is in Designing an AI Orchestration Platform with Go, NATS, and Dagster.