How I Designed an OpenAI-Compatible Proxy in Rust (axum) and Why I Moved to Go

Design and prototype implementation of an OpenAI/Ollama-compatible proxy in Rust (axum), the NATS + Dagster integration specification it produced, and the decision to migrate to Go when async context management became the dominant cost.

These articles use AI-generated summaries of Obsidian notes originally kept as technical memos.

English translations are produced with AI assistance.

Conclusion

To unify the entry point of a local LLM platform behind an OpenAI-compatible API, I designed and prototyped a proxy in Rust (axum). Roughly 1,600 lines of Rust delivered a working OpenAI/Ollama-compatible multi-backend proxy with SSE streaming, model-name-based backend routing, and a layered architecture.

However, the broader vision — NATS event relay, Dagster oneshot jobs, RAG orchestration, Qdrant semantic cache — remained stubbed. Managing async contexts for simultaneous SSE relay, NATS subscription, and PostgreSQL/Qdrant writes in Rust was verbose enough that the implementation cost of wiring everything together outweighed the runtime benefits. That led to the decision to migrate to Go (Gin).

What the Rust prototype accomplished was not a production system, but a locked-down design specification and type contract. The trait definitions and data structures served directly as the spec for the Go rewrite. Goroutine plus channel mapped cleanly onto the same architecture, and the Go implementation moved much faster because the design questions were already settled.

This article is the prequel to the full design record of the Go-based AI orchestration platform.

Prerequisites

Language: Rust 2024 edition
Framework: axum 0.8, tokio 1.48
Serialization: serde / serde_json
HTTP client: reqwest 0.12
Messaging (design only): async-nats 0.45 (stubbed)
Data stores (design only): sqlx 0.8 (PostgreSQL), qdrant-client 1.16 (stubbed)
Logging: tracing / tracing-subscriber
Containers: Docker (proxy + PostgreSQL 18 + Qdrant)
Bind address: 0.0.0.0:8080

Design Decisions

Layered Architecture

Following Clean Architecture dependency rules, the system was split into HTTP, Domain, and Infrastructure layers.

  Client (CLI / IDE / API consumer)
    |
    v
HTTP Layer [bk/http/]
    |-- routes.rs         endpoint registration
    |-- handlers_chat.rs  OpenAI/Ollama chat handlers
    |-- handlers_embeddings.rs  embeddings handler
    |-- error.rs          OpenAI-compatible error mapping
    |
    v
Domain Layer [bk/domain/]
    |-- chat.rs     ChatCompletionRequest/Response, ChatRoute (Direct/Rag/Workflow)
    |-- rag.rs      RagParams, EmbeddingsRequest/Response
    |-- workflow.rs JobRunId
    |
    v
Service Layer [bk/services/]
    |-- ChatService trait     -> HttpChatService (prod) / StubChatService (dev)
    |-- EmbeddingsService trait -> StubEmbeddingsService
    |-- RagService trait      -> StubRagService
    |
    v
Infrastructure Layer [bk/infra/]
    |-- llm.rs          HttpClient (backend LLM calls)
    |-- qdrant_client.rs QdrantClient (vector search)
    |-- auth.rs         ApiKeyAuthorizer

The Domain layer has zero external dependencies. Adding an LLM backend or swapping a data store is an Infra-layer change only. This structure survived the Go migration intact.

Dual OpenAI/Ollama Compatibility

The proxy needed to serve both OpenAI and Ollama clients through a single entry point.

Method	Path	Format
POST	`/v1/chat/completions`	OpenAI compatible
POST	`/v1/embeddings`	OpenAI compatible
GET	`/v1/models`	OpenAI compatible
POST	`/api/chat`	Ollama compatible
GET	`/api/tags`	Ollama compatible
GET	`/health`	Health check
GET	`/ready`	Readiness

The Ollama handler converts between Ollama-native DTOs and the internal ChatService interface. Internally, both paths converge on the same service.

Multi-Backend Routing

The proxy selects a backend automatically based on model name prefix.

  // HttpChatService
fn pick_client(&self, model: &str) -> &HttpClient {
    // model prefix -> route-specific client, fallback -> default client
}

A default LLM client plus additional route-specific clients configured at startup. If no prefix matches, the request falls back to the default. Keeping this logic in the Service layer prevents routing changes from propagating into the HTTP layer.

ChatRoute Branching

Custom headers on the request explicitly select the processing path.

  pub enum ChatRoute {
    Direct,    // forward to backend LLM
    Rag,       // vector search + context injection + LLM
    Workflow,  // delegate to Dagster pipeline
}

Additional extension headers: x-workspace (RAG workspace ID), x-pipeline (workflow pipeline ID), x-correlation-id (request tracing). This means a single /v1/chat/completions endpoint can express direct inference, RAG, and workflow execution.

SSE Streaming

The service layer returns one of three response types.

  pub enum ChatServiceResponse {
    Once(ChatCompletionResponse),       // non-stream: single JSON
    Stream(Pin<Box<dyn Stream<...>>>),  // chunked: OpenAI-compatible
    StreamRaw(Pin<Box<dyn Stream<...>>>), // raw text: backend passthrough
}

Stream mode sends data: {...}\n\n SSE frames and terminates with [DONE]. StreamRaw passes through the backend SSE response without transformation.

Implementation

What Worked

The prototype was verified to work for the following scope.

/v1/chat/completions — OpenAI-compatible streaming and non-streaming responses
/api/chat — Ollama compatibility with internal DTO conversion
/v1/models — model listing from backends (both OpenAI and Ollama formats)
Multi-backend routing — automatic client selection by model name prefix
SSE streaming — parsing SSE lines from reqwest byte streams and relaying them
Structured logging — tracing spans per handler recording model name, stream flag, and message count

What Remained Stubbed

The following had complete trait definitions and design specs but stub implementations only.

Feature	Status	Design Position
RAG context building	StubRagService (returns fixed string)	Qdrant + PostgreSQL pgvector search
Embeddings	StubEmbeddingsService (fixed response)	ONNX model inference
Workflow pipeline	ChatRoute::Workflow stub	Dagster oneshot job launch
NATS event relay	async-nats in dependencies but not wired	`evt.chat.{trace_id}` publish
Authentication	ApiKeyAuthorizer defined but not connected	handler returns `Ok(())`
PostgreSQL idempotency log	sqlx in dependencies but not wired	idempotency_log / completions_cache

Error Format

Errors follow the OpenAI JSON format.

  pub enum ApiError {
    Unauthorized,      // 401
    Forbidden,         // 403
    BadRequest(String), // 400 -> validation_error
    NotFound(String),  // 404 -> validation_error
    Backend(String),   // 502 -> backend_error
    Internal(String),  // 500 -> proxy_error
}
// -> { "error": { "message": "...", "type": "...", "code": N } }

Docker Compose Setup

  services:
  proxy:    # Rust proxy :8080
    environment:
      LLM_BASE_URL: http://host.docker.internal:14434  # vLLM/Ollama
      DATABASE_URL: postgres://postgres:postgres@db:5432/openai
      QDRANT_URL: http://qdrant:6333
    depends_on: [db, qdrant]
  qdrant:   # Vector DB :6333
  db:       # PostgreSQL 18 :5432

Multi-stage Dockerfile: Rust 1.79 build stage, Debian bookworm-slim runtime with dependency caching.

NATS + Dagster Design Spec (Not Implemented)

The following design was completed as a specification but not implemented in Rust. After migrating to Go, this spec became the basis for the production implementation.

Runtime Flow

  1. Client -> /v1/chat/completions(stream=true)
2. Rust allocates trace_id, starts SSE
3. systemd / Quadlet launches dagster-<job>@{trace_id} as oneshot
4. Dagster ops run LLMs and tools in parallel, publish to evt.chat.{trace_id}
5. Rust subscribes -> converts to OpenAI chunks -> streams via SSE
6. On completion, UPSERT to PostgreSQL / Qdrant

Event Schema

  {"type":"role","role":"assistant"}
{"type":"token","text":"...","task":"A"}
{"type":"tool_call","name":"search","arguments":"{...}"}
{"type":"usage","usage":{...}}
{"type":"finished","reason":"stop","winner":"llama3.1-8b"}

Idempotency Design

trace_id: session identifier
req_id = sha256(model + messages + params): request identifier
All side effects through UPSERTs or unique constraints
finished emitted exactly once after artifacts are committed

Staged Migration to JetStream

The original plan was to start with NATS Core for simplicity and migrate to JetStream only when durable intake and replay became necessary. After the Go migration, JetStream was adopted from the start.

The Decision to Migrate to Go

After the Rust prototype was running, the following assessment led to the migration decision.

Async Context Management Cost

SSE relay, NATS subscription, and PostgreSQL/Qdrant writes run concurrently within the same request scope. In Rust, wiring these together requires Pin<Box<dyn Stream>>, tokio::select!, and lifetime management that adds significant code volume unrelated to the design intent.

Go’s goroutine plus channel model maps directly onto this pattern. Each async task launches as a goroutine, results merge through channels, and the control flow reads linearly.

Runtime Overhead

The bottleneck in this use case is LLM inference I/O wait, not proxy-layer CPU work. Rust’s zero-cost abstractions do not provide meaningful advantage here.

Design Asset Reuse

The type contracts and design specs from Rust translated directly to Go.

Rust	Go
trait ChatService	interface ChatService
ChatCompletionRequest struct	ChatCompletionRequest struct
ChatRoute enum	ChatRoute const
ApiError enum	ApiError type + HTTP status mapping
Layered architecture	internal/transport, internal/domain, internal/infra

The Rust type definitions converted almost one-to-one into Go struct definitions. Routing logic, error format, and extension header specs carried over unchanged.

What Changed After Migration

Area	Rust Prototype	Go Production
NATS	Stubbed (async-nats not wired)	JetStream publish (fire-and-forget)
Dagster	Design spec only	daemon sensor + asset materialization
RAG	StubRagService	Knowledge Service (embed -> pgvector ANN -> rerank)
Auth	Not connected	RequestContext middleware with tracking IDs
Telemetry	tracing logs only	Vector (Rust) -> Prometheus + Loki + Grafana
Host topology	Single-host Docker Compose	3-host (storage / desktop / compute)
Reranker	Design concept only	multi-bert-inference (Rust + ONNX Runtime) gRPC integration

Caveats

The Rust prototype code remains in the openai-api-proxy repository but is not maintained after the Go migration
NATS, Dagster, and RAG design specs evolved during the Go implementation (NATS Core to JetStream, oneshot to sensor-driven, Qdrant to pgvector + ColBERT rerank)
The Ollama-compatible endpoint was replaced by Anthropic Messages API in the Go version

Verification

OpenAI-compatible endpoint confirmed working for both streaming and non-streaming responses
Ollama-compatible endpoint DTO conversion verified
Multi-backend routing by model name prefix confirmed
Docker Compose startup and connectivity for proxy + PostgreSQL + Qdrant verified

Next Steps

The Rust prototype served its purpose: locking down the design specification and type contract. That objective was achieved.

The Go production implementation now runs NATS JetStream event relay, Dagster sensor-based pull subscribe with asset materialization, pgvector ANN plus ColBERT rerank RAG orchestration, and a 3-host local AI platform.

The full design record is in Designing an AI Orchestration Platform with Go, NATS, and Dagster.

Selected as IT Introduction Support Provider (FY2021)

loFT LLC was selected as an IT …

Establishment of Subsidiary Lorchestra Inc.

Lorchestra Inc. was …