How I Designed an OpenAI-Compatible Proxy in Rust (axum) and Why I Moved to Go
Design and prototype implementation of an OpenAI/Ollama-compatible proxy in Rust (axum), the NATS + Dagster integration specification it produced, and the decision to migrate to Go when async context management became the dominant cost.
Conclusion
To unify the entry point of a local LLM platform behind an OpenAI-compatible API, I designed and prototyped a proxy in Rust (axum). Roughly 1,600 lines of Rust delivered a working OpenAI/Ollama-compatible multi-backend proxy with SSE streaming, model-name-based backend routing, and a layered architecture.
However, the broader vision — NATS event relay, Dagster oneshot jobs, RAG orchestration, Qdrant semantic cache — remained stubbed. Managing async contexts for simultaneous SSE relay, NATS subscription, and PostgreSQL/Qdrant writes in Rust was verbose enough that the implementation cost of wiring everything together outweighed the runtime benefits. That led to the decision to migrate to Go (Gin).
What the Rust prototype accomplished was not a production system, but a locked-down design specification and type contract. The trait definitions and data structures served directly as the spec for the Go rewrite. Goroutine plus channel mapped cleanly onto the same architecture, and the Go implementation moved much faster because the design questions were already settled.
This article is the prequel to the full design record of the Go-based AI orchestration platform.
Prerequisites
- Language: Rust 2024 edition
- Framework: axum 0.8, tokio 1.48
- Serialization: serde / serde_json
- HTTP client: reqwest 0.12
- Messaging (design only): async-nats 0.45 (stubbed)
- Data stores (design only): sqlx 0.8 (PostgreSQL), qdrant-client 1.16 (stubbed)
- Logging: tracing / tracing-subscriber
- Containers: Docker (proxy + PostgreSQL 18 + Qdrant)
- Bind address:
0.0.0.0:8080
Design Decisions
Layered Architecture
Following Clean Architecture dependency rules, the system was split into HTTP, Domain, and Infrastructure layers.
Client (CLI / IDE / API consumer)
|
v
HTTP Layer [bk/http/]
|-- routes.rs endpoint registration
|-- handlers_chat.rs OpenAI/Ollama chat handlers
|-- handlers_embeddings.rs embeddings handler
|-- error.rs OpenAI-compatible error mapping
|
v
Domain Layer [bk/domain/]
|-- chat.rs ChatCompletionRequest/Response, ChatRoute (Direct/Rag/Workflow)
|-- rag.rs RagParams, EmbeddingsRequest/Response
|-- workflow.rs JobRunId
|
v
Service Layer [bk/services/]
|-- ChatService trait -> HttpChatService (prod) / StubChatService (dev)
|-- EmbeddingsService trait -> StubEmbeddingsService
|-- RagService trait -> StubRagService
|
v
Infrastructure Layer [bk/infra/]
|-- llm.rs HttpClient (backend LLM calls)
|-- qdrant_client.rs QdrantClient (vector search)
|-- auth.rs ApiKeyAuthorizer
The Domain layer has zero external dependencies. Adding an LLM backend or swapping a data store is an Infra-layer change only. This structure survived the Go migration intact.
Dual OpenAI/Ollama Compatibility
The proxy needed to serve both OpenAI and Ollama clients through a single entry point.
| Method | Path | Format |
|---|---|---|
| POST | /v1/chat/completions | OpenAI compatible |
| POST | /v1/embeddings | OpenAI compatible |
| GET | /v1/models | OpenAI compatible |
| POST | /api/chat | Ollama compatible |
| GET | /api/tags | Ollama compatible |
| GET | /health | Health check |
| GET | /ready | Readiness |
The Ollama handler converts between Ollama-native DTOs and the internal ChatService interface. Internally, both paths converge on the same service.
Multi-Backend Routing
The proxy selects a backend automatically based on model name prefix.
// HttpChatService
fn pick_client(&self, model: &str) -> &HttpClient {
// model prefix -> route-specific client, fallback -> default client
}
A default LLM client plus additional route-specific clients configured at startup. If no prefix matches, the request falls back to the default. Keeping this logic in the Service layer prevents routing changes from propagating into the HTTP layer.
ChatRoute Branching
Custom headers on the request explicitly select the processing path.
pub enum ChatRoute {
Direct, // forward to backend LLM
Rag, // vector search + context injection + LLM
Workflow, // delegate to Dagster pipeline
}
Additional extension headers: x-workspace (RAG workspace ID), x-pipeline (workflow pipeline ID), x-correlation-id (request tracing). This means a single /v1/chat/completions endpoint can express direct inference, RAG, and workflow execution.
SSE Streaming
The service layer returns one of three response types.
pub enum ChatServiceResponse {
Once(ChatCompletionResponse), // non-stream: single JSON
Stream(Pin<Box<dyn Stream<...>>>), // chunked: OpenAI-compatible
StreamRaw(Pin<Box<dyn Stream<...>>>), // raw text: backend passthrough
}
Stream mode sends data: {...}\n\n SSE frames and terminates with [DONE]. StreamRaw passes through the backend SSE response without transformation.
Implementation
What Worked
The prototype was verified to work for the following scope.
/v1/chat/completions— OpenAI-compatible streaming and non-streaming responses/api/chat— Ollama compatibility with internal DTO conversion/v1/models— model listing from backends (both OpenAI and Ollama formats)- Multi-backend routing — automatic client selection by model name prefix
- SSE streaming — parsing SSE lines from reqwest byte streams and relaying them
- Structured logging — tracing spans per handler recording model name, stream flag, and message count
What Remained Stubbed
The following had complete trait definitions and design specs but stub implementations only.
| Feature | Status | Design Position |
|---|---|---|
| RAG context building | StubRagService (returns fixed string) | Qdrant + PostgreSQL pgvector search |
| Embeddings | StubEmbeddingsService (fixed response) | ONNX model inference |
| Workflow pipeline | ChatRoute::Workflow stub | Dagster oneshot job launch |
| NATS event relay | async-nats in dependencies but not wired | evt.chat.{trace_id} publish |
| Authentication | ApiKeyAuthorizer defined but not connected | handler returns Ok(()) |
| PostgreSQL idempotency log | sqlx in dependencies but not wired | idempotency_log / completions_cache |
Error Format
Errors follow the OpenAI JSON format.
pub enum ApiError {
Unauthorized, // 401
Forbidden, // 403
BadRequest(String), // 400 -> validation_error
NotFound(String), // 404 -> validation_error
Backend(String), // 502 -> backend_error
Internal(String), // 500 -> proxy_error
}
// -> { "error": { "message": "...", "type": "...", "code": N } }
Docker Compose Setup
services:
proxy: # Rust proxy :8080
environment:
LLM_BASE_URL: http://host.docker.internal:14434 # vLLM/Ollama
DATABASE_URL: postgres://postgres:postgres@db:5432/openai
QDRANT_URL: http://qdrant:6333
depends_on: [db, qdrant]
qdrant: # Vector DB :6333
db: # PostgreSQL 18 :5432
Multi-stage Dockerfile: Rust 1.79 build stage, Debian bookworm-slim runtime with dependency caching.
NATS + Dagster Design Spec (Not Implemented)
The following design was completed as a specification but not implemented in Rust. After migrating to Go, this spec became the basis for the production implementation.
Runtime Flow
1. Client -> /v1/chat/completions(stream=true)
2. Rust allocates trace_id, starts SSE
3. systemd / Quadlet launches dagster-<job>@{trace_id} as oneshot
4. Dagster ops run LLMs and tools in parallel, publish to evt.chat.{trace_id}
5. Rust subscribes -> converts to OpenAI chunks -> streams via SSE
6. On completion, UPSERT to PostgreSQL / Qdrant
Event Schema
{"type":"role","role":"assistant"}
{"type":"token","text":"...","task":"A"}
{"type":"tool_call","name":"search","arguments":"{...}"}
{"type":"usage","usage":{...}}
{"type":"finished","reason":"stop","winner":"llama3.1-8b"}
Idempotency Design
trace_id: session identifierreq_id = sha256(model + messages + params): request identifier- All side effects through UPSERTs or unique constraints
finishedemitted exactly once after artifacts are committed
Staged Migration to JetStream
The original plan was to start with NATS Core for simplicity and migrate to JetStream only when durable intake and replay became necessary. After the Go migration, JetStream was adopted from the start.
The Decision to Migrate to Go
After the Rust prototype was running, the following assessment led to the migration decision.
Async Context Management Cost
SSE relay, NATS subscription, and PostgreSQL/Qdrant writes run concurrently within the same request scope. In Rust, wiring these together requires Pin<Box<dyn Stream>>, tokio::select!, and lifetime management that adds significant code volume unrelated to the design intent.
Go’s goroutine plus channel model maps directly onto this pattern. Each async task launches as a goroutine, results merge through channels, and the control flow reads linearly.
Runtime Overhead
The bottleneck in this use case is LLM inference I/O wait, not proxy-layer CPU work. Rust’s zero-cost abstractions do not provide meaningful advantage here.
Design Asset Reuse
The type contracts and design specs from Rust translated directly to Go.
| Rust | Go |
|---|---|
| trait ChatService | interface ChatService |
| ChatCompletionRequest struct | ChatCompletionRequest struct |
| ChatRoute enum | ChatRoute const |
| ApiError enum | ApiError type + HTTP status mapping |
| Layered architecture | internal/transport, internal/domain, internal/infra |
The Rust type definitions converted almost one-to-one into Go struct definitions. Routing logic, error format, and extension header specs carried over unchanged.
What Changed After Migration
| Area | Rust Prototype | Go Production |
|---|---|---|
| NATS | Stubbed (async-nats not wired) | JetStream publish (fire-and-forget) |
| Dagster | Design spec only | daemon sensor + asset materialization |
| RAG | StubRagService | Knowledge Service (embed -> pgvector ANN -> rerank) |
| Auth | Not connected | RequestContext middleware with tracking IDs |
| Telemetry | tracing logs only | Vector (Rust) -> Prometheus + Loki + Grafana |
| Host topology | Single-host Docker Compose | 3-host (storage / desktop / compute) |
| Reranker | Design concept only | multi-bert-inference (Rust + ONNX Runtime) gRPC integration |
Caveats
- The Rust prototype code remains in the
openai-api-proxyrepository but is not maintained after the Go migration - NATS, Dagster, and RAG design specs evolved during the Go implementation (NATS Core to JetStream, oneshot to sensor-driven, Qdrant to pgvector + ColBERT rerank)
- The Ollama-compatible endpoint was replaced by Anthropic Messages API in the Go version
Verification
- OpenAI-compatible endpoint confirmed working for both streaming and non-streaming responses
- Ollama-compatible endpoint DTO conversion verified
- Multi-backend routing by model name prefix confirmed
- Docker Compose startup and connectivity for proxy + PostgreSQL + Qdrant verified
Next Steps
The Rust prototype served its purpose: locking down the design specification and type contract. That objective was achieved.
The Go production implementation now runs NATS JetStream event relay, Dagster sensor-based pull subscribe with asset materialization, pgvector ANN plus ColBERT rerank RAG orchestration, and a 3-host local AI platform.
The full design record is in Designing an AI Orchestration Platform with Go, NATS, and Dagster.
