Introduction

I already had a rough direction for a local LLM stack with an OpenAI-compatible entry point, but the embedding and rerank layer was still too fuzzy. In this note, I narrowed the scope on purpose and fixed the design for just those two services: embed and rerank, implemented with Rust, Axum, ort, and tokenizers.

The point was not to add yet another embedding endpoint. I wanted a design that lets me precompute document-side vectors, reuse them over a LAN through a key -> bin cache, and keep retrieval and reranking connected without shipping large payloads around. That is why I biased the design toward ONNX/CPU first. It gives me a clean boundary that still fits the wider proxy architecture I had been sketching in related notes.

Background and Motivation

In the broader local pipeline, I was already thinking in terms of CPU and GPU specialization. The CPU side handles prompt shaping, embeddings, and task normalization. The GPU side focuses on final generation. Once I looked at the system that way, embedding and reranking stopped looking like incidental features and started looking like shared infrastructure.

Reranking especially forces a more careful design than a plain embedding API. A single-vector interface is straightforward. A ColBERT-style reranker is not, because it works with token-level vectors. If I leave the granularity vague, I will end up with unstable boundaries between cache design, gRPC payloads, retrieval output, and model execution. That is why I made one hard rule up front: transport should mostly carry keys and small metadata, while the document token vectors live in cache.

Design Approach

The first decision was to split the specification into two explicit lanes: embed(256d) and rerank(ColBERT 64d token, MaxSim). Both are designed around ONNX/CPU execution and a shared Rust implementation stack. The wider proxy and routing layer still matters, but this note deliberately focuses on the reusable retrieval core underneath it.

The other key decision was to precompute document-side vectors while computing query-side vectors on demand. That division gives me better throughput characteristics and a cleaner cache story. Documents change relatively rarely, so I can version and reuse them. Queries are ephemeral, so they are better handled inline per request.

Specification Details

Goals

The goals are simple and intentionally separate. The embed service returns a single vector for nearest-neighbor retrieval. The rerank service refines an already shortlisted candidate set using ColBERT-style late interaction. Keeping those responsibilities apart should make both tuning and debugging much easier later.

Embed API

POST /embed accepts texts: string[] and returns 256-dimensional vectors. The processing sequence is fixed as tokenization through tokenizers, ONNX inference through ort, pooling according to model conventions, truncation to 256 dimensions, and L2 normalization.

I made truncation and normalization explicit because I want predictable retrieval behavior. Even if the model naturally emits something like a 1024-dimensional vector, a stable 256-dimensional contract is easier to index and reason about. That said, this is one of the areas where I do not want to improvise. If pooling and normalization drift from the model card guidance, retrieval quality will drift too. So the design assumes I will validate that behavior against a Python reference first.

Rerank API

POST /rerank accepts a query: string plus candidates: Candidate[]. A candidate can be represented either as a resolved doc_key or as enough metadata to derive the key, such as doc_id + seq + chunk_id. The response returns aligned scores: float32[] and a sorted order: int[].

The important part is how the vectors are sourced. Query token vectors (Tq x 64) are generated on the fly. Document token vectors (Td x 64) are fetched from cache. MaxSim then scores each candidate by taking the best matching document token for each query token and summing those maxima. That lets me preserve the ColBERT interaction model without bloating the wire protocol.

Models and File Layout

For embeddings, I anchored the design on lightonai/modernbert-embed-large/onnx/*. The production deployment uses INT8 (onnx/model_int8.onnx, roughly 17MB). The edge-oriented quantization roughness actually contributed positively to retrieval precision, and latency settled at around 10ms. The lightweight INT8 model also aligns with the goal of maximizing parallel execution tolerance.

The tokenizer comes from the same repo, ideally tokenizer.json. I called that out because tokenizer mismatches are a quiet failure mode. If special tokens or normalization behavior differ from the reference path, the embedding output may look valid while retrieval quality degrades.

For reranking, I assumed mixedbread-ai/mxbai-edge-colbert-v0-32m. But I also wrote down the practical risk: if that repo does not include ONNX assets, I cannot drive it directly from Rust through ort. In that case the fastest path is to use lightonai/mxbai-edge-colbert-v0-32m-onnx instead of spending time on my own export pipeline.

Cache Design (LAN-based, key/bin)

The cache key is built from a versioned document identity. I used doc_id + content_seq from PostgreSQL as the base, and then extended it with the variables that actually affect the vector payload:

  doc_id|content_seq|chunk_id|model_id|tokenizer_hash|max_len|chunk_ver|dtype|layout
  

That material is hashed with blake3, optionally shortened, and prefixed as a document key. I wanted content_seq because it only advances when the source document changes. I wanted chunk_ver because chunking strategy changes are effectively a full cache invalidation event. Putting both into the key keeps invalidation mechanical instead of heuristic.

I set the implementation order to start with fp32 (per-row float32 vectors), then move to bitpack (sign encoding + u64 array) later. It is tempting to jump directly to the most compact representation, but that tends to slow down validation. Starting with a straightforward format is a better way to get the service boundary correct first.

The binary header is intentionally fixed: dtype, dim (=64), n_tokens, scales_present, layout (token-major), and version. I called out token-major layout because MaxSim performance and decoding logic both depend on agreeing about how tokens are laid out in memory.

gRPC Design (Binary Vec Direct Transmission)

The original plan was to transport only doc_keys and pull vectors from cache. During implementation, the design pivoted to eliminate DB dependency entirely, sending pre-embedded 256d vectors directly over the wire.

The implementation defines a MaxSimSearch gRPC RPC that receives map<string, VecF32> — key and 256d vector pairs as candidates. 256 floats x 4 bytes = 1024 bytes/vec, roughly 1KB per candidate. Server-side filtering via threshold, top_k, and top_p reduces the result set, returning parallel arrays of keys + scores.

  message MaxSimRequest {
  map<string, VecF32> candidates = 1;  // key -> 256d vec (100-200 typical)
  VecF32 query_vec = 2;               // query vector (256d)
  float threshold = 3;                // min score to include
  int32 top_k = 4;                    // max results
  float top_p = 5;                    // cumulative-score cutoff
}
  

Normally, shipping vectors over RPC is something to avoid. But the following conditions made it a rational choice:

  • Communication is confined to the local LAN
  • Each candidate is roughly 1KB; 100-200 candidates stay within 100-200KB
  • Removing DB dependency makes the embed/rerank service fully stateless
  • Retrieval quality was verified to be sufficient at this precision level

As a result, the cache layer and PostgreSQL content_seq management became unnecessary, enabling a design that maximizes parallel execution tolerance.

The HTTP-side /v1/rerank offers a separate text-based interface, accepting query text + candidate keys and running ONNX inference internally. gRPC is the fast path for upstream services (agent-gateway) that already hold pre-embedded vectors and only need MaxSim scoring.

Rust Implementation Requirements

On the Rust side, Axum is the outward-facing REST layer. Internal gRPC is optional, not foundational. tokenizers should load tokenizer.json and match the reference handling of special tokens and normalization. ort should start with intra_threads = num_cpus, optimization_level = Level3, and the CPU execution provider.

For MaxSim, I kept inner product as the default similarity metric and explicitly left room for a trait-based swap to bitpack + Hamming approximation later. That separation matters because I want to test faster approximations without rewriting request handling or cache decoding.

Implementation Order

I also fixed the implementation sequence. First, get /embed working with pool -> truncate256 -> normalize. Then confirm whether the mixedbread rerank model has ONNX available, and if not, pull the ONNX variant and finish /rerank. After that, add the PostgreSQL content_seq rule, build the document precompute plus cache population path, and only then introduce gRPC in its minimal key-carrying form.

That order is deliberate. If I add transport too early, I end up changing API shape, cache format, and RPC contracts all at once. Locking down the local boundary first is safer.

Caveats and Unknowns

The biggest uncertainty is the ONNX availability for the rerank model. Without it, the Rust-side ort path stalls immediately. The second risk is embedding fidelity: pooling and normalization have to match the intended model behavior closely enough that retrieval quality remains stable. The third is cache invalidation. Changing max_len or stride invalidates the document vector cache, so chunk_ver cannot be optional metadata; it has to be part of the key.

Results

What this note accomplished was not code, but constraint-setting. I now have a much cleaner separation between retrieval embeddings and reranking, and I have a concrete path for fitting both inside the larger OpenAI-compatible proxy stack without guessing at payload boundaries later.

The original plan was to move only keys across service boundaries, tying together retrieval, reranking, cache locality, and PostgreSQL versioning in one line. During implementation, however, I pivoted to eliminating DB dependency entirely, sending pre-embedded 256d vectors directly over gRPC. Each candidate is roughly 1KB; 100-200 candidates stay within 100-200KB, which is entirely reasonable over a local LAN. As a result, the cache layer and PostgreSQL content_seq management became unnecessary, enabling a design that maximizes parallel execution tolerance.

Next Steps

The embed side has settled on an INT8 model (roughly 17MB) with latency around 10ms and sufficient parallel execution tolerance. The rerank side adopted lightonai/mxbai-edge-colbert-v0-32m-onnx as the ONNX distribution, allowing Rust to drive it directly through ort.

The remaining design question is whether /embed and /rerank should be exposed as public APIs or kept behind the OpenAI-compatible proxy as internal components. That choice will affect authentication, monitoring, and how tightly they couple to the rest of the workflow stack.

Still, the important part is already in place: I now know what crosses RPC boundaries and what Rust is responsible for owning. The decision to eliminate DB dependency and go fully stateless turned out to be the right call as implementation progressed.