voracle Dev Log vol.2 -- Deploying the Research Pipeline and Overhauling the ONNX Inference Engine
A week of stabilizing voracle’s research pipeline: fixing a multibyte character panic in Rust, migrating the ONNX inference engine to Qwen3, and redesigning the vault structure.
About This Article
The previous article covered voracle’s design and initial implementation – ONNX embedding + ColBERT reranking, the MCP server, and the distil command. This is the continuation: what happened when I actually started using the research pipeline in daily work, and the week it took to stabilize it.
The Design Philosophy: An External Memory That Gets Smarter on Its Own
The original motivation for building voracle was fuzzy semantic search – being able to find things with vague “that thing, you know” queries. But once I built the research pipeline, a bigger picture started to emerge.
When you look something up during development, the usual flow is: search in a browser, read, close the tab. That knowledge gets buried in browser history and never surfaces again. voracle’s research command connects that process to the vault. Web content retrieved via the Brave Search API is automatically converted to Markdown, summarized, and saved to the vault.
What this means in practice is that your interests accumulate without conscious effort. You don’t need to proactively run CLI commands to explore – knowledge from your daily development research flows into your Obsidian vault, your external memory, without you even thinking about it. Then during the development cycle, AI agents search the vault and come back with “you looked into this before, remember?” Knowledge starts circulating.
To support this design philosophy, I spent late March through early April stabilizing the research pipeline.
Structuring the Vault – Limiting the Index to Finalized Knowledge
First, I reorganized the vault directory structure. I wanted voracle to only index finalized knowledge. Having in-progress drafts and system files mixed in would create search noise.
obsidian/
├── articles/ # My articles
│ ├── _drafts/ # Rough drafts
│ └── _processed/ # Archived after publication
├── web-research/ # research command output
│ └── {category}/note_id.md
├── tasks/ # Task management
├── thoughts/ # Interest notes
└── works/ # Work logs
└── development/
├── _claude/
├── _codex/
└── _gemini/
The exclusion rule is simple. Anything with a _ or . prefix is excluded from indexing. _drafts/, _processed/, .search_result.jsonl – all intermediate artifacts get caught by this rule.
# Check vault status
voracle -status
vaults: 1
obsidian: 847 notes, 2,341 chunks
excluded: _drafts, _processed, _claude, _codex, _gemini
index: .voracle.usearch (f16 HNSW, 2341 vectors)
store: .voracle.db (SQLite)
With this design, the search scope is limited to finished articles, web knowledge collected via research, and my own notes. Nothing half-baked makes it in.
Deploying Research in Production – Researching TurboQuant
On the same day I finished structuring the vault, I deployed the research command on an actual research task: investigating Google’s TurboQuant and local LLM quantization techniques.
voracle research "TurboQuant|local LLM"
The pipeline works as follows:
- Extract keywords from the query
- Search via Brave Search API in both Japanese and English (Web + News)
- Deduplicate URLs into a list
- Fetch each URL and convert to Markdown with
html-to-markdown-rs - Score by content quality and novelty
- Save the top N results as Markdown files in
web-research/ - Auto-generate summaries and frontmatter (topic, category, difficulty)
The result was 61,813 characters, roughly 26,540 tokens. It exceeded Claude Code’s tool result size limit so it had to be read from a file, but the research itself worked properly.
The dual-language search is what makes a difference here. It searches for “TurboQuant quantization” in English while simultaneously querying for the equivalent in Japanese. Counter-language keyword selection uses maxsim matching against a topic configuration file, so information missed in one language gets picked up by the other.
Note: In practice, Japanese search results rarely turned up useful content, while English-based searches consistently found better material that hadn’t been translated. I adjusted the scoring to pull most articles from English sources.
A Multibyte Character Panic in Rust – Things Break in Production
Same day, another hit. I ran research with a broader query.
voracle research "TurboQuant|ColBERT LLM|Modern BERT"
It panicked.
thread 'main' panicked at src/infra/inference/lfm.rs:237:49:
byte index 2000 is not a char boundary; it is inside '━' (bytes 1998..2001)
The cause was immediately obvious. The text truncation before summary generation (written by Claude) was slicing at a byte position:
// BAD: byte position 2000 might land in the middle of a character
let truncated = if text.len() > 2000 { &text[..2000] } else { text };
text.len() in Rust returns the byte count. The box-drawing character ━ (U+2501) occupies 3 bytes in UTF-8 (1998..2001), and byte position 2000 landed right in the middle of it. Web content fetched via Brave Search routinely contains box-drawing characters, symbols, and Japanese text, so this was a bug waiting to happen.
The fix uses floor_char_boundary:
let truncated = if text.len() > 2000 {
let boundary = text.floor_char_boundary(2000);
&text[..boundary]
} else {
text
};
floor_char_boundary returns the nearest character boundary at or below the specified byte position. If position 2000 is mid-character, it backs up to 1998. This API was stabilized in Rust 1.73, essentially made for this exact use case.
Added a test as well:
#[test]
fn truncate_multibyte_boundary() {
// '━' (U+2501) is 3 bytes in UTF-8 (0xE2 0x94 0x81)
let text = "a".repeat(1998) + "━━━";
assert_eq!(text.len(), 2007); // 1998 + 9
let boundary = text.floor_char_boundary(2000);
assert_eq!(boundary, 1998);
let truncated = &text[..boundary];
assert!(truncated.is_char_boundary(truncated.len()));
}
This kind of bug never shows up with English-only test data. You only find it when you actually feed in Japanese content. That’s the value of real-world deployment.
Migrating the ONNX Inference Engine to Qwen3
Another significant change. voracle’s summarize / keyword extraction / frontmatter generation had been running on LiquidAI’s LFM2.5-350M-ONNX, but I switched to onnx-community/Qwen3.5-0.8B-ONNX due to licensing constraints.
Selection criteria:
- Apache 2.0 license, no issues with commercial use
- 0.8B parameters keeps inference load practical
- Available in ONNX format (INT8/FP32) for direct loading
I kept the infra/inference/ module structure and swapped out the LFM2.5 components for Qwen3.5-0.8B:
src/infra/inference/
embed.rs # Dense vector embeddings (pplx-embed-v1-0.6b -- unchanged)
colbert.rs # ColBERT MaxSim reranking (unchanged)
lfm.rs # LFM2.5 -> replaced with Qwen3.5-0.8B
ONNX Runtime is pinned at ort = "=2.0.0-rc.12". This is a shared policy across multi-bert-inference and edge-retrieval – the ort crate can introduce breaking changes between RC versions, so I pin to a version that’s been verified to work.
# Cargo.toml
[dependencies]
ort = "=2.0.0-rc.12"
tokenizers = "0.21"
The tokenizer is loaded from tokenizer.json via the tokenizers crate. Qwen3 models don’t use SentencePiece but their own tokenizer format, which the HuggingFace tokenizers crate handles natively.
# Model layout
ls ~/.local/share/voracle/models/
qwen3.5-0.8b-onnx/
model.onnx
model_quantized.onnx
tokenizer.json
config.json
pplx-embed-v1-0.6b/
model.onnx
tokenizer.json
mxbai-edge-colbert/
model.onnx
tokenizer.json
Summary quality from the research pipeline after the switch is at least on par with LFM2.5, honestly not bad at all. Japanese content summaries in particular felt improved.
Stabilizing the Research Pipeline
I reviewed the entire import logic of the research pipeline and added missing tests.
Full Pipeline Flow
To lay it out again, the research command runs in this order:
Query input
-> Keyword extraction (Qwen3.5-0.8B)
-> Brave Search (primary language + counter language)
-> URL dedup + domain strike exclusion, fetch HTML
-> tree-sitter-html content extraction + boilerplate removal
-> html-to-markdown-rs Markdown conversion
-> Scoring -> select top N
-> vault writer saves to web-research/
-> Summary + frontmatter generation (Qwen3.5-0.8B)
Domain strikes automatically exclude domains that repeatedly return paywall or empty content. After 3 failures, a domain gets excluded from Brave Search queries via -site:. It’s managed at the domain level – no point looking at individual pages when the whole domain is the problem.
# Check strike status
voracle admin strikes
# Restore a falsely flagged domain
voracle admin allow example.com
Adding Tests
I filled the testing gaps found during review:
writer.rs: collision handling when writing to vault with existing filesresolver.rs: edge cases for_/.prefix exclusion rulesextraction/: handling of empty content and error responses
#[test]
fn writer_avoids_overwriting_existing_note() {
let dir = tempdir().unwrap();
let writer = VaultWriter::new(dir.path());
// Write the same URL twice
writer.write_research_note(&entry_a).unwrap();
writer.write_research_note(&entry_a_updated).unwrap();
// Second write creates a separate file (no overwrite)
let files: Vec<_> = fs::read_dir(dir.path()).unwrap().collect();
assert_eq!(files.len(), 2);
}
#[test]
fn resolver_excludes_underscore_prefix() {
let dir = tempdir().unwrap();
fs::create_dir_all(dir.path().join("_drafts")).unwrap();
fs::write(dir.path().join("_drafts/note.md"), "# test").unwrap();
fs::write(dir.path().join("visible.md"), "# test").unwrap();
let resolver = VaultResolver::new(dir.path());
let notes = resolver.scan();
assert_eq!(notes.len(), 1);
assert_eq!(notes[0].file_name(), "visible.md");
}
Integrating MCP Presets into agent-gateway
With the vault and research pipeline stable, I added voracle to agent-gateway’s MCP presets.
agent-gateway is a gateway for LLM agents to use MCP tools, managing server configurations for each tool through preset definitions:
// internal/agent/mcp/presets.go
func PresetVoracle(command string) ServerConfig {
if command == "" {
command = "voracle"
}
return ServerConfig{
Name: "voracle",
Command: command,
Args: []string{"--server"},
}
}
func PresetOkitegami(command string) ServerConfig {
if command == "" {
command = "okitegami"
}
return ServerConfig{
Name: "okitegami",
Command: command,
Args: []string{"--server"},
}
}
This lets agents on agent-gateway call voracle directly. Agents search the vault, run research as needed, and accumulate results back into the vault. The last piece connecting the knowledge circulation loop – where humans don’t even need to be conscious of it – fell into place.
voracle Today
The actual coding here was about half a day’s work. But the design refinements came from ideas that popped up during daily walks, gradually reflected one at a time. It feels less like something I sat down and wrote, more like something I grew out of daily life.
Here’s the current command set:

The commands I use most in daily work are grep and research. grep is semantic search – it pulls up chunks that are semantically close, not exact matches:
[email protected] ~ % voracle grep "turbo quant"
/Users/ksh3/Development/obsidian/web-research/2026/04/09/technology/2029b5389eccfe38.md:16 [How to test TurboQuant] (4.2379)
Verify correctness and measure real-world impact before deploying TurboQuant to production.
/Users/ksh3/Development/obsidian/web-research/2026/04/09/technology/2029b5389eccfe38.md:5 [Triton + vLLM (serving workloads)] (4.2284)
Advanced
Engineers building inference serving pipelines who want to test TurboQuant in a vLLM-like environment
/Users/ksh3/Development/obsidian/web-research/2026/04/09/technology/2029b5389eccfe38.md:2 [How to use TurboQuant] (4.2177)
A practical guide for developers who want to try TurboQuant KV cache compression. Covers available implementations, s...
Knowledge gathered via research and my own articles are all searchable from the same index:
[email protected] ~ % voracle grep "GLM-5.1"
/Users/ksh3/Development/obsidian/articles/llm/glm-4-7-flash-cpu-hybrid-gpu-benchmark_en.md:19 [NVFP4 + vLLM Operational Evaluation] (6.9704)
In addition to the IQ5_K benchmarks above, GLM-4.7-Flash-NVFP4 was also evaluated on vLLM for operational suitability.
/Users/ksh3/Development/obsidian/articles/llm/glm-4-7-flash-cpu-hybrid-gpu-benchmark_ja.md:33 [共通変数] (6.8849)
IMG=compute.home.arpa/ik_llama-cpu:latest
MO=/mnt/data/hf/hub/models--ubergarm--GLM-4.7-Flash-GGUF
/Users/ksh3/Development/obsidian/web-research/2026/04/08/technology/5ee34b17c638fe00.md:40 [GLM-5 - GlmMoeDsa] (6.7509)
The zAI team launches GLM-5, and introduces it as such:
> GLM-5, targeting complex systems engineering and long-horizon agentic tasks.
The domain strikes have been growing too. Domains that can’t deliver content behind paywalls get excluded automatically:
[email protected] ~ % voracle admin blocklist
Exclude Domains (strike >= 3 = excluded):
[X] github.com 40 (paywall)
[X] www.ponkotsu.dev 11 (paywall)
[X] medium.com 7 (paywall)
[X] dev.to 4 (paywall)
[X] machinelearningmastery.com 3 (paywall)
...
The research command also works as an MCP tool, so Claude Code and agent-gateway agents call it during development. Things I research and things AI researches all end up in the same vault. Being able to reference external memory without consuming context is a big deal – lately, whenever I want to look something up, my first instinct is to run voracle grep.
Most of the refinements I think of during walks are small things – “I want to reduce this kind of noise,” “I want this search pattern to rank higher” – and I implement them in about 30 minutes after getting home. Once that cycle starts turning, the tool itself begins to fit your thinking patterns.
