Response Vocabulary Design Swings Small LLM Accuracy by 15 Points: Experiment Log from pathfinder
Simply changing the metadata fields included in MCP server pathfinder’s responses caused small LLM path resolution accuracy to swing widely between 52-67/70. Experiments with maybe_exts, alike, and exist_similar yielded three response vocabulary design principles: conditional appearance, small cardinality, and directly actionable.
Conclusion
Changing a single metadata field in an MCP server’s response can swing a small LLM’s benchmark accuracy by more than 15 points.
When maybe_exts (a list of candidate extensions sharing the same stem) was removed from pathfinder’s path resolution response, the score degraded from 67/70 to 52–63/70. Restoring it recovered the score to 65/70. Adding alike: N (count of same-stem files) caused the LLM to enter an infinite loop, and returning exist_similar: bool on every response expanded the variance to 10 points.
Ultimately, the following design principles proved effective for MCP response vocabulary targeting small LLMs:
- Conditional appearance — Return a field only in ambiguous cases. Absence implies certainty
- Small cardinality — Lists of 2–3 elements. A number like 273 is dangerous
- Directly actionable — Concrete instructions like “extensions to try.” Not abstract similarity
- Minimal response length — Fields returned on every response accumulate in context and consume attention bandwidth
Background
pathfinder is a custom MCP server that combines ColBERT MaxSim-based semantic reranking with lexical scoring for file path resolution. qwen3.5-35b-a3b (Q5_K_M quantization) was used as the benchmark LLM, continuously measuring path resolution accuracy across 70 tests.
The starting point for this work was a score degradation to 52–63/70 after partially reverting changes from the earlier lexical optimization (lexiopt4) that had achieved a best score of 67/70.
Investigating Score Degradation After Revert
Three benchmark result files (lexiopt4-1, 4-2, 4-3) were compared. Against the pre-revert lexiopt4 (67/70) and lexiopt5 (61/70), the post-revert scores were 63, 52, 57 — with significant variance.
Key degradation points:
| Category | Pre-revert | Post-revert |
|---|---|---|
| G: Retry operations | 3/3 | 0/3 (total failure in Runs 2, 3) |
| L: Test variant/config | 4/4 | 0–2/4 |
| D: Wrong extensions | 7/7 | 6/7 (Test 31 consistently failing) |
The initial suspicion was the code revert itself. segment_overlap, dir_ext_counts, warmup improvements, tool_retry_with_resolve tool_name addition — these were features that should have been lost in the revert.
Nothing Had Been Reverted
Grepping the actual code confirmed all four features were present in the current codebase. Nothing had been reverted.
So what caused the score degradation? A git diff between the commit at lexiopt4 benchmark time (9192d6b) and current HEAD (d6c9dc8) was examined.
Discovering maybe_exts
The diff revealed a single change — the removal of the maybe_exts field.
maybe_exts returns a list of extensions for candidates sharing the same stem (filename minus extension) as the best resolution result. For example, when resolving config.go, if config.rs is also a candidate, it returns maybe_exts: ["go", "rs"].
This field existed in 4 locations:
PathResolveOutstruct- Aggregation logic in
resolve() tool_path_resolveresponse- Tool description
Restoring maybe_exts and Verifying the Effect
All 4 locations were restored, build and tests passed, and the benchmark was run.
Result: 65/70 — recovered to the baseline variance range.
| Category | opt4 (67) | Post-revert (52-63) | Restored (65) |
|---|---|---|---|
| G: Retry | 3/3 | 0-3/3 | 3/3 |
| K: Cross-lang | 4/6 | 4/6 | 5/6 |
| L: Config | 0-4/4 | 0-2/4 | 2/2 |
A few tokens of metadata (maybe_exts: ["rs","go"]) were helping the LLM make extension decisions, producing a score difference of over 10 points.
Further Response Vocabulary Experiments
“Hint text — small, effective vocabulary choices determine accuracy.” With this insight, additional signals were tested.
Adding alike: N (Same-Stem File Count)
Where maybe_exts communicates what is similar, alike was designed to communicate how many are similar as a numeric value.
The result was unexpected. There were 273 files with the stem config, producing a response of alike: 273. The LLM reacted to this large number by entering an excessive exploration mode, failed at list_dir API calls, then fell into an infinite loop trying to verify counts during summary creation. It repeatedly recalculated “59 PASS, 1 FAIL” dozens of times and completely skipped Tests 61–70.
Switching to exist_similar: bool
If numbers are dangerous, use a boolean. exist_similar: true/false was returned on every response.
Three benchmark runs: 66, 60, 56 — variance expanded to 10 points. lexiopt6-1 achieved K=6/6, L=4/4 for the highest level ever, but lexiopt6-3 collapsed to G=0/3. The variance increased compared to the maybe_exts-only configuration (stable at 65/70).
The hypothesis that “false might improve stability by reinforcing certainty” led to always-on return, but it backfired.
Reverting to maybe_exts Only
exist_similar and alike were removed, returning to maybe_exts only.
Discovering the Principles
Three principles emerged from the experimental results.
Context Pollution Is Proportional to Response Length
Fields returned on every test accumulate in context. The candidates list (path, score, why, root for top-K) is 50–80 tokens per entry; top-5 × 70 tests means thousands of accumulated tokens. exist_similar: false is a single field, but 70 appearances accumulate and consume attention bandwidth.
Why maybe_exts is stable — it does not appear in most tests (because the solution is unambiguous). It only appears in genuinely ambiguous cases. Context pollution is near zero.
Primacy-Recency Cascade
Small LLMs are strongly biased toward the beginning and end of context. Mid-sequence tests (G: Tests 46–48) sit in the “attention valley.” When always-on fields accumulate in context, they cross a critical threshold mid-sequence and collapse, cascading into the end section. The pattern of G and L degrading together supports this.
More Is Not Better
More information does not necessarily improve accuracy. alike: 273 was amplified as an anxiety signal, and exist_similar: false became always-on noise. The most effective approach was maybe_exts — conditionally appearing, small, and directly actionable.
Task-Based Benchmark Expansion
10 tasks covering categories that consistently failed in the 70-test benchmark (F: intent, K: cross-lang, L: config disambiguation) were added, expanding the task-based tests (test_prompt_tasks.md) from 30 to 40.
Added Tasks 31–40:
- Go vs Rust client disambiguation (K category)
- Hand-written vs generated disambiguation (K category)
- Dev vs prod Kubernetes overlay (L category)
- VPC vs ECS Terraform module (L category)
- Go config vs Rust config (D category)
- Deep OAuth callback, Postgres vs PostgREST, vendor exclusion, GraphQL schema vs generated, reranker entry point
Benchmark result: 40/40 perfect score, 42 tool calls. In task-based tests, since the LLM naturally has intent, pathfinder’s intent_text utilization is more effective.
A baseline version test (test_prompt_tasks_baseline.md) solving the same tasks using only filesystem MCP tools was also created.
Benchmark Score Progression
| Configuration | Score | Variance |
|---|---|---|
| With maybe_exts (lexiopt4) | 67/70 | baseline |
| Without maybe_exts (revert) | 52-63/70 | 11 points |
| maybe_exts restored (latest) | 65/70 | baseline range |
| alike: N added | collapse (loop) | — |
| exist_similar: bool added | 56-66/70 | 10 points |
| maybe_exts only (final) | 65/70 | stable |
Code Changes Summary
- Restored
maybe_exts(PathResolveOut, resolve(), tool_path_resolve, tool description) - Added and removed
alike: usize - Added and removed
exist_similar: bool - Expanded task-based tests from 30 to 40
- Created filesystem baseline task tests
