Conclusion

Changing a single metadata field in an MCP server’s response can swing a small LLM’s benchmark accuracy by more than 15 points.

When maybe_exts (a list of candidate extensions sharing the same stem) was removed from pathfinder’s path resolution response, the score degraded from 67/70 to 52–63/70. Restoring it recovered the score to 65/70. Adding alike: N (count of same-stem files) caused the LLM to enter an infinite loop, and returning exist_similar: bool on every response expanded the variance to 10 points.

Ultimately, the following design principles proved effective for MCP response vocabulary targeting small LLMs:

  • Conditional appearance — Return a field only in ambiguous cases. Absence implies certainty
  • Small cardinality — Lists of 2–3 elements. A number like 273 is dangerous
  • Directly actionable — Concrete instructions like “extensions to try.” Not abstract similarity
  • Minimal response length — Fields returned on every response accumulate in context and consume attention bandwidth

Background

pathfinder is a custom MCP server that combines ColBERT MaxSim-based semantic reranking with lexical scoring for file path resolution. qwen3.5-35b-a3b (Q5_K_M quantization) was used as the benchmark LLM, continuously measuring path resolution accuracy across 70 tests.

The starting point for this work was a score degradation to 52–63/70 after partially reverting changes from the earlier lexical optimization (lexiopt4) that had achieved a best score of 67/70.


Investigating Score Degradation After Revert

Three benchmark result files (lexiopt4-1, 4-2, 4-3) were compared. Against the pre-revert lexiopt4 (67/70) and lexiopt5 (61/70), the post-revert scores were 63, 52, 57 — with significant variance.

Key degradation points:

CategoryPre-revertPost-revert
G: Retry operations3/30/3 (total failure in Runs 2, 3)
L: Test variant/config4/40–2/4
D: Wrong extensions7/76/7 (Test 31 consistently failing)

The initial suspicion was the code revert itself. segment_overlap, dir_ext_counts, warmup improvements, tool_retry_with_resolve tool_name addition — these were features that should have been lost in the revert.


Nothing Had Been Reverted

Grepping the actual code confirmed all four features were present in the current codebase. Nothing had been reverted.

So what caused the score degradation? A git diff between the commit at lexiopt4 benchmark time (9192d6b) and current HEAD (d6c9dc8) was examined.


Discovering maybe_exts

The diff revealed a single change — the removal of the maybe_exts field.

maybe_exts returns a list of extensions for candidates sharing the same stem (filename minus extension) as the best resolution result. For example, when resolving config.go, if config.rs is also a candidate, it returns maybe_exts: ["go", "rs"].

This field existed in 4 locations:

  1. PathResolveOut struct
  2. Aggregation logic in resolve()
  3. tool_path_resolve response
  4. Tool description

Restoring maybe_exts and Verifying the Effect

All 4 locations were restored, build and tests passed, and the benchmark was run.

Result: 65/70 — recovered to the baseline variance range.

Categoryopt4 (67)Post-revert (52-63)Restored (65)
G: Retry3/30-3/33/3
K: Cross-lang4/64/65/6
L: Config0-4/40-2/42/2

A few tokens of metadata (maybe_exts: ["rs","go"]) were helping the LLM make extension decisions, producing a score difference of over 10 points.


Further Response Vocabulary Experiments

“Hint text — small, effective vocabulary choices determine accuracy.” With this insight, additional signals were tested.

Adding alike: N (Same-Stem File Count)

Where maybe_exts communicates what is similar, alike was designed to communicate how many are similar as a numeric value.

The result was unexpected. There were 273 files with the stem config, producing a response of alike: 273. The LLM reacted to this large number by entering an excessive exploration mode, failed at list_dir API calls, then fell into an infinite loop trying to verify counts during summary creation. It repeatedly recalculated “59 PASS, 1 FAIL” dozens of times and completely skipped Tests 61–70.

Switching to exist_similar: bool

If numbers are dangerous, use a boolean. exist_similar: true/false was returned on every response.

Three benchmark runs: 66, 60, 56 — variance expanded to 10 points. lexiopt6-1 achieved K=6/6, L=4/4 for the highest level ever, but lexiopt6-3 collapsed to G=0/3. The variance increased compared to the maybe_exts-only configuration (stable at 65/70).

The hypothesis that “false might improve stability by reinforcing certainty” led to always-on return, but it backfired.

Reverting to maybe_exts Only

exist_similar and alike were removed, returning to maybe_exts only.


Discovering the Principles

Three principles emerged from the experimental results.

Context Pollution Is Proportional to Response Length

Fields returned on every test accumulate in context. The candidates list (path, score, why, root for top-K) is 50–80 tokens per entry; top-5 × 70 tests means thousands of accumulated tokens. exist_similar: false is a single field, but 70 appearances accumulate and consume attention bandwidth.

Why maybe_exts is stable — it does not appear in most tests (because the solution is unambiguous). It only appears in genuinely ambiguous cases. Context pollution is near zero.

Primacy-Recency Cascade

Small LLMs are strongly biased toward the beginning and end of context. Mid-sequence tests (G: Tests 46–48) sit in the “attention valley.” When always-on fields accumulate in context, they cross a critical threshold mid-sequence and collapse, cascading into the end section. The pattern of G and L degrading together supports this.

More Is Not Better

More information does not necessarily improve accuracy. alike: 273 was amplified as an anxiety signal, and exist_similar: false became always-on noise. The most effective approach was maybe_exts — conditionally appearing, small, and directly actionable.


Task-Based Benchmark Expansion

10 tasks covering categories that consistently failed in the 70-test benchmark (F: intent, K: cross-lang, L: config disambiguation) were added, expanding the task-based tests (test_prompt_tasks.md) from 30 to 40.

Added Tasks 31–40:

  • Go vs Rust client disambiguation (K category)
  • Hand-written vs generated disambiguation (K category)
  • Dev vs prod Kubernetes overlay (L category)
  • VPC vs ECS Terraform module (L category)
  • Go config vs Rust config (D category)
  • Deep OAuth callback, Postgres vs PostgREST, vendor exclusion, GraphQL schema vs generated, reranker entry point

Benchmark result: 40/40 perfect score, 42 tool calls. In task-based tests, since the LLM naturally has intent, pathfinder’s intent_text utilization is more effective.

A baseline version test (test_prompt_tasks_baseline.md) solving the same tasks using only filesystem MCP tools was also created.


Benchmark Score Progression

ConfigurationScoreVariance
With maybe_exts (lexiopt4)67/70baseline
Without maybe_exts (revert)52-63/7011 points
maybe_exts restored (latest)65/70baseline range
alike: N addedcollapse (loop)
exist_similar: bool added56-66/7010 points
maybe_exts only (final)65/70stable

Code Changes Summary

  • Restored maybe_exts (PathResolveOut, resolve(), tool_path_resolve, tool description)
  • Added and removed alike: usize
  • Added and removed exist_similar: bool
  • Expanded task-based tests from 30 to 40
  • Created filesystem baseline task tests