Building an AST-Based Codebase Analyzer for Local LLM Context: ctree Design and pathfinder/Serena Integration
I built ctree, a Rust AST-based on-demand codebase analyzer for the languages I commonly use, to reduce local LLM context consumption. It provides shallow codebase overview via hash-based snapshots, complementing pathfinder for path accuracy and Serena for symbol-level precision.
Conclusion
When running a coding agent on a local LLM, context consumption efficiency determines practical usability. Symbol analysis tools like Serena are precise, but loading all symbols upfront causes context to explode.
I built ctree, an AST-based on-demand codebase analyzer in Rust, targeting only the languages I commonly use. It manages codebase-wide snapshots with hash-based tracking, following implementation changes and dependency changes in compact form. The implementer can adjust scope arbitrarily based on their current focus.
Get shallow hints from ctree, then raise resolution with Serena’s find_symbol — that is the core design. Combined with pathfinder for file operation accuracy, this achieves both context savings and tool precision improvement.
Given the per-token cost of GPU context processing, software-side performance improvements are highly cost-effective.

Motivation: Complementing Serena’s Context Cost
Serena provides precise symbol-level analysis, but it is not suited for “first, get a shallow overview of the entire codebase.” Loading all symbols inflates context and pressures the limited context windows of local LLMs.
What was needed was a “shallow overview” layer placed before Serena.
| Layer | Tool | Role |
|---|---|---|
| File operation accuracy | pathfinder | Recovers from path resolution failures, prevents context bloat |
| Shallow codebase overview | ctree | Provides implementation and dependency changes compactly via hash-based snapshots |
| Precise symbol analysis | Serena | Retrieves specific symbol definitions and references via find_symbol |
ctree was designed alongside pathfinder. pathfinder ensures file path accuracy, ctree provides structural hints about the codebase, and Serena delivers final resolution. All three operate as independent MCP tools.
CLI
$ ctree --help
Generate annotated tree + symbols + compact ctx
Usage: ctree [OPTIONS] [COMMAND]
Commands:
init
reset
check
clean
help Print this message or the help of the given subcommand(s)
Options:
--mcp Run as MCP stdio server
--config <CONFIG> Config file path (default: .ctree.toml if exists)
--root <ROOT> Scan root path [default: .]
--syntax <SYNTAX> [possible values: go, rust, python, typescript,
csharp, dart, lua, awk, shell, kotlin, swift,
markdown, html]
--include <INCLUDE> Include globs
--exclude <EXCLUDE> Exclude globs [default: **/tmp/**,**/testdata/**,
**/.git/**,**/vendor/**]
--strong <STRONG> Strong scope globs
--weak <WEAK> Weak scope globs
--sw <SW> Max annotation words per strong file line [default: 12]
--ww <WW> Max annotation words per weak/symbol-only file line [default: 3]
--reasoning <REASONING> Reasoning level [possible values: high, medium, low]
--template <TEMPLATE> Output template [possible values: plain, hugo, jinja]
-h, --help Print help
Design: AST-Based On-Demand Analysis
Supported Languages
I implemented tree-sitter based AST parsers for the languages I regularly use.
Go, Rust, Python, TypeScript/JavaScript, Dart, Kotlin, Lua, Shell, Awk, C#, Swift, Markdown, HTML
This covers not just application code but also shell scripts and documentation. In practice, looking only at the primary language rarely gives a complete picture of a repository.
strong / weak Scope
Rather than analyzing the entire codebase uniformly, files are separated into strong (detailed) and weak (abbreviated) scopes.
[ctree]
watch = "rust"
root = "."
include = ["src/**/*.rs", "tests/**/*.rs"]
exclude = ["**/tmp/**", "**/testdata/**", "**/.git/**", "**/vendor/**", "**/target/**"]
strong = ["src/**"]
weak = ["tests/**"]
sw = 24
ww = 5
reasoning = "medium"
Files in strong get detailed symbol summaries; weak files get lightweight overviews only. When focus shifts, changing this configuration switches the scope.
This separation is the core design decision in ctree. Instead of carrying the entire codebase at full weight, the implementer actively decides where to invest detail.
reasoning Preset
Summary detail level is controlled by a single field: reasoning = "high" | "medium" | "low".
| reasoning | Use case |
|---|---|
| high | Small projects, situations requiring detailed analysis |
| medium | Normal development work (default) |
| low | Large monorepos, minimizing token consumption |
Hash-Based Snapshots
A deterministic hash is generated from scope settings (strong/weak patterns + reasoning level) to isolate snapshot directories.
.ctree/
rust/
e83adb52/ ← hash of scope configuration
snapshots/
.baseline.txt
rev/
0001.txt
0002.txt
Previous snapshots are preserved when switching scopes. Switching back restores without regeneration. This makes branch switches and focus changes low-cost.
MCP Tools
ctree exposes 5 MCP tools.
| Tool | Role |
|---|---|
check | Generate/update snapshots. Returns the latest diff |
get_baseline | Returns annotated file tree + symbol summaries for the entire codebase |
get_revs | Returns revision history (what changed) |
get_text | Resolves hashes to symbol or dependency text |
get_depends | Explores dependency relationships between symbols |
Actual Output
ctree check — Snapshot Generation and Diff Detection
$ ctree check --syntax rust
frequency=always
action=generate
latest_before=0001
latest_after=0001
(no new rev)
When no changes are detected, it returns (no new rev) immediately. When changes are found, a new revision is generated with hashes of changed symbols.
First-time generation output:
$ ctree check --syntax rust
frequency=always
action=generate
latest_before=
latest_after=0001
rev_file=.ctree/rust/{scope_hash}/snapshots/rev/0001.txt
hashes={hash1},{hash2},{hash3},...
{hash1}
source=symbol kind=module name=... scope=strong path=src/.../main.rs line=2
mod ...;
--{hash1}
{hash2}
source=symbol kind=function name=... scope=strong path=src/.../mcp.rs line=164
fn ...(args: ...) -> Result<...> {
--{hash2}
Each hash is a symbol identifier — pass it to get_text to retrieve the full body.
get_baseline — Shallow Codebase Overview
What get_baseline returns is an annotated file tree and symbol summaries — not full source code. The baseline includes an estimated token count, from which you identify symbols of interest and retrieve definitions via Serena’s find_symbol.
The detailed output format is not disclosed, as ctree is integrated into our in-house LLM pipeline infrastructure.
get_revs — Revision Diffs
Returns compact revision diffs expressing symbol additions/removals and dependency additions/removals. Pass hashes to get_text to retrieve full symbol bodies or dependency details.
The detailed output format is not disclosed.
Three-Layer Integration: pathfinder + ctree + Serena
pathfinder — Recovering from Path Resolution Failures
Even when the LLM typos a directory name, pathfinder resolves to the correct path.
path_resolve:
check_path: "content/ja/docs/tech/infrastrcture/podman-quadlet-systemd-ubuntu.md"
^^^^^^^^^^^ typo
→ resolved: "content/ja/docs/tech/infrastructure/podman-quadlet-systemd-ubuntu.md"
$ pathfinder --help
pathfinder — semantic path finder & MCP resolution server
USAGE
pathfinder [OPTIONS] Interactive semantic directory finder (default).
pathfinder --mcp [OPTIONS] Start as an MCP server.
FINDER OPTIONS
--include-builds Include build/artifact dirs (target, dist, …).
MCP OPTIONS
--root <PATH> Add a project root directory to watch and index.
GENERAL OPTIONS
-h, --help Print this help message and exit.
-V, --version Print version, model, and PCA config.
MCP TOOLS
1. path_resolve Resolve a failed file path to the best match.
2. tool_retry_with_resolve Resolve + retry the operation in one call.
3. roots_list Return configured root directories.
4. reindex_paths Force a full index rebuild.
ENVIRONMENT VARIABLES
PF_MCP_INFERENCE Inference mode: "general" (default) or "code".
Models (both INT8 quantized):
general → mxbai-edge-colbert (17M, 48-dim)
code → lateon-code-edge (17M, 48-dim)

Expected Workflow
1. pathfinder → resolve LLM path typos (ENOENT → tool_retry_with_resolve)
2. ctree check → update snapshots, get hashes of changed symbols
3. get_baseline → get a shallow overview of the codebase
4. Identify symbols of interest → drill down with Serena's find_symbol
5. get_revs + get_text → get details of changed symbols
The key point of this architecture is that each layer operates as an independent MCP tool, acquiring information progressively. Instead of loading everything upfront, start with a shallow overview and drill down as needed.
GPU context processing is expensive. For local LLMs operating within 8K–32K context windows, improving context efficiency on the software side translates directly to cost reduction.
Caveats
- ctree is an AST parser limited to languages I commonly use, not a general-purpose static analysis tool
- The detailed output formats of
get_baselineandget_revsare not disclosed, as ctree is integrated into our in-house LLM pipeline infrastructure - Integration of ML inference (ONNX, ColBERT, etc.) was considered and rejected. The strong/weak scope control already solves the monorepo context explosion problem, so the deterministic design was maintained
- Compliant with MCP protocol (2024-11-05). Works with Claude Code, Zed Editor, and other MCP clients
