Fixing aichat Function Calling Hangs in a Symlink Environment
How a symlink-based llm-functions layout caused a fork bomb in aichat’s function calling, and how a real-path-aware PATH sanitizer fixed it
Background
aichat is a Rust-based LLM CLI client that supports shell-script-based function calling through its llm-functions submodule. In my fork, I registered shell scripts wrapping system tools like jq, rg, tokei, and dust as llm-functions/tools/*.sh, allowing the LLM to invoke them.
Under this scheme, argc build generates symlinks under llm-functions/bin/ for each tool name. For example, bin/jq → ../scripts/run-tool.sh — all tool binaries point to a shared runner script. run-tool.sh determines the tool name from $0, executes the corresponding tools/*.sh, and writes the result to $LLM_OUTPUT.
However, when invoking tools via function calling, responses never came back. Normal chat responses worked fine, but shell tool calls hung indefinitely. Specifically, [DEBUG] tool start: tokei would appear in the logs, but [DEBUG] tool done never followed.
My fork environment had one unusual aspect: the llm-functions directory was not a submodule inside the aichat repository but a symlink to a separately cloned repository (/Users/ksh3/Development/aichat/llm-functions → /Users/ksh3/Development/llm-functions). This environment-specific layout turned out to be the core of the problem.
Investigating the Rust Side — a Dead End
The initial hypothesis was that a recent modification on the fork side had introduced a regression.
I examined diffs in src/function.rs and src/utils/command.rs, but the working tree contained only committed changes with no obvious regression. A commit message reading “mistake refactor” in git log looked suspicious but was not the direct cause.
I verified the build environment with cargo run -- --info, confirming function_calling: true was enabled, functions_dir pointed to the correct path, and both functions.json and tools.txt existed. Running ./Argcfile.sh test-function-calling failed with Error: Unknown role '%functions%', which was a role configuration issue unrelated to the hang.
At this point, the Rust-side investigation was a dead end.
Clues from the Reproduction Log
The actual reproduction log looked like this:
Call ctree_check {}
[DEBUG] tool start: ctree_check
[DEBUG] tool done: ctree_check exit=1
Error: Tool call exit with 1
pilot> run tokei
Call tokei {}
[DEBUG] tool start: tokei
ctree_check was an MCP bridge tool — it exited with an error but did not hang. In contrast, tokei went through llm-functions/tools/tokei.sh as a shell tool, and the response never returned after tool start.
This difference was critical. MCP bridge tools go through run-mcp-tool.sh, while shell tools go through run-tool.sh. The problem was isolated to the run-tool.sh execution path.
Discovering the Fork Bomb
Running bin/tokei '{}' </dev/null directly produced “error: invalid JSON data” and exited immediately. The tool itself worked, but the hang could not be reproduced this way.
The decisive evidence came from ps -Ao 'pid,ppid,command'. Hundreds of processes were running, all following the same pattern:
bash /Users/ksh3/Development/aichat/llm-functions/bin/jq -r def escape_shell_word:...
Each process was a child of the previous one, forming an infinite recursion chain. bin/jq is a symlink to run-tool.sh, but run-tool.sh internally calls the jq command to parse JSON arguments. When llm-functions/bin/ sits at the front of PATH, the shell resolves jq not to the system /usr/bin/jq but to the wrapper bin/jq. That wrapper launches run-tool.sh again, which calls jq again, ad infinitum. This fork bomb exhausted process slots, causing tool calls to hang.
The reproduction conditions were:
llm-functions/bin/is in PATH (aichat adds it automatically when executing tools)- The JSON parsing in
run-tool.shcallsjqbefore PATH cleanup has run - This affects all shell tools, not just those whose names collide with system binaries (
run-tool.shis the common entry point for every tool, and thejqcall for argument parsing runs before any tool-specific code)
First Fix Attempt — Insufficient
The first fix was straightforward. run-tool.sh already had a PATH cleanup line — export PATH="${PATH//"$root_dir/bin:"/}" — but it was placed after the jq call. I moved it before the jq call and also added the same cleanup to run-agent.sh, which was missing it entirely.
However, this fix was insufficient. In my fork environment, llm-functions is a symlink, so root_dir/bin resolves to the real path /Users/ksh3/Development/llm-functions/bin, while PATH contains the symlink-based path /Users/ksh3/Development/aichat/llm-functions/bin. Bash string substitution ${PATH//"$root_dir/bin:"/} does not match because the two paths differ. This was the mechanism behind the environment-specific failure. With a standard git submodule layout, the paths would match and the problem would not occur.
Resolution with sanitize_path
The final fix introduced a sanitize_path() function that:
- Resolves the real path of
root_dir/binusingcd ... && pwd -P - Compares each PATH entry against both the literal path and the resolved real path (
pwd -P) - Excludes any entry that matches either one
This function was placed in both run-tool.sh and run-agent.sh, before the first jq invocation. The old string-substitution line was removed.
Verification tests:
get_current_time '{}'— worked correctlytokei '{"path":"src"}'— no hang, worked correctlyjq '{"filter":".a","input":"{\"a\":1}"}'— worked correctly even when the tool name collides directly with a system binarypgrep -f 'escape_shell_word'— confirmed zero zombie processes
Post-Fix Rust-Side Cleanup
Although the problem was resolved on the shell script side, I also cleaned up changes made to the fork’s Rust code during the investigation. Reviewing the diff against the develop branch:
src/function.rsdebugeprintln!statements ([DEBUG] tool start/done) — added during debugging, unnecessary, removedsrc/utils/command.rsrun_command_no_stdin()— a function that passesStdio::null()to stdin, preventing tools from hanging while waiting for stdin input. This was a useful improvement and was kept.
Summary
The root cause was that run-tool.sh’s PATH cleanup logic did not handle symlink-based llm-functions layouts. run-tool.sh already had logic to remove bin/ from PATH, but it relied on string matching, which fails when symlinks cause the real path and the literal path to differ. With a standard submodule layout the paths would match, so this problem only manifested in my environment-specific configuration.
The fix came down to two things: ensuring bin/ is removed from PATH before jq is first called, and using pwd -P real-path comparison so the cleanup works correctly even in symlink environments.
