Background

aichat is a Rust-based LLM CLI client that supports shell-script-based function calling through its llm-functions submodule. In my fork, I registered shell scripts wrapping system tools like jq, rg, tokei, and dust as llm-functions/tools/*.sh, allowing the LLM to invoke them.

Under this scheme, argc build generates symlinks under llm-functions/bin/ for each tool name. For example, bin/jq../scripts/run-tool.sh — all tool binaries point to a shared runner script. run-tool.sh determines the tool name from $0, executes the corresponding tools/*.sh, and writes the result to $LLM_OUTPUT.

However, when invoking tools via function calling, responses never came back. Normal chat responses worked fine, but shell tool calls hung indefinitely. Specifically, [DEBUG] tool start: tokei would appear in the logs, but [DEBUG] tool done never followed.

My fork environment had one unusual aspect: the llm-functions directory was not a submodule inside the aichat repository but a symlink to a separately cloned repository (/Users/ksh3/Development/aichat/llm-functions/Users/ksh3/Development/llm-functions). This environment-specific layout turned out to be the core of the problem.

Investigating the Rust Side — a Dead End

The initial hypothesis was that a recent modification on the fork side had introduced a regression.

I examined diffs in src/function.rs and src/utils/command.rs, but the working tree contained only committed changes with no obvious regression. A commit message reading “mistake refactor” in git log looked suspicious but was not the direct cause.

I verified the build environment with cargo run -- --info, confirming function_calling: true was enabled, functions_dir pointed to the correct path, and both functions.json and tools.txt existed. Running ./Argcfile.sh test-function-calling failed with Error: Unknown role '%functions%', which was a role configuration issue unrelated to the hang.

At this point, the Rust-side investigation was a dead end.

Clues from the Reproduction Log

The actual reproduction log looked like this:

  Call ctree_check {}
[DEBUG] tool start: ctree_check
[DEBUG] tool done: ctree_check exit=1
Error: Tool call exit with 1

pilot> run tokei
Call tokei {}
[DEBUG] tool start: tokei
  

ctree_check was an MCP bridge tool — it exited with an error but did not hang. In contrast, tokei went through llm-functions/tools/tokei.sh as a shell tool, and the response never returned after tool start.

This difference was critical. MCP bridge tools go through run-mcp-tool.sh, while shell tools go through run-tool.sh. The problem was isolated to the run-tool.sh execution path.

Discovering the Fork Bomb

Running bin/tokei '{}' </dev/null directly produced “error: invalid JSON data” and exited immediately. The tool itself worked, but the hang could not be reproduced this way.

The decisive evidence came from ps -Ao 'pid,ppid,command'. Hundreds of processes were running, all following the same pattern:

  bash /Users/ksh3/Development/aichat/llm-functions/bin/jq -r def escape_shell_word:...
  

Each process was a child of the previous one, forming an infinite recursion chain. bin/jq is a symlink to run-tool.sh, but run-tool.sh internally calls the jq command to parse JSON arguments. When llm-functions/bin/ sits at the front of PATH, the shell resolves jq not to the system /usr/bin/jq but to the wrapper bin/jq. That wrapper launches run-tool.sh again, which calls jq again, ad infinitum. This fork bomb exhausted process slots, causing tool calls to hang.

The reproduction conditions were:

  • llm-functions/bin/ is in PATH (aichat adds it automatically when executing tools)
  • The JSON parsing in run-tool.sh calls jq before PATH cleanup has run
  • This affects all shell tools, not just those whose names collide with system binaries (run-tool.sh is the common entry point for every tool, and the jq call for argument parsing runs before any tool-specific code)

First Fix Attempt — Insufficient

The first fix was straightforward. run-tool.sh already had a PATH cleanup line — export PATH="${PATH//"$root_dir/bin:"/}" — but it was placed after the jq call. I moved it before the jq call and also added the same cleanup to run-agent.sh, which was missing it entirely.

However, this fix was insufficient. In my fork environment, llm-functions is a symlink, so root_dir/bin resolves to the real path /Users/ksh3/Development/llm-functions/bin, while PATH contains the symlink-based path /Users/ksh3/Development/aichat/llm-functions/bin. Bash string substitution ${PATH//"$root_dir/bin:"/} does not match because the two paths differ. This was the mechanism behind the environment-specific failure. With a standard git submodule layout, the paths would match and the problem would not occur.

Resolution with sanitize_path

The final fix introduced a sanitize_path() function that:

  1. Resolves the real path of root_dir/bin using cd ... && pwd -P
  2. Compares each PATH entry against both the literal path and the resolved real path (pwd -P)
  3. Excludes any entry that matches either one

This function was placed in both run-tool.sh and run-agent.sh, before the first jq invocation. The old string-substitution line was removed.

Verification tests:

  • get_current_time '{}' — worked correctly
  • tokei '{"path":"src"}' — no hang, worked correctly
  • jq '{"filter":".a","input":"{\"a\":1}"}' — worked correctly even when the tool name collides directly with a system binary
  • pgrep -f 'escape_shell_word' — confirmed zero zombie processes

Post-Fix Rust-Side Cleanup

Although the problem was resolved on the shell script side, I also cleaned up changes made to the fork’s Rust code during the investigation. Reviewing the diff against the develop branch:

  • src/function.rs debug eprintln! statements ([DEBUG] tool start/done) — added during debugging, unnecessary, removed
  • src/utils/command.rs run_command_no_stdin() — a function that passes Stdio::null() to stdin, preventing tools from hanging while waiting for stdin input. This was a useful improvement and was kept.

Summary

The root cause was that run-tool.sh’s PATH cleanup logic did not handle symlink-based llm-functions layouts. run-tool.sh already had logic to remove bin/ from PATH, but it relied on string matching, which fails when symlinks cause the real path and the literal path to differ. With a standard submodule layout the paths would match, so this problem only manifested in my environment-specific configuration.

The fix came down to two things: ensuring bin/ is removed from PATH before jq is first called, and using pwd -P real-path comparison so the cleanup works correctly even in symlink environments.