Writing MCP tool descriptions that LLMs actually follow

When an LLM has a tool that can call five downstream services in one round-trip, you expect it to use that tool. In practice it often doesn’t - it calls each service one at a time, waiting for a full LLM generation between each call. For five tools that’s five generations instead of one. The tool description was technically correct, but it wasn’t driving the behavior.

We ran into this with Code Mode. The execute_code tool lets an LLM write JavaScript that runs in a WASM sandbox and calls multiple MCP tools in a single execution. The old description explained what the tool did. LLMs read it and then called tools individually anyway. We rewrote the description and saw measurable improvement. This post is about the techniques that worked.

The cost of doing nothing

Each individual tool call is a complete LLM round-trip: generate the call, execute it, send the result back, wait for the next generation. For agentic workflows with multiple tools this isn’t a minor inefficiency.

With chaining - 1 LLM round-trip:

graph LR
  L5[LLM writes script] -->|execute_code| SB["sandbox: search + fetch_doc + summarize"]
  SB -->|all results| L6[LLM final answer]

  style L5 fill:#1a1a24,stroke:#8b5cf6,color:#e2e8f0
  style L6 fill:#1a1a24,stroke:#8b5cf6,color:#e2e8f0
  style SB fill:#1a1a24,stroke:#22c55e,color:#e2e8f0

Without chaining - 3 LLM round-trips:

graph LR
  L1[LLM generates] -->|tool call| T1[search]
  T1 -->|result| L2[LLM thinks]
  L2 -->|tool call| T2[fetch_doc]
  T2 -->|result| L3[LLM thinks]
  L3 -->|tool call| T3[summarize]
  T3 -->|result| L4[LLM final answer]

  style L1 fill:#1a1a24,stroke:#8b5cf6,color:#e2e8f0
  style L2 fill:#1a1a24,stroke:#8b5cf6,color:#e2e8f0
  style L3 fill:#1a1a24,stroke:#8b5cf6,color:#e2e8f0
  style L4 fill:#1a1a24,stroke:#8b5cf6,color:#e2e8f0
  style T1 fill:#1a1a24,stroke:#ef4444,color:#e2e8f0
  style T2 fill:#1a1a24,stroke:#ef4444,color:#e2e8f0
  style T3 fill:#1a1a24,stroke:#ef4444,color:#e2e8f0

Before and after

The original execute_code description read like documentation:

Execute JavaScript code in a sandboxed WASM runtime with MCP tools available as async functions. Tools are accessible via tools.serverAlias.toolName(args). Use await for tool calls. Return a value to send it back as the result.

Everything in it is accurate. It doesn’t tell the LLM when to use the tool, how to sequence discovery before execution, what to do when tool calls fail, or why chaining matters. An LLM reading this description treats execute_code as one option among several roughly equivalent options.

The new description shipped in v0.0.17 (April 29, 2026) is structured differently. It uses labeled sections that give the LLM distinct signal types: preference, process, rules, and patterns. Each section does a specific job.

Five techniques that worked

1. ALL-CAPS labels as visual anchors

The new description opens with:

STRONG PREFERENCE: Whenever a task requires more than one downstream MCP tool call, use execute_code. Two or more downstream calls = use execute_code. One downstream call = call it directly.

LLMs assign higher weight to all-caps text than surrounding prose. This isn’t a documented behavior in any model card, but it’s reproducible - the same instruction in sentence case gets followed less reliably than in caps. The label STRONG PREFERENCE is explicit about the nature of the instruction before the instruction itself arrives. WORKFLOW, RULES, PATTERNS, and NOTE follow the same pattern throughout the description.

The label tells the LLM how to classify what it’s about to read. That classification happens before interpretation, so the content lands in the right mental category.

2. Numeric rules that remove interpretation room

Vague guidance like “prefer to batch tool calls” leaves the LLM deciding what “prefer” means in each situation. The new description replaces that with a threshold:

Two or more downstream calls = use execute_code. One downstream call = call it directly.

The LLM doesn’t have to weigh tradeoffs. It counts. Two is the boundary. This kind of concrete numeric rule generalizes well because the LLM can apply it to situations that don’t pattern-match anything in training data.

3. Working code snippets for pattern matching

The PATTERNS section includes two code examples the LLM can match against:

// Fan-out: run calls in parallel
const [a, b, c] = await Promise.allSettled([
  tools.search.query({ q: "topic A" }),
  tools.search.query({ q: "topic B" }),
  tools.docs.fetch({ id: "abc" })
])

// Continue-on-partial-failure
const results = []
for (const id of ids) {
  try {
    results.push(await tools.db.get({ id }))
  } catch (e) {
    console.log("skipped", id, e.message)
  }
}

The key is that these are complete, runnable patterns - not pseudocode. The LLM doesn’t need to invent the error handling approach or the parallelism strategy. It pattern-matches and adapts. Asking an LLM to invent patterns from a prose description produces inconsistent results. Giving it concrete patterns to adapt produces consistent ones.

The description also notes: “unhandled rejection aborts the run” - because LLMs frequently write Promise.all when they mean Promise.allSettled, and one failing tool call should not cancel the whole script.

4. The reason included, not just the rule

The description explains the cost model explicitly:

Calling tools individually means one round-trip each. Calling them inside execute_code means one round-trip total, regardless of how many tools the script calls.

This sounds redundant if you’re thinking of tool descriptions as an instruction set. But LLMs that understand why a rule exists apply it to novel situations that don’t match the original pattern. An LLM that only knows “two or more calls = use execute_code” might still call tools sequentially when the task is framed differently. An LLM that understands “sequential calls are expensive because of per-call LLM overhead” will chain them in situations the description never explicitly anticipated.

The reasoning is the generalization mechanism.

The WORKFLOW section in execute_code says:

Always call search_tools() first to find what’s available, then call execute_code with what you found. Do not skip discovery.

And the search_tools description says:

After searching, call execute_code with the tools you found. Calling tools individually after searching wastes round-trips; chain them inside execute_code instead.

Neither description works as well alone. The execute_code description tells the LLM to start with discovery. The search_tools description closes the loop by pointing back to execution. An LLM reading both descriptions gets a consistent workflow from two independent angles, which reinforces the behavior more than either one could on its own.

💡The discovery step matters

A common failure mode is LLMs calling execute_code without knowing the exact tool names available. The WORKFLOW section enforces search_tools() before execute_code. The two-step pattern - discover, then act - prevents “I’ll guess the tool name” hallucinations inside the sandbox.

Technique summary

Technique	What it does	Why it works
ALL-CAPS labels	STRONG PREFERENCE, WORKFLOW, RULES	Visual anchors - LLM weighs them higher than prose
Numeric rules	”2+ calls = use execute_code”	No interpretation room, counts instead of weighs
Working code snippets	`Promise.allSettled(...)`, try/catch loop	Pattern matching, not invention
Reasoning included	”one round-trip total vs one each”	LLM generalizes to novel situations
Cross-links between tools	search_tools points to execute_code and back	Two descriptions reinforce the same behavior

One more thing: the NOTE section

The description includes a section explaining what Promise<any> means in the TypeScript declarations the LLM receives at discovery time:

NOTE: Tool return types are declared as Promise<any> because the actual shape is only known at runtime. Use console.log to inspect what a tool returns before building logic on top of it. console.log output appears in the execution result.

This matters because LLMs frequently write code that assumes a specific return shape (result.items[0].title) without knowing whether the tool actually returns that. The NOTE section teaches the right debugging behavior - inspect first, then build - rather than hoping the LLM guesses correctly.

The console.log output appearing in the result is also non-obvious behavior worth stating explicitly. LLMs don’t naturally reach for console.log in a tool-call context the way they would in a Node.js script. Telling them it works, and that the output is visible, makes them use it.

There’s a companion post on how Code Mode learns tool return types at discovery time that goes deeper on the TypeScript declaration injection.

Applies to any MCP tool description

None of these techniques are VoidLLM-specific. If you’re writing tool descriptions for your own MCP server, the same patterns apply:

Lead with the most important behavioral preference, labeled in caps
Use concrete thresholds instead of vague preferences
Include working code where the tool takes code input
Explain the cost model, not just the rule
If you have a tool that should always precede another, say so in both descriptions

Tool descriptions are the primary interface between your API design and LLM behavior. The LLM follows what’s in the description more reliably than anything you put in a system prompt, because descriptions are right next to the tool in the context window when the model decides whether to call it.

The techniques above landed in VoidLLM and VoidMCP v0.0.17 on April 29, 2026. The full Code Mode description is in the VoidMCP repo if you want to read the complete version.

Writing MCP tool descriptions that LLMs actually follow

The cost of doing nothing

Before and after

Five techniques that worked

1. ALL-CAPS labels as visual anchors

2. Numeric rules that remove interpretation room

3. Working code snippets for pattern matching

4. The reason included, not just the rule

Technique summary

One more thing: the NOTE section

Applies to any MCP tool description

Related posts

Code Mode: Let AI Agents Write Scripts, Not Chat

How Code Mode learns what tools return

Connect Cursor, Windsurf, and Claude Code to VoidLLM

MCP Gateway: Give Your AI Tools a Single Front Door

The cost of doing nothing

Before and after

Five techniques that worked

1. ALL-CAPS labels as visual anchors

2. Numeric rules that remove interpretation room

3. Working code snippets for pattern matching

4. The reason included, not just the rule

5. Cross-linking between related tool descriptions

Technique summary

One more thing: the NOTE section

Applies to any MCP tool description

Related posts

Code Mode: Let AI Agents Write Scripts, Not Chat

How Code Mode learns what tools return

Connect Cursor, Windsurf, and Claude Code to VoidLLM

MCP Gateway: Give Your AI Tools a Single Front Door