TJ Zhang

Describing Tools for AI Agents

The narrative of modern AI agents usually focuses on two things: better prompts and better tools. We spent months on both — building a 30-command Linear CLI, testing it exhaustively with zero-context Claude instances, iterating on discoverability and error handling. We thought that was the work.

But the most significant breakthrough came from a different direction: how we described the tool to the agent.

We're in the CLI camp

Mario Zechner recently made the case that CLI tools beat Model Context Protocol (MCP) for agent use. The numbers are stark: MCP tool definitions for Linear cost 6,471 tokens to cover 33 tools, while the same operations through a 30-command CLI use 130 tokens of output. That's a 50x difference before the agent even executes anything.

We built our linear-cli as an experiment in that hypothesis. The CLI offers complete parity with the official Linear MCP server for agent workflows, tested against dozens of real scenarios. But the journey taught us something beyond "CLI good, MCP heavy" — it revealed that the description format itself is a lever that dominates both.

The tool-side plateau

Our first implementation gave us a baseline: 42% first-try accuracy when we threw zero-context GPT-4 at 19 task scenarios. We iterated aggressively: hidden aliases for common misspellings (--status--state), error messages with actionable hints, porcelain commands that map to natural language verbs. Six iteration cycles later, we hit 74%.

Then we plateaued. The remaining failures weren't fixable with better tool design. Agents would invent subcommands that didn't exist (comment create instead of issue comment), guess wrong nesting (treating assignee as a top-level command), or miss flag names entirely. These aren't mistakes a better error message can fix — they're structural unknowns. The model literally cannot infer that in your CLI, the comment command lives under issue.

We'd been trying to solve a description problem with a tool problem. More aliases, more flexible parsing, more helpful errors — all showed diminishing returns. The plateau was telling us something: you can't out-engineer a description gap.

The prediction test

So we asked a different question: what if we gave the model better information instead?

We batched 18 task scenarios in a single prompt and asked Claude to predict the exact commands it would invoke — no execution, no feedback loop, just first guesses. The test cost $0.05 and took 6 seconds. We ran it five times, progressively adding context:

Description Type Tokens First-Try Accuracy
None 0 ~40%
"Follows gh/kubectl conventions" ~25 ~50%
Entity + verb list (no flags) ~55 ~80%
Full man-page synopsis ~250 ~100%
EBNF grammar ~120 ~100%

The jump from 40% to 100% happened in description, not in the tool. The CLI stayed identical. We just told the model what it could do differently.

The efficiency is striking: EBNF achieved the same 100% accuracy as a 250-token synopsis using half the tokens. The format that would be opaque to most humans — Backus-Naur notation — was maximally legible to the model.

Why EBNF works

CLI commands require exact symbols. Not --status, but --state. Not comment create, but issue comment. Not on-track, but onTrack. Human language is forgiving; parsers are not. Models have strong domain priors — they know what "close an issue" means semantically — but probabilistic inference doesn't reliably produce the exact syntax a parser accepts.

EBNF bridges this gap. The grammar pins down three things that prose cannot:

Structure. (close|start|reopen) ID makes the grouping unambiguous. These three verbs take an ID, and they're mutually exclusive.

Vocabulary. Semantic names activate the model's existing knowledge. close evokes the concept of issue resolution. onTrack is self-explanatory in the context of project management. A bare token like P requires explanation; P = urgent|high|medium|low|none is self-documenting.

Repetition factoring. Instead of listing issue view ID, issue close ID, issue start ID, EBNF lets you write issue (view|close|start) ID once. Fewer tokens, same information, clearer structure.

Here's the grammar for a 30-command CLI in 120 tokens:

P = urgent|high|medium|low|none
H = onTrack|atRisk|offTrack

linear (
  issue (list --team K [--state S.. --assignee N|--mine --priority P]
       | (view|branch) ID | (close|start|reopen) ID
       | assign ID USER | comment ID BODY
       | create --team K --title S [--description S --priority P --assignee N])
| project (list [--team K] | (view|update) N [--description S]
         | create --name S --team K
         | post N --body S [--health H]
         | milestone create N --name S --date DATE)
| team (list | view K)
| user (list | view N | me)
| inbox [--unread])

Models encounter EBNF constantly in their training data: language specifications, RFCs, PL textbooks. It's native territory. They parse it fluently while simultaneously leveraging domain expertise. The format combines formal precision with semantic familiarity.

The key principle: prompt the delta from priors. The model already knows issue lifecycle semantics, team hierarchies, natural command structure. You don't need to explain those. You only provide structure and vocabulary — the things genuinely unique to your tool.

What this means

We went in thinking the work was building a better tool. It turned out the highest-leverage move was describing the tool in notation the model already understands.

This finding generalizes. Any tool that exposes a structured command interface — CLI, API, DSL — can be described to an agent with a compact grammar instead of verbose JSON schemas or prose documentation. The 50x token savings compound: fewer tokens per tool means more tools in context, means more capable agents, means fewer round-trips.

The principle — let the model's existing knowledge do the work — extends beyond description formats to every surface you expose to an agent. Semantic names over opaque identifiers. Conventional patterns over novel interfaces. Compact structure over verbose explanation.

Stop explaining what the model already knows. Pin down the exact symbols it can't infer. EBNF does both in 120 tokens.


The linear-cli source is open source. Mario Zechner's MCP comparison is what started us down this path.