6 min read
The tool soup problem: why ReAct agents fail in production

Most ReAct posts are about agents with three tools. The real failures start showing up around eight.

ReAct (Yao et al, 2022) is everywhere now. So is Reflexion. So is Toolformer. If you’ve read those papers, you’ve seen the same setup: an agent with a small set of tools (search, calculator, maybe a code runner), iterating thought-action-observation until done. The benchmarks look great. The papers are clean.

Then you go build one in production and you start adding tools. By tool eight or nine, your agent stops working well. Not in a single dramatic way. It just gets quietly worse.

That’s the tool soup problem. Nobody writes about it because the papers don’t measure it. I want to.

What I actually built

At LastDraft, I built a ReAct-style agent for drafting government bid documents. It has four tools:

  1. search_tenders. pgvector lookup over the active tender database.
  2. vault_rag. RAG over the company’s document vault.
  3. past_bids. Retrieval over previously submitted bids.
  4. company_metadata. Pulls structured fields about the company.

Four tools. That’s it. The agent autonomously chains these in whatever order it needs to draft a legally-compliant bid document.

Four sounds small. It is. That was deliberate. I want to tell you what I tried before settling there.

What goes wrong as you add tools

In the prototype phase, I had more. Different scopes, different indexes, different “find things” tools that all sounded similar in their descriptions. Here’s what I saw.

1. Description overlap

When two tools sound similar in their descriptions, the agent picks somewhat at random. The actual scope of each tool was clear to me when I wrote them. It was not clear to the model.

You can fix this by writing better descriptions. You’ll improve. You’ll never improve enough. Two tools that do similar things will keep confusing the agent. The fix is to collapse them into one tool with a parameter, or to split them so cleanly that there’s no ambiguity left.

2. Context dilution

Every step, the agent sees every tool’s description in the prompt. With eight tools at a few hundred tokens each, that’s a few thousand tokens of tool soup before any actual reasoning happens. Ironically, the more tools you give the agent, the worse its decisions get, because the prompt drowns out the immediate task.

This is also why “just give the LLM access to all your APIs” doesn’t work. People try it. It produces a confused agent and a stack of failed traces.

3. Loop traps

When the agent picks the wrong tool, it gets a vague observation back. It thinks for a moment, picks the same wrong tool again with slightly different arguments. Sometimes a third time. ReAct papers measure success at fixed step counts. Production agents have step budgets. Losing four steps to a loop trap is a real budget hit.

Loop traps are usually the consequence of description overlap plus a slightly underspecified user query. You can’t fully prevent them. You can detect them and break out.

4. Schema confusion

Tools with many optional parameters are worse than tools with few required ones. If your tool signature is search(query, filter_type=None, top_k=None, namespace=None, threshold=None), the agent will routinely fill in nonsense. Or skip useful params it should set. Or invent params that don’t exist.

The model hallucinating tool params at high rate is your tool’s fault, not the model’s.

What actually works

Here are the four things that moved the needle for me. None of them are clever.

Cap tool count. Seriously.

For most agents, five to seven tools is the cap. If you need more, you probably need a different architecture. Either you merge tools, or your “agent” should actually be a fixed pipeline with a small agentic step inside.

I went from a bigger prototype set down to four. Same model, same prompt template, fewer tools, noticeably better behavior. Tool selection got more accurate. Step counts dropped. End-to-end success rate climbed.

Two-stage routing for genuinely large tool sets

If you really do need fifteen tools, don’t expose all fifteen to a single ReAct loop. First, a small classifier step decides which category of tool the next step needs. Then only the tools in that category are exposed to the actual ReAct loop.

This is just dependency inversion applied to tool selection. It works because the model only has to discriminate within a category, not across the whole soup.

Treat tool descriptions as a hot eval surface

Your tool descriptions are prompts. They behave like prompts. They drift like prompts. Test them like prompts.

Keep a small set of ambiguous queries and run them through the tool-selection step regularly. If two tools tie often on the same query, that’s a signal. Either rewrite the descriptions or merge the tools.

Most teams write tool descriptions once and never look at them again. This is wild, because they are load-bearing.

Know when to bail out of ReAct

ReAct is for when the path through the problem is genuinely unknown step to step. If you can write down the steps in order, you don’t need an agent. You need a workflow with one or two LLM calls in it.

A lot of “agents” in production are ReAct loops solving deterministic problems. A fixed three-step pipeline would be faster, cheaper, and more reliable. Teams use ReAct anyway because it feels modern. That isn’t a good reason.

For the BidDraftingAgent, the path actually is unknown step to step. Different tenders need different sequences of lookups. ReAct earns its keep. For other flows in the same codebase, I use fixed pipelines with one LLM call. Same model, same infra, different shapes.

The closing claim

The literature on tool-using agents is mostly written from the perspective of the model. It assumes the tools are correct and asks how well the model uses them.

In production, the bottleneck is the other direction. Most of the time, the model is fine and the tools are the problem. Bad descriptions, overlapping scopes, parameter sprawl, too many of them.

If your agent is failing, look at the tools first. Treat them as an interface design problem, not a model problem. You’ll fix more bugs that way than by switching to the next model.