Reliable Agents: Fixing Duplicate Tool Actions Before Scale
Agents are different from chat completions because they do things. They create tickets, update records, call payment tools, write summaries, send emails, and trigger workflows.
That makes them useful. It also means a reliability bug is no longer just a bad answer. It can be a duplicated booking, a stale CRM update, or a workflow that runs twice.
This anonymized composite incident shows why production agents need runtime controls before they scale.
Agent traces must include model spans, tool spans, retry events, validation results, and policy decisions.
The incident
A logistics workflow agent handled shipment requests from account managers. The flow looked straightforward:
- Read the customer request.
- Call a pricing tool.
- Book a shipment with a carrier API.
- Send a confirmation email.
- Write the final status back to the CRM.
In staging, everything passed. In production, the carrier API occasionally timed out after accepting a booking. The agent interpreted the timeout as failure and retried the booking step. Some shipments were booked twice.
| Signal | Incident window |
|---|---|
| Affected requests | 0.8% of shipment runs |
| Carrier API p95 latency | Rose from 730ms to 2.6s |
| Duplicate side effects | Duplicate bookings and confirmation emails |
| Mean time to identify cause | More than 3 hours |
The team had logs, but they were scattered across the app, the model provider, the carrier adapter, and the email service. Nobody could quickly reconstruct one full agent run.
What changed
The fix was not "use a better model." The model was not the root cause. The runtime contract was.
SkyAIApp introduced four controls:
| Control | Why it mattered |
|---|---|
| Idempotency keys | Retried booking calls could be safely deduplicated |
| Bounded retries | The runtime stopped retry loops before they became cost and side-effect problems |
| Structured tool schemas | Malformed carrier responses were caught before the next step |
| End-to-end traces | Engineers could replay a run and see every model, tool, retry, and decision |
Each tool call received a stable runId, stepId, and external idempotency key. The carrier adapter stored that key before making the provider request, so a late success and a retry could resolve to one booking.
Runtime policy
The team added a simple runtime policy:
- Read-only tools can retry once immediately, then back off.
- Side-effect tools require idempotency and have no blind retry.
- Tool output must pass schema validation before the agent can continue.
- If retry budget exceeds 2% in a 15-minute window, alert and degrade to human review.
That policy was visible in traces, which made production reviews much less subjective.
Results after rollout
| Metric | Before | After |
|---|---|---|
| Duplicate bookings | 0.8% of runs | Effectively zero |
| Agent run success rate | 97.8% | 99.4% |
| P95 run duration | 6.8s | 5.1s |
| MTTR for failed runs | More than 3 hours | Under 20 minutes |
The improvement came from making failure legible and bounded. The agent still encountered slow APIs and malformed data, but those failures no longer turned into silent side effects.
The checklist we recommend
Before an agent touches production systems, require:
- Idempotency for every state-changing tool.
- Explicit timeout and retry policy per tool family.
- JSON schema validation for tool inputs and outputs.
- Trace IDs that connect model spans, tool spans, and application logs.
- A human-review path for high-risk or repeated failures.
Agents can be powerful product infrastructure, but only when the runtime assumes things will fail. SkyAIApp is designed around that assumption: trace first, retry carefully, validate every boundary, and keep side effects controlled.