Reliable Agents: Fixing Duplicate Tool Actions Before Scale

3/27/2026

Agents are different from chat completions because they do things. They create tickets, update records, call payment tools, write summaries, send emails, and trigger workflows.

That makes them useful. It also means a reliability bug is no longer just a bad answer. It can be a duplicated booking, a stale CRM update, or a workflow that runs twice.

This anonymized composite incident shows why production agents need runtime controls before they scale.

Agent trace and retry flow — Agent traces must include model spans, tool spans, retry events, validation results, and policy decisions.

The incident

A logistics workflow agent handled shipment requests from account managers. The flow looked straightforward:

Read the customer request.
Call a pricing tool.
Book a shipment with a carrier API.
Send a confirmation email.
Write the final status back to the CRM.

In staging, everything passed. In production, the carrier API occasionally timed out after accepting a booking. The agent interpreted the timeout as failure and retried the booking step. Some shipments were booked twice.

Signal	Incident window
Affected requests	0.8% of shipment runs
Carrier API p95 latency	Rose from 730ms to 2.6s
Duplicate side effects	Duplicate bookings and confirmation emails
Mean time to identify cause	More than 3 hours

The team had logs, but they were scattered across the app, the model provider, the carrier adapter, and the email service. Nobody could quickly reconstruct one full agent run.

What changed

The fix was not "use a better model." The model was not the root cause. The runtime contract was.

SkyAIApp introduced four controls:

Control	Why it mattered
Idempotency keys	Retried booking calls could be safely deduplicated
Bounded retries	The runtime stopped retry loops before they became cost and side-effect problems
Structured tool schemas	Malformed carrier responses were caught before the next step
End-to-end traces	Engineers could replay a run and see every model, tool, retry, and decision

Each tool call received a stable runId, stepId, and external idempotency key. The carrier adapter stored that key before making the provider request, so a late success and a retry could resolve to one booking.

Runtime policy

The team added a simple runtime policy:

Read-only tools can retry once immediately, then back off.
Side-effect tools require idempotency and have no blind retry.
Tool output must pass schema validation before the agent can continue.
If retry budget exceeds 2% in a 15-minute window, alert and degrade to human review.

That policy was visible in traces, which made production reviews much less subjective.

Results after rollout

Metric	Before	After
Duplicate bookings	0.8% of runs	Effectively zero
Agent run success rate	97.8%	99.4%
P95 run duration	6.8s	5.1s
MTTR for failed runs	More than 3 hours	Under 20 minutes

The improvement came from making failure legible and bounded. The agent still encountered slow APIs and malformed data, but those failures no longer turned into silent side effects.

The checklist we recommend

Before an agent touches production systems, require:

Idempotency for every state-changing tool.
Explicit timeout and retry policy per tool family.
JSON schema validation for tool inputs and outputs.
Trace IDs that connect model spans, tool spans, and application logs.
A human-review path for high-risk or repeated failures.

Agents can be powerful product infrastructure, but only when the runtime assumes things will fail. SkyAIApp is designed around that assumption: trace first, retry carefully, validate every boundary, and keep side effects controlled.