Reliable Agents: Fixing Duplicate Tool Actions Before Scale

3/27/2026

Agents are different from chat completions because they do things. They create tickets, update records, call payment tools, write summaries, send emails, and trigger workflows.

That makes them useful. It also means a reliability bug is no longer just a bad answer. It can be a duplicated booking, a stale CRM update, or a workflow that runs twice.

This anonymized composite incident shows why production agents need runtime controls before they scale.

Agent trace and retry flow

Agent traces must include model spans, tool spans, retry events, validation results, and policy decisions.

The incident

A logistics workflow agent handled shipment requests from account managers. The flow looked straightforward:

  1. Read the customer request.
  2. Call a pricing tool.
  3. Book a shipment with a carrier API.
  4. Send a confirmation email.
  5. Write the final status back to the CRM.

In staging, everything passed. In production, the carrier API occasionally timed out after accepting a booking. The agent interpreted the timeout as failure and retried the booking step. Some shipments were booked twice.

SignalIncident window
Affected requests0.8% of shipment runs
Carrier API p95 latencyRose from 730ms to 2.6s
Duplicate side effectsDuplicate bookings and confirmation emails
Mean time to identify causeMore than 3 hours

The team had logs, but they were scattered across the app, the model provider, the carrier adapter, and the email service. Nobody could quickly reconstruct one full agent run.

What changed

The fix was not "use a better model." The model was not the root cause. The runtime contract was.

SkyAIApp introduced four controls:

ControlWhy it mattered
Idempotency keysRetried booking calls could be safely deduplicated
Bounded retriesThe runtime stopped retry loops before they became cost and side-effect problems
Structured tool schemasMalformed carrier responses were caught before the next step
End-to-end tracesEngineers could replay a run and see every model, tool, retry, and decision

Each tool call received a stable runId, stepId, and external idempotency key. The carrier adapter stored that key before making the provider request, so a late success and a retry could resolve to one booking.

Runtime policy

The team added a simple runtime policy:

  • Read-only tools can retry once immediately, then back off.
  • Side-effect tools require idempotency and have no blind retry.
  • Tool output must pass schema validation before the agent can continue.
  • If retry budget exceeds 2% in a 15-minute window, alert and degrade to human review.

That policy was visible in traces, which made production reviews much less subjective.

Results after rollout

MetricBeforeAfter
Duplicate bookings0.8% of runsEffectively zero
Agent run success rate97.8%99.4%
P95 run duration6.8s5.1s
MTTR for failed runsMore than 3 hoursUnder 20 minutes

The improvement came from making failure legible and bounded. The agent still encountered slow APIs and malformed data, but those failures no longer turned into silent side effects.

The checklist we recommend

Before an agent touches production systems, require:

  • Idempotency for every state-changing tool.
  • Explicit timeout and retry policy per tool family.
  • JSON schema validation for tool inputs and outputs.
  • Trace IDs that connect model spans, tool spans, and application logs.
  • A human-review path for high-risk or repeated failures.

Agents can be powerful product infrastructure, but only when the runtime assumes things will fail. SkyAIApp is designed around that assumption: trace first, retry carefully, validate every boundary, and keep side effects controlled.