Production Routing Playbook: A Support Copilot at 40 QPS
Most AI support copilots start the same way: a product team picks one strong model, writes a careful prompt, connects the help center, and gets a great demo.
The trouble starts after launch. Real support traffic is not one workload. It is password resets, refund policy questions, order status lookups, angry long-form complaints, screenshots, multilingual follow-ups, and occasional high-risk edge cases. Sending all of that through the same model is simple, but it is rarely economical or reliable.
This is an anonymized composite scenario based on common production patterns we see when teams move from prototype to controlled AI operations.
Production routing separates request intent, policy, model choice, cache, fallback, and traceability.
The starting point
The team ran a support copilot inside a consumer marketplace. Traffic was steady on weekdays and spiky after promotions.
| Signal | Before routing |
|---|---|
| Daily AI requests | 18k to 24k |
| Peak traffic | 40 QPS |
| Primary model | One frontier model for every request |
| Cost per 1k requests | $4.18 |
| P95 latency | 1.64s |
| Successful response rate | 98.1% |
| Repeatable FAQ share | About 36% |
The model quality was strong, but three problems kept showing up in reviews:
- Simple FAQ traffic was paying premium-model prices.
- Provider latency spikes caused full retries, which made peak cost worse.
- Engineers could not explain why a given answer used a costly model.
The team did not want to reduce quality. They wanted a control plane that could spend more only when the request deserved it.
The routing policy
They introduced three pools in SkyAIApp:
| Pool | Used for | Policy |
|---|---|---|
| Cost pool | Password resets, return windows, shipping cutoffs, plan limits | Cheapest model that passes answer-format checks |
| Balanced pool | Normal support conversation with moderate context | Best score across latency, cost, and historical success |
| Quality pool | Refund disputes, account recovery, multi-step tool planning | Strong model with stricter validation and fallback |
The routing signal was intentionally boring: intent label, customer tier, retrieval confidence, prompt size, tool requirement, locale, and safety flag. The first version used lightweight classifiers and embeddings rather than a large new ML project.
Here is the important part: the team treated routing as versioned product infrastructure, not as a hidden SDK choice. Each policy version had an owner, rollout percentage, and evaluation report.
Cache with guardrails
Semantic cache was enabled only for low-risk intent families:
faq.policyfaq.shippingfaq.plan_limitsdocs.how_to
Cache keys included locale, policy version, public help-center revision, and safety labels. Anything involving account-specific data, payment status, or a live tool call bypassed cache automatically.
That kept the cache useful without creating stale or unsafe answers.
Results after 21 days
| Metric | Before | After |
|---|---|---|
| Cost per 1k requests | $4.18 | $2.96 |
| P95 latency | 1.64s | 1.08s |
| Successful response rate | 98.1% | 99.3% |
| FAQ cache hit rate | 0% | 32% |
| Requests using quality pool | 100% | 9% |
The biggest lesson was not "use cheaper models." It was "route with evidence." Simple intents were faster and cheaper on smaller models, while complex cases still had budget to use the strongest pool.
Why this matters
Without routing, every product decision becomes a model decision. Want faster support? Change the model. Need lower cost? Change the model. Need safer account recovery? Change the model again.
That does not scale.
SkyAIApp lets teams move those choices into policies:
- Product can define which journeys deserve quality-first handling.
- Engineering can keep fallback and retry behavior consistent.
- Finance can see unit cost by intent, tenant, and policy version.
- Support leaders can review quality without reading raw provider logs.
For a production support copilot, the router becomes the layer that keeps quality, latency, and cost in the same conversation.