Production Routing Playbook: A Support Copilot at 40 QPS

5/6/2026

Most AI support copilots start the same way: a product team picks one strong model, writes a careful prompt, connects the help center, and gets a great demo.

The trouble starts after launch. Real support traffic is not one workload. It is password resets, refund policy questions, order status lookups, angry long-form complaints, screenshots, multilingual follow-ups, and occasional high-risk edge cases. Sending all of that through the same model is simple, but it is rarely economical or reliable.

This is an anonymized composite scenario based on common production patterns we see when teams move from prototype to controlled AI operations.

Multi-model routing architecture — Production routing separates request intent, policy, model choice, cache, fallback, and traceability.

The starting point

The team ran a support copilot inside a consumer marketplace. Traffic was steady on weekdays and spiky after promotions.

Signal	Before routing
Daily AI requests	18k to 24k
Peak traffic	40 QPS
Primary model	One frontier model for every request
Cost per 1k requests	$4.18
P95 latency	1.64s
Successful response rate	98.1%
Repeatable FAQ share	About 36%

The model quality was strong, but three problems kept showing up in reviews:

Simple FAQ traffic was paying premium-model prices.
Provider latency spikes caused full retries, which made peak cost worse.
Engineers could not explain why a given answer used a costly model.

The team did not want to reduce quality. They wanted a control plane that could spend more only when the request deserved it.

The routing policy

They introduced three pools in SkyAIApp:

Pool	Used for	Policy
Cost pool	Password resets, return windows, shipping cutoffs, plan limits	Cheapest model that passes answer-format checks
Balanced pool	Normal support conversation with moderate context	Best score across latency, cost, and historical success
Quality pool	Refund disputes, account recovery, multi-step tool planning	Strong model with stricter validation and fallback

The routing signal was intentionally boring: intent label, customer tier, retrieval confidence, prompt size, tool requirement, locale, and safety flag. The first version used lightweight classifiers and embeddings rather than a large new ML project.

Here is the important part: the team treated routing as versioned product infrastructure, not as a hidden SDK choice. Each policy version had an owner, rollout percentage, and evaluation report.

Cache with guardrails

Semantic cache was enabled only for low-risk intent families:

faq.policy
faq.shipping
faq.plan_limits
docs.how_to

Cache keys included locale, policy version, public help-center revision, and safety labels. Anything involving account-specific data, payment status, or a live tool call bypassed cache automatically.

That kept the cache useful without creating stale or unsafe answers.

Results after 21 days

Metric	Before	After
Cost per 1k requests	$4.18	$2.96
P95 latency	1.64s	1.08s
Successful response rate	98.1%	99.3%
FAQ cache hit rate	0%	32%
Requests using quality pool	100%	9%

The biggest lesson was not "use cheaper models." It was "route with evidence." Simple intents were faster and cheaper on smaller models, while complex cases still had budget to use the strongest pool.

Why this matters

Without routing, every product decision becomes a model decision. Want faster support? Change the model. Need lower cost? Change the model. Need safer account recovery? Change the model again.

That does not scale.

SkyAIApp lets teams move those choices into policies:

Product can define which journeys deserve quality-first handling.
Engineering can keep fallback and retry behavior consistent.
Finance can see unit cost by intent, tenant, and policy version.
Support leaders can review quality without reading raw provider logs.

For a production support copilot, the router becomes the layer that keeps quality, latency, and cost in the same conversation.