Production Routing Playbook: A Support Copilot at 40 QPS

5/6/2026

Most AI support copilots start the same way: a product team picks one strong model, writes a careful prompt, connects the help center, and gets a great demo.

The trouble starts after launch. Real support traffic is not one workload. It is password resets, refund policy questions, order status lookups, angry long-form complaints, screenshots, multilingual follow-ups, and occasional high-risk edge cases. Sending all of that through the same model is simple, but it is rarely economical or reliable.

This is an anonymized composite scenario based on common production patterns we see when teams move from prototype to controlled AI operations.

Multi-model routing architecture

Production routing separates request intent, policy, model choice, cache, fallback, and traceability.

The starting point

The team ran a support copilot inside a consumer marketplace. Traffic was steady on weekdays and spiky after promotions.

SignalBefore routing
Daily AI requests18k to 24k
Peak traffic40 QPS
Primary modelOne frontier model for every request
Cost per 1k requests$4.18
P95 latency1.64s
Successful response rate98.1%
Repeatable FAQ shareAbout 36%

The model quality was strong, but three problems kept showing up in reviews:

  • Simple FAQ traffic was paying premium-model prices.
  • Provider latency spikes caused full retries, which made peak cost worse.
  • Engineers could not explain why a given answer used a costly model.

The team did not want to reduce quality. They wanted a control plane that could spend more only when the request deserved it.

The routing policy

They introduced three pools in SkyAIApp:

PoolUsed forPolicy
Cost poolPassword resets, return windows, shipping cutoffs, plan limitsCheapest model that passes answer-format checks
Balanced poolNormal support conversation with moderate contextBest score across latency, cost, and historical success
Quality poolRefund disputes, account recovery, multi-step tool planningStrong model with stricter validation and fallback

The routing signal was intentionally boring: intent label, customer tier, retrieval confidence, prompt size, tool requirement, locale, and safety flag. The first version used lightweight classifiers and embeddings rather than a large new ML project.

Here is the important part: the team treated routing as versioned product infrastructure, not as a hidden SDK choice. Each policy version had an owner, rollout percentage, and evaluation report.

Cache with guardrails

Semantic cache was enabled only for low-risk intent families:

  • faq.policy
  • faq.shipping
  • faq.plan_limits
  • docs.how_to

Cache keys included locale, policy version, public help-center revision, and safety labels. Anything involving account-specific data, payment status, or a live tool call bypassed cache automatically.

That kept the cache useful without creating stale or unsafe answers.

Results after 21 days

MetricBeforeAfter
Cost per 1k requests$4.18$2.96
P95 latency1.64s1.08s
Successful response rate98.1%99.3%
FAQ cache hit rate0%32%
Requests using quality pool100%9%

The biggest lesson was not "use cheaper models." It was "route with evidence." Simple intents were faster and cheaper on smaller models, while complex cases still had budget to use the strongest pool.

Why this matters

Without routing, every product decision becomes a model decision. Want faster support? Change the model. Need lower cost? Change the model. Need safer account recovery? Change the model again.

That does not scale.

SkyAIApp lets teams move those choices into policies:

  • Product can define which journeys deserve quality-first handling.
  • Engineering can keep fallback and retry behavior consistent.
  • Finance can see unit cost by intent, tenant, and policy version.
  • Support leaders can review quality without reading raw provider logs.

For a production support copilot, the router becomes the layer that keeps quality, latency, and cost in the same conversation.