Architecture deep-dive

How the router actually decides

After reading: you can answer 'how does SkyAIApp represent candidate models? how is scoring computed? what's the cache lookup algorithm? when do fallbacks fire? how do traces form a tree?' — and why we picked these designs.

Request lifecycle

Request lifecycle architecture

How one request is governed, routed, executed, and audited

sync path
cache fast path
async side path
01entry

Edge ingress

TLS, WAF, API key, tenant identity, and rate limits run before the request enters the control plane.

TLS 1.3
WAF
key-tier limits
02normalize

Normalize request

OpenAI-compatible calls, native route requests, and metadata are normalized into one auditable envelope.

JSON Schema
prompt fingerprint
policy resolve
03protect

Input guardrails

Injection, PII, tenant boundary, and residency checks run before spending on a model call.

PII
region tags
tool scopes
04fast path

Semantic cache lookup

Namespace, embedding similarity, and TTL determine cache hits; a hit returns immediately and skips the model call.

HNSW
similarity gate
TTL
05model pool

Candidate filtering

Model pools are filtered by context window, modality, region, RBAC, budget, and provider health.

budget cap
context fit
provider health
06decision

Score and rank

Strategy weights normalize cost, latency, quality, and reliability to produce the primary and fallback chain.

cost
latency
quality score
07execute

Execute primary or fallback

Timeouts, 5xxs, safety hits, or budget changes trigger a rechecked fallback handoff.

HTTP/2
timeout
fallback
08response

Assemble response

Model output, routing metadata, guardrail results, and trace ID are assembled before returning to the app.

trace_id
route reason
usage metadata
The main path shows synchronous steps that block the response; side cards show cache, fallback, and observability paths. Every node writes into the same trace tree.

Why this shape

Why is cache lookup before candidate filter?

Cache hit is the free path — skip the model means skipping the most expensive step. By looking up first, hits only execute steps 1–3.

Why is budget checked both in filter and fallback?

The filter excludes 'obviously over budget' candidates. After a fallback fires, we re-account for already-spent cost — what was a borderline candidate may now exceed the budget.

Why score with 1/normalized rather than w * cost?

Cost spans 3 orders of magnitude ($0.0001–$0.05). Direct multiplication is dominated by extremes. Normalizing to [0,1] then inverting makes 'cheap+slow' and 'expensive+fast' actually comparable under balanced.

Why tie-break by provider diversity?

If primary + secondary are both from OpenAI, an OpenAI outage takes down both. Diversity tie-break ensures the fallback chain provides genuine resilience.

Why is trace writing async?

Zero impact on response latency. Traces go through a fire-and-forget queue with eventual consistency. Cost: P99 write latency ~2s (a 'just now' request may briefly not show in dashboards).

Semantic cache internals

Cache is a namespaced vector index (HNSW) + content store (S3-compatible). Lookup flow:

fn cache_lookup(ns: &str, prompt: &str, similarity: f32, ttl: Duration)
    -> Option<CachedResponse>
{
    // 1. Embed the prompt with a small embedder (~5ms p50)
    let q_vec: [f32; 384] = embed_small(prompt);

    // 2. ANN search in the namespaced HNSW index
    //    K=8 candidates returned with cosine similarities
    let candidates = hnsw.search(ns, &q_vec, k = 8);

    // 3. Filter by threshold AND TTL
    let now = SystemTime::now();
    let valid = candidates
        .iter()
        .filter(|c| c.similarity >= similarity)
        .filter(|c| now.duration_since(c.created_at) <= ttl)
        .max_by(|a, b| a.similarity.partial_cmp(&b.similarity).unwrap());

    valid.map(|hit| {
        // 4. Fetch content (separated from index for cost reasons)
        let body = content_store.get(&hit.content_key)?;
        Some(CachedResponse {
            output: body,
            similarity: hit.similarity,
            stored_at: hit.created_at,
        })
    })
    .flatten()
}

Why a small embedder (not OpenAI ada-002): the lookup cost cannot exceed the saved cost. We use a distilled MPNet (384-dim); a cache hit costs ~5ms and ~0.00001 USD per lookup.

How traces form a tree

Each trace is a span tree. The router creates a root span; each internal step (cache, filter, score, execute) is a child span. Agent runs nest each step as a child; tool calls inside an agent step are grandchildren.

tr_01JFGYZ7K8M2N3P4Q5R6S7T8U9   (root: route, 1820ms)
├─ ingress           (3ms)
├─ normalize         (1ms)
├─ cache.lookup      (5ms)   miss
├─ filter            (1ms)   12 candidates
├─ score             (2ms)   gpt-5.5-pro wins
├─ execute.primary   (1810ms)
│  └─ http.openai    (1808ms) 200 OK
│     ├─ tcp.connect (45ms)
│     ├─ tls.handshake (60ms)
│     └─ stream.read (1700ms)
├─ guardrail.output  (1ms)   PII clean
├─ cache.write       (4ms)   stored
└─ billing.emit      (0.4ms)

Multi-region active-active

Router runs active-active in 3 regions (us-east, eu-west, ap-southeast). GeoDNS routes requests to the nearest region. Policies / cache / usage use a write-once-read-many topology: writes go to us-east, async replicated globally (eventual consistency ≤ 30s). Result: a single region outage doesn't affect global availability; policy changes propagate globally in ≤ 30s.

Two-sided guardrails

Guardrails run on both input and output: input checks prompt injection / PII / topic blocks; output checks PII leakage / hallucination suppression / content moderation. Every hit becomes a guardrail span in the trace; DPOs can export with one click.

See also

Was this page helpful?

Let us know how we can improve

Architecture deep-dive | SkyAIApp Docs — SkyAIApp