Core concepts

How SkyAIApp thinks

Five concepts explain the whole platform: the Goal × Strategy decision model, the routing pipeline, the agent runtime, guardrails, semantic cache, and the trace that makes it all debuggable. Once these click, everything else is detail.

TL;DR

SkyAIApp turns 'calling an LLM' from a single API call into a governed decision flow. You declare intent (goal + strategy + budget); SkyAIApp picks the model, hits the cache, runs fallbacks, traces everything, and meters spend — consolidating plumbing that every team would otherwise rebuild themselves.

18–35%

Cost savings (expected range)

< 1.2s

P95 latency (typical)

≥ 99.5%

Success with fallback

These are expected ranges under tuned baselines. Actuals depend on your prompt distribution, model mix, and cache hit rate. Audited production benchmarks ship alongside public beta.

Smart routing

The router is the heart of SkyAIApp. For each request, it picks the 'best' model in single-digit milliseconds. 'Best' is defined by the goal × strategy × budget you declare — not by hard-coded rules.

The decision pipeline (5 stages)

1
Request normalization — Parse messages, tools, and metadata into the internal representation. Compute the prompt fingerprint for cache lookup.
2
Cache lookup — If cache=true, query the semantic vector store. On hit, return immediately and skip the model call.
3
Candidate filtering — Filter the 50+ model pool by budget, modality, and policy constraints.
4
Score & rank — Score candidates across (cost, quality, latency); weights come from the strategy.
5
Execute + fallback — Call the top-ranked model; on timeout/failure, walk the fallback chain. Every event is written to the trace.

Full internals: Architecture deep-dive。

Minimal example

// One call covers normalization, cache, candidate selection, ranking, execute, fallback.
const res = await sky.route({
  goal: "cost",                          // "cost" | "quality" | "stability"
  strategy: "balanced",                  // "balanced" | "cost-optimized" | "quality-first"
  messages: [{ role: "user", content: "Summarize..." }],

  // Constraints the router will respect:
  budget:    { maxCostUsd: 0.01 },
  fallback:  { models: ["claude-haiku-4.5", "gpt-5.5-mini"], maxRetries: 2 },
  cache:     true,
  timeoutMs: 10_000,
});

console.log(res.routing.selectedModel);    // "claude-haiku-4.5"
console.log(res.routing.decisionReason);   // human-readable explanation
console.log(res.traceId);                  // open in console for full span tree

Goal — what you want

Goal is the high-level intent. It picks the primary axis the router optimizes against. The three goals are mutually exclusive — if you want 'cheap and fast', that's a strategy concern.

goal: "cost"Cost — minimize spend

Use when

Data labeling, bulk classification, prompt engineering, internal scripts

Avoid when

User-facing critical paths — users hate slow/wrong more than expensive

Typical pick

Typical primary: gpt-5.5-mini · claude-haiku-4.5 · deepseek-v4

goal: "quality"Quality — best output wins

Use when

Code generation, deep reasoning, research analysis, contract review

Avoid when

High-QPS simple classification (cost spirals)

Typical pick

Typical primary: gpt-5.5-pro · claude-opus-4.7 · gemini-3.1-pro

goal: "stability"Stability — survival first

Use when

Financial support, medical consult, SLA-sensitive flows, compliance gateways

Avoid when

A/B testing new models — stability biases toward conservative candidates

Typical pick

Biases toward 'in-production ≥ 90 days · provider SLA ≥ 99.9%' candidates

Strategy — what you'll trade

Strategy sets the weights. Under the same goal, different strategies weigh (cost, quality, latency) differently when ranking candidates.

Strategy	cost	quality	latency	Best for
`balanced`	⬤⬤⬤	⬤⬤⬤	⬤⬤⬤	Default / most production calls
`cost-optimized`	⬤⬤⬤⬤⬤	⬤⬤	⬤⬤⬤	Batch jobs / offline pipelines
`quality-first`	⬤	⬤⬤⬤⬤⬤	⬤⬤	Code generation / deep reasoning
`latency-optimized`	⬤⬤⬤	⬤⬤⬤	⬤⬤⬤⬤⬤	Real-time chat / streaming UI

💡 If unsure：default to balanced. Monitor traces for a week, then tune based on observed bottlenecks. Full strategy selection guide →

Agent Runtime

The agent runtime wraps 'call LLM + call tool + handle failures + maintain state across steps' into one primitive. Think of it as upgrading the while-loop you'd write yourself into a production-grade construct with timeouts, retries, sandboxing, and traces.

Key features

Tool calling

MCP protocol + built-in web/code/calculator; custom tools via JSON schema.

Multi-step reasoning

Plan → execute → reflect → converge. Per-step timeouts and retries.

Sandbox execution

Code tools run isolated; network/FS access authorized by scope.

State management

Working memory shared across steps; serializable for resume-on-failure.

const agent = sky.createAgent({
  // Built-in + custom tools. Each is sandboxed unless you opt out.
  tools: [
    "web_search",
    "calculator",
    "code_exec",
    {
      name: "lookup_invoice",
      description: "Fetch invoice by ID from internal billing system.",
      parameters: { type: "object", properties: { id: { type: "string" } }, required: ["id"] },
      handler: async ({ id }) => billing.find(id),
    },
  ],
  maxSteps: 10,                         // hard upper bound — protects against runaway agents
  perStepTimeoutMs: 30_000,             // each tool/LLM call gets 30s max
  totalBudgetUsd: 0.50,                 // cumulative cost cap across all steps
  onStep: (step) => console.log(`[${step.number}] ${step.action} → ${step.tool ?? "llm"}`),
});

const result = await agent.run({
  task: "Find this month's overdue invoices and email a polite reminder to each.",
});

console.log(result.output);             // final agent response
console.log(result.steps.length);       // how many steps it actually took
console.log(result.totalCostUsd);       // cumulative spend
console.log(result.traceId);            // open in console for full span tree

Full Agent API reference →

Guardrails — built-in defenses

Guardrails intercept around the request: detect/redact PII, filter unsafe content, block prompt injection, write audit logs. The default config covers ~80% of cases; the rest can be customized as policies in the console.

PII detection & redaction

On by default. Hits are logged in the trace (entity type, position, confidence). Modes: redact / hash / block.

Coverage: Email, phone, SSN, credit card, passport, IP, medical IDs — 30+ entity types

Content moderation

Runs on both input and output. The detector is decoupled from the model — switching providers doesn't break consistency.

Coverage: Hate, violence, sexual, self-harm, illegal activity — NIST AI RMF compatible labels

Prompt injection defense

Inserts boundary delimiters around system/tool-result content; optional LLM-as-judge second-pass validation.

Coverage: Default blocks: ignore-previous, role-swap, system-prompt extraction patterns

Security best practices guide →

Semantic caching

Traditional caches do exact-key matches — an extra space and you miss. Semantic cache looks up by embedding similarity: 'same meaning' counts as a hit. Two benefits: hit rate goes up sharply, and rephrased prompts still benefit.

×Traditional exact cache

✓ "What is AI?"

✗ "What's artificial intelligence?"

✗ "Explain AI to me"

✗ "What is AI? " (extra space)

✓SkyAIApp semantic cache

✓ "What is AI?"

✓ "What's artificial intelligence?"

✓ "Explain AI to me"

✓ "What is AI? " (extra space)

When NOT to enable cache

When freshness matters: creative writing, code completion, personalized recommendations
Prompts with dynamic variables (username, timestamp) — pollutes the cache or always misses
When tuning a new prompt — cache will mask the actual change

const res = await sky.route({
  goal: "cost",
  messages: [...],
  cache: {
    enabled: true,
    similarity: 0.92,                 // threshold: lower = more hits, more risk of wrong-meaning hits
    ttlSeconds: 60 * 60 * 24,         // 24h — tune by content freshness needs
    namespace: "tenant_acme_summarize", // segregate caches per tenant + workflow
  },
});

if (res.routing.cacheHit) {
  console.log("Cache hit — saved", res.routing.savedCostUsd, "USD");
}

Trace — what makes it debuggable

Every request emits a trace_id. Search it in the console and you'll see the decision tree (candidates with scores, policy matches), cache lookups, the actual model call (with tokens + ms), fallback events, guardrail hits, and the final output. This is what separates SkyAIApp from 'just calling OpenAI' — you stop being blind to the why.

A trace includes

Decision tree with per-candidate scores
Policy version + matched rules
Cache lookup (hit/miss + similarity)
Model HTTP call (status, ms, tokens)
Fallback trigger + reason
Guardrail hits (entities, confidence)
Agent steps, tool calls, sub-traces

Export options

OpenTelemetry compatible (OTLP gRPC/HTTP)
Direct integrations: Datadog / Honeycomb / Grafana Tempo
Configurable sampling rate (cost trade-off)
PII fields redacted before export
Retention: 7d default / 90d on Enterprise

// Inspect a trace programmatically
const trace = await sky.traces.get("tr_01JFGYZ7K8M2N3P4Q5R6S7T8U9");

console.log(trace.routing.decisionReason);
//  → "balanced + cost: claude-haiku-4.5 wins on $/quality vs alternatives in this band."

console.log(trace.spans.length);                 // total span count
console.log(trace.summary.totalCostUsd);         // total cost (incl. retries + agent steps)
console.log(trace.summary.cacheHitRate);         // cache hit rate inside this trace

// Walk the span tree
for (const span of trace.spans) {
  console.log(`[${span.startedAtMs}] ${span.name} (${span.durationMs}ms) → ${span.status}`);
}

Putting it all together

This mental model covers 95% of the situations you'll hit:

// 1. You declare intent.
const res = await sky.route({
  goal: "quality",                    // what you want
  strategy: "quality-first",          // what you'll trade
  budget: { maxCostUsd: 0.05 },       // hard limits

  // 2. Optional: bring your own decisions.
  fallback: { models: [...] },
  cache: { similarity: 0.92 },
  metadata: { tenant: "acme" },

  // 3. The actual work.
  messages: [...],
});

// 4. The router runs the pipeline:
//    normalize → cache → filter → score → execute (+ fallback).

// 5. Every step is in the trace.
//    Open res.traceId in console for the full decision tree.
console.log(res.routing.selectedModel, res.routing.decisionReason);

// 6. If the workload is multi-step / tool-heavy, swap route() for createAgent().
//    Same goal/strategy/budget contracts apply.

Recommended next: Architecture deep-dive (internals of the router) or the strategy selection guide (which strategy fits which workload).

Next steps

Concepts down — now go build or go deeper.

Hands-on quickstart

5-minute integration

Architecture

Router internals

Strategy guide

Tuning recipes

Authentication

Architecture deep-dive

Was this page helpful?

Let us know how we can improve