How SkyAIApp thinks
Five concepts explain the whole platform: the Goal × Strategy decision model, the routing pipeline, the agent runtime, guardrails, semantic cache, and the trace that makes it all debuggable. Once these click, everything else is detail.
TL;DR
SkyAIApp turns 'calling an LLM' from a single API call into a governed decision flow. You declare intent (goal + strategy + budget); SkyAIApp picks the model, hits the cache, runs fallbacks, traces everything, and meters spend — consolidating plumbing that every team would otherwise rebuild themselves.
These are expected ranges under tuned baselines. Actuals depend on your prompt distribution, model mix, and cache hit rate. Audited production benchmarks ship alongside public beta.
Smart routing
The router is the heart of SkyAIApp. For each request, it picks the 'best' model in single-digit milliseconds. 'Best' is defined by the goal × strategy × budget you declare — not by hard-coded rules.
The decision pipeline (5 stages)
- 1Request normalization — Parse messages, tools, and metadata into the internal representation. Compute the prompt fingerprint for cache lookup.
- 2Cache lookup — If cache=true, query the semantic vector store. On hit, return immediately and skip the model call.
- 3Candidate filtering — Filter the 50+ model pool by budget, modality, and policy constraints.
- 4Score & rank — Score candidates across (cost, quality, latency); weights come from the strategy.
- 5Execute + fallback — Call the top-ranked model; on timeout/failure, walk the fallback chain. Every event is written to the trace.
Full internals: Architecture deep-dive。
Minimal example
// One call covers normalization, cache, candidate selection, ranking, execute, fallback.
const res = await sky.route({
goal: "cost", // "cost" | "quality" | "stability"
strategy: "balanced", // "balanced" | "cost-optimized" | "quality-first"
messages: [{ role: "user", content: "Summarize..." }],
// Constraints the router will respect:
budget: { maxCostUsd: 0.01 },
fallback: { models: ["claude-haiku-4.5", "gpt-5.5-mini"], maxRetries: 2 },
cache: true,
timeoutMs: 10_000,
});
console.log(res.routing.selectedModel); // "claude-haiku-4.5"
console.log(res.routing.decisionReason); // human-readable explanation
console.log(res.traceId); // open in console for full span treeGoal — what you want
Goal is the high-level intent. It picks the primary axis the router optimizes against. The three goals are mutually exclusive — if you want 'cheap and fast', that's a strategy concern.
goal: "cost"Cost — minimize spendUse when
- Data labeling, bulk classification, prompt engineering, internal scripts
Avoid when
- User-facing critical paths — users hate slow/wrong more than expensive
Typical pick
- Typical primary: gpt-5.5-mini · claude-haiku-4.5 · deepseek-v4
goal: "quality"Quality — best output winsUse when
- Code generation, deep reasoning, research analysis, contract review
Avoid when
- High-QPS simple classification (cost spirals)
Typical pick
- Typical primary: gpt-5.5-pro · claude-opus-4.7 · gemini-3.1-pro
goal: "stability"Stability — survival firstUse when
- Financial support, medical consult, SLA-sensitive flows, compliance gateways
Avoid when
- A/B testing new models — stability biases toward conservative candidates
Typical pick
- Biases toward 'in-production ≥ 90 days · provider SLA ≥ 99.9%' candidates
Strategy — what you'll trade
Strategy sets the weights. Under the same goal, different strategies weigh (cost, quality, latency) differently when ranking candidates.
| Strategy | cost | quality | latency | Best for |
|---|---|---|---|---|
balanced | ⬤⬤⬤ | ⬤⬤⬤ | ⬤⬤⬤ | Default / most production calls |
cost-optimized | ⬤⬤⬤⬤⬤ | ⬤⬤ | ⬤⬤⬤ | Batch jobs / offline pipelines |
quality-first | ⬤ | ⬤⬤⬤⬤⬤ | ⬤⬤ | Code generation / deep reasoning |
latency-optimized | ⬤⬤⬤ | ⬤⬤⬤ | ⬤⬤⬤⬤⬤ | Real-time chat / streaming UI |
💡 If unsure:default to balanced. Monitor traces for a week, then tune based on observed bottlenecks. Full strategy selection guide →
Agent Runtime
The agent runtime wraps 'call LLM + call tool + handle failures + maintain state across steps' into one primitive. Think of it as upgrading the while-loop you'd write yourself into a production-grade construct with timeouts, retries, sandboxing, and traces.
Key features
Tool calling
MCP protocol + built-in web/code/calculator; custom tools via JSON schema.
Multi-step reasoning
Plan → execute → reflect → converge. Per-step timeouts and retries.
Sandbox execution
Code tools run isolated; network/FS access authorized by scope.
State management
Working memory shared across steps; serializable for resume-on-failure.
const agent = sky.createAgent({
// Built-in + custom tools. Each is sandboxed unless you opt out.
tools: [
"web_search",
"calculator",
"code_exec",
{
name: "lookup_invoice",
description: "Fetch invoice by ID from internal billing system.",
parameters: { type: "object", properties: { id: { type: "string" } }, required: ["id"] },
handler: async ({ id }) => billing.find(id),
},
],
maxSteps: 10, // hard upper bound — protects against runaway agents
perStepTimeoutMs: 30_000, // each tool/LLM call gets 30s max
totalBudgetUsd: 0.50, // cumulative cost cap across all steps
onStep: (step) => console.log(`[${step.number}] ${step.action} → ${step.tool ?? "llm"}`),
});
const result = await agent.run({
task: "Find this month's overdue invoices and email a polite reminder to each.",
});
console.log(result.output); // final agent response
console.log(result.steps.length); // how many steps it actually took
console.log(result.totalCostUsd); // cumulative spend
console.log(result.traceId); // open in console for full span treeGuardrails — built-in defenses
Guardrails intercept around the request: detect/redact PII, filter unsafe content, block prompt injection, write audit logs. The default config covers ~80% of cases; the rest can be customized as policies in the console.
PII detection & redaction
On by default. Hits are logged in the trace (entity type, position, confidence). Modes: redact / hash / block.
Coverage: Email, phone, SSN, credit card, passport, IP, medical IDs — 30+ entity types
Content moderation
Runs on both input and output. The detector is decoupled from the model — switching providers doesn't break consistency.
Coverage: Hate, violence, sexual, self-harm, illegal activity — NIST AI RMF compatible labels
Prompt injection defense
Inserts boundary delimiters around system/tool-result content; optional LLM-as-judge second-pass validation.
Coverage: Default blocks: ignore-previous, role-swap, system-prompt extraction patterns
Semantic caching
Traditional caches do exact-key matches — an extra space and you miss. Semantic cache looks up by embedding similarity: 'same meaning' counts as a hit. Two benefits: hit rate goes up sharply, and rephrased prompts still benefit.
×Traditional exact cache
✓ "What is AI?"
✗ "What's artificial intelligence?"
✗ "Explain AI to me"
✗ "What is AI? " (extra space)
✓SkyAIApp semantic cache
✓ "What is AI?"
✓ "What's artificial intelligence?"
✓ "Explain AI to me"
✓ "What is AI? " (extra space)
When NOT to enable cache
- When freshness matters: creative writing, code completion, personalized recommendations
- Prompts with dynamic variables (username, timestamp) — pollutes the cache or always misses
- When tuning a new prompt — cache will mask the actual change
const res = await sky.route({
goal: "cost",
messages: [...],
cache: {
enabled: true,
similarity: 0.92, // threshold: lower = more hits, more risk of wrong-meaning hits
ttlSeconds: 60 * 60 * 24, // 24h — tune by content freshness needs
namespace: "tenant_acme_summarize", // segregate caches per tenant + workflow
},
});
if (res.routing.cacheHit) {
console.log("Cache hit — saved", res.routing.savedCostUsd, "USD");
}Trace — what makes it debuggable
Every request emits a trace_id. Search it in the console and you'll see the decision tree (candidates with scores, policy matches), cache lookups, the actual model call (with tokens + ms), fallback events, guardrail hits, and the final output. This is what separates SkyAIApp from 'just calling OpenAI' — you stop being blind to the why.
A trace includes
- Decision tree with per-candidate scores
- Policy version + matched rules
- Cache lookup (hit/miss + similarity)
- Model HTTP call (status, ms, tokens)
- Fallback trigger + reason
- Guardrail hits (entities, confidence)
- Agent steps, tool calls, sub-traces
Export options
- OpenTelemetry compatible (OTLP gRPC/HTTP)
- Direct integrations: Datadog / Honeycomb / Grafana Tempo
- Configurable sampling rate (cost trade-off)
- PII fields redacted before export
- Retention: 7d default / 90d on Enterprise
// Inspect a trace programmatically
const trace = await sky.traces.get("tr_01JFGYZ7K8M2N3P4Q5R6S7T8U9");
console.log(trace.routing.decisionReason);
// → "balanced + cost: claude-haiku-4.5 wins on $/quality vs alternatives in this band."
console.log(trace.spans.length); // total span count
console.log(trace.summary.totalCostUsd); // total cost (incl. retries + agent steps)
console.log(trace.summary.cacheHitRate); // cache hit rate inside this trace
// Walk the span tree
for (const span of trace.spans) {
console.log(`[${span.startedAtMs}] ${span.name} (${span.durationMs}ms) → ${span.status}`);
}Putting it all together
This mental model covers 95% of the situations you'll hit:
// 1. You declare intent.
const res = await sky.route({
goal: "quality", // what you want
strategy: "quality-first", // what you'll trade
budget: { maxCostUsd: 0.05 }, // hard limits
// 2. Optional: bring your own decisions.
fallback: { models: [...] },
cache: { similarity: 0.92 },
metadata: { tenant: "acme" },
// 3. The actual work.
messages: [...],
});
// 4. The router runs the pipeline:
// normalize → cache → filter → score → execute (+ fallback).
// 5. Every step is in the trace.
// Open res.traceId in console for the full decision tree.
console.log(res.routing.selectedModel, res.routing.decisionReason);
// 6. If the workload is multi-step / tool-heavy, swap route() for createAgent().
// Same goal/strategy/budget contracts apply.Recommended next: Architecture deep-dive (internals of the router) or the strategy selection guide (which strategy fits which workload).
Next steps
Concepts down — now go build or go deeper.
Was this page helpful?
Let us know how we can improve