AI FinOps: How a Docs Assistant Found 27% Unit Cost Savings
AI cost reviews often start too late. By the time finance asks why the monthly provider bill doubled, engineering is already under pressure to cut spend quickly.
The better pattern is to define unit economics before the traffic curve bends. That means answering four questions every week:
- What would this workload cost on a single-model baseline?
- Which requests were routed to cheaper or faster pools?
- How much cost did cache remove?
- Did any saving harm quality, latency, or reliability?
This anonymized scenario follows a B2B SaaS team running an AI documentation assistant for admins, support agents, and customer success managers.
A useful savings report compares the actual routed workload against a stable single-model baseline.
The workload
The assistant answered product configuration questions, summarized release notes, and generated short implementation snippets from internal docs.
| Signal | Value |
|---|---|
| Weekly requests | About 96k |
| Average prompt size | 1,450 tokens |
| Average output size | 420 tokens |
| Main users | Support, success, solutions engineers |
| Baseline model | One high-quality model for all traffic |
The team had a classic concern: documentation answers must be correct. A bad answer could waste an engineer's time or create a bad customer recommendation. So the goal was not aggressive cost cutting. It was controlled cost reduction with quality gates.
The baseline
SkyAIApp replayed a sampled week against a single-model policy to create a baseline. That baseline did not change during the first optimization cycle, which made the savings credible.
The team then compared production traffic against the baseline using three buckets:
| Bucket | What it measures |
|---|---|
| Routing mix | Savings from sending simpler tasks to cheaper pools |
| Cache wins | Savings from semantic reuse of repeated doc questions |
| Fallback efficiency | Savings from avoiding long retries and duplicate calls |
The dashboard also tracked "quality hold": the share of evaluated answers that stayed above the team's acceptance threshold.
Policy changes
The first production policy had four rules:
- Route doc navigation and short factual answers to the cost pool.
- Route migration guidance and code snippets to the balanced pool.
- Route low retrieval-confidence requests to the quality pool.
- Disable cache whenever the docs revision or permission scope changed.
That last rule mattered. Many AI cost projects fail because cache gets treated as a universal accelerator. For documentation, cache must respect version and access boundaries.
Results after 30 days
| Metric | Baseline | Routed policy |
|---|---|---|
| Cost per 1k requests | $3.71 | $2.70 |
| Net savings | 0% | 27.2% |
| P95 latency | 1.38s | 1.01s |
| Successful response rate | 98.7% | 99.2% |
| Semantic cache hit rate | 0% | 24% |
| Quality hold | 100% target | 99.4% |
The quality hold dip was reviewed manually. Most misses came from doc pages that had recently changed but were not yet indexed. The fix was operational, not model-related: tighter indexing alerts and cache invalidation on docs deploys.
What finance liked
The finance team did not need model names or prompt theory. They needed a repeatable explanation:
- Baseline cost if nothing changed.
- Actual routed cost.
- Savings by component.
- Quality and reliability checks proving the savings were not cosmetic.
Because SkyAIApp reported unit cost by policy version, finance could see that the savings persisted after rollout, not just during a cherry-picked test window.
The operating cadence
The team settled into a weekly AI FinOps review:
- Product reviews top intent families and quality misses.
- Engineering reviews cache invalidation, fallback events, and latency drift.
- Finance reviews cost per tenant and cost per 1k requests.
- Leadership reviews whether usage growth is improving or weakening gross margin.
That cadence changed the conversation. AI spend stopped being an unpredictable invoice and became an operating metric.
For production AI apps, FinOps is not a spreadsheet after the fact. It belongs in the runtime, next to routing policy, cache, tracing, and evaluations.