AI FinOps: How a Docs Assistant Found 27% Unit Cost Savings

3/5/2026

AI cost reviews often start too late. By the time finance asks why the monthly provider bill doubled, engineering is already under pressure to cut spend quickly.

The better pattern is to define unit economics before the traffic curve bends. That means answering four questions every week:

  • What would this workload cost on a single-model baseline?
  • Which requests were routed to cheaper or faster pools?
  • How much cost did cache remove?
  • Did any saving harm quality, latency, or reliability?

This anonymized scenario follows a B2B SaaS team running an AI documentation assistant for admins, support agents, and customer success managers.

Example AI FinOps savings report chart

A useful savings report compares the actual routed workload against a stable single-model baseline.

The workload

The assistant answered product configuration questions, summarized release notes, and generated short implementation snippets from internal docs.

SignalValue
Weekly requestsAbout 96k
Average prompt size1,450 tokens
Average output size420 tokens
Main usersSupport, success, solutions engineers
Baseline modelOne high-quality model for all traffic

The team had a classic concern: documentation answers must be correct. A bad answer could waste an engineer's time or create a bad customer recommendation. So the goal was not aggressive cost cutting. It was controlled cost reduction with quality gates.

The baseline

SkyAIApp replayed a sampled week against a single-model policy to create a baseline. That baseline did not change during the first optimization cycle, which made the savings credible.

The team then compared production traffic against the baseline using three buckets:

BucketWhat it measures
Routing mixSavings from sending simpler tasks to cheaper pools
Cache winsSavings from semantic reuse of repeated doc questions
Fallback efficiencySavings from avoiding long retries and duplicate calls

The dashboard also tracked "quality hold": the share of evaluated answers that stayed above the team's acceptance threshold.

Policy changes

The first production policy had four rules:

  1. Route doc navigation and short factual answers to the cost pool.
  2. Route migration guidance and code snippets to the balanced pool.
  3. Route low retrieval-confidence requests to the quality pool.
  4. Disable cache whenever the docs revision or permission scope changed.

That last rule mattered. Many AI cost projects fail because cache gets treated as a universal accelerator. For documentation, cache must respect version and access boundaries.

Results after 30 days

MetricBaselineRouted policy
Cost per 1k requests$3.71$2.70
Net savings0%27.2%
P95 latency1.38s1.01s
Successful response rate98.7%99.2%
Semantic cache hit rate0%24%
Quality hold100% target99.4%

The quality hold dip was reviewed manually. Most misses came from doc pages that had recently changed but were not yet indexed. The fix was operational, not model-related: tighter indexing alerts and cache invalidation on docs deploys.

What finance liked

The finance team did not need model names or prompt theory. They needed a repeatable explanation:

  • Baseline cost if nothing changed.
  • Actual routed cost.
  • Savings by component.
  • Quality and reliability checks proving the savings were not cosmetic.

Because SkyAIApp reported unit cost by policy version, finance could see that the savings persisted after rollout, not just during a cherry-picked test window.

The operating cadence

The team settled into a weekly AI FinOps review:

  • Product reviews top intent families and quality misses.
  • Engineering reviews cache invalidation, fallback events, and latency drift.
  • Finance reviews cost per tenant and cost per 1k requests.
  • Leadership reviews whether usage growth is improving or weakening gross margin.

That cadence changed the conversation. AI spend stopped being an unpredictable invoice and became an operating metric.

For production AI apps, FinOps is not a spreadsheet after the fact. It belongs in the runtime, next to routing policy, cache, tracing, and evaluations.