RAG Quality Gates: Reducing Unsupported Answers in Knowledge Apps
RAG systems fail in subtle ways. Retrieval can return the wrong document. The model can answer beyond the evidence. A newly updated policy can conflict with cached context. A confident answer can be the most dangerous failure mode.
Production teams do not need a philosophical debate about hallucinations. They need a measurable quality system.
This anonymized composite scenario follows a knowledge-base assistant used by sales engineers and support teams.
Quality gates combine retrieval confidence, answerability, citations, evals, and fallback paths.
The problem
The assistant looked good in early testing, but production traffic exposed three failure modes:
- It answered questions where retrieval confidence was low.
- It mixed old and new policy documents after weekly releases.
- It gave confident summaries without enough citations.
The team sampled answers weekly and found that 8.7% were unsupported: not necessarily wrong, but not adequately grounded in retrieved evidence.
That rate was too high for customer-facing workflows.
The quality-gate design
SkyAIApp helped the team separate RAG quality into gates:
| Gate | Decision |
|---|---|
| Retrieval confidence | Is the retrieved context strong enough to answer? |
| Document freshness | Does the context match the latest indexed revision? |
| Answerability | Should the assistant answer, ask a follow-up, or refuse? |
| Citation coverage | Are key claims linked to retrieved sources? |
| Eval score | Does the answer pass task-specific checks? |
The goal was not to make every answer longer. The goal was to make unsupported answers harder to ship.
Routing by confidence
The team used retrieval confidence as a routing signal:
- High confidence and simple question: cost pool.
- Medium confidence: balanced pool with citation checks.
- Low confidence or conflicting docs: quality pool or human handoff.
This made model choice depend on evidence, not just prompt length.
Cache with document versions
Caching was useful, but only after the team tied cache keys to document revision and permission scope. A cached answer from last week's policy page could not be reused after a docs deploy.
SkyAIApp traces made it obvious when an answer came from cache, which policy version allowed it, and which document revision was used.
Results after two release cycles
| Metric | Before | After |
|---|---|---|
| Unsupported answer rate | 8.7% | 2.4% |
| Answers with adequate citations | 71% | 94% |
| Low-confidence answers escalated | 12% | 38% |
| P95 latency | 1.52s | 1.21s |
| Cost per 1k requests | $3.44 | $2.81 |
Latency and cost improved because high-confidence simple questions could use smaller models and cache safely. Quality improved because weak evidence no longer flowed through the same path as strong evidence.
What changed operationally
The team added a weekly RAG review:
- top unsupported answer categories
- docs revisions that caused cache invalidation
- routing policy changes by confidence band
- eval failures by tenant and locale
- human-handoff volume
That review gave product, support, and engineering one shared quality language.
RAG quality is not solved by one better prompt. It is solved by a system that knows when to answer, when to route up, when to cite, and when to stop.