RAG Quality Gates: Reducing Unsupported Answers in Knowledge Apps

4/18/2026

RAG systems fail in subtle ways. Retrieval can return the wrong document. The model can answer beyond the evidence. A newly updated policy can conflict with cached context. A confident answer can be the most dangerous failure mode.

Production teams do not need a philosophical debate about hallucinations. They need a measurable quality system.

This anonymized composite scenario follows a knowledge-base assistant used by sales engineers and support teams.

RAG quality gates and eval routing

Quality gates combine retrieval confidence, answerability, citations, evals, and fallback paths.

The problem

The assistant looked good in early testing, but production traffic exposed three failure modes:

  • It answered questions where retrieval confidence was low.
  • It mixed old and new policy documents after weekly releases.
  • It gave confident summaries without enough citations.

The team sampled answers weekly and found that 8.7% were unsupported: not necessarily wrong, but not adequately grounded in retrieved evidence.

That rate was too high for customer-facing workflows.

The quality-gate design

SkyAIApp helped the team separate RAG quality into gates:

GateDecision
Retrieval confidenceIs the retrieved context strong enough to answer?
Document freshnessDoes the context match the latest indexed revision?
AnswerabilityShould the assistant answer, ask a follow-up, or refuse?
Citation coverageAre key claims linked to retrieved sources?
Eval scoreDoes the answer pass task-specific checks?

The goal was not to make every answer longer. The goal was to make unsupported answers harder to ship.

Routing by confidence

The team used retrieval confidence as a routing signal:

  • High confidence and simple question: cost pool.
  • Medium confidence: balanced pool with citation checks.
  • Low confidence or conflicting docs: quality pool or human handoff.

This made model choice depend on evidence, not just prompt length.

Cache with document versions

Caching was useful, but only after the team tied cache keys to document revision and permission scope. A cached answer from last week's policy page could not be reused after a docs deploy.

SkyAIApp traces made it obvious when an answer came from cache, which policy version allowed it, and which document revision was used.

Results after two release cycles

MetricBeforeAfter
Unsupported answer rate8.7%2.4%
Answers with adequate citations71%94%
Low-confidence answers escalated12%38%
P95 latency1.52s1.21s
Cost per 1k requests$3.44$2.81

Latency and cost improved because high-confidence simple questions could use smaller models and cache safely. Quality improved because weak evidence no longer flowed through the same path as strong evidence.

What changed operationally

The team added a weekly RAG review:

  • top unsupported answer categories
  • docs revisions that caused cache invalidation
  • routing policy changes by confidence band
  • eval failures by tenant and locale
  • human-handoff volume

That review gave product, support, and engineering one shared quality language.

RAG quality is not solved by one better prompt. It is solved by a system that knows when to answer, when to route up, when to cite, and when to stop.