A chatbot that confidently invents a refund policy is worse than one that simply says “I don’t know.” For anything customer- or compliance-facing, a grounded non-answer beats a fluent wrong one. The good news: hallucination is an engineering problem, and engineering problems have engineering solutions.
Why naive RAG still hallucinates
Retrieval-augmented generation (RAG) is meant to fix this by feeding the model your documents. But a first-cut pipeline still drifts when:
- retrieval returns loosely-related chunks, so the model fills the gaps from its training data;
- the prompt never forces the model to answer only from the supplied context;
- there’s no signal for “the answer isn’t in the documents,” so it guesses anyway.
How we keep answers grounded
We treat a production assistant like any other system that has to stay correct under load:
- Better retrieval, not just more. Hybrid keyword + semantic search, re-ranking and content-aware chunking beat throwing a bigger model at the problem.
- Strict grounding prompts. The model answers only from retrieved context, cites its sources, and is told to say so when the context is insufficient.
- Guardrails. Output is screened for unsupported claims, policy violations and PII before it reaches a user.
- Human-in-the-loop on high-stakes actions — refunds, medical, legal — where a wrong answer is expensive.
Prove it before launch
None of this is trustworthy unless it’s measured. We build an evaluation harness with a labelled question set and score every change for accuracy, groundedness and “refusal-when-unknown.” A model or prompt update only ships when the numbers hold or improve — the same discipline you’d apply to a payments change.
The takeaway
You don’t make an LLM trustworthy by hoping. You make it trustworthy with retrieval you’ve tuned, prompts that forbid guessing, guardrails that catch the rest, and an evaluation suite that proves it — before a single customer sees it.









