>_TheQuery
← All Guides

RAG Works in Theory. Here's Why It Fails in Production.

March 2, 2026

You had a working demo. The retrieval-augmented generation pipeline pulled the right docs, the model gave clean answers, and the stakeholder demo went great. Then it shipped. And within a week, users started getting confident, well-formatted answers that were completely wrong.

This is the most common trajectory for RAG systems. Not that they don't work — they work beautifully in controlled conditions. The problem is that production is not a controlled condition.

So what actually goes wrong?

The retrieval looks right but isn't

Most RAG failures aren't generation failures. They're retrieval failures wearing a generation mask. The language model is doing exactly what you asked — synthesizing an answer from the context it was given. The issue is that the context was wrong, and the model has no reliable way to know that.

This is where the concept of grounding matters. A grounded answer is one that's actually supported by the retrieved evidence. But "supported" is doing a lot of heavy lifting in that sentence. If your retrieval pipeline returns a chunk that's topically adjacent but factually irrelevant, the model will still ground its answer in that chunk. It'll look grounded. It won't be.

The instinct is to blame the model for hallucination. But the model didn't hallucinate in the traditional sense — it faithfully reflected bad input. You gave it the wrong documents and it gave you the wrong answer. That's not a model problem. That's a search problem.

Why your chunks are lying to you

Chunking is the first place production RAG systems silently break. In the notebook, you probably used fixed-size chunks — 500 tokens, maybe 1000 — and it worked fine because your test documents were clean, well-structured, and short.

Production documents are none of those things. They're 80-page PDFs with headers that lie about their content. They're Confluence pages where the critical detail is in paragraph four of a section titled "Miscellaneous." They're Slack thread exports where the answer is split across three messages from two people.

When you chunk these documents at fixed intervals, you're cutting across semantic boundaries. A chunk might contain the tail end of one topic and the beginning of another. The embedding for that chunk will be a meaningless average of two unrelated ideas. Your semantic search will return it for queries related to either topic, and it'll be wrong for both.

The question isn't "what chunk size should I use?" The question is "does each chunk actually represent a coherent idea?" That's a fundamentally different problem, and it's why people end up exploring sentence-level chunking, semantic chunking, or document-aware splitting that respects headings and paragraph boundaries.

But even good chunking can't save you if the next layer fails.

Embeddings don't mean what you think they mean

An embedding maps text to a point in high-dimensional vector space. Points that are close together are supposed to be "similar." And they usually are — similar in some sense. The problem is that the sense of similarity your embedding model learned during training may not match the sense of similarity your use case requires.

Consider a legal RAG system. A user asks: "Can I terminate the contract early?" Your embedding model returns chunks that are semantically similar — they mention contracts, termination, early exit. But the top result is from a different contract type entirely. The embedding doesn't know which contract the user is asking about. It matched on meaning, not on specificity.

This is the dense retrieval problem in a nutshell. Dense retrieval (embedding-based) is excellent at capturing semantic relationships but terrible at exact matching. If the user asks about "Section 4.2(b)" and that string appears verbatim in one document, a sparse retrieval method like BM25 will find it instantly. Your embedding model might not even rank it in the top 20.

The practical answer is almost always hybrid retrieval — combining dense and sparse methods so you catch both semantic similarity and keyword precision. But the deeper lesson is this: a high similarity score is not evidence of relevance. It's a hypothesis. And in production, you need something to test that hypothesis.

Reranking: the safety net you're probably skipping

This is where reranking enters the picture. Your initial retrieval — whether dense, sparse, or hybrid — casts a wide net. It returns the top 20 or 50 candidates. A reranker then looks at each candidate in the context of the original query and re-scores them.

The difference is architectural. A bi-encoder (what your embedding model is) encodes the query and the document separately, then compares them. It's fast but shallow. A cross-encoder (what most rerankers are) encodes the query and the document together, allowing deep interaction between them. It's slower but dramatically more accurate.

Most production RAG pipelines that work well have a reranking step. Most RAG pipelines that were built in a sprint and shipped fast don't. If you're seeing retrieval results that are topically correct but not actually answering the question, adding a reranker is likely the highest-leverage fix available to you.

But retrieval quality is only half the battle.

Stuffing the context window is not a strategy

Once you have your retrieved chunks, you need to fit them into the model's context window along with your system prompt, the user's query, and any conversation history. The temptation is to stuff as much retrieved content as possible — more context means better answers, right?

Not really. Language models exhibit a well-documented "lost in the middle" effect. Information at the beginning and end of the context gets more attention than information in the middle. If your most relevant chunk lands in position 4 out of 8, the model may effectively ignore it in favor of less relevant chunks that happen to be first or last.

There's also the noise problem. Every irrelevant chunk you include doesn't just waste tokens — it actively competes with relevant chunks for the model's attention. Three highly relevant chunks will almost always produce better answers than three relevant chunks buried in seven irrelevant ones.

This is where the retrieval and generation stages need to talk to each other. Your retrieval pipeline shouldn't just return results — it should return confident results, and your generation prompt should be designed to handle cases where retrieval confidence is low. Sometimes the right answer is "I don't have enough information to answer that," but most RAG systems are never given that option.

The context rot problem nobody talks about

Here's a failure mode that only appears over time: context rot. Your documents change. Policies get updated, product features ship, org structures reorganize. But your vector database still contains embeddings of the old versions. A user asks about the current refund policy and gets last quarter's version, delivered with full confidence.

This isn't a hypothetical — it's the default behavior of every RAG system that doesn't have an explicit refresh strategy. And "re-embed everything nightly" is not a strategy. It's a cron job that ignores the question of which documents changed and whether the changes are semantically significant enough to alter retrieval behavior.

The more mature approach involves tracking document versions, updating only changed chunks, and — critically — validating that the updated embeddings still retrieve correctly for your known-good test queries. This is retrieval regression testing, and almost nobody does it.

When similarity search isn't enough

Some queries can't be answered by finding similar text. "What's the relationship between Project Alpha and the Q3 budget?" requires understanding connections between entities across multiple documents. No single chunk contains that answer. No embedding similarity will find it.

This is the space where knowledge graphs start to matter. A knowledge graph represents entities and their relationships explicitly — not as text to be searched, but as structured data to be traversed. When a user asks a relational question, you're not looking for similar text; you're looking for connected nodes.

The hybrid approach — sometimes called GraphRAG — combines vector retrieval for content questions with graph traversal for relational questions. The query routing layer decides which retrieval strategy (or combination) to use based on the nature of the question. It adds complexity, but for domains where relationships matter as much as content, it's the difference between a system that works and one that confidently fails.

What a production-ready RAG system actually looks like

The systems that survive production share a few properties. They don't treat retrieval as a solved problem — they instrument it, measure it, and iterate on it independently of the generation layer. They have chunking strategies that respect document structure instead of applying arbitrary token limits. They use hybrid retrieval to cover both semantic and lexical matching. They rerank before generation. They manage context windows deliberately rather than stuffing them. And they have some mechanism — even a crude one — for detecting when their indexed knowledge is stale.

None of this is exotic. All of it is skipped in the first version because the demo worked and the sprint ended.

The gap between a RAG demo and a RAG product isn't a model upgrade or a bigger context window. It's the engineering discipline to treat retrieval as seriously as you treat the model itself. The model is the easy part. The retrieval is where the actual work lives.


Go deeper

Several of the concepts covered here are explored in more depth in the RAG + Knowledge Graph Master Course:

For a broader view of where RAG fits in the landscape of modern AI systems: