Retrieval-Augmented Generation

NLP

An architecture pattern that reduces LLM hallucination by retrieving relevant documents from an external knowledge base and including them as context before generating a response.

Retrieval-Augmented Generation (RAG) addresses the fundamental limitation that LLMs can only rely on knowledge memorized during training. Instead of asking the model to answer from memory, RAG first retrieves relevant documents from an external database, then augments the prompt with this retrieved context, and finally generates an answer grounded in the retrieved information.

A typical RAG pipeline consists of: a document store (vector database or search index) containing the knowledge base, an embedding model that converts both queries and documents into vectors, a retrieval step that finds the top-k most similar documents using cosine similarity, and an LLM that generates answers given the query plus retrieved context. This architecture is powerful because knowledge can be updated by simply adding documents to the store, without retraining the model.

RAG is preferred over fine-tuning when knowledge changes frequently, when you need to cite sources, or when compute resources are limited. However, RAG does not eliminate hallucination entirely -- if retrieval fails to find relevant documents, the LLM may still generate fabricated answers. Production RAG systems need fallback logic, relevance thresholds, and monitoring. Often the best approach combines RAG (for up-to-date factual grounding) with fine-tuning (for domain-specific style and reasoning).

Last updated: February 22, 2026

Retrieval-Augmented Generation

Related Terms