RAG (Retrieval Augmented Generation)
Retrieval Augmented Generation (RAG) is a technique that combines a retrieval step with a generative language model to produce responses grounded in external knowledge.
RAG was introduced by Lewis et al. at Facebook AI in 2020 as a method to improve the factual accuracy and relevance of text generation. In a standard generative model, the model relies solely on its internal parameters to produce output, which can lead to hallucinations or outdated information. RAG addresses this by first retrieving relevant documents or passages from a knowledge source, such as a database or the web, and then conditioning the language model on both the query and the retrieved context to generate a response.
The retrieval component typically uses a dense or sparse vector search to find documents that are semantically similar to the input query. These documents are then concatenated with the query and fed into the generator, which is often a pre-trained transformer model like BART or T5. The generator produces text that is informed by the retrieved evidence, allowing the system to incorporate up-to-date or domain-specific information without retraining the model. The retrieval and generation steps can be trained end-to-end or used as separate modules.
RAG has been applied in various domains, including question answering, fact verification, and dialogue systems. It reduces the risk of generating incorrect information by grounding the output in external sources, and it allows the system to adapt to new information by simply updating the retrieval index. However, the quality of the output depends on the relevance and accuracy of the retrieved documents, and the retrieval step introduces additional latency and computational cost.
Why it matters
RAG matters because it significantly improves the reliability of language models by anchoring their outputs in verifiable external knowledge. This reduces hallucinations and enables applications where factual accuracy is critical, such as customer support, medical advice, or legal document analysis. It also allows models to access information beyond their training data cutoff, making them more adaptable to rapidly changing domains without requiring frequent retraining.
First appeared
Lewis et al., Facebook AI, 2020.
Related terms
FAQ
How does it work?
RAG works by first retrieving relevant documents from a knowledge source based on the input query. These documents are then combined with the query and passed to a generative language model, which produces a response that is informed by the retrieved context. The retrieval is typically performed using vector similarity search, and the generator is a pre-trained transformer model.
What are the main advantages of RAG over standard generation?
RAG provides better factual accuracy and reduces hallucinations by grounding responses in external knowledge. It also allows the system to incorporate new or domain-specific information without retraining, simply by updating the retrieval index. This makes it more flexible and cost-effective for applications requiring up-to-date or specialized knowledge.
What are the limitations of RAG?
RAG introduces additional latency due to the retrieval step, and its output quality depends on the relevance and accuracy of the retrieved documents. It also requires a well-maintained knowledge source and can be less effective if the retrieval fails to find pertinent information. Additionally, the system may still generate incorrect responses if the retrieved context is misleading or incomplete.