Retrieval-Augmented Generation, or RAG, is the pattern of retrieving relevant content from a knowledge source and feeding it into a language model so the model generates a grounded answer. Instead of relying on what the model memorized during training, RAG gives it specific source material at runtime. The result is responses that can cite real documents, stay current with information that postdates the training cutoff, and stay grounded in your organization's specific data.
The basic flow is simple. The user asks a question. The system embeds the question and searches a vector database (or hybrid search system) for the most relevant chunks of source material. Those chunks get inserted into the prompt along with the question. The model generates an answer based on the provided context. The answer can include citations pointing back to the sources.
RAG matters because it solves two problems vanilla LLM use cannot. It handles knowledge the model does not have (your product docs, your internal wiki, today's news) without retraining. And it grounds the model in real sources, reducing the hallucination rate dramatically. Both make RAG the workhorse pattern of enterprise AI in 2025 and 2026.
The pipeline has four main stages. First, ingest source content into a knowledge store. Documents get chunked into smaller pieces, each chunk gets embedded into a vector, and the vectors are stored in a database optimized for similarity search.
Second, embed the user's query into the same vector space. Most teams use the same embedding model for both indexing and querying to keep results consistent.
Third, search the vector database for chunks similar to the query. The result is typically the top 5 to 20 chunks ranked by cosine similarity. Hybrid search adds keyword matching alongside semantic search to catch cases where exact terms matter.
Fourth, feed the retrieved chunks plus the query to the LLM with a prompt that instructs the model to answer using only the provided context. The model generates the response, often including citations.
Production RAG systems add layers. Reranking takes the top 50 retrieved chunks and uses a more expensive model to score which 5 to 10 are actually relevant. Query rewriting transforms the user's phrasing into something more retrievable. Chunk preprocessing applies summarization or expansion. Post-processing validates that citations point to real sources.
The model can only answer well if it gets good context. Retrieve the wrong chunks and you get confidently wrong answers that look authoritative. Most RAG quality issues trace back to retrieval problems rather than model problems.
Chunking strategy matters more than people expect. Splitting a document at arbitrary character counts can break semantic units mid-sentence or mid-thought. Better strategies preserve paragraphs, sections, or other natural boundaries. Some teams use hierarchical chunking that retrieves the small chunk and includes its parent section as additional context.
Embedding quality affects which chunks rank as similar. Domain-specific embeddings (or fine-tuned ones) often outperform generic models for specialized content. Cohere, Voyage, OpenAI, and Anthropic all offer strong embedding APIs.
Hybrid search combining keyword and semantic search catches cases where pure semantic search misses. Acronyms, names, code identifiers, and specific terms benefit from keyword matching alongside semantic similarity.
Reranking dramatically improves precision when the initial retrieval set is large. A reranker model takes the top 50 results and reorders them based on relevance to the query. The cost is a single additional model call per query for substantial quality gains.
The simplest pattern: a single vector database, a generic embedding model, basic top-k retrieval, and a prompt template that injects the chunks. Works for many use cases and ships in days.
The mid-complexity pattern adds reranking, hybrid search, and metadata filtering. The user's query is rewritten to match the index style. Results are reranked. Metadata filters scope the search to relevant document subsets. This handles harder retrieval problems and broader corpus.
The advanced pattern uses multiple specialized indexes (one per document type or domain), routing queries to the right index, multi-step retrieval (retrieve, then retrieve again based on what was found), and agentic retrieval where the model decides what to search for next. This handles the hardest cases but adds significant complexity.
Most teams start simple and add layers as they hit real quality limits. Premature complexity usually hurts more than it helps.
RAG shines when answers need to come from a specific knowledge source: product documentation, internal wikis, customer records, research papers, legal contracts. It also shines when information needs to be current beyond the model's training cutoff: today's prices, recent events, latest procedures.
RAG is unnecessary when the model already knows what you need. General questions about programming, biology, or history rarely benefit from retrieval; the model has learned enough during training. Adding retrieval wastes tokens and adds latency without improving quality.
RAG struggles with synthesis across many documents. If the answer requires reasoning across hundreds of chunks, retrieval cannot pull all of them into context. Some workflows are better served by document-level summarization first, then RAG over summaries.
RAG also struggles when retrieval cannot find the answer. If your knowledge source genuinely does not contain the answer, the model will either hallucinate or correctly say "I do not know." Both outcomes depend on prompt design and validation.
RAG retrieves information at runtime and feeds it to the model. Fine-tuning bakes information into the model's weights through additional training. RAG is faster to update (just add documents to the index), more transparent (you can see which sources were used), and avoids the maintenance burden of retraining as the base model evolves. Fine-tuning produces faster inference and can teach style or behavior that prompting struggles with. Most teams should start with RAG and only fine-tune when they hit specific limits prompting cannot solve.
For getting started, Postgres with pgvector is hard to beat: you already know Postgres, it is free, and it scales to millions of vectors. For larger scale or specialized needs, Pinecone is the most production-mature managed option, Weaviate is strong for hybrid search, and Qdrant offers an attractive open-source middle ground. The choice matters less than people think for most workloads under tens of millions of vectors.
Typical chunk sizes are 200 to 800 tokens, depending on the content. Smaller chunks improve precision but lose context; larger chunks preserve context but reduce retrieval precision. Many teams use 300 to 500 tokens with overlap (50 to 100 tokens) between adjacent chunks. The right size depends on your content and use case; experiment with your evaluation set.
For general use, OpenAI's text-embedding-3-large, Cohere's embed-english-v3.0, and Voyage AI's voyage-3 are strong defaults. For domain-specific content (legal, medical, code), specialized embeddings often work better. Some teams fine-tune embeddings on their domain for best results. The choice has real impact on retrieval quality and is worth measuring on your specific corpus.
Two layers. Retrieval quality measures whether the right chunks come back: precision, recall, mean reciprocal rank against a labeled test set. Generation quality measures whether the model answers correctly given the retrieved chunks: factual accuracy, citation correctness, completeness. Tools like Ragas and DeepEval automate both. Most teams underinvest in retrieval evaluation and over-focus on generation, missing the upstream cause of many quality issues.
Hybrid search combines semantic search (finding chunks similar in meaning) with keyword search (finding chunks with matching terms). Pure semantic search misses cases where exact strings matter (product SKUs, specific names, code identifiers). Pure keyword search misses cases where wording differs but meaning matches. Combining both, often with a weighted score, captures the strengths of each. For production systems with diverse content, hybrid search regularly outperforms either approach alone.
Several strategies. Hierarchical chunking retrieves a small chunk plus its parent section. Document summaries get indexed alongside chunks so retrieval can find the right document first. Long-context models can handle entire documents in some cases, eliminating chunking. The right approach depends on document length, structure, and use case. Many teams combine strategies: summary index for routing, chunk index for precise answers. What about citations? Most production RAG systems require the model to cite sources. The pattern is to include source IDs or URLs in the retrieved chunks, instruct the model to cite them when generating the response, and validate that the cited sources actually appear in the retrieved context. Citations let users verify answers and build trust. They also catch hallucination when the model cites something not in the retrieval results.
Yes, with extensions. Image and video content can be embedded with multimodal models and retrieved alongside text. Tables can be embedded as text representations or with table-specific encoders. Code benefits from code-specific embeddings. Each modality adds engineering work but the basic pattern (embed, retrieve, generate with context) extends naturally.
Two trends matter. Long-context models (1M+ token windows) reduce the need for retrieval in some cases by letting models see entire document sets at once. Cost is the trade-off; long context is expensive per call. Agentic retrieval, where the model decides what to search for next based on partial results, handles harder synthesis tasks than basic top-k retrieval. The category will likely fragment into simple RAG (still dominant for most use cases), long-context approaches (for cases where retrieval is costly to design), and agentic retrieval (for complex multi-hop reasoning).