LS LOGICIEL SOLUTIONS
Toggle navigation
Technology

RAG on AWS: A Production Architecture from Ingest to Response

RAG on AWS: A Production Architecture from Ingest to Response

Five Stages, Many Service Choices

Building RAG on AWS involves more architectural decisions than the demo tutorials suggest. The five stages from document ingest to user response each have AWS services that fit and services that do not. Choosing right at each stage matters because retrieval quality determines RAG quality, and retrieval quality is mostly about engineering choices made stage by stage.

AWS published its reference architecture for RAG on Bedrock in 2024 (AWS, "Knowledge Bases for Amazon Bedrock," 2024), but the reference is one of several legitimate paths through the AWS service catalog. The right path depends on workload characteristics. For some workloads, Knowledge Bases is the right answer. For others, more custom architectures fit better.

Five stages organize the decisions. Each stage has its own service options, its own design considerations, and its own operational implications.

Budget Approval Playbook

Inside a 5-step framework that won $500K of infrastructure budget in 14 days.

Download

Stage One: Document Ingest

The first stage moves source documents into AWS in a form ready for processing. The choices here affect everything downstream.

Document sources vary: SharePoint, S3 buckets already containing documents, content management systems, databases with attached files, third-party APIs, customer-uploaded documents. Each source has its own access pattern, refresh cadence, and security profile.

The AWS services that handle this stage include S3 (the eventual landing zone for most ingest patterns), AWS DataSync for replicating from on-premises file systems, AWS Glue for ETL on structured document metadata, and AppFlow for SaaS source integration. For ingest from databases, AWS DMS or native CDC connectors are common.

The stage's output is documents in S3 with metadata (source, classification, ingest timestamp, ownership) in a consistent format. The metadata becomes the foundation for governance, lineage, and operational visibility.

Common mistakes at this stage include skipping metadata capture (which makes downstream debugging difficult), inadequate handling of document updates (creating ingest pipelines that work for initial load and break for ongoing operation), and weak access control on the landing zone (creating security exposure).

Stage Two: Processing and Chunking

The second stage processes documents into chunks suitable for retrieval. The decisions at this stage have outsized impact on retrieval quality.

Document parsing varies by format. PDFs need extraction with structure preservation; HTML needs cleaning; structured documents (DOCX, XLSX) need format-specific handlers. AWS Textract handles PDFs and images. Custom parsing in Lambda handles other formats. Third-party tools (Unstructured.io, LlamaIndex parsers) sometimes outperform native AWS options for specific document types.

Chunking strategy affects retrieval quality more than most teams realize. Fixed-size chunking is the default and the most common. Semantic chunking (chunks aligned with paragraph or section boundaries) usually outperforms fixed-size. Hierarchical chunking (multiple chunk sizes per document) outperforms single-size for many workloads.

The processing typically runs in Lambda for small documents, AWS Glue for batch processing of large corpora, or Step Functions for orchestrating multi-stage processing pipelines. The choice depends on document volume and processing complexity.

The stage's output is chunks with metadata in a form ready for embedding. Each chunk has its source document, location within source, and any structural metadata that supports filtering during retrieval.

Stage Three: Embedding and Indexing

The third stage produces embeddings for the chunks and stores them in a vector database.

The embedding model choice matters. Amazon Bedrock provides Titan and Cohere embedding models. Self-hosted embedding models on SageMaker (BGE, E5, others) sometimes outperform managed options for specific domains. The choice trades managed-service convenience against domain optimization.

The vector store choice has several AWS options. Amazon OpenSearch with vector search support is the AWS-native heavy option. Amazon Aurora with pgvector extension fits when PostgreSQL is already in the stack. Pinecone and Weaviate, while not AWS-native, deploy easily on AWS and offer mature features. Each has different cost, performance, and operational characteristics.

The indexing pipeline coordinates embedding generation, vector storage, and metadata management. For batch workloads, Glue or Step Functions orchestrate this. For continuous workloads, Lambda triggered from S3 events handles incremental indexing.

The stage's output is a queryable vector index with associated metadata. The output's quality determines what retrieval can return.

Stage Four: Retrieval

The fourth stage handles retrieval at query time. Given a user query, return the most relevant chunks.

The retrieval logic typically combines several techniques. Embedding-based semantic search finds chunks semantically similar to the query. Keyword-based BM25 finds chunks with literal term matches. The two are typically combined through hybrid search that reranks the merged results.

Reranking adds another step where a smaller model (Cohere Rerank, Voyage Rerank, or open-source rerankers on SageMaker) reorders the candidate chunks for final selection. Reranking typically improves retrieval quality 15-30 percent over pure embedding search.

The retrieval runs in Lambda for most workloads or in ECS/Fargate for workloads requiring sustained throughput. The latency budget for retrieval is typically 100-300ms; the architecture has to fit within it.

Filtering by metadata happens during retrieval. Customer-specific knowledge bases retrieve only from that customer's chunks. Time-sensitive workloads filter by ingest date. Security-sensitive workloads filter by classification. The metadata captured in earlier stages becomes the filtering basis.

Stage Five: Generation and Response

The fifth stage assembles retrieved context with the user query and produces the response.

Bedrock provides the model invocation, with Claude, Anthropic models, and other Bedrock models as options. The prompt construction includes the user query, the retrieved chunks (typically 5-10 chunks formatted for the model), and any system prompt establishing tone, format, and behavior constraints.

Citation handling is part of generation design. The model is prompted to cite retrieved chunks. The response is parsed for citations. Bad citation hygiene is a common production issue; structured prompting that requires explicit citation markers helps.

Streaming response generation reduces perceived latency for users. Bedrock supports streaming responses. The application layer has to be designed for streaming from initial design.

Post-processing handles formatting, structured output validation, and any guardrails on the response (content filtering, PII detection in generated text).

The stage's output is the user-facing response with appropriate citation and formatting.

What This Costs at Scale

A production RAG system on AWS at moderate scale (1M document corpus, 10K queries per day) typically costs $20K-$80K per month including embedding generation, vector storage, retrieval compute, and generation. The cost breakdown varies by workload but typically distributes roughly: 30 percent vector storage and retrieval, 50 percent generation (model inference), 20 percent supporting infrastructure (Lambda, processing, observability).

The engineering investment to build the architecture typically requires 2-4 senior engineers for 2-3 months, plus ongoing 20-40 percent of one engineer for sustained operation. The investment compounds: the second RAG system on the established infrastructure costs much less than the first.

From Data Chaos to Data Confidence

Inside a 6-month plan that turned 47 fragile pipelines into 98.7% reliability.

Download

Call to Action

What Logiciel Does Here

Logiciel works with engineering teams building production RAG on AWS where the demo-stage architecture has not scaled. The work is typically structured around the five-stage assessment followed by service selection and architecture for each stage appropriate to the workload's specific requirements.

The RAG at Production Scale framework covers the four-layer retrieval quality stack. The AI Data Pipelines framework covers the broader pipeline considerations that ingest and processing depend on.

A 30-minute working session is enough to assess your current RAG architecture against the five-stage reference.

Frequently Asked Questions

Should I use Bedrock Knowledge Bases or build custom?

Knowledge Bases for workloads that fit its abstractions (standard document types, supported vector stores, no custom retrieval logic). Build custom when the workload requires specific chunking strategy, hybrid search, reranking customization, or integration with non-AWS components.

Which vector store on AWS is best?

Workload-dependent. OpenSearch fits when you need full-text search alongside vector search. Aurora pgvector fits when PostgreSQL is already in the stack. Pinecone or Weaviate (not AWS-native but deployable) fit specific workload patterns. The choice depends on scale, query patterns, and existing infrastructure.

How do I handle document updates and deletions?

Through CDC patterns at the ingest stage that propagate changes through processing, embedding, and indexing. Updates require re-embedding affected chunks. Deletions require removing chunks from the index and any caches. The update path is often less well-designed than the initial load path.

How do I measure RAG quality?

Through eval frameworks that test retrieval (does the right chunk come back) separately from generation (does the response use the chunk correctly). RAGAS, the open-source RAG eval framework, plus internal eval sets covering specific failure modes typical for the workload.

What about multi-tenant RAG?

Through metadata filtering at retrieval. Each tenant's chunks are tagged with tenant ID; queries include tenant filter; retrieval returns only that tenant's chunks. The architecture supports many tenants on shared infrastructure with clean isolation. Sources: - AWS, "Knowledge Bases for Amazon Bedrock," 2024 - AWS, "Bedrock Documentation," 2024

Submit a Comment

Your email address will not be published. Required fields are marked *