Retrieval-Augmented Generation (RAG) gives LLMs access to your company data instead of generating answers from stale training data. To build an enterprise RAG pipeline: (1) ingest and chunk documents, (2) embed chunks and store them in a vector database, (3) retrieve relevant chunks at query time using similarity search, (4) pass them to an LLM as grounded context. Production complexity — latency, accuracy, access control, multi-source retrieval, and compliance — is where most teams get stuck.
Most enterprises start their AI journey here, and for good reason. Years of accumulated knowledge sit scattered across Confluence, SharePoint, Google Drive, Salesforce, internal wikis, and shared drives. Getting an AI to search all of it and answer employee questions accurately is an obvious first win. The global RAG market is projected to grow from $1.96 billion in 2025 to over $40 billion by 2035 (ResearchAndMarkets, 2025), reflecting how central this pattern has become.
This guide covers the complete architecture, the production challenges most teams underestimate, and — critically — the gap that surfaces once your RAG pipeline works: the distance between finding information and completing the work that information enables.
RAG pipeline architecture: the 4-step foundation
Every RAG system has the same core flow:
1. Ingest your documents. Split them into chunks. Convert each chunk into a vector embedding — a numerical representation of the text's meaning.
2. Store those embeddings in a vector database alongside the original text chunks. This is your knowledge base.
3. When a user asks a question, convert the question into an embedding using the same model. Search the vector database for the most similar chunks.
4. Pass the retrieved chunks as context to an LLM along with the question. The LLM generates an answer grounded in the retrieved content.
Four steps. In a prototype on clean sample data, you can have this working in an afternoon. The distance between that prototype and a production enterprise system is where the real engineering happens.
Step 1: Document ingestion — chunking, preprocessing, and metadata
This is where most teams underestimate the work.
Enterprise documents are not clean text files. They are PDFs with tables, images, and mixed formatting. PowerPoint decks with charts. Confluence pages with embedded videos. Salesforce records with nested objects. Excel spreadsheets with formulas. Scanned documents with OCR artifacts.
What you need to handle:
- Format diversity. PDFs, Word docs, PowerPoint, Excel, HTML, Markdown, email (EML/MSG), images, scanned documents. Each requires different parsing logic. A PDF table extracted as plain text loses its structure. A chart in a PowerPoint deck becomes an image, not data.
- Chunking strategy. How you split documents directly affects retrieval quality. Chunks that are too large contain irrelevant information that dilutes the answer. Chunks that are too small lose context. Fixed-size chunking (splitting every N tokens), semantic chunking (splitting by meaning), and hierarchical chunking (maintaining parent-child document relationships) each produce different retrieval outcomes. There is no universally correct chunk size — it depends on your document types and query patterns.
- Metadata extraction. The chunk text alone is not enough. You need to know where it came from (document, page, section), when it was created, who owns it, what department it belongs to, and what access permissions it carries. This metadata drives filtering, access control, and source citation.
- Incremental updates. Enterprise knowledge changes daily. New documents get added. Existing ones get updated. Old ones get archived. Your pipeline needs to handle these updates without re-processing the entire corpus — otherwise your index drifts from reality.
- Quality control. Duplicate documents, outdated versions, conflicting information across sources. If your knowledge base contains three versions of the same policy (one current, two outdated), the retriever will surface the wrong one. Deduplication and version management are not optional.
Framework approaches:
- Haystack (deepset) provides document processors — converters, cleaners, splitters — as composable pipeline components.
- LlamaIndex offers sophisticated data connectors via LlamaHub and indexing strategies built to handle complex multi-format data structures.
- LangChain includes document loaders and text splitters with broad source coverage, though less specialized for complex document processing than Haystack or LlamaIndex.
Step 2: Embedding documents and building the vector index
Once documents are processed and chunked, each chunk is converted into a vector embedding and stored for retrieval.
Key decisions:
-
Embedding model. The model determines the quality of your semantic search. The MTEB (Massive Text Embedding Benchmark) leaderboard is the standard reference for comparing embedding models. As of early 2026, Google's Gemini Embedding 001 leads the English MTEB benchmark with an average score of 68.32, while open-weight models from NVIDIA and Qwen are competitive alternatives. Cohere's Embed v3 and open-source models (BGE, E5) perform well for domain-specific enterprise retrieval. Multi-lingual requirements, document length, and domain specificity all factor into the decision — and the model you choose affects every retrieval query indefinitely, so this is a high-stakes selection.
-
Vector database. The main enterprise options differ on performance, operational model, and cost:
- Pinecone — fully managed, straightforward to operate, strong enterprise SLAs
- Weaviate — AI-native with built-in vectorization, available managed or self-hosted
- Qdrant — open-source, Rust-based, with strong filtering and benchmark-leading latency (around 20ms p95 in independent benchmarks, vs. 30ms for Weaviate and 50ms for Pinecone under equivalent load)
- pgvector — PostgreSQL extension, keeps you in existing infrastructure
- Elasticsearch — if you already operate it; built-in hybrid search support
- Chroma — lightweight, suited to development and smaller deployments
Independent benchmarks from Qdrant's published benchmark suite and third-party comparisons consistently show sub-100ms query latency across all major databases for typical RAG workloads — fast enough for user-facing applications. At scale above 50 million vectors, architecture and infrastructure choices start to matter more than raw algorithm performance.
-
Hybrid search. Pure vector search misses exact-match queries ("What is the policy number for EMEA-2024-037?"). Hybrid search combines vector similarity with traditional keyword search (BM25) and consistently outperforms either method alone, particularly for enterprise knowledge bases where both semantic understanding and precise term matching matter. Elasticsearch and Weaviate support hybrid search natively. Other databases require building the hybrid retrieval layer explicitly.
-
Dimensionality and cost. Higher-dimension embeddings capture more semantic nuance but cost more to store and query. At enterprise scale — millions of documents, thousands of daily queries — embedding model choice and dimension settings have measurable infrastructure cost implications.
Step 3: Query-time retrieval — similarity search, re-ranking, and filtering
Retrieval is where a basic RAG system becomes a good one. The question is not simply "find similar chunks." It is "find the right information to answer this specific question, for this specific user, from this specific set of permitted documents."
What separates good retrieval from basic retrieval:
- Re-ranking. The initial vector search returns the top N candidates by approximate similarity. A re-ranker — Cohere Rerank, cross-encoder models, or Haystack's built-in rankers — re-scores those candidates based on deeper relevance analysis. Re-ranking is one of the highest-impact improvements available to a RAG pipeline and is now considered standard practice in production enterprise deployments.
- Query transformation. The user's natural language question is rarely the optimal search query. Techniques like HyDE (Hypothetical Document Embeddings, where the model generates a hypothetical answer and searches for documents similar to that), query decomposition (breaking complex questions into sub-queries), and query expansion improve retrieval by reformulating the search before it hits the index.
- Metadata filtering. Enterprise retrieval is not purely about semantic relevance. Access control (can this user see this document?), recency (is this the current version?), and source priority (prefer the official policy over the informal Slack discussion about it) all require metadata filtering at retrieval time. This is not a configuration option — it is a core architectural requirement.
- Multi-step retrieval. Some questions require information from multiple sources that cannot be retrieved in a single query. "What is our churn rate and how does it compare to our Q4 target?" requires data from the analytics system and the quarterly plan. Multi-hop retrieval strategies — retrieve, synthesize intermediate context, retrieve again — handle this, but they add latency and complexity.
Haystack's pipeline-first architecture is particularly strong here. Each retrieval step (embedding lookup, filtering, re-ranking) is an explicit pipeline component you can tune and evaluate independently. For teams that need precise engineering control over retrieval quality, this composability is a real advantage.
Step 4: LLM generation with retrieved context
The retrieved chunks go to an LLM as context. The LLM generates an answer grounded in that content.
What matters at the generation layer:
- Prompt engineering. How you structure the context and question for the LLM affects answer quality substantially. System prompts that instruct the model to answer only from provided context, cite sources explicitly, and acknowledge uncertainty when information is insufficient are the baseline for production use. These instructions directly reduce hallucination rates.
- Model selection. GPT-4o, Claude 3.5 Sonnet, Cohere Command R+, Llama 3.1, Mistral Large, and Gemini 1.5 Pro each have different strengths for generation quality, instruction-following, and long-context handling. Enterprise requirements around data privacy, deployment location, and cost often narrow the practical options before technical performance comes into play.
- Context window management. Enterprise questions may surface 20 or more relevant chunks. Not all of them fit within the model's context window. Selection, ordering, and compression of retrieved context affects both answer quality and per-query cost.
- Citation and traceability. Enterprise users need to verify answers. The system must surface which documents informed each response, with links back to the source. In regulated industries, this is not a UX nicety — it is a compliance requirement.
- Guardrails. The LLM should answer from the retrieved context and acknowledge when it lacks sufficient information. Implementing reliable guardrails — using evaluation frameworks like RAGAS, TruLens, or DeepEval — requires deliberate engineering. Out-of-the-box LLM behavior without guardrails will produce hallucinations in production.
RAG production challenges: what breaks between demo and deployment
A prototype on controlled data is not a production enterprise system. Here is what breaks at scale.
Security and document-level access control
Enterprise documents carry different access levels. HR policies are company-wide. Salary bands are leadership-only. Customer contracts are team-specific. A RAG system that retrieves based purely on semantic relevance will surface restricted documents to unauthorized users.
You need document-level access control enforced at retrieval time, synchronized with your identity provider (Active Directory, Okta), and maintained as permissions change. This is not a toggle in your vector database configuration. It is an engineering project that touches your identity system, your metadata schema, and every document in your corpus.
Evaluation and quality monitoring
In production, with thousands of queries daily, you cannot manually review every response. A production-grade evaluation system requires:
- Automated evaluation pipelines measuring retrieval precision and generation faithfulness (frameworks like RAGAS provide metrics for context precision, context recall, answer relevance, and faithfulness)
- Ground truth test sets for regression testing when you change models, chunking strategies, or retrieval parameters
- User feedback mechanisms that capture "this answer was wrong" signals and route them into evaluation
- Monitoring for retrieval drift — answer quality degrading as the knowledge base changes faster than the index keeps up
Building this evaluation infrastructure is a separate engineering project from building the retrieval pipeline itself.
Multi-system knowledge integration
Enterprise knowledge does not live in one place. Confluence holds policies and procedures. Salesforce holds customer data. Jira holds project status. SharePoint holds documents. Slack holds decisions. Google Drive holds presentations. A useful RAG system retrieves across all of these, maintains freshness, resolves conflicts, and handles format differences.
Each integration carries its own authentication model, data transformation requirements, incremental sync logic, error handling, and schema change management. At 10 systems, this is its own platform engineering workstream.
Scalability and infrastructure cost
A prototype with 1,000 documents and 10 users is cheap to run. An enterprise system with 10 million documents, 50,000 users, and thousands of queries per hour is a different infrastructure profile. Embedding computation, vector storage, re-ranking inference, LLM generation, and data freshness pipelines all scale together. According to analysis of enterprise RAG deployments, governance and compliance infrastructure alone typically adds 20–30% to baseline infrastructure costs in regulated deployments. Architecture decisions made for the prototype — embedding model, chunk size, database choice — directly determine production cost.
Governance and compliance
For regulated industries and public companies: Where does the data go? Which models process it? Can you audit every query and response? Is PII handled correctly? Are access controls provable to auditors? Does the system meet SOC 2, ISO 27001, GDPR, HIPAA, or SOX requirements?
Governance is not a feature you add after the system is built. It is an architectural concern that shapes decisions at every layer, from document ingestion (what gets indexed and when) to response generation (what gets logged and retained).
Beyond RAG: from information retrieval to autonomous action
Here is what becomes apparent once a RAG pipeline works in production.
Suppose you have built a production-grade enterprise RAG system. Security, evaluation, multi-source retrieval, scale, and compliance are all handled. Employees ask questions and receive grounded, accurate answers from your enterprise knowledge.
Now what?
An employee asks: "What is the onboarding process for a new enterprise customer in France?"
The RAG system retrieves the relevant documentation and generates a comprehensive answer. The employee reads it. And then manually:
- Checks the CRM for the customer's current status
- Validates their billing information against the ERP
- Confirms product compatibility based on their contract terms
- Identifies which team handles French enterprise accounts
- Creates the onboarding ticket in the correct format
- Sends the welcome sequence from the right template
- Schedules the kickoff call with the assigned account manager
- Updates the CRM with the onboarding status
The RAG system handled step zero: finding the information. Steps 1 through 8 are the actual work. They span five systems, require judgment calls at steps 3 and 4, and represent 30 minutes of manual effort per customer.
This pattern surfaces reliably once RAG works. The retrieval was not the bottleneck. The workflow around the retrieval is.
Multiply this across:
- Hundreds of customer onboardings per month
- Thousands of support tickets requiring cross-system resolution
- Sales teams researching thousands of accounts across dozens of data sources
- Compliance teams monitoring regulatory changes across multiple jurisdictions
- HR teams processing employee requests that touch four internal systems
The ROI of finding information faster is real but bounded. The ROI of completing the entire workflow autonomously is a different order of magnitude.
RAG vs. autonomous agents: which path for your enterprise?
Once you see this gap, the decision simplifies to two paths.
Path 1: Build the workflow layer on top of your RAG pipeline
Extend your RAG system. Add custom integrations with your CRM, ERP, ticketing system, and communication tools. Build decision logic that validates data against business rules. Implement exception handling for edge cases. Add escalation flows for human judgment. Build audit trails and compliance controls for every action the system takes.
This is technically feasible. Some engineering teams do it. But the RAG project that was already a significant engineering investment becomes a full workflow automation platform. Retrieval was step one. You are now building steps two through ten.
The engineering cost of this path is why Lambda, a leading AI infrastructure company with world-class engineers, chose not to take it. Their CTO evaluated the build option and concluded the opportunity cost was too high. Every engineering hour spent on internal workflow automation was an hour not spent on their core product.
Path 2: Use a platform where RAG is one capability within complete workflow automation
This is what Nexus is. The platform supports both real-time RAG (connecting agents to live data from CRMs, ERPs, databases) and stored RAG (uploading documents from Confluence, Google Drive, SharePoint for vectorized knowledge retrieval). RAG is one step in agents that complete entire workflows end to end.
A Nexus agent handling customer onboarding does not just retrieve the process documentation. It collects customer information, validates against CRM and billing systems, checks product compatibility, routes edge cases to the right team, escalates complex issues with full context, updates multiple systems, and sends confirmations — autonomously, with every decision logged and traceable.
What Path 2 looks like in practice:
- Orange Group (multi-billion euro telecom): Business team built autonomous customer onboarding agents. 4-week deployment. 50% conversion improvement. 90% autonomous resolution. 100% team adoption. Business teams own the agents, not engineering.
- Lambda (leading AI infrastructure company): Agents monitor thousands of accounts, synthesize intelligence from dozens of data sources, and surface significant pipeline opportunity. Over 24,000 research hours added annually. Built by a non-engineer in days.
- European telecom: Deployed agents across support operations. 40% support volume freed across millions of interactions. Forward Deployed Engineers identified use cases and handled integration complexity.
The difference between Path 1 and Path 2 is not only speed. It is who does the work. Path 1 requires your engineering team to build and maintain everything. Path 2 lets business teams deploy agents — supported by Forward Deployed Engineers — while your engineers focus on your core product.
How to choose the right RAG approach for your enterprise
If retrieval IS the product — you are building a search tool, a document QA system, or a knowledge base as a product feature — a RAG framework is the right choice. You need engineering control over every retrieval step because retrieval quality is your product quality. Haystack is strong for pipeline-first RAG. LlamaIndex handles complex multi-format data well. LangChain offers breadth if your requirements extend beyond retrieval.
If retrieval is one step in an employee-facing workflow — customer support, sales intelligence, compliance monitoring, onboarding automation — the framework approach means building retrieval and then building everything else. Retrieval is roughly 20% of the total work. The workflow layer is 80%. That is where cost and time accumulate.
If you do not have a dedicated AI engineering team — or your engineers are fully allocated to your core product — frameworks create a dependency on resources you do not have. The business teams that need the AI cannot build it. The engineers who could build it have other priorities.
For the second and third scenarios, Nexus provides RAG as a built-in capability within agents that complete full workflows. 4,000+ native integrations connect agents to the systems those workflows touch. Forward Deployed Engineers handle the integration complexity your team should not have to carry. Business teams own the result.
Frequently asked questions
What is RAG (Retrieval-Augmented Generation) and why do enterprises use it?
RAG is the practice of retrieving relevant documents from your data sources and passing them as context to an LLM before it generates a response. This grounds the LLM's output in your actual business data, reducing hallucinations. Enterprises use it to build knowledge assistants, internal Q&A systems, and document search tools without fine-tuning models. It is the most widely deployed pattern in enterprise AI because it works with existing content and does not require retraining.
What is the difference between a basic RAG demo and a production enterprise RAG pipeline?
A demo works in an afternoon on clean, limited sample data. Production requires: document ingestion handling PDFs, Office documents, and web content at scale with metadata extraction; real-time or near-real-time indexing as content changes; retrieval latency under 200ms for interactive use; hybrid search combining vector similarity and keyword matching; document-level access control enforced at retrieval time; evaluation pipelines measuring answer quality continuously; and audit trails for compliance. Each of these is its own engineering workstream.
Which vector databases are most used in enterprise RAG pipelines?
The main options: Pinecone (managed, straightforward operations), Weaviate (open-source with built-in vectorization), Qdrant (open-source, high-performance Rust implementation with benchmark-leading latency), pgvector (PostgreSQL extension for teams with existing Postgres infrastructure), and Elasticsearch (for teams already running it, with native hybrid search). Independent benchmarks show Qdrant leading on raw query latency at scale. Below 50 million vectors, all major databases deliver sub-100ms query times adequate for production RAG. Database choice ultimately depends on operational model (managed vs. self-hosted), existing infrastructure, filtering requirements, and compliance constraints.
When should I use RAG versus fine-tuning an LLM?
RAG is the right choice when: your information changes frequently (new policies, updated documentation), you need citations so users can verify answers, data privacy requires keeping content out of model weights, and for most enterprise knowledge use cases. Fine-tuning is the right choice when: you need to teach a consistent response style or domain-specific reasoning pattern, you want to reduce per-query latency by baking knowledge into weights, or your retrieved context would consistently exceed the context window. Most enterprises should start with RAG — it is faster to implement, easier to update, and more controllable.
What frameworks help build enterprise RAG pipelines?
LangChain provides RAG chain abstractions and broad tool integrations. LlamaIndex specializes in RAG data infrastructure — document loaders, index types, and query engines. Haystack (deepset) focuses on production RAG pipelines with composable pipeline architecture and strong enterprise features. For teams that want RAG as part of a complete workflow automation platform — rather than as a standalone retrieval system — Nexus includes both real-time and stored RAG within agents that complete end-to-end business processes.
Worth exploring?
If your team is building enterprise RAG and beginning to see the gap between "AI that finds information" and "AI that completes the work," it is worth a conversation about what the full workflow looks like.
Every Nexus engagement starts with a 3-month proof of concept tied to measurable business outcomes. Forward Deployed Engineers embed with your team from day one. You see the results before committing. You can exit anytime.
100% of clients who started a POC converted to an annual contract.
See the full Nexus vs Haystack comparison -->
Related reading
- Top 10 RAG Frameworks and Platforms for Enterprise
- Top 10 Haystack Alternatives for AI Search and RAG
- Haystack vs LangChain: AI Frameworks Compared
- Nexus vs Haystack: full comparison
- Nexus vs LangChain: full comparison
- Nexus vs Glean: search assistant vs autonomous agents
- Top 10 LangChain Alternatives for Building AI Agents



