RAG in Production: What Actually Works
RAG in Production: What Actually Works
Every RAG tutorial follows the same script: split documents, embed chunks, store in a vector database, retrieve top-k, send to LLM. Done.
It works in demos. It falls apart in production.
I built BidScribe — an AI-powered tool that generates RFP responses from a company's knowledge base. It's a RAG system that real businesses depend on for real proposals. Here's what I learned building it.
Chunking: Where Most RAG Systems Fail
The default advice is "split on 500-1000 tokens with some overlap." This is fine for blog posts. It's terrible for structured business documents.
The Problem
RFP responses aren't articles. They're structured answers — often with headers, sub-questions, tables, and bullet points that form a logical unit. Naive chunking splits these apart, and your retrieval returns fragments that lack context.
What Actually Works
Semantic chunking over fixed-size. I chunk by logical units — a complete answer, a section, a coherent paragraph group. The chunks vary in size from 200 to 2000 tokens. That's fine. Uniform chunk size is a false goal.
Preserve metadata aggressively. Every chunk carries:
- Source document title
- Section header hierarchy
- Original question (if it was a Q&A pair)
- Date and version
- Tags/categories
This metadata isn't just for filtering — it's injected into the prompt alongside the chunk content. An answer about "data security practices" means very different things depending on whether it came from a healthcare RFP or a financial services RFP.
Overlapping context windows. Instead of overlapping tokens between chunks, I prepend parent context. Each chunk knows its place in the document hierarchy. Think of it as breadcrumbs: Document Title > Section > Subsection > This Chunk.
interface Chunk {
id: string;
content: string;
embedding: number[];
metadata: {
sourceDocument: string;
sectionPath: string[]; // ["Security", "Data Protection", "Encryption"]
originalQuestion?: string;
documentDate: string;
tags: string[];
};
}
Embeddings: The Model Matters Less Than You Think
I started with OpenAI's text-embedding-3-large. Tried text-embedding-3-small. Tested Cohere's embed v3. Ran benchmarks on my actual data.
The results? Within 5-8% of each other for my use case.
What Actually Matters
Your preprocessing pipeline matters 10x more than your embedding model. Clean, well-structured text with good metadata produces better embeddings than messy text through a "better" model.
What I do before embedding:
- Strip formatting artifacts (HTML tags, markdown remnants, weird Unicode)
- Normalize whitespace and structure
- Prepend the section path as natural language: "In the context of Security > Data Protection > Encryption:"
- Include the original question when available
That last one is huge. If someone asks "How do you handle data encryption at rest?" and your knowledge base has the answer filed under "Storage Security Protocols," the question-enriched embedding bridges that vocabulary gap.
Dimensionality
I use 1536 dimensions. I tried reducing to 512 with text-embedding-3-small. Retrieval quality dropped noticeably for nuanced queries. For straightforward keyword-like queries, it was fine. I kept 1536 because my data isn't big enough for the cost savings to matter, and the tail queries are where RAG either shines or fails.
Retrieval: Hybrid or Go Home
Pure vector similarity search isn't enough. Here's why.
The Keyword Problem
User asks: "What is your SOC 2 compliance status?"
Vector search returns chunks about "security certifications," "audit processes," and "compliance frameworks." All semantically related. None specifically mention SOC 2.
Meanwhile, there's a chunk that says "We achieved SOC 2 Type II certification in March 2024" — but it's embedded in a broader section about company milestones, so it's not the top vector match.
Hybrid Search
BidScribe uses hybrid retrieval:
-- Simplified version of what actually runs
SELECT
chunks.id,
chunks.content,
chunks.metadata,
-- Vector similarity score
1 - (chunks.embedding <=> query_embedding) AS vector_score,
-- Full-text search score
ts_rank(chunks.fts, plainto_tsquery('english', query_text)) AS text_score
FROM chunks
WHERE
-- Pre-filter by metadata when possible
chunks.workspace_id = $1
ORDER BY
(0.7 * (1 - (chunks.embedding <=> query_embedding))) +
(0.3 * ts_rank(chunks.fts, plainto_tsquery('english', query_text)))
DESC
LIMIT 20;
The 0.7/0.3 weighting came from testing against real queries. Vector search handles the "what is this about" question well. Full-text search catches the specific terms and acronyms that matter.
Supabase makes this surprisingly easy — pgvector for embeddings and PostgreSQL's built-in full-text search in the same query. No separate search infrastructure needed.
Re-ranking
Top-20 from hybrid search go through a re-ranking step. I use the LLM itself for this — send it the query and candidate chunks, ask it to rank by relevance. Expensive? Yes. Worth it? Absolutely.
The re-ranker catches things the embedding model misses. It understands that a chunk about "annual penetration testing results" is highly relevant to a question about "security assessment procedures" even when the vector similarity is mediocre.
async function rerankChunks(query: string, chunks: Chunk[]): Promise<Chunk[]> {
const response = await llm.chat({
messages: [{
role: 'system',
content: 'Rank these text chunks by relevance to the query. Return chunk IDs in order.'
}, {
role: 'user',
content: `Query: ${query}\n\nChunks:\n${chunks.map((c, i) =>
`[${c.id}]: ${c.content.slice(0, 500)}`
).join('\n\n')}`
}],
model: 'gpt-4o-mini', // Fast and cheap enough for re-ranking
});
// Parse ranked IDs and reorder
return parseRankedIds(response).map(id => chunks.find(c => c.id === id)!);
}
The Stuff Nobody Talks About
Staleness
Knowledge bases go stale. Last year's security policy isn't this year's. BidScribe handles this with:
- Date-aware retrieval. Recent chunks get a slight boost.
- Version tracking. When a document is re-uploaded, old chunks are soft-deleted, not removed. This prevents broken references and lets you audit changes.
- Confidence signals. If the newest relevant chunk is over 12 months old, the UI flags it: "This answer may be outdated."
When RAG Says "I Don't Know"
This is harder than it sounds. The default LLM behavior is to hallucinate something plausible. The worst failure mode isn't a wrong answer — it's a confident wrong answer in a proposal your client will read.
My approach:
- Retrieval threshold. If the best chunk scores below 0.6 similarity, flag it.
- Explicit instruction. The system prompt says: "If the retrieved context doesn't contain enough information to answer confidently, say so. Never fabricate details."
- Source attribution. Every generated answer links back to its source chunks. Users can verify.
- Draft mode. BidScribe never auto-submits. Everything is a draft that a human reviews and edits.
That last point is philosophical as much as technical. RAG systems should augment human judgment, not replace it. Especially when the output goes to a client.
Multi-tenancy
Each BidScribe workspace has its own knowledge base. Row Level Security in Supabase handles isolation — one workspace can never retrieve another's chunks. This is non-negotiable and it's why I chose Supabase over rolling my own vector store.
ALTER TABLE chunks ENABLE ROW LEVEL SECURITY;
CREATE POLICY "Users can only access their workspace chunks"
ON chunks FOR SELECT
USING (workspace_id IN (
SELECT workspace_id FROM workspace_members
WHERE user_id = auth.uid()
));
Cost Management
RAG at scale gets expensive. Three things that keep costs sane:
-
Cache aggressively. Same question with same knowledge base? Return the cached answer. I cache at the query-embedding level with a similarity threshold — "close enough" queries hit the cache.
-
Tiered models. Re-ranking uses
gpt-4o-mini. Final generation usesgpt-4o. Don't use your most expensive model for every step. -
Smart context windows. Don't stuff 20 chunks into the prompt. After re-ranking, I typically send 3-5 chunks. More context ≠ better answers. It usually means more noise.
What I'd Do Differently
Start with evaluation earlier. I built the system, then figured out how to measure quality. Should've been the other way around. Now I have a test set of 200+ query-answer pairs that I run against every change.
Invest in the ingestion pipeline. I underestimated how much time I'd spend on document parsing. PDFs are a nightmare. Tables are worse. Build robust parsing from day one.
Don't over-engineer the vector store. I considered Pinecone, Weaviate, Qdrant. Supabase with pgvector handles my scale (tens of thousands of chunks) without breaking a sweat. Start simple. Migrate when you actually need to.
The Bottom Line
RAG in production is 20% retrieval algorithm and 80% everything else — data quality, chunking strategy, metadata, evaluation, caching, and knowing when the system should shut up instead of guessing.
The LLM is the easy part. The hard part is everything that happens before the prompt.