I Built a Production-Grade RAG Framework With Hybrid Search, Contextual Chunking, and Zero Cloud Lock-In
How ragpipe combines BM25 + dense vectors, Anthropic's contextual retrieval, HyDE query expansion, and Ollama-first local inference into a modular Python framework
I Built a Production-Grade RAG Framework With Hybrid Search, Contextual Chunking, and Zero Cloud Lock-In
Every RAG tutorial on the internet follows the same pattern: take a document, split it into 512-token chunks, embed with OpenAI, dump into a vector database, retrieve the top-5 by cosine similarity, and feed them into GPT-4. Ship it. Call it "production-ready."
I've built RAG systems that handle 21,000+ incident records, 1,000+ AUTOSAR XML configurations, and enterprise document repositories with hundreds of PDFs. And I can tell you: that tutorial pattern fails in production in predictable, frustrating ways.
So I built ragpipe — a modular Python framework that implements the advanced retrieval patterns that actually work. Hybrid search with BM25 + dense vectors. Anthropic's contextual chunking. HyDE query expansion. Cross-encoder reranking. And the entire pipeline can run locally with Ollama for $0/month.
GitHub: github.com/shivamongit/ragpipe
Why Naive RAG Fails
Before I explain what ragpipe does, let me explain the problem it solves. Naive RAG has four predictable failure modes:
1. Chunking Destroys Context
A document says: "Revenue for Q3 was $4.2 billion, up 12% year-over-year." Your chunker splits this into two pieces. The embedding for "$4.2 billion" has no idea it's about Q3 revenue. When a user asks "What was Q3 revenue?", the retriever finds the wrong chunk because the number and the quarter ended up in different pieces.
2. Embedding Similarity ≠ Relevance
"How do I reset my password?" and "Password reset policy" have high cosine similarity. But one is a user question and the other is an admin policy document. Dense embeddings capture meaning, but they miss intent. And they completely miss exact keywords — searching for "error code E4021" returns nothing if the embedding doesn't capture that specific token.
3. Top-K Is Crude
Retrieving the 5 most similar chunks isn't the same as retrieving the 5 most useful chunks. Chunk #6 might be the one that actually answers the question, while chunks #1-5 are all from the same paragraph saying slightly different things about a tangentially related topic.
4. Raw Queries Are Bad Search Queries
Users type questions: "Why is our API slow?" The documents contain: "Connection pool exhaustion under high concurrency causes increased latency in the /api/v2 endpoints." These have low embedding similarity despite being a perfect match. The question's vocabulary doesn't match the document's vocabulary.
ragpipe addresses each of these with a dedicated module.
The Architecture
ragpipe's pipeline has 7 pluggable stages:
Documents → Loader → Chunker → Embedder → Retriever → Reranker → Generator → Answer
↑
Query Expansion
(HyDE / Multi-Query / Step-Back)
Every stage is an abstract base class. You can swap any component without touching the rest. The framework ships with production-ready implementations for each stage.
Hybrid Search: Dense + Sparse + RRF
This is the single biggest improvement over naive RAG. Instead of using only dense vector search, ragpipe combines two retrieval strategies:
Dense retrieval (FAISS or NumPy) captures semantic similarity. "automobile" matches "car."
Sparse retrieval (BM25) captures exact keywords. "error code E4021" matches "E4021."
The HybridRetriever fuses both using Reciprocal Rank Fusion (RRF):
RRF_score(chunk) = Σ weight / (k + rank)
A chunk ranked #1 in dense and #3 in BM25 scores higher than one ranked #2 in both. The beauty of RRF is that it works on ranks, not raw scores — so you don't need to normalize BM25's unbounded scores against cosine similarity's [0,1] range.
from ragpipe.retrievers import NumpyRetriever, BM25Retriever, HybridRetriever
retriever = HybridRetriever(
dense_retriever=NumpyRetriever(),
sparse_retriever=BM25Retriever(),
dense_weight=0.6, # favor semantic
sparse_weight=0.4, # but still catch keywords
)
In my testing across three production datasets, hybrid search improved recall@10 by 15-25% compared to dense-only retrieval. The biggest gains were on queries with specific identifiers, error codes, and proper nouns — exactly the queries where dense search fails silently.
The BM25 Implementation
I implemented BM25 from scratch using pure Python — no external dependencies. The Okapi BM25 scoring function:
score(q, d) = Σ IDF(qi) · (tf(qi, d) · (k1 + 1)) / (tf(qi, d) + k1 · (1 - b + b · |d|/avgdl))
Where k1 controls term frequency saturation (default 1.5) and b controls length normalization (default 0.75). The implementation uses a simple whitespace tokenizer with lowercasing — good enough for English text, and you can extend it for other languages.
Contextual Chunking: Anthropic's Secret Weapon
This was the most impactful research finding I incorporated. Anthropic published a technique called contextual retrieval where you prepend each chunk with LLM-generated context about where it fits in the document.
Before:
Revenue was $4.2 billion, up 12% year-over-year.
After:
This section covers Q3 2025 financial results from Acme Corp's annual report, specifically discussing top-line revenue growth. Revenue was $4.2 billion, up 12% year-over-year.
The context prefix makes the chunk dramatically easier to retrieve. Anthropic reported 49% fewer retrieval failures with this approach.
ragpipe's ContextualChunker wraps any base chunker and calls an LLM to generate context:
from ragpipe.chunkers import ContextualChunker, RecursiveChunker
def my_llm(prompt: str) -> str:
# Call your LLM of choice
...
chunker = ContextualChunker(
base_chunker=RecursiveChunker(chunk_size=512),
context_generator=my_llm,
doc_preview_chars=3000,
)
The LLM sees the first 3000 characters of the document plus the specific chunk, and generates a 2-3 sentence context. Yes, this costs more at ingestion time. But you ingest once and query thousands of times — the accuracy improvement pays for itself immediately.
Query Expansion: HyDE, Multi-Query, Step-Back
Raw user queries are terrible search queries. ragpipe includes three expansion strategies:
HyDE (Hypothetical Document Embeddings)
Instead of searching for the question, generate a hypothetical answer and search for documents similar to that answer.
from ragpipe.query import HyDEExpander
expander = HyDEExpander(generate_fn=my_llm)
queries = expander.expand("Why is our API slow?")
# → ["Why is our API slow?",
# "API slowness is commonly caused by connection pool exhaustion,
# N+1 database queries, missing indexes, or network latency..."]
The hypothetical answer uses document-language vocabulary ("connection pool exhaustion", "N+1 queries"), which matches documents much better than the question's vocabulary.
Multi-Query Expansion
Generate N diverse reformulations:
from ragpipe.query import MultiQueryExpander
expander = MultiQueryExpander(generate_fn=my_llm, n_queries=3)
queries = expander.expand("Why is our API slow?")
# → ["Why is our API slow?",
# "API performance debugging techniques",
# "Common causes of backend latency",
# "Slow response time troubleshooting"]
Each query is searched independently, and results are deduplicated. This covers terminology gaps.
Step-Back Prompting
For very specific questions, retrieve broader background first:
from ragpipe.query import StepBackExpander
expander = StepBackExpander(generate_fn=my_llm)
queries = expander.expand("Why does IndexFlatIP need L2 normalization?")
# → ["Why does IndexFlatIP need L2 normalization?",
# "How do FAISS index types handle distance metrics?"]
The broader query retrieves context that helps answer the specific question.
Ollama-First: Run the Entire Pipeline for Free
Every component in ragpipe that requires ML inference has an Ollama-powered variant:
from ragpipe.embedders import OllamaEmbedder
from ragpipe.generators import OllamaGenerator
embedder = OllamaEmbedder(model="nomic-embed-text") # 768d, local, free
generator = OllamaGenerator(model="gemma4") # local, free
The OllamaEmbedder calls Ollama's /api/embed endpoint. The OllamaGenerator calls /api/chat. Both use pure Python urllib — no extra dependencies.
This means you can run the entire ragpipe pipeline — chunking, embedding, retrieval, reranking, generation — on a laptop with zero API costs and zero data leaving your machine. For enterprise environments with data sensitivity requirements, this is a game changer.
9 Evaluation Metrics Built In
You can't improve what you don't measure. ragpipe includes 9 evaluation metrics with no external eval framework needed:
Retrieval metrics:
hit_rate— did we find at least one relevant chunk?mrr— how early is the first relevant chunk?precision_at_k— what fraction of top-K are relevant?recall_at_k— what fraction of relevant docs are in top-K?ndcg_at_k— rank-weighted quality (relevant docs ranked higher score more)map_at_k— average precision across all relevant positionscontext_precision— RAGAS-style weighted precision
Generation metrics:
rouge_l— longest common subsequence overlap with reference answerfaithfulness_score— n-gram overlap between answer and source chunks
from ragpipe.evaluation import hit_rate, mrr, ndcg_at_k, map_at_k, rouge_l
results = pipe.retrieve("What is FAISS?", top_k=10)
relevant = {"doc_id_1", "doc_id_2"}
print(f"Hit Rate: {hit_rate(results, relevant)}")
print(f"NDCG@5: {ndcg_at_k(results, relevant, k=5):.3f}")
print(f"MAP@5: {map_at_k(results, relevant, k=5):.3f}")
Putting It All Together
Here's a complete production pipeline in 15 lines:
from ragpipe import Document, Pipeline
from ragpipe.chunkers import ContextualChunker, RecursiveChunker
from ragpipe.embedders import OllamaEmbedder
from ragpipe.retrievers import NumpyRetriever, BM25Retriever, HybridRetriever
from ragpipe.generators import OllamaGenerator
embedder = OllamaEmbedder(model="nomic-embed-text")
pipe = Pipeline(
chunker=ContextualChunker(
base_chunker=RecursiveChunker(chunk_size=512, overlap=64),
context_generator=lambda p: OllamaGenerator(model="gemma4").generate(p, []).answer,
),
embedder=embedder,
retriever=HybridRetriever(
dense_retriever=NumpyRetriever(),
sparse_retriever=BM25Retriever(),
),
generator=OllamaGenerator(model="gemma4:26b"),
)
pipe.ingest([Document(content="...", metadata={"source": "report.pdf"})])
result = pipe.query("What are the key findings?")
This pipeline uses:
- Contextual chunking for maximum retrieval accuracy
- Hybrid search for both semantic and keyword matching
- Ollama for embeddings and generation (free, local)
- Recursive splitting to preserve document structure
No API keys. No cloud costs. No data leaving your machine.
What's Under the Hood
The full project structure:
ragpipe/
├── ragpipe/
│ ├── core.py # Document, Chunk, Pipeline orchestrator
│ ├── chunkers/ # Token, Recursive, Semantic, Contextual
│ ├── embedders/ # Ollama, SentenceTransformers, OpenAI
│ ├── retrievers/ # FAISS, NumPy, BM25, Hybrid (RRF)
│ ├── rerankers/ # CrossEncoder
│ ├── generators/ # Ollama, OpenAI
│ ├── query/ # HyDE, MultiQuery, StepBack expansion
│ ├── evaluation/ # 9 metrics
│ └── loaders/ # PDF, DOCX, TXT, Directory
├── tests/ # 54 tests, all passing
├── ARCHITECTURE.md # Detailed system documentation
└── pyproject.toml # MIT license, Python 3.10+
Everything is typed, tested, and documented. 54 tests covering all components run in under 0.3 seconds.
Design Philosophy
A few principles guided the design:
1. No framework lock-in. Every component is a base class. Extend BaseEmbedder to add Cohere. Extend BaseRetriever to add Pinecone. The framework doesn't care what's behind the interface.
2. Optional dependencies. Core ragpipe needs only numpy and tiktoken (~5 MB). FAISS, OpenAI, sentence-transformers are opt-in via extras. If you're running Ollama locally, the core install is all you need.
3. Ollama-first. I designed the local providers first, not as afterthoughts. OllamaEmbedder and OllamaGenerator use pure urllib — zero extra dependencies.
4. Hybrid by default. I recommend HybridRetriever for every new project. Dense-only retrieval silently fails on exact keywords. BM25-only misses semantics. The combination catches both.
5. Evaluation is not optional. You ship what you measure. 9 built-in metrics mean you can evaluate retrieval quality without installing RAGAS or another eval framework.
What I Learned Building This
BM25 Still Matters
The AI community has a bias toward dense retrieval. "Embeddings understand meaning!" Yes, but BM25 understands tokens. When a user searches for a specific error code, product SKU, or person's name, BM25 finds it instantly. Dense search might not.
In my AUTOSAR configuration search system (1,000+ XML parameters), adding BM25 alongside FAISS improved recall on parameter-name queries from 72% to 94%.
Contextual Chunking Is Worth the Cost
I was skeptical about calling an LLM for every single chunk at ingestion time. Doesn't that defeat the purpose of local inference?
For a 100-page PDF that produces 200 chunks, generating context with Gemma 4 takes about 3 minutes on an M3 MacBook. You do it once. Then every query for the lifetime of that index benefits from dramatically better retrieval. The amortized cost is essentially zero.
Query Expansion Is Underused
HyDE is the single most underrated technique in RAG. The core insight is simple: a hypothetical answer uses the same vocabulary as the documents, while a question uses different vocabulary. Searching by hypothetical answer matches better.
In my incident resolution RAG system, HyDE improved hit@5 from 81% to 93% on our evaluation set. That's 12 percentage points from one line of code.
Get Started
pip install ragpipe
# Pull Ollama models (optional, for local inference)
ollama pull nomic-embed-text
ollama pull gemma4
Full documentation in the README and ARCHITECTURE.md.
ragpipe is MIT-licensed and open source. If you're building RAG systems that need to work reliably in production — not just demo well in a notebook — give it a try.
