NoRag v2
Document question-answering without a vector database. Two explicit LLM pipelines over plain Markdown indexes.
NoRag is a document Q&A system with no embedding layer. Where classical RAG transforms documents into opaque mathematical vectors, NoRag keeps everything in plain, readable Markdown and delegates routing to the contextual intelligence of modern LLMs.
data/
├── index.md ← document catalog (doc_id, sections, keywords)
├── index_system_prompt.md ← agent catalog (agent_id, capabilities, system prompt)
└── documents/
├── contract_saas.md ← full text, split by ## section_id headers
└── ...L1 is the core NoRag primitive. Every request is resolved in exactly two LLM calls, no matter how complex the question.
Call 1 — Router (SLM)
Storage read
Call 2 — Answer (LLM)
L1Result
POST /query
{
"question": "What SLA is guaranteed in the SaaS contract?",
"mode": "L1"
}
→ {
"answer": "The SaaS contract guarantees 99.9% uptime ...",
"citations": [{"doc_id": "contract_saas", "section": "sla_guarantees"}],
"agent_id": "juriste_conformite",
"tokens": {"router": 1820, "answer": 3410}
}Multi_L fans out N independent L1 layers, runs them concurrently, then synthesizes through an Aggregator LLM. Use it for multi-perspective analysis, question decomposition, or corpus partitioning.
Planner (SLM)
N × L1 in parallel
Aggregator (LLM)
Same question, different agents. Cross-perspective analysis in one response.
Split the question into independent sub-questions, each routed separately.
Same question, different agents, different document scopes.
The Planner freely combines agents, sub-questions, and index scopes.
| Criterion | RAG (classical) | NoRag |
|---|---|---|
| Routing method | Cosine similarity on vectors | LLM semantic routing on Markdown index |
| Infrastructure | Vector DB (Pinecone, Weaviate, pgvector…) | Plain Markdown files |
| Embeddings | ✗ Required (cost per token) | ✓ None |
| Human readability | ✗ Opaque vectors | ✓ 100% — any text editor |
| Context given to LLM | Arbitrary top-K chunks | Complete named sections |
| Citations | Approximate (chunk offset) | ✓ Exact [doc_id, section] |
| Session memory | Must be implemented separately | ✓ Native via index files |
| Auditability | Debug requires ML expertise | ✓ Read the index, see the routing |
| Scalability | ✓ Billions of documents | Up to ~500 documents |
| Setup complexity | 2–6 weeks | 2–4 hours (or 5 min with no-code mode) |
| Vendor lock-in | High (Pinecone, Weaviate migration is costly) | ✓ None — plain .md files |
| Item | NoRag | RAG (classical) |
|---|---|---|
| Infrastructure | $0 local / ~$25/mo Supabase Pro | $70–500/mo (Pinecone, Weaviate) |
| Embedding API | $0 — no embeddings | $0.02–0.13 per 1M tokens |
| Initial dev time | 2–4 hours | 2–6 weeks |
| Required skills | Basic Python (or zero — no-code mode) | ML + DevOps + advanced Python |
| Team onboarding | < 1 day | 1–4 weeks |
| Pipeline | Tokens | Est. cost |
|---|---|---|
| NoRag — Router call (index ~5K + question) | ~6K input | $0.0009 |
| NoRag — Answer call (sections ~10K + question) | ~12K in + 1K out | $0.0020 |
| NoRag total (Gemini Flash) | ~18K tokens | ~$0.003 |
| RAG — Embedding query | ~50 tokens | $0.000001 |
| RAG — Vector search (Pinecone) | — | ~$0.0001 |
| RAG — Top-5 chunks + answer (GPT-4o) | ~3K in + 500 out | $0.0135 |
| RAG total (GPT-4o + Pinecone) | ~3K LLM tokens | ~$0.014 |
| Scenario | NoRag (Gemini Flash) | RAG (GPT-4o + Pinecone) |
|---|---|---|
| Setup | ~$2 | $500–2,000 (dev time) |
| Infrastructure / year | $0–300 | $840–6,000 |
| LLM requests / year | $36 | $168 |
| Maintenance / year | $60 | $500–1,500 |
| Total 12 months | ~$100–400 | ~$2,000–10,000 |
| Use case | Why NoRag |
|---|---|
| Private library (10–500 docs) | Readable index, instant setup, precise citations |
| Internal knowledge base | Full control, auditability, no vendor lock-in |
| Prototype / MVP | No-code mode operational in 5 minutes |
| Critical technical docs | Section-level citations, guaranteed traceability |
| Non-technical team | Markdown index editable in Notion or Excel |
| Limited budget | No Vector DB, no separate embedding API |
| Cross-document questions | LLM intelligently combines multiple documents |
| Compliance / auditability | Every routing decision is explainable |
| Use case | Why RAG |
|---|---|
| Massive corpus (1,000+ docs) | Vector search scales; LLM index becomes impractical |
| Real-time updates | New documents indexed in seconds |
| Multimodal content | Image/audio/video embeddings (CLIP, Whisper…) |
| Experienced ML team | Stack well understood, monitoring in place |
| Consumer-scale app | Scalability critical — millions of users |
git clone https://github.com/supergmax/NoRag
cd NoRag
pip install -e ".[dev]"
export GEMINI_API_KEY=your_key
uvicorn api.main:app --reload
# L1 query
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"question": "What are the SLA terms?", "mode": "L1"}'
# Multi_L preset A
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"question": "Analyze risk across all contracts", "mode": "MultiL", "preset": "A"}'/norag What are the SLA terms in the SaaS contract? /norag multi_l A Analyze our contracts from all angles /norag list /norag agents
NoRag/ ├── core/ │ ├── config.py ← ROUTER_MODEL, ANSWER_MODEL, AGGREGATOR_MODEL │ ├── storage.py ← read_index(), read_document_sections() │ ├── llm_client.py ← generate(), generate_json() │ ├── l1_engine.py ← L1Engine.run() │ ├── multi_l_engine.py ← MultiLEngine.run() │ ├── indexer.py ← Indexer.ingest() │ └── prompts/ │ ├── router.md │ ├── planner.md │ └── aggregator.md ├── api/ │ ├── main.py ← FastAPI app, create_app() │ └── schemas.py ← Pydantic v2 models ├── data/ │ ├── index.md │ ├── index_system_prompt.md │ └── documents/ └── tests/ ← 28 tests, all green