NoRag v2

Technical Blueprint

Document question-answering without a vector database. Two explicit LLM pipelines over plain Markdown indexes.

What is NoRag?

NoRag is a document Q&A system with no embedding layer. Where classical RAG transforms documents into opaque mathematical vectors, NoRag keeps everything in plain, readable Markdown and delegates routing to the contextual intelligence of modern LLMs.

Index layout

data/
├── index.md                ← document catalog (doc_id, sections, keywords)
├── index_system_prompt.md  ← agent catalog (agent_id, capabilities, system prompt)
└── documents/
    ├── contract_saas.md    ← full text, split by ## section_id headers
    └── ...

index.md is the routing brain — it lists every document with its sections and keywords, compact enough to fit in a single LLM context window.

The L1 Pipeline — 2 calls, deterministic

L1 is the core NoRag primitive. Every request is resolved in exactly two LLM calls, no matter how complex the question.

Call 1 — Router (SLM)

Reads index.md + index_system_prompt.md + the user question. Returns structured JSON: { agent_id, documents: [{ doc_id, sections }], reasoning }. A small, fast model handles this cheaply.

Storage read

The engine extracts the selected ## section_id blocks from document files. No similarity search — pure regex on known section headers.

Call 2 — Answer (LLM)

Receives the agent system prompt + full extracted sections + the question. Responds with mandatory citations [doc_id, section_id]. A capable model handles this.

L1Result

{ answer, citations, agent_id, tokens: { router, answer } }. Every citation is traceable to an exact section in an exact document.

POST /query
{
  "question": "What SLA is guaranteed in the SaaS contract?",
  "mode": "L1"
}

→ {
  "answer": "The SaaS contract guarantees 99.9% uptime ...",
  "citations": [{"doc_id": "contract_saas", "section": "sla_guarantees"}],
  "agent_id": "juriste_conformite",
  "tokens": {"router": 1820, "answer": 3410}
}

The Router reads both index files in a single call, selecting both the right documents and the right agent simultaneously — keeping the total at exactly 2 LLM calls.

The Multi_L Pipeline — parallel, then synthesized

Multi_L fans out N independent L1 layers, runs them concurrently, then synthesizes through an Aggregator LLM. Use it for multi-perspective analysis, question decomposition, or corpus partitioning.

Planner (SLM)

Reads index.md + question + preset → emits a JSON array of layer plans: [{ agent_id, index_scope, sub_question }]. Capped at multil_max_layers (default 3).

N × L1 in parallel

Each layer plan spawns a full L1 run (Router + Answer) with its own agent, scoped index, and sub-question. All layers run concurrently with asyncio.gather. Each has a configurable timeout.

Aggregator (LLM)

Receives all layer answers. Writes a unified synthesis that explicitly names contradictions, preserves all citations, and notes which layer produced which insight.

Four presets

AMulti-Agent

Same question, different agents. Cross-perspective analysis in one response.

BDecomposition

Split the question into independent sub-questions, each routed separately.

CMulti-Corpus

Same question, different agents, different document scopes.

DHybrid / Auto

The Planner freely combines agents, sub-questions, and index scopes.

NoRag vs RAG — Technical Comparison

Criterion	RAG (classical)	NoRag
Routing method	Cosine similarity on vectors	LLM semantic routing on Markdown index
Infrastructure	Vector DB (Pinecone, Weaviate, pgvector…)	Plain Markdown files
Embeddings	✗ Required (cost per token)	✓ None
Human readability	✗ Opaque vectors	✓ 100% — any text editor
Context given to LLM	Arbitrary top-K chunks	Complete named sections
Citations	Approximate (chunk offset)	✓ Exact [doc_id, section]
Session memory	Must be implemented separately	✓ Native via index files
Auditability	Debug requires ML expertise	✓ Read the index, see the routing
Scalability	✓ Billions of documents	Up to ~500 documents
Setup complexity	2–6 weeks	2–4 hours (or 5 min with no-code mode)
Vendor lock-in	High (Pinecone, Weaviate migration is costly)	✓ None — plain .md files

Cost Analysis

Setup costs

Item	NoRag	RAG (classical)
Infrastructure	$0 local / ~$25/mo Supabase Pro	$70–500/mo (Pinecone, Weaviate)
Embedding API	$0 — no embeddings	$0.02–0.13 per 1M tokens
Initial dev time	2–4 hours	2–6 weeks
Required skills	Basic Python (or zero — no-code mode)	ML + DevOps + advanced Python
Team onboarding	< 1 day	1–4 weeks

Per-request cost comparison

Pipeline	Tokens	Est. cost
NoRag — Router call (index ~5K + question)	~6K input	$0.0009
NoRag — Answer call (sections ~10K + question)	~12K in + 1K out	$0.0020
NoRag total (Gemini Flash)	~18K tokens	~$0.003
RAG — Embedding query	~50 tokens	$0.000001
RAG — Vector search (Pinecone)	—	~$0.0001
RAG — Top-5 chunks + answer (GPT-4o)	~3K in + 500 out	$0.0135
RAG total (GPT-4o + Pinecone)	~3K LLM tokens	~$0.014

12-month TCO — 1,000 req/month, 100 documents

Scenario	NoRag (Gemini Flash)	RAG (GPT-4o + Pinecone)
Setup	~$2	$500–2,000 (dev time)
Infrastructure / year	$0–300	$840–6,000
LLM requests / year	$36	$168
Maintenance / year	$60	$500–1,500
Total 12 months	~$100–400	~$2,000–10,000

NoRag with Gemini Flash is 4–5× cheaper per request than a typical RAG pipeline on GPT-4o, despite consuming more tokens — because Gemini Flash is 10–20× cheaper per token.

When to Use NoRag

Choose NoRag when:

Use case	Why NoRag
Private library (10–500 docs)	Readable index, instant setup, precise citations
Internal knowledge base	Full control, auditability, no vendor lock-in
Prototype / MVP	No-code mode operational in 5 minutes
Critical technical docs	Section-level citations, guaranteed traceability
Non-technical team	Markdown index editable in Notion or Excel
Limited budget	No Vector DB, no separate embedding API
Cross-document questions	LLM intelligently combines multiple documents
Compliance / auditability	Every routing decision is explainable

Choose RAG when:

Use case	Why RAG
Massive corpus (1,000+ docs)	Vector search scales; LLM index becomes impractical
Real-time updates	New documents indexed in seconds
Multimodal content	Image/audio/video embeddings (CLIP, Whisper…)
Experienced ML team	Stack well understood, monitoring in place
Consumer-scale app	Scalability critical — millions of users

Quick Start

Mode A — API (FastAPI)

git clone https://github.com/supergmax/NoRag
cd NoRag
pip install -e ".[dev]"
export GEMINI_API_KEY=your_key

uvicorn api.main:app --reload

# L1 query
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What are the SLA terms?", "mode": "L1"}'

# Multi_L preset A
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "Analyze risk across all contracts", "mode": "MultiL", "preset": "A"}'

Mode B — Claude Code skill

/norag What are the SLA terms in the SaaS contract?
/norag multi_l A Analyze our contracts from all angles
/norag list
/norag agents

Project structure

NoRag/
├── core/
│   ├── config.py          ← ROUTER_MODEL, ANSWER_MODEL, AGGREGATOR_MODEL
│   ├── storage.py         ← read_index(), read_document_sections()
│   ├── llm_client.py      ← generate(), generate_json()
│   ├── l1_engine.py       ← L1Engine.run()
│   ├── multi_l_engine.py  ← MultiLEngine.run()
│   ├── indexer.py         ← Indexer.ingest()
│   └── prompts/
│       ├── router.md
│       ├── planner.md
│       └── aggregator.md
├── api/
│   ├── main.py            ← FastAPI app, create_app()
│   └── schemas.py         ← Pydantic v2 models
├── data/
│   ├── index.md
│   ├── index_system_prompt.md
│   └── documents/
└── tests/                 ← 28 tests, all green

NoRag v2 · github.com/supergmax/NoRag · 2026