NoRag.
/

NoRag v2

Technical Blueprint

Document question-answering without a vector database. Two explicit LLM pipelines over plain Markdown indexes.

What is NoRag?

NoRag is a document Q&A system with no embedding layer. Where classical RAG transforms documents into opaque mathematical vectors, NoRag keeps everything in plain, readable Markdown and delegates routing to the contextual intelligence of modern LLMs.

Index layout

data/
├── index.md                ← document catalog (doc_id, sections, keywords)
├── index_system_prompt.md  ← agent catalog (agent_id, capabilities, system prompt)
└── documents/
    ├── contract_saas.md    ← full text, split by ## section_id headers
    └── ...
index.md is the routing brain — it lists every document with its sections and keywords, compact enough to fit in a single LLM context window.

The L1 Pipeline — 2 calls, deterministic

L1 is the core NoRag primitive. Every request is resolved in exactly two LLM calls, no matter how complex the question.

Call 1 — Router (SLM)

Reads index.md + index_system_prompt.md + the user question. Returns structured JSON: { agent_id, documents: [{ doc_id, sections }], reasoning }. A small, fast model handles this cheaply.

Storage read

The engine extracts the selected ## section_id blocks from document files. No similarity search — pure regex on known section headers.

Call 2 — Answer (LLM)

Receives the agent system prompt + full extracted sections + the question. Responds with mandatory citations [doc_id, section_id]. A capable model handles this.

L1Result

{ answer, citations, agent_id, tokens: { router, answer } }. Every citation is traceable to an exact section in an exact document.
POST /query
{
  "question": "What SLA is guaranteed in the SaaS contract?",
  "mode": "L1"
}

→ {
  "answer": "The SaaS contract guarantees 99.9% uptime ...",
  "citations": [{"doc_id": "contract_saas", "section": "sla_guarantees"}],
  "agent_id": "juriste_conformite",
  "tokens": {"router": 1820, "answer": 3410}
}
The Router reads both index files in a single call, selecting both the right documents and the right agent simultaneously — keeping the total at exactly 2 LLM calls.

The Multi_L Pipeline — parallel, then synthesized

Multi_L fans out N independent L1 layers, runs them concurrently, then synthesizes through an Aggregator LLM. Use it for multi-perspective analysis, question decomposition, or corpus partitioning.

Planner (SLM)

Reads index.md + question + preset → emits a JSON array of layer plans: [{ agent_id, index_scope, sub_question }]. Capped at multil_max_layers (default 3).

N × L1 in parallel

Each layer plan spawns a full L1 run (Router + Answer) with its own agent, scoped index, and sub-question. All layers run concurrently with asyncio.gather. Each has a configurable timeout.

Aggregator (LLM)

Receives all layer answers. Writes a unified synthesis that explicitly names contradictions, preserves all citations, and notes which layer produced which insight.

Four presets

AMulti-Agent

Same question, different agents. Cross-perspective analysis in one response.

BDecomposition

Split the question into independent sub-questions, each routed separately.

CMulti-Corpus

Same question, different agents, different document scopes.

DHybrid / Auto

The Planner freely combines agents, sub-questions, and index scopes.

NoRag vs RAG — Technical Comparison

CriterionRAG (classical)NoRag
Routing methodCosine similarity on vectorsLLM semantic routing on Markdown index
InfrastructureVector DB (Pinecone, Weaviate, pgvector…)Plain Markdown files
Embeddings✗ Required (cost per token)✓ None
Human readability✗ Opaque vectors✓ 100% — any text editor
Context given to LLMArbitrary top-K chunksComplete named sections
CitationsApproximate (chunk offset)✓ Exact [doc_id, section]
Session memoryMust be implemented separately✓ Native via index files
AuditabilityDebug requires ML expertise✓ Read the index, see the routing
Scalability✓ Billions of documentsUp to ~500 documents
Setup complexity2–6 weeks2–4 hours (or 5 min with no-code mode)
Vendor lock-inHigh (Pinecone, Weaviate migration is costly)✓ None — plain .md files

Cost Analysis

Setup costs

ItemNoRagRAG (classical)
Infrastructure$0 local / ~$25/mo Supabase Pro$70–500/mo (Pinecone, Weaviate)
Embedding API$0 — no embeddings$0.02–0.13 per 1M tokens
Initial dev time2–4 hours2–6 weeks
Required skillsBasic Python (or zero — no-code mode)ML + DevOps + advanced Python
Team onboarding< 1 day1–4 weeks

Per-request cost comparison

PipelineTokensEst. cost
NoRag — Router call (index ~5K + question)~6K input$0.0009
NoRag — Answer call (sections ~10K + question)~12K in + 1K out$0.0020
NoRag total (Gemini Flash)~18K tokens~$0.003
RAG — Embedding query~50 tokens$0.000001
RAG — Vector search (Pinecone)~$0.0001
RAG — Top-5 chunks + answer (GPT-4o)~3K in + 500 out$0.0135
RAG total (GPT-4o + Pinecone)~3K LLM tokens~$0.014

12-month TCO — 1,000 req/month, 100 documents

ScenarioNoRag (Gemini Flash)RAG (GPT-4o + Pinecone)
Setup~$2$500–2,000 (dev time)
Infrastructure / year$0–300$840–6,000
LLM requests / year$36$168
Maintenance / year$60$500–1,500
Total 12 months~$100–400~$2,000–10,000
NoRag with Gemini Flash is 4–5× cheaper per request than a typical RAG pipeline on GPT-4o, despite consuming more tokens — because Gemini Flash is 10–20× cheaper per token.

When to Use NoRag

Choose NoRag when:

Use caseWhy NoRag
Private library (10–500 docs)Readable index, instant setup, precise citations
Internal knowledge baseFull control, auditability, no vendor lock-in
Prototype / MVPNo-code mode operational in 5 minutes
Critical technical docsSection-level citations, guaranteed traceability
Non-technical teamMarkdown index editable in Notion or Excel
Limited budgetNo Vector DB, no separate embedding API
Cross-document questionsLLM intelligently combines multiple documents
Compliance / auditabilityEvery routing decision is explainable

Choose RAG when:

Use caseWhy RAG
Massive corpus (1,000+ docs)Vector search scales; LLM index becomes impractical
Real-time updatesNew documents indexed in seconds
Multimodal contentImage/audio/video embeddings (CLIP, Whisper…)
Experienced ML teamStack well understood, monitoring in place
Consumer-scale appScalability critical — millions of users

Quick Start

Mode A — API (FastAPI)

git clone https://github.com/supergmax/NoRag
cd NoRag
pip install -e ".[dev]"
export GEMINI_API_KEY=your_key

uvicorn api.main:app --reload

# L1 query
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What are the SLA terms?", "mode": "L1"}'

# Multi_L preset A
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "Analyze risk across all contracts", "mode": "MultiL", "preset": "A"}'

Mode B — Claude Code skill

/norag What are the SLA terms in the SaaS contract?
/norag multi_l A Analyze our contracts from all angles
/norag list
/norag agents

Project structure

NoRag/
├── core/
│   ├── config.py          ← ROUTER_MODEL, ANSWER_MODEL, AGGREGATOR_MODEL
│   ├── storage.py         ← read_index(), read_document_sections()
│   ├── llm_client.py      ← generate(), generate_json()
│   ├── l1_engine.py       ← L1Engine.run()
│   ├── multi_l_engine.py  ← MultiLEngine.run()
│   ├── indexer.py         ← Indexer.ingest()
│   └── prompts/
│       ├── router.md
│       ├── planner.md
│       └── aggregator.md
├── api/
│   ├── main.py            ← FastAPI app, create_app()
│   └── schemas.py         ← Pydantic v2 models
├── data/
│   ├── index.md
│   ├── index_system_prompt.md
│   └── documents/
└── tests/                 ← 28 tests, all green
NoRag v2 · github.com/supergmax/NoRag · 2026