docmd-search splits the work into two distinct phases: a heavy build-time pipeline that runs on Node.js, and a lightweight search-time runtime that runs in the browser using only arithmetic.
Architecture overview
┌─────────────────────────────────────────────────────────┐
│ BUILD TIME (Node.js) │
│ │
│ Crawl → Chunk → Embed (ONNX) → Quantize → Compress │
│ │ │ │
│ └──────────────────────┘ │
│ │ │
│ Engine Adapter (Rust → JS → built-in) │
│ │ │
│ ▼ │
│ .docmd-search/ │
│ ├── manifest.json │
│ ├── batches/000.json, 000.bin │
│ └── navigation.json │
└─────────────────────────────────────────────────────────┘
│
deploy / serve
│
┌─────────────────────────────────────────────────────────┐
│ SEARCH TIME (Browser, <3KB) │
│ │
│ Load manifest → Load batch 000 → Search immediately │
│ → Background-load remaining batches │
│ → Keyword scoring + Cosine similarity │
│ → Ranked results │
└─────────────────────────────────────────────────────────┘
Engine adapter
docmd-search uses an engine adapter (src/engine.ts) that automatically selects the best available backend for CPU-bound tasks like chunking and quantization:
| Priority | Engine | When used |
|---|---|---|
| 1 | Rust ⚡ | @docmd/engine-rust installed and binary available |
| 2 | JS ◆ | @docmd/engine-js installed (docmd is present) |
| 3 | Built-in ◇ | Always available - no external dependencies |
docmd-search does not require docmd or its engines. When running standalone (npx docmd-search ./docs), the built-in fallback handles everything. When running inside a docmd project, the Rust engine accelerates chunking and quantization automatically.
Tasks delegated to the engine
| Task | Purpose |
|---|---|
search:chunk |
Split text into overlapping chunks by heading + word count |
search:quantize |
Float32[] → Int8[] per-vector quantization |
search:cosine |
Batch cosine similarity scoring (for search) |
ONNX inference (the actual embedding generation) stays in Node.js - it uses onnxruntime-node which is itself a native addon. The engine handles the pure-math tasks that benefit from Rust’s speed.
Build-time pipeline
1. Crawl
The crawler walks your directory and discovers files matching the include patterns while respecting exclude patterns. Default file types: .md, .txt, .html.
For incremental indexing, the crawler compares each file’s modification time and size against the stored manifest. Unchanged files are skipped entirely.
2. Chunk
Each file is split into chunks using heading-aware boundaries. The chunking is delegated to the engine adapter (Rust when available, otherwise built-in JS):
- Markdown headings (
#,##,###, etc.) create natural chunk boundaries - Chunks respect the configured
chunkSize(in tokens, default: 256) - Adjacent chunks share
chunkOverlaptokens (default: 32) to prevent information loss at boundaries - Each chunk retains its source file path, heading context, and byte range
# Installation Guide ← chunk boundary
## Prerequisites ← chunk boundary
You need Node.js 18+...
Make sure npm is installed...
## Quick Start ← chunk boundary
Run the following command...
3. Embed
Each chunk’s text is fed through an ONNX Runtime model to produce a dense vector embedding - a fixed-length array of floating-point numbers that captures the chunk’s semantic meaning.
ONNX Runtime runs models locally without Python, CUDA, or cloud APIs. The models are downloaded once and cached at ~/.docmd-search/models/. No data ever leaves your machine.
Models run in Int8-quantized form (q8) by default - the model_quantized.onnx variant from HuggingFace. This is ~4× smaller than full-precision (fp32) and 2-3× faster at inference time, with negligible quality loss. The default all-MiniLM-L6-v2 model is ~23 MB in this form.
ONNX Runtime is configured to use all available CPU cores automatically. The thread count is set to the physical CPU count so ORT’s internal scheduler can select the optimal parallelism for the machine - this gives a further 2-4× speedup over the default single-threaded path.
| Configuration | Throughput | Notes |
|---|---|---|
| fp32, default threading | ~18 chunks/s | Original baseline |
| q8, default threading | ~55 chunks/s | q8 dtype only |
| q8, full CPU threading | ~2000+ chunks/s | Current default |
The default model is trained on English text. For multilingual documentation (Chinese, German, French, etc.) switch to a multilingual model in your config. See Model selection.
4. Quantize
Raw embeddings are Float32 arrays (e.g., 384 dimensions × 4 bytes = 1,536 bytes per chunk). Quantization compresses them to Int8 (1 byte per dimension), reducing size by 75% with negligible impact on search quality.
Float32: [0.234, -0.891, 0.045, ...] → Int8: [30, -114, 6, ...]
5. Compress
For larger indexes, additional compression kicks in automatically:
| Chunk count | Compression | Ratio | Description |
|---|---|---|---|
| ≤ 100 | None | 1:1 | Raw Int8 vectors, no overhead |
| 101-1000 | Ternary | ~12:1 | Vectors reduced to {-1, 0, +1} values |
| > 1000 | Product Quantization | ~24:1 | Codebook-based, highest compression |
You don’t need to configure compression. The indexer selects the optimal strategy based on the number of chunks.
6. Save (multi-batch)
Chunks and vectors are saved in batches:
.docmd-search/
├── manifest.json # Index metadata, batch list, file records
├── batch-000.json # First 500 chunks + vectors
├── batch-001.json # Next 500 chunks + vectors
├── ...
└── navigation.json # Auto-generated nav tree from file structure
Each batch is independently loadable. The manifest tracks which files are indexed, their modification times, and the batch structure. This enables:
- Progressive loading - search from batch 0, load rest in background
- Incremental updates - only rebuild batches containing changed files
- Resumable indexing - interrupted runs resume from the last complete batch
Search-time runtime
The browser client is under 3KB gzipped. It contains no model weights - only arithmetic for keyword matching and vector comparison.
Loading strategy
Fetch manifest
The client loads manifest.json to learn how many batches exist and the vector dimensions.
Load batch 0
The first batch loads and search becomes available immediately.
Background-load remaining
Using requestIdleCallback (or setTimeout as fallback), remaining batches load without blocking the UI. Search results improve as more content becomes available.
Hybrid scoring
Each search query produces results using a two-phase scoring algorithm:
Phase 1 - Keyword matching (BM25-like)
The query is split into terms. Each chunk is scored by how many times each term appears, with BM25-style saturation to prevent long documents from dominating:
keywordScore = Σ count(term) / (count(term) + 1.5)
Phase 2 - Vector reranking
The top keyword result’s pre-built vector is used as the query vector. All candidate results are reranked by cosine similarity:
finalScore = keywordScore × 0.6 + cosineSimilarity × 0.4
The browser never runs a neural network. The “query vector” is approximated from the best keyword match’s pre-built vector. This keeps the runtime at pure arithmetic - no WASM, no model download, no GPU.
Index format
manifest.json
{
"version": 3,
"model": "Xenova/all-MiniLM-L6-v2",
"dimensions": 384,
"status": "complete",
"totalChunks": 1247,
"batchCount": 3,
"files": {
"docs/index.md": { "mtime": 1714500000, "size": 2048 },
"docs/guide.md": { "mtime": 1714500100, "size": 4096 }
}
}
batch-NNN.json
{
"batchId": 0,
"dimensions": 384,
"compression": "ternary",
"vectorCount": 500,
"chunks": [
{
"file": "docs/index.md",
"heading": "Getting Started",
"text": "Run docmd-search...",
"range": [0, 256]
}
],
"vectors": "<base64-encoded compressed vectors>"
}