Corpora (RAG)
A corpus is a collection of documents Kenaz has indexed and made retrievable to the model. When you attach a corpus to a session, the model can ask "find me the 5 most relevant chunks for this query" without dumping the whole document set into context.
Use corpora when you have:
- A set of internal docs the model should know about (engineering notes, runbooks, contracts).
- A codebase too large to paste into a turn.
- Reference material you want the model to cite from rather than hallucinate.
Creating a corpus
Corpora view → New corpus.
- Pick a name and a description.
- Add sources. Three source types:
- Local directory — Kenaz walks the directory, reads supported formats, indexes them.
- Single file — drop a PDF, Markdown, code file.
- Web URL — Kenaz fetches and indexes a single page (one-shot, no recursion).
- Pick an embedding provider. Embeddings come from the same providers as chat — Anthropic, OpenAI, Bedrock, Ollama, OpenRouter — but the model is different (smaller, faster, cheaper). Kenaz suggests a default per provider (
text-embedding-3-largeon OpenAI,voyage-3via OpenRouter,nomic-embed-textvia Ollama, …). - Click Build. Kenaz extracts text, chunks it, embeds each chunk, writes the index to
$XDG_DATA_HOME/kaneaz-harness/corpora/<id>/.
Builds run in the background. The corpus is queryable as soon as the first chunks are embedded; the indicator turns green when fully built.
Supported document formats
- Plaintext —
.md,.txt,.rst,.org - Code — every common extension; chunked along function/class boundaries when a tree-sitter parser is available
- PDF — text-extractable PDFs; OCR is not run automatically
- HTML — fetched URLs and
.htmlfiles are stripped to readable text - Office documents —
.docx,.pptx,.xlsxvia local conversion
Binary formats Kenaz doesn't understand are skipped with a log line.
Attaching to a session
In the chat header → Corpora dropdown → check the corpora you want available. Multiple corpora can be active at once.
When a corpus is attached, the model gets a corpus.search tool that takes a query and returns the top-K chunks with their source filenames. The model decides when to use it — typically once at the start of a turn before generating a final answer.
Updating a corpus
- Re-index a single file — Corpora view → corpus → file → ⋯ → Re-embed. Useful when a runbook changed.
- Re-index everything — corpus → ⋯ → Rebuild. Wipes and re-embeds. Cheap on small corpora, slow on large ones.
- Watch a directory — Source → ⋯ → Watch. Kenaz re-indexes files when their mtime changes. Off by default.
Privacy
- Documents are read locally and embedded by whichever embedding provider you configured for the corpus. The full text of each chunk is sent to that provider.
- The resulting embeddings (vectors) and chunk text are stored locally; never uploaded.
- A
corpus.searchtool call sends only the query (a few words / a sentence) to the model — not the corpus contents. Once the model picks chunks to read, those chunk contents go in the next turn's context as the model continues. - The corpus index sits at
$XDG_DATA_HOME/kaneaz-harness/corpora/<id>/. Delete the directory or use the UI's Delete corpus action to remove.
Cost
Embedding cost is roughly proportional to total document length. Per-million-token rates as of writing:
- OpenAI
text-embedding-3-large— $0.13 / 1M tokens - Voyage AI (via OpenRouter) — $0.18 / 1M tokens
- Bedrock Titan Embeddings — varies by region
- Ollama — free (local)
A 5-megabyte Markdown corpus is roughly 1.2M tokens — under a quarter on most providers.
Recurring queries are free — embeddings are computed once at build time and reused on every search.