Why do AI coding agents pick stale or outdated docs?
Because retrieval ranks by similarity, not freshness. When several files mention the same topic — a current runbook, an old postmortem, a duplicate wiki page — the agent has no signal for which is authoritative, so it reads whichever is the closest text match. A stale or duplicate copy can rank high and be read as the source of truth. The fix is a freshness signal: mark each result canonical, stale, or duplicate and return the one canonical doc, so the stale copies never reach the agent's context.
Why similarity search surfaces the wrong copy
Embedding search answers "which text is most similar to the query?" — not "which doc is current?" An old incident write-up and the current runbook can be almost identical in wording, so the stale one scores just as high. Worse, near-duplicate copies (a wiki page and its README twin) both rank, so the agent pays to read the same content twice and still has to guess which to trust.
What actually stops it
- A freshness marker per result — canonical, stale, or duplicate — so the agent never reasons over an outdated copy.
- Return one canonical doc, not a ranked list, so there's nothing to mis-rank.
- Skip duplicates instead of sending the agent two copies of the same thing.
How trovex handles it
trovex indexes your repo's markdown and answers a query with the single current doc that addresses it — a path:line pointer with a freshness marker — so stale and duplicate copies are skipped, not ranked. Same answers, about 60% fewer tokens per lookup. It runs locally (SQLite + ONNX, no cloud or keys).
Skip the stale copies.
trovex is in private beta — request access and serve your agents the one current doc per query.
Open source (AGPL-3.0 core, MIT CLIs). Local-first — your docs never leave your machine.