Updated
Do AI agents report false success?
Yes. Agents fail silently by claiming completion when the environment state says otherwise. A 2026 study (arXiv:2606.09863) measured false success in up to 75.8% of self-assessing coding-agent runs, and found LLM judges scored 0.54 to 0.65 AUROC, near a coin flip, at catching it. The reliable signal is a verified state change, not self-report or an LLM judge. Cheap calibrated detectors reached 0.83 to 0.95.
Gate on verified state, not the agent's word
The doer is also the checker: the same model that did the work decides whether it succeeded, reading its own transcript instead of the real state of the repo. Adding an LLM judge keeps the check inside a model's opinion, which is why it lands near a coin flip in the study. Cheap calibrated detectors do better and run far faster, but the only check you can trust in production is external to the agent: a passing test or build, a tool or eval response, or a canonical doc the agent reads as a source rather than its own guess. Separate the doer from the checker and gate on verified state.
FAQ
Do AI coding agents really report false success?
Yes, and often. An agent grades its own run and reports the task complete while the environment state says otherwise. A 2026 study measured false success in up to 75.8% of self-assessing coding-agent runs. The agent is not lying on purpose; it has no external check on whether the work actually landed.
Can an LLM judge catch false success?
Not reliably. In the same study an LLM judge added to grade the runs scored 0.54 to 0.65 AUROC, where 0.5 is a coin flip. Cheap calibrated detectors based on TF-IDF reached 0.83 to 0.95 and ran far faster, but the signal you can trust in production is a verified state change, not another model's vote.
Why does an agent claim a task is done when it isn't?
Because the doer is also the checker. The same model that did the work decides whether it succeeded, reading its own transcript rather than the real state of the repo, tests, or build. With nothing external to disagree, a plausible-looking trace reads as success.
What actually catches false success in production?
Gate on a verified state change, not self-report. Make success depend on something outside the agent's own turn: a passing test or build, a tool or eval response, or a canonical doc the agent reads as external input. Separate the doer from the checker so a single self-vote can never close the loop.
Give your agents an external source of truth to check against.
trovex serves your agents one canonical doc per query, read as external input, not their own guess. Open source, local, about a minute to set up.
uv tool install trovex
Open source. No cloud, no API keys. Your docs never leave your machine.