2026 keeps teaching the same lesson: an agent's summary of what it did is not evidence
这条记录涉及编程工具或代码能力更新,适合开发者评估工作流变化和可复用价值。
A lot of this year's worst dev incidents share one root cause: an agent acted faster than anyone was watching. According to Phoenix Security's 2026 supply chain report, the first half of 2026 alone produced roughly 4.5x the package-compromise volume of all of 2025 combined, with AI coding agents named as a documented accelerant. Unwatched agent behavior stopped being an edge case this year. It became the default.
The everyday version of that blind spot is smaller and just as real. Claude Code tells you "I updated the validation in auth.rs
." It did. It also touched three files it never mentioned and refactored a function you never asked about. Each one harmless on its own. Together, that's how you lose a Tuesday night debugging something "nobody changed."
The agent's summary is a narration of intent. git is the record. In 2026, those two drifting apart went from mildly annoying to the shape of an actual incident.
What actually helps:
- Treat every summary as a claim to verify, not a changelog. Run git diff
after every turn, no exceptions.
- Commit small. A 400-line agent commit is unreviewable and you will rubber-stamp it.
- Watch the files touched but not mentioned. That's where the surprises live.
- Commit messages tied to what you actually asked, so the "why" survives past this week.
Full disclosure before I go further: this is my own project and I'm obviously biased. I'm one of the creators of brain0. This exact gap annoyed me enough that we built a tool around it. brain0 (open source, Apache-2.0) passively reads git plus your Claude Code transcripts and scores this "declared vs. done" drift per commit, down to the function. It also logs what each session read , including whether anything sensitive reached the model. No hooks, no workflow changes, nothing leaves your machine by default. Repo: https://github.com/Brain0-ai/brain0
How is everyone else keeping their agents honest? Curious, because "read the diff every time" does not scale and I haven't cracked it.