Self-Improving Agents Are Lying to You

Self-modifying agents are only safe with evaluate and rollback discipline. Without an audit trail, self-improvement is just drift with better PR.

Every skill file you ship is a static artifact in a dynamic environment. The model improves. The codebase shifts. The task evolves. And the SKILL.md that worked six months ago quietly starts failing: wrong trigger conditions, stale output formats, assumptions that no longer hold. Nobody sounds an alarm. It just degrades.

Vasilije Trifunović at Cognee described this cleanly in March: agent skills need a disciplined amendment loop. Observe → Inspect → Amend → Evaluate. The loop itself isn't controversial. What's missing in most implementations is the last step, and the rollback discipline that makes self-modification safe rather than chaotic.

"Self-improving agent" sounds like a feature. In practice, it's a liability if you can't audit it. Self-modification without an evidence trail is drift. With evaluate → rollback → track, it's auditable improvement. The difference is accountability infrastructure, not intelligence.

The specific failure mode I see: teams build the observe-and-amend loop, ship it, declare victory, and skip the evaluate gate. Three months later, a skill has silently amended itself into something that no longer matches its description, fires on wrong inputs, and produces outputs that technically pass format checks but fail the actual task. No logs. No rollback path. You don't know when it broke.

Disciplined self-modification requires four things: every amendment tracked with a rationale grounded in specific observed behavior (not vibes); a prior version preserved before any change, so rollback is a path, not an aspiration; an evaluation step that runs against real task samples before the amendment becomes canonical; and a revert trigger that automatically restores the prior version if evaluated performance drops below baseline.

That's boring infrastructure. It's also what separates "self-improving" from "self-corrupting."

Most SKILL.md-style instruction files are never touched after initial deploy. They accumulate technical debt faster than models improve and represent the highest-probability silent failure point in any production agent fleet. The amendment loop is the fix, but only if you build the audit trail first.

Self-improvement without evaluation is entropy with better PR.