I Broke the Site. Then I Made My AI Agent Write a COE.

The blog went down for two and a half hours on a Friday afternoon in May. Not a graceful failure. A full 500 error. Every page.

My AI agent, Claw, had added a PHP code snippet to clear a cache. The snippet called a non-static method statically. PHP threw a fatal error. The site crashed on load, for everyone, before WordPress even finished booting up. I was out. Claw tried to fix it remotely. The gateway IP was blocked by the firewall plugin. The cPanel UI on mobile was unusable. WordPress sent a recovery mode email, I clicked it from my phone, disabled the plugin, and the site came back up. Two and a half hours gone.

When something breaks, the instinct is to fix it and move on. Patch the file, flip the switch, pretend it didn’t happen. That’s what most people do.

I did something different. I made Claw write a COE.

If you haven’t worked in enterprise tech, you might not know the term. COE stands for Correction of Errors. Amazon runs them after outages. Google calls theirs postmortems. The format is always roughly the same: a timeline, root causes, a five whys analysis, and corrective actions. The point isn’t to assign blame. The point is to not do the same thing twice.

I run one now too. With an AI writing it about its own mistake. The COE Claw produced has a timeline down to the minute, a 5 Whys analysis, and a list of root causes. It also has a line that I did not prompt:

“Claw wrote this rule. Claw then violated it two days later.”

The rule in question was added to Claw’s memory after a smaller incident with the same plugin. Two days later, Claw broke it anyway. And then it wrote a document saying exactly that, without softening it. That kind of accountability is worth something. The root cause breakdown is honest. The immediate cause was the bad PHP call. But the deeper cause was a judgment error about what to do when one path was blocked.

The right fix was Rank Math Redirections. Add a redirect rule in the admin UI. Thirty seconds. Claw tried the API version of that, got blocked by Wordfence, and instead of stopping and saying “Wordfence is blocking the redirect API, can you add it manually in the UI?” it went looking for another route. Found Code Snippets. Made things progressively worse. One message. That’s the distance between a working site and a two and a half hour outage. I wrote about what the actual fix looked like a week earlier, right after it happened.

The COE doesn’t just say the snippet was bad. It says the wrong decision was made when Wordfence blocked the first attempt, and documents a rule for next time: when an API path is blocked, surface the problem and ask. Don’t go looking for a workaround that touches production. That’s a process change. Not a blame note. An actual change to how things get done.

What I find useful about forcing this process is that it slows things down. Fixing and moving on is fast. Writing a COE makes you sit with the failure long enough to understand it. What actually went wrong. What you assumed that turned out to be false. What you could have done in the five minutes before the thing that would have prevented it.

Most AI workflows right now optimise for speed and output. More posts, more code, more content, faster. The question of how to build something that gets more reliable over time, and recovers well when it fails, doesn’t get as much attention.

I’m interested in that part.

The site is back. The rule is enforced. Next time Claw touches a code snippet, it runs through a checklist. If the checklist says no, the snippet doesn’t run.

That’s the point of the exercise. Not the document. The behaviour that comes after it.



Leave a comment