Today, I needed to work through substantial project with a lot of drudgery (checking through an entire 1M+ LOC codebase for an http api for patterns which could cause state leakage between requests if we made a specific change to the request handling infrastructure. This involved a mix of things which are easy to do programmatically and things which require intelligent judgement, and has a fairly objective desired artifact (a list of all the places where state could leak, and a failing functional test demonstrating that leakage for each one).
I decided to do the John Henry thing - I set up Claude Code (in a container with --dangerously-skip-permissions) in one worktree with a detailed description of the project and the task, and then in a separate worktree I set off with my favorite text editor and without the help of any AI tooling more advanced than Copilot.
I finished about 4 hours later, despite fairly frequent interruptions to provide clarification and further instructions to Claude. Claude is now reaching the 7 hour / 100M token mark and has still not finished, though it has multiple times now declared that it has succeeded at the task and that the codebase is safe for this migration (it's not).
I'm honestly pretty shocked, because this task seemed like a pretty much perfect fit for a coding agent, and is one that doesn't require all that much codebase-specific context. I went into this expecting to lose - I was trying to quantify how much coding agents can help with obnoxious repetitive maintenance tasks, thus allowing maintenance work which might otherwise have been deferred to happen at lower cost. But I guess that's not the post I'm writing today (which is a bummer, I had a whole outline planned out and everything).
Likely this situation will change by next year, but for now I suppose the machine cannot yet replace even the more repetitive parts of my job. Perhaps things are different in ML land but I kind of doubt it.