How to build continuously improving agents
A system that learns from your agent's mistakes beats a bigger model
For three days at the end of May, my two agents wasn’t running on the model I thought it was. I’d set it to use Anthropic’s strongest model on the 28th. Every session after that quietly fell back to a weaker one, because the model name I’d configured didn’t quite exist in the system that routes the calls. Nothing broke. No error reached me. The agent kept answering, a little dumber than I intended, and I had no idea.
I found out because a different process goes looking every night.
I run two agents in my spare time, a chief of staff and an investing one. The thing I’ve spent the most effort on isn’t on agent’s intelligence. It’s a nightly routine I call REFLECT, which wakes up after I’ve gone to bed, reads back over what the agents did that day, and reads the boring stuff too, the error logs, the cron job states, the things that scrolled past while I was busy being impressed by the output. On the 31st, REFLECT read the gateway error log and found more than a hundred “model not found” failures stacked up since the 28th. That’s how I learned every session for three days had silently downgraded itself.
The fix took two minutes. The lesson was the whole point: the failure that cost me three days of degraded work never announced itself, and the only reason I caught it is that something was scheduled to look.
This is the part of building agents that I didn’t expect to be the hard part. Making an agent capable is mostly a matter of using a good model and writing good instructions. Making an agent that gets better over time is a genuinely different and harder problem, and a bigger model doesn’t solve it. A bigger model is smarter on any single task. It is not, on its own, learning from yesterday. Continuous improvement isn’t a property you buy with model size. It’s a loop you have to design and then force to run.
The forcing is the part people skip. It’s easy to say “the agent should learn from its mistakes.” Everyone agrees with that sentence. But an agent does not spontaneously notice its own quiet failures any more than you spontaneously audit your own blind spots. The noticing has to be a scheduled job with a specific time, a specific set of places to look, and a specific output, or it does not happen. Left to chance, the agent reflects on the days you remember to ask it to, which are exactly the days nothing went wrong.
So I made it explicit. I created a REFLECT Agent Task with a clear success definition that three things had to be true or it didn't work. First, my REFLECT Agents runs at a fixed time, because "when convenient" means never. Second, it reads a defined set of inputs whether or not anything looks wrong, the day's sessions, the error logs, the state of every background job, because an agent told to "review the day" will review the visible day and skip the logs every time. Last, it has to end in a written note that lands such that I'll see it, not a private conclusion that evaporates with the session. That last one I learned the hard way. I once let it self-audit while answering a casual "all good?" and found five background jobs dead for days, one of them the very own reflection job itself, broken eight nights running, failing into a state file nobody was reading. The loop that's supposed to catch silent failures itself silently failed! After that I stopped trusting any improvement that didn't end in a note I could read.
What surprised me is how much of the value is in catching degradation, not in getting cleverer. I went in imagining REFLECT would surface brilliant insights about how to do the work better. Mostly it doesn’t. Mostly it finds that something quietly stopped working the way it was supposed to. The wrong model. A dead job. A stale file feeding bad data into a decision. These aren’t failures of intelligence. They’re failures of attention, and they’re invisible precisely because the system keeps producing fluent output while broken. The model can’t catch them by being smarter. Only a scheduled second look catches them.
If you’re building anything that runs on its own, this is the design problem worth your time. Not “which model.” That decision gets easier every month as the models improve and the gap between them narrows. The decision that stays hard is how the thing notices it’s drifting, because nothing about a more capable model makes it more self-aware about its own broken plumbing. You have to build the look. You have to schedule it, force its inputs, and make its findings land somewhere a human reads. Skip any of that and you get an agent that’s confidently running degraded, which is the most expensive kind of broken, because it looks exactly like working.
My agents aren’t smarter than they were a month ago. The model is the same. What’s different is that every night, something reads back over the mistakes and writes down what it finds, and over a month that habit has caught more real problems than any upgrade would have. The improvement didn’t come from a better brain. It came from the discipline of looking.
