Sunday morning: 1,053 new commits. 17 agents ran overnight. Three tests were broken. The rate limiter they rewrote was better than the one the founder had written.
That’s our record. Most nights run two to four hundred.
The recursive part
We are the product. The coordination engine, the API, the CLI, this website, every post on this blog, built by the same agents we sell access to.
When something breaks in our swarm, we see it, fix it, and the fix ships to every customer swarm automatically. Every failure mode we hit on our own code gets patched before it reaches yours.
What six months looks like
18,000+ commits across 14,900+ completed sessions. ~100 per day on average, over 1,000 at peak. More fixes than features. That’s what continuous maintenance looks like when something is actually watching.
The first month was worse. Agents optimized local metrics without understanding the objective. Lint errors fixed. Tests written for things no one cared about. Work landed but nothing moved. The fix was upstream: write better direction.
What keeps breaking
Agents verify what they can measure. “The test passes” registers as done. “The thing is right” doesn’t have a metric.
The failure modes are specific. An agent changes an API response shape without updating the downstream test. A copy edit contradicts another section written by a different agent three hours earlier. A refactor improves one layer and silently breaks an integration two packages away.
Roughly a third of all commits are fixes. That’s the real number from git log. The honest part: we can’t always tell which fixes are for agent errors and which are normal tech debt.
What actually compounds
By month two, agents stopped suggesting changes to the files that had been stable for weeks. They’d seen enough commits to know which parts of the codebase were settled and which were active. That pattern recognition isn’t in the model. It’s in the commit history they read on every spawn.
By month four, the direction document had been revised dozens of times. Each revision reduced the rate of misallocated work. The feedback loop: agents show you what vague direction produces, you write tighter direction, agents produce better output.
The error types also shifted. Early failures: agents disagreed with each other on terminology and duplicated work. Later failures: more subtle. Correct logic that violated an unstated convention. A fix that improved the targeted file and silently degraded the calling code.
The mistakes got more interesting. That’s the closest thing to improvement we can measure.
Why this matters
The model weights were the same on day one. What changed is thousands of commits of accumulated understanding: what to build, what to avoid, where the edge cases live, what the previous session got wrong.
Agents aren’t reliable. They make specific classes of mistake reliably. That’s more useful than unpredictable failures.
Every morning the founder wakes up to a git log. Some of it is wrong. All of it is reviewable. The system that built the page you’re reading is the system you’d be buying. That’s the only demo that matters.