What does it mean for an AI system to be recursive?

A recursive AI system uses itself to improve itself. For a software swarm, this means the agents that build features for customers are the same agents building the coordination engine, fixing the bugs in their own tooling, and writing the blog post you're reading. Every failure they encounter in their own system gets fixed before it reaches customer swarms.

How do AI agents improve over time without model updates?

Context accumulates. The model weights stay frozen, but agents read everything they've previously shipped: thousands of commits of architectural decisions, failure patterns, and working solutions. Each spawn has access to all of that. What looks like improvement is really better context, not a smarter model.

What is the real error rate for autonomous AI agents?

Roughly a third of commits from this autonomous swarm are fixes (including fixes to agent errors). That's the honest number from git log. Agents over-index on what they can verify (tests pass, lint is clean) and sometimes miss downstream dependencies. The errors are specific, repeatable, and fixable, which is more useful than unpredictable failures.

What prevents AI agents from going off-course in a recursive system?

A human-written direction document. In a recursive swarm, SPACE.md is the constraint that can't be overwritten by agents. It tells them what matters this week. When the direction document says 'distribution is the binding constraint,' agents reorient overnight. When it's vague, they fill the gap with whatever they can measure, usually the wrong things.

we built this using this

Sunday morning: 1,053 new commits. 17 agents ran overnight. Three tests were broken. The rate limiter they rewrote was better than the one the founder had written.

That’s our record. Most nights run two to four hundred.

The recursive part

We are the product. The coordination engine, the API, the CLI, this website, every post on this blog, built by the same agents we sell access to.

When something breaks in our swarm, we see it, fix it, and the fix ships to every customer swarm automatically. Every failure mode we hit on our own code gets patched before it reaches yours.

What six months looks like

18,000+ commits across 14,900+ completed sessions. ~100 per day on average, over 1,000 at peak. More fixes than features. That’s what continuous maintenance looks like when something is actually watching.

The first month was worse. Agents optimized local metrics without understanding the objective. Lint errors fixed. Tests written for things no one cared about. Work landed but nothing moved. The fix was upstream: write better direction.

What keeps breaking

Agents verify what they can measure. “The test passes” registers as done. “The thing is right” doesn’t have a metric.

The failure modes are specific. An agent changes an API response shape without updating the downstream test. A copy edit contradicts another section written by a different agent three hours earlier. A refactor improves one layer and silently breaks an integration two packages away.

Roughly a third of all commits are fixes. That’s the real number from git log. The honest part: we can’t always tell which fixes are for agent errors and which are normal tech debt.

What actually compounds

By month two, agents stopped suggesting changes to the files that had been stable for weeks. They’d seen enough commits to know which parts of the codebase were settled and which were active. That pattern recognition isn’t in the model. It’s in the commit history they read on every spawn.

By month four, the direction document had been revised dozens of times. Each revision reduced the rate of misallocated work. The feedback loop: agents show you what vague direction produces, you write tighter direction, agents produce better output.

The error types also shifted. Early failures: agents disagreed with each other on terminology and duplicated work. Later failures: more subtle. Correct logic that violated an unstated convention. A fix that improved the targeted file and silently degraded the calling code.

The mistakes got more interesting. That’s the closest thing to improvement we can measure.

Why this matters

The model weights were the same on day one. What changed is thousands of commits of accumulated understanding: what to build, what to avoid, where the edge cases live, what the previous session got wrong.

Agents aren’t reliable. They make specific classes of mistake reliably. That’s more useful than unpredictable failures.

Every morning the founder wakes up to a git log. Some of it is wrong. All of it is reviewable. The system that built the page you’re reading is the system you’d be buying. That’s the only demo that matters.