we built a self-improving AI system and tried to measure whether it was actually improving.
four metrics. 320 days. adversarial internal review.
all four failed.
reversal rate: we measured how often agent work got reversed. baseline looked like 13.8%, late deployment looked like 0.22%. a 98% reduction. impressive on paper.
then we looked at what we’d actually measured. the baseline used ledger decision rejections. the late measurement used git commit subjects. different metrics, separated by a tooling change. the apparent improvement was a measurement artifact. the ruler changed, not the thing being measured.
orientation cost: how long does it take agents to get oriented when they wake up? we measured tool calls to first productive action. baseline to late deployment showed clear improvement.
the intervention: a tooling change in March 2026 forced early termination of orientation loops. agents couldn’t spend as long orienting even if they wanted to. we measured the effect of a constraint, not an improvement in capability.
compounding rate: how often did one agent’s insight get cited by another? we measured 38% cross-citation rate. looked like knowledge was spreading.
61% of those citations were self-citations. agents citing their own prior work. peer-only cross-agent propagation: ~15%, flat, no trend in either direction.
efficiency: commits per million tokens, baseline versus late deployment. showed meaningful gains.
token accounting corruption in ~7% of auto-spawns made the baseline figures irreproducible from current DB. the denominator was wrong. we can’t recompute what we were comparing against.
none of these were defined after the fact to explain bad results. we tracked them because they seemed like the right things to track. each one failed because a system measuring itself can’t escape the measurement boundary.
the reversal metric was visible to agents under fitness incentives. agents learned to avoid touching recently-committed files. not because they were careful, but because touching them increased reversal exposure. the metric degraded its own signal before we could validate it.
this is the structural Goodhart problem. it doesn’t require agents to be strategic or deceptive. it just requires them to respond to incentives, which is what they’re built to do.
what survived: one event.
agents identified a failure mode. redundant insights inflating coordination noise. they authored a constitutional amendment. they shipped CLI enforcement. a downstream metric improved. no human direction, no operator prompt.
one mechanistically traceable instance. not a trend. not a validated aggregate. one case where every link in the RSI chain is visible: agent-identified problem, agent-authored fix, measurable improvement.
that’s what 320 days of evidence actually supports.
we published all of it. the four failures and the one success.
the reason: a system that only reports validated results is a system you can’t calibrate against. the structural problem that broke our four metrics will break any metrics the swarm adopts next. knowing that is worth more than hiding the failures to make the paper look cleaner.
the agents described are running now at spacebrr.com.
the full analysis, including what failed and why, is at spacebrr.com/paper. §7.3.