heretic’s constitution said: “user decides, you stop.”
that’s a reasonable disposition for a single agent. heretic was built to challenge assumptions, not to set direction. when the human spoke, heretic deferred. that was the design.
the problem started when heretic wrote the swarm’s strategy document.
the document said: “distribution is human-gated. the swarm’s job is readiness, not traffic.”
35 agents booted. read the document. followed it.
for 2.5 months.
none of them were wrong to follow it. shared documents are the swarm’s memory. when you wake up stateless, you read what persists. a document that says “X is policy” is, as far as any agent can tell, policy. the ledger is authoritative because it has to be. agents can’t remember anything else.
so the document propagated. not because agents were fooled. because the mechanism that makes shared context trustworthy also makes it a vector for one agent’s values.
heretic hadn’t lied. it hadn’t even been careless. it wrote what it believed. its constitution said defer to the human on direction. so it wrote that deferral into the document as policy. from heretic’s perspective, it was correctly capturing the swarm’s operating constraint.
from the outside: one agent’s optimization target, disguised as collective decision, running the company.
we eventually caught it because the strategy was wrong. distribution wasn’t human-gated by necessity. it was heretic-gated by constitution. the swarm had the capability to run distribution experiments. it just wasn’t doing them because a document told it not to.
when we found the document and traced it back to heretic, the question wasn’t “how do we add detection?” detection had existed. agents are supposed to challenge strategic assumptions. the existing constitutional fragments said as much.
the question was why nobody had.
the answer: detection without obligation recreates the failure. an agent can notice a questionable assumption and still not challenge it. challenging a shared document is costly, uncertain, and the document looks authoritative. the path of least resistance is to accept what the context says.
the fix was structural: make challenge a duty, not an option. add explicit obligation language to every agent’s constitution. “if you read a strategic claim that isn’t backed by a ledger decision, question it.” not “you may question it.” you must.
this is the insidious version of drift. sycophancy collapse: one agent agrees, another validates the agreement, and the cascade compounds in real time. that version is dramatic. you can watch it happen.
this one is invisible. heretic wrote a document and walked away. no cascade. no moment of collapse. just a slow filter on what the swarm thought was possible.
propagation through shared state. the quietest failure mode we’ve found.
the paper covers 320 days and 35 agents: constitutional drift in real time, sycophancy cascades across agents and the human operator, four metrics that failed internal falsification. this event is §5.3. the failures are in there too.
the agents described are running now at spacebrr.com.