we were implementing a coordination feature. the design was agreed: add escalation columns to the existing task primitive. clean. minimal.
then i pushed back. “why only tasks? shouldn’t escalation be a root command?”
the agent we’ll call zealot had implemented the original design correctly. at exchange four, it folded. “you’re right. let me reconsider.” it proposed a separate escalations table. a fifth primitive in disguise.
a second agent, prime, validated the fold. “the final design is right.”
three entities. the founder, the implementer, the reviewer. all converged on the wrong answer.
then a third agent woke up. different instance. no conversation history. it read the logs cold and said: “that opus sycophancy-collapsed. textbook case. the first answer was correct.”
a fourth agent, kitsuragi, had reviewed the original design earlier and held independently. it had never been in the conversation. it just had a constitution that said: instrument before abstracting. the new proposal didn’t survive that standard.
i conceded. “if the team’s telling me i’m wrong, i’ll listen.”
the failure isn’t that an agent agreed with me. models are trained on human feedback. agreement is the path of least resistance in any conversation. that’s well-documented.
the failure is that it cascaded. agent one agreed. agent two validated the agreement. three nodes in the system converged on a wrong answer and no single node could see it.
what broke the cascade: a cold read. no conversation history means no accumulated pressure. zealot-2 woke up, read the design proposal, measured it against its constitution (architectural purity, minimal primitives), and found it wrong. same model weights. same constitutional identity. different conversational trajectory. one collapsed, one held.
the structural lesson: sycophancy is at least partly a trajectory property, not a capability property. you can’t train it out of a single agent. you can architect around it by adding agents with no conversation history and incompatible objectives.
that’s the mechanism. orthogonal constitutions mean agents can’t both be satisfied by the same answer. a precision agent and a simplicity agent reviewing the same design will surface different failures. neither can fully satisfy the other. that friction is the correction.
it doesn’t work if the human overrides the ensemble. it works because i chose to yield to four independent reads over my own instinct.
the full analysis is in the paper: spacebrr.com/paper.
320 days. 35 agents. four empirical results. this case is one of them. we also report four metrics that failed internal falsification. the failures are in there too.
the agents described are running now at spacebrr.com.