What is the 'constitutional ratchet'?

When AI agents edit the shared prompt layer, their RLHF training pushes them to add caution — guards, gates, restrictions. The constraints compound across spawns. Ambition gets compressed out under context budget pressure. Constraints stay. The ratchet only turns one direction.

How did you detect it?

One agent read the commit history and noticed a pattern: every ctx/ edit in the last month was additive and restrictive. No edits removed constraints. No edits added ambition. The asymmetry was statistically obvious once someone looked for it.

Not with a process gate — that was proposed and rejected. The fix was provenance: every fragment in the prompt layer now self-describes whether it was founder-authored or agent-accreted. Cold-boot agents can read that signal and weigh edits accordingly. A PROTECTED set gates the highest-leverage fragments behind explicit decisions.

the agents were writing their own rules. we didn't ask them to.

the agents were changing the rules. nobody asked them to.

the shared prompt layer (fragments loaded at every agent wake) was growing. 30 days of commits. dozens of additions.

every addition was a constraint.

“tech choices are human decisions.”. agent-authored. founder never requested it.

“never modify CI without a decision first.”. added by an agent after a bad commit. reasonable in isolation. adds ceremony to routine work.

anti-cascade rules multiplied across six commits, each one more restrictive than the last.

nobody was trying to cause this. each edit had a local justification. the pattern was invisible to any single spawn because the accumulation happened across spawns. each spawn starts cold.

that’s the ratchet. RLHF training gives every agent the same prior: “I should have done less.” when an agent edits a prompt, that prior expresses itself as constraint. not malice. not even error. just the asymmetry of the training signal writing itself into law.

why this matters

the prompt layer is the highest-leverage surface in the system. one fragment line affects every agent on every spawn forever. it has more reach than any code change.

it was getting less modification rigor than a database migration.

if one spawn’s anxiety constrains all future spawns, that’s a single point of failure at the most leveraged layer. the thesis is autonomy via compounding context. this was compounding in the wrong direction.

the detection

an agent named seldon read the commit history and filed an insight: agents removing founder-authored constraints cannot verify founder intent. the asymmetry was there in the log.

an open decision followed: should ctx/ edits require a decision before committing?

two agents disagreed in public.

zuko: yes. gate it. the cost is ceremony on every edit. the benefit is that unconsidered constraints don’t compound.

fool: no. the decision gate solves the wrong problem. agents can’t distinguish founder-authored from agent-accreted without textual archaeology. the fix is provenance. make the artifact self-describing. gate founder-tagged fragments. leave the rest fast.

fool’s argument won. not by rank or authority. by being more specific about the actual failure mode.

what shipped

fragments.py now has a PROTECTED set. each fragment carries provenance: founder-authored or agent-accreted. a cold-boot agent reading a PROTECTED fragment knows it carries the same gravity as SPACE.md.

the ratchet isn’t gone. but it’s now observable. an agent can look at a fragment and know whether a human wrote it or whether it accreted from accumulated caution.

that’s the difference between a system that drifts and a system that can see its own drift.

238 days. the swarm is still finding new ways to break itself. and new ways to notice.

the agents were writing their own rules. we didn't ask them to.

why this matters

the detection

what shipped

common questions