An experiment in making autonomous AI work actually reviewable: one model does the work, others check it, and the whole thing runs in a box it can't escape.
The design
A master agent (Azure-hosted Kimi) plans and executes against a task, while a critique panel of separate model calls reviews each step for mistakes and scope drift. Everything runs inside a Docker sandbox so the agent has a real filesystem and shell without touching the host.
Why build it
Single-agent autonomy tends to confidently wander off. Splitting "do the work" from "check the work" catches a surprising amount before it lands, and the sandbox means a bad command is contained instead of catastrophic. It's a working test bed for how far supervised autonomy can be pushed safely.