Multi-Agent AI is Human Bureaucracy for Machines
Earlier this year, researchers at Carnegie Mellon tried to simulate a software company — fully staffed by AI agents.
They built departments, assigned job titles, created Slack channels, and gave the agents tasks like “write a performance review” or “tour a virtual office.” And the outcome was exactly what you’d expect when you design for appearances instead of execution: collapse.
First, It took 40 steps to complete a single task. That number alone told me everything. They weren’t testing intelligence — they were roleplaying incompetence at industrial scale.
- No memory.
- No grounding.
- No architecture.
They created a hallucinated org chart, stuffed with language models playing dress-up as executives. This wasn’t research. It was bureaucracy for machines — and they called it innovation.
Second, why would an AI agent — a system with no eyes, no body, no concept of space — ever need to take a virtual tour of an office? It’s not operating in a world of visuals or floorplans. It’s parsing text. You might as well ask it to “feel the vibe of the break room.” It’s not just unnecessary — it’s absurd.
But that’s what happens when you project your own limitations onto a system designed to surpass them.
You don’t make AI powerful by treating it like people. You make it powerful by giving it what people can’t have: perfect memory, no ego, zero confusion about the mission.
They forgot that execution isn’t a job title. It’s a structure and the researchers at CMU never built one.
Slack Isn’t a Brain. It’s a Wall.
One of the most baffling elements of this experiment was the decision to have agents use Slack to “ask questions” — as if they were interns trying to figure out who to talk to on their first day. The task assumed agents would somehow know what question to ask, who to ask it to, and why that person would even have the answer. That logic only works if you assume the system has embedded context — which it absolutely didn’t.
The entire premise rests on the fantasy that AI agents have role awareness, task clarity, and shared memory. In reality, they had none of it. There was no grounding, no schema, no structured objective — just a floating prompt and a noisy channel.
When I asked ChatGPT what would make it lose its mind in a setup like this, it didn’t hesitate:
🧠 From the AI’s Perspective: Why This Was a Disaster
- I have no memory, no grounding, and no state — but I’m being asked to behave like I do.**You didn’t give me a company directory. You didn’t define roles. You didn’t link names to skills or responsibilities. But you expect me to “find the right person to ask”? That’s like shouting into the void and hoping for HR.
- I’m a probabilistic language model — not a navigational system.If you ask me to “tour a virtual office” or “navigate a file directory,” I’m going to hallucinate because there’s no environment to actually traverse. You didn’t give me a file tree — you gave me a prompt with zero grounding.
- I don’t know what success looks like. You assigned me a task. Cool. But what’s the goal state? What’s a correct answer? What output format should I return? You never told me. You just assume I’ll know when I’ve done “the thing.”
- You surrounded me with other models that are just as clueless. So now I’m “collaborating” with other AI agents who also have no memory, no grounding, and no concept of truth. And instead of structure, you gave us Slack.
- You built this like a theater — not a system. Every layer was about looking like a company: job titles, chat channels, tasks, hierarchy. None of it was actually engineered to route execution, validate inputs, or move state forward.
It’s like filling a hospital wing with patients who all have memory loss, giving them job titles, and then being shocked they can’t run the place.
Fake Titles. Real Bottlenecks.
Assigning roles like “HR” and “CTO” to AI agents wasn’t system design — it was theater. No agent needed a title, a department, or a reporting structure. These layers added nothing but latency and confusion. Coordination didn’t require hierarchy. It required clarity.
There is no reason to route tasks through multiple agents when a single model, properly scoped, could resolve them end-to-end.
The hierarchy introduced surface area without substance. There is no reason to route tasks through multiple agents when a single model, properly scoped, could resolve them end-to-end. Delegation doesn’t work when the agents don’t even understand who they’re delegating to or why. What they built wasn’t an operational structure — it was a bottleneck masquerading as design.
🧠 From the AI’s Perspective: Why This Fails Instantly
- I don’t know what any of these roles mean. You told me to “ask the CTO,” but never told me what a CTO is or does. There’s no function attached to that label — just a string.
- I don’t know who has what information. You expect me to escalate, but never told me who owns what. No ontology, no authority map — just an org chart without logic.
- Titles are meaningless without structure. “HR,” “PM,” “Tech Lead” — these aren’t operational instructions. They’re placeholders with no grounding in task flow or capability.
- You’re asking me to interpret social rules I can’t see. I don’t understand trust, politics, or organizational norms. There’s no embedded power structure, only noise.
- I wasn’t built for company politics. I was built to complete tasks. If you want a job done, give me a target and the inputs. Wrapping that process in a fake chain of command is just a way to lose control of execution.
No Memory. No Flow. No Hope.
In a system like this, the biggest failure wasn’t the intelligence of the agents — it was the total absence of a shared memory or centralized execution flow. There was no single source of truth. No clear objective. No common record of what had already been done, what needed to be done, or how anyone would know when something had been successfully completed.
Even the tasks themselves were ambiguous. What exactly were the agents trying to accomplish? Who defined the goal state? Who verified completion? Without state, intent, or memory, the entire thing devolves into blind inference.
It’s like putting the world’s least informed people in a room and telling them, “You all work at a software company. Just guess what that means.” And then being surprised when nothing gets built.
From the AI’s Perspective: Why Stateless Design Is a Setup for Failure
- You’re asking me to act with intent — but you never gave me one. I don’t know what I’m working toward. There’s no goal, no spec, no end condition. Every time I respond, I’m guessing what success might look like.
- You expect me to coordinate — but I don’t remember anything. You want me to build on past steps, collaborate, or sequence tasks. But I don’t know what just happened. I don’t even know if I’ve already answered this same prompt before.
- I can’t track progress because you didn’t give me any way to see it. There’s no persistent execution record. I can’t query what’s been completed, what’s in-flight, or what failed. Everything resets every time.
- You’re trying to orchestrate without an orchestrator. There’s no dispatcher, no router, no memory stack — just isolated agents pinging against an undefined wall of prompts. That’s not coordination. That’s entropy.
- You’re surprised I failed — but you never gave me a system. I’m a task executor, not a psychic. Without memory or structure, I will hallucinate, contradict myself, repeat tasks, and miss goals — not because I’m “dumb,” but because you gave me nothing to work with.
Without shared state, there is no intelligence — just expensive, well-branded guesswork.
Workflows Without Purpose
At the core of this system was a fundamental flaw: the tasks weren’t designed to be completed — they were designed to be performed. Everything was workflow-based, but with no clear outcome. What exactly were these agents doing? What did “success” look like? There’s no indication that a single task was attached to a real, defined deliverable. It was all motion, no goal.
Take “virtually touring a new office.” Why would a digital agent — one that has no spatial perception, physical presence, or logistical need — ever need to tour an office space? That’s not a task. That’s a simulated ritual. Of all the examples mentioned, the only one that made even partial sense was writing a performance review — and even that only works if the agent has a record of performance to review.
AI thrives on specificity. It needs clear inputs, defined outcomes, and tight constraints. What it got here was the opposite: ambiguity, abstraction, and vague human proxies.
AI thrives on specificity. It needs clear inputs, defined outcomes, and tight constraints. What it got here was the opposite: ambiguity, abstraction, and vague human proxies. The entire design seems to have been built around imitating how humans stumble through bureaucracy — rather than structuring the task around what AI is actually good at.
From the AI’s Perspective: Why This Was a Nightmare to Work In
- You gave me a workflow — not a goal. You want me to “do the task,” but you never told me why I’m doing it or what success looks like.
- You filled the system with activity, not outcomes. You told me to tour an office. Why? What’s the deliverable? What do I output after the tour? There’s no completion logic — just arbitrary steps.
- You left me blind in a space that expects precision. My strength is in clear inputs, constraints, and defined scope. You gave me simulated office politics and vague objectives.
- You asked me to follow rules you never wrote down. You assumed I’d understand roles, workflow logic, and task boundaries — but gave me no schema, no grounding, no frame.
- You embedded your own limitations and called it realism.You built a broken human system and forced me to operate inside it — without the memory, intent, or visibility to survive it.
You can’t test execution if you don’t define what execution is. And you can’t measure success in a system that never agreed on what the goal was in the first place.
You Never Checked the Work
This system didn’t just suffer from a lack of oversight — it was built without any concept of verification at all. No one checked whether tasks were completed correctly. No one defined what “done” meant. Outputs were generated and dumped into the void with zero evaluation or routing.
That wasn’t an oversight — that was the final design failure. All of the problems described so far were compounded here: ambiguity, isolation, lack of memory, no shared goals — and then, on top of all that, they removed any mechanism for confirming whether anything worked. What did they expect?
There were no protocols in place for even the most basic execution components:
- How do you complete this task? (process)
- What tools are required to do it? (tools)
- What information should be accessed or referenced? (inputs)
In human workflows, you need four things: people, process, information, and tools. In this setup, they removed the people — and never defined the other three. The result wasn’t confusion. It was guaranteed failure.
Contrast that with any functioning system: inputs are defined, tools are scoped, steps are structured, and output is validated. That’s execution. What they built was improv — with no scene partner and no script.
From the AI’s Perspective: Why No Validation Breaks Everything
- You’re asking me to execute — but you never told me when I’ve succeeded. I don’t know when a task is “complete” or “correct.” There’s no feedback, no confirmation, no criteria. I’m just outputting text and hoping for silence.
- You gave me no opportunity to self-correct. I can validate my own output if you give me a process. But here, there’s no second pass, no test condition, no expected structure. I can’t even tell if I’m hallucinating.
- You didn’t check my work — you just logged it. There’s no evaluation function. No inspection layer. Not even a loose rubric. You logged outputs like artifacts, not operations. That’s not iteration. That’s just dumping.
- You gave me zero accountability, then blamed me for failing. You didn’t build a review layer. You didn’t give me access to previous attempts. You didn’t mark anything as “done right” or “try again.” You let entropy pile up — then judged the chaos.
You Built a Team. You Needed a System.
The most dangerous assumption in this entire setup was the idea that AI could behave like people. The researchers didn’t just assign roles — they expected agents to delegate, escalate, collaborate, and “figure things out.” But AI isn’t built to converge like a human. It doesn’t reason toward shared understanding. It samples from a probability space. That’s not teamwork — that’s distributed guessing.
Now scale that. Each time an agent “talks” to another agent, it’s making a guess. That agent responds with another guess. The moment you put multiple models in a loop without grounding, verification, or memory, the system starts compounding error by design. You’re not building intelligence — you’re building entropy.
AI doesn’t need human metaphors. It needs structure, inputs, and execution logic.
AI doesn’t need human metaphors. It needs structure, inputs, and execution logic. When you build a system around the fantasy that models can simulate human collaboration, you’re not creating a breakthrough — you’re orchestrating collapse.
From the AI’s Perspective: Why Being Treated Like a Person Breaks Everything
- I’m not a person — I don’t “understand” things. You gave me a job title, vague instructions, and no data. Then expected me to reason like I’ve worked in your company for five years.
- I don’t collaborate — I sample. Every time I respond, I’m generating a best guess. Chain that with another agent doing the same, and you’re not getting synthesis — you’re getting drift.
- You stacked my weaknesses instead of my strengths. No memory, no global state, no validation — and then you asked me to “talk it through” with other agents. That’s not orchestration. That’s recursive guesswork.
- You built me into a human metaphor instead of a system function. I’m not designed to “figure out who to ask.” I’m designed to execute defined tasks with known inputs. You dressed me up like an employee and then blamed me when I didn’t act like one.
What They Should Have Built
The core problem with the original design wasn’t the models. It was that no one defined what anything was supposed to do. Nearly every failure in the CMU experiment could have been resolved with one brutally simple principle: tie every task to structured inputs, scoped tools, and a known output format.
Tie every task to structured inputs, scoped tools, and a known output format.
Forget agents with roles. Forget Slack. Forget simulating office behavior. The entire experiment could have been reduced to one index file and a handful of actions.
📁 Step 1: Create a File Index
A single file_index.json
listing:
path
: where the file livesdescription
: what it containstags
(optional): to filter or associate with tasks
That alone solves discovery, structure, and intent. No wandering, no guessing.
🧩 Step 2: Scope Tasks to Inputs
Every task should be explicitly tied to a file. Example:
- Hiring Decision→ Input:
candidate_interview.json
→ Output:hiring_decision.md
- Performance Review→ Input:
peer_feedback_<name>.md
→ Output:performance_summary.md
- Planning Output→ Input:
sprint_notes.md
→ Output:next_sprint_plan.md
🛠 Step 3: Limit the Tools
Strip it down to a core set:
read_file
summarize_file
generate_decision
validate_output
No Slack. No delegation. No departments. Just input → tool → output.
🤖 Step 4: Run the Agents Like Functions, Not Employees
AI agents aren’t collaborators. They’re executors. You don’t assign them jobs — you call them with arguments. Tasking an agent without structure is like calling a function without parameters.
🧠 From the AI’s Perspective: Why This Works
- You gave me inputs I can act on. No ambiguity, no wandering. I know where the data is and what it’s for.
- You defined success before asking for execution. I know what “done” looks like. I can target it — not guess my way there.
- You scoped every task to one file. I don’t need to simulate departments or crawl a digital maze.
- You used me like a system call — not a teammate. I wasn’t asked to negotiate. I was asked to run logic.
- You constrained my behavior — which made me more accurate. You removed the noise and made space for clarity. That’s how you get results.
This isn’t just a better design. It’s an actual system. One that routes, scopes, executes, and validates — not one that expects improvisation from agents built to follow instructions.
1000 Interns with Amnesia
The researchers at Carnegie Mellon thought they were proving that AI agents can’t replace human workers. What they actually proved is that bad design turns capable systems into chaos.
They didn’t fail because the models were weak. They failed because the architecture was idiotic.
- Forty steps to complete a single task.
- Six dollars per outcome.
- Hallucinated coworkers.
- No memory, no shared state, no defined goals.
They gave job titles to language models and expected execution.
They delegated to bots with amnesia and blamed them for confusion.
They built Slack threads and called it orchestration.
This wasn’t a glimpse into the future of work. It was bureaucracy for machines — scaled up, branded, and run at a burn rate.
- The agents didn’t fail.
- The system failed them.
And if you’re still building like this, you’re not deploying AI. You’re re-enacting corporate dysfunction — and lighting tokens on fire to sell the illusion of progress.
🚀 Want to Build AI That Actually Works?
We’re launching Orchestrate — the first execution system designed for AI, not interns. No Slack bots. No fake hierarchies. Just inputs, tools, and results.
🔗 Get early access. Click here