THE SIGNAL

You've been using AI wrong. One agent, one chat, one task at a time. That's a bicycle. What OpenAI and Stripe just showed us is a factory floor.

OpenAI shipped an internal product with zero human-written code. A million lines, 1,500 pull requests, all from agents. Stripe's "Minions" merge over a thousand PRs per week. Unattended. While engineers sleep.

The trick isn't a smarter model. It's the system around it: task graphs, orchestration, sandboxes, and constraints that keep agents on rails. Here's how to build that system yourself.

THE SHIFT

WEBSITES:
Beads
Composio Agent Orchestrator
OpenAI Harness writeup
Stripe Minions deep dive

Here's the mental model. Imagine a software team, but every team member is an agent.

Someone needs to remember what's happening. That's Beads. It's a task tracker that lives inside your git repo. Not Jira, not Notion. A .beads/ folder with JSONL files that agents can read and update directly. Tasks form a graph: this blocks that, this depends on that.

An agent can ask "what's ready for me to work on?" and get an answer instantly. When tasks are done, old ones get compressed so the memory stays clean.

Someone needs to decide who does what. That's Composio Agent Orchestrator. It splits the work into two layers: a Planner that breaks big goals into steps, and Executors that actually do the work.
The Planner looks at your Beads task graph and says "these three tasks can run in parallel, this one waits."
The Executors spin up coding agents, each in its own git branch, its own workspace. If something fails, the orchestrator retries or escalates. It doesn't just crash and lose everything.

Someone needs to design the workspace. That's the Harness pattern from OpenAI. Their big insight: agents don't struggle because they're dumb. They struggle because the environment is messy. So you make the repo the entire world the agent can see.
Architecture rules? In the repo.
Quality standards? Enforced by linters.
Documentation? Structured files agents can actually parse, not a 500-line instruction dump nobody reads.

Someone needs to define the workflow. That's the Minions pattern from Stripe. They call them "blueprints": deterministic sequences that alternate between fixed steps and agent calls.
-Set up environment.
-Run agent.
-Run tests.
-Check results.
-If pass, open PR.
-If fail, retry with context.

The agent only gets creative where creativity is needed. Everything else is predictable code.

Build This Now — Ship Your MVP With AI Agents

9 AI specialists build your complete product. No coding required. $197 one-time.

Build This Now • Build This Now

HOW IT FLOWS

Here's the actual pipeline, start to finish.

1. You write a spec. A structured doc checked into the repo. What you want built, what constraints matter, what "done" looks like. You're the architect now.

2. The spec becomes tasks. Either you or a planning agent converts that spec into Beads. Epics, tasks, subtasks. Dependencies mapped. Priorities set. All versioned in git alongside the code.

3. Composio plans the work. The orchestrator reads the task graph, figures out what can run in parallel, and spins up agents. Each agent gets its own branch, its own isolated workspace, and only the tools it needs for that specific task. Not every tool. Just the relevant ones. This keeps the agent focused.

4. Agents execute in sandboxes. Each coding agent follows a blueprint: provision environment, write code, run tests, validate against linters. If tests pass, open a PR. If they fail, the agent gets the error output and tries again. All of this happens without you watching.

5. Constraints catch mistakes. Your repo has custom linters, structural tests, architecture rules. "This layer can't import from that layer." "Every API endpoint needs input validation." "Logs must use structured format." Agents follow these because they're enforced in CI, not because they read a guideline doc.

6. Cleanup runs in the background. Dedicated maintenance agents scan for code that drifted from your patterns. They open small refactoring PRs. Old Beads tasks get compressed. Documentation gets updated. This is continuous garbage collection for your codebase.

7. You review what matters. PRs land in your queue. Most take under a minute to approve because the constraints already caught the obvious stuff. You spend your time on architecture decisions, user feedback, and steering the system.

THE THREE RULES THAT MAKE IT WORK

After studying how OpenAI and Stripe built these systems, three patterns show up everywhere.

Rule 1: Design the environment, not the prompts. Both teams stopped trying to write perfect prompts and started building better workspaces. Repository structure, tool access, linter rules, documentation format. Get these right and even a mediocre prompt produces good work. Get these wrong and the best prompt in the world won't save you.

Rule 2: Separate thinking from doing. Mixing "figure out the plan" and "write the code" in one agent loop creates greedy, short-sighted decisions. The planner plans. The executor executes. The planner never touches code. The executor never questions the plan. Clean separation.

Rule 3: Encode taste as code. Every time an agent produces something you don't like, don't just fix it. Ask: can I write a linter rule, a test, or a structural check that catches this automatically? Over time, your preferences become mechanical constraints. The system gets better without you repeating yourself.

WHAT NOT TO DO

Don't write a giant instruction file. OpenAI tried a massive AGENTS.md and it failed. Too long, too stale, too many conflicting rules. Instead: a short table of contents (100 lines max) that points to specific docs. Progressive disclosure. Agents start small, look up details as needed.

Don't let agents roam free. "Here's the whole codebase, figure it out" is a recipe for nonsense. Scope each agent to a specific task, specific files, specific tools. Walls matter more than brains.

Don't skip the task graph. Chat history is not a project management tool. If you can't answer "what's done, what's blocked, what's next?" from a structured query, your system will fall apart at scale.

Don't wait for better models. Stripe and OpenAI both said the same thing: the bottleneck was never model quality. It was always the environment. Better workflows beat better weights.

START SMALL

Copy-paste this to plan your first swarm pipeline:

You're a software architect designing an agent swarm pipeline.

My project: [DESCRIBE YOUR PROJECT]
My stack: [YOUR TECH STACK]
My team size: [NUMBER]

Design a practical swarm pipeline using these 4 components:
1. Beads (git-backed task graph)
2. Composio (orchestrator with planner/executor split)
3. Harness patterns (agent-legible repo design)
4. Blueprint workflows (deterministic steps wrapping agent calls)

For each component, give me:
- What to set up first
- One concrete example using my project
- The biggest mistake to avoid

Keep it actionable. I want to start building today.

If you try this and it maps out a real pipeline, reply "worked" so I know to send more systems-level stuff.

BOTTOM LINE

One agent is a tool. A swarm is a team.

The gap between "I use AI to help me code" and "I run a fleet of agents that ship software" is not model quality. It's infrastructure. Task graphs. Orchestration. Sandboxes. Constraints.

OpenAI built a product with zero human-written code. Stripe merges a thousand agent PRs per week. These aren't experiments. They're production systems.

The tools to build your own version exist today. Beads for memory. Composio for orchestration. Harness patterns for environment design. Blueprint workflows for reliability.

Stop chatting with one agent. Start running a factory.