- Stripe Minions — 1,300 PRs per Week via Slack Emoji Workflow
- Ramp Inspect — 30% of Merged PRs from an Agent
- Coinbase Cloudbot — Agent Councils + Auto-Merge
- Spotify Honk — From Phone-Described Requirements to Merged PR
- Common Architectural Patterns
- Other Companies Doing the Same
- The Big Picture
- References
🌏 中文版
In early 2026, a pattern began to emerge: elite engineering teams across Silicon Valley were independently building their own AI coding agents. Not Copilot-style autocomplete, but genuine end-to-end automation — from a single Slack message to a production-ready PR, with no human touching the keyboard.
This article first deep-dives into the approaches of four representative companies — Stripe, Ramp, Coinbase, and Spotify — then expands to cover the full landscape across Google, Meta, Amazon, Uber, and more than a dozen others, along with the common architectural patterns they’ve converged on.
Stripe Minions — 1,300 PRs per Week via Slack Emoji Workflow
Stripe’s Minions is the most publicly documented internal coding agent to date, built by engineer Steve Kaliski’s team and shared with technical details in February 2026.
Trigger Mechanism
An engineer adds a specific emoji reaction to any task-describing message in Slack (e.g., :create-minion-payserver:), and a bot confirms the Minion has been launched. The original Slack message becomes the agent’s prompt directly.
It also supports triggers via CLI, web interface, and automated systems (e.g., flaky test detectors). But the most common path is Slack.
Five-Layer Pipeline
The Minions architecture breaks down into five layers:
1. Invocation — Slack emoji / CLI / Web / automated system triggers
2. Devbox — Isolated VM, ~10 sec startup, pre-loaded with Stripe codebase and services
3. Toolshed MCP — Centralized MCP Server managing ~500 internal tools
4. Agent Loop — Blueprint architecture (alternating deterministic + agentic nodes)
5. Output — Lint → CI (up to 2 rounds) → Open PR for human review
Blueprint Architecture
This is the core design of Minions. A Blueprint is an orchestration template that chains together two fundamentally different types of steps:
- Deterministic Nodes: Fixed, predictable operations — git push, linting, CI execution, format checks
- Agentic Nodes: LLM-driven reasoning and code generation
The two alternate in a feedback loop: AI generates code → deterministic node verifies it compiles → AI proposes a refactor → tests run to confirm nothing breaks. Rather than relying on AI to be correct every time, deterministic checkpoints catch errors.
A typical Blueprint flow looks like this:
Slack trigger (deterministic) → clone repo + env setup (deterministic)
→ understand task + plan implementation (agentic) → write code (agentic)
→ run linter (deterministic) → push branch (deterministic)
→ fix CI failures (agentic, up to 2 attempts) → push final version (deterministic)
→ PR ready for review
The CI fix cap of 2 attempts is deliberate — if the LLM can’t fix it in two tries, a third won’t help either; it’s just burning compute. At that point, the system flags the task for human takeover.
Different task types (dependency updates, API migrations, test generation, documentation) have specialized Blueprints, and the orchestration layer automatically routes to the right one.
Toolshed MCP Server
Stripe has over 500 internal tools, but feeding all of them to the AI causes token paralysis. Toolshed is a centralized MCP (Model Context Protocol) Server that curates a subset of roughly 15 of the most relevant tools based on the task type, giving the agent precise, high-density context from the start.
Devbox Isolation Environment
Each Minion runs in an isolated AWS EC2 VM (Devbox) — identical specs to what human engineers use, pre-loaded with Stripe’s full source code, warmed Bazel cache, and type-checking cache.
Stripe pre-provisions a warm pool, pulling a machine from the pool on trigger, enabling startup in ~10 seconds. No internet access, no production access, no real user data — fully sandboxed. The blast radius of any error is contained within a single disposable VM.
Another detail: as the agent navigates the filesystem, directory-scoped rule files are automatically appended — rather than dumping an entire global context in at once, guidance is dynamically provided based on the current directory. This prevents context window overflow.
Key Metrics
- 1,300+ PRs merged per week (roughly 260+ per day)
- All PRs contain zero human-written code
- Every PR still requires human code review
- The underlying code supports Stripe’s $1 trillion+ in annual payment volume
Origin and Design Philosophy
Minions’ core agent is an internal fork of Block’s open-source Goose. The key modification was stripping out everything designed for human use — interruptibility, confirmation dialogs, human-triggered commands — and replacing them with a fully unattended one-shot mode.
Steve Kaliski calls this pattern “pair prompting” — a new form of pair programming. His core insight: “A good human developer experience also produces good AI agent results.” The infrastructure that makes human developers efficient (devbox, tooling, CI) equally makes agents efficient.
Stripe’s architectural philosophy can be distilled into one sentence: “The walls matter more than the model” — the guardrails, infrastructure, and constraints around the agent matter more than which LLM you use. Devbox infrastructure, 3 million tests, 500 MCP tools — this is not something a startup can replicate overnight.
Ramp Inspect — 30% of Merged PRs from an Agent
Ramp is one of Silicon Valley’s fastest-growing corporate spend management platforms. Their internal coding agent is called Inspect.
Technical Architecture
Inspect is built on OpenCode (an open-source AI coding CLI tool), paired with Modal cloud containers for isolated sandbox environments. Each task executes in an independent container that can run tests, lint, and type checks, ensuring generated code passes basic quality gates before submission.
Trigger Mechanism
Primarily triggered via Slack — engineers describe tasks in a Slack channel, Inspect spins up a container, executes the work, and posts back a PR link in Slack when done. CLI is also supported.
Use Cases
- Bug fixes
- Small feature implementations
- Refactoring and code migrations
- Test writing
- Boilerplate generation
All agent-produced PRs still require human review. Inspect is positioned as augmentation, not a replacement for human judgment.
Key Metrics
- Approximately 30% of merged PRs in frontend and backend repos are produced by Inspect
- Team adoption is extremely high, with most engineers using it daily
- Adoption rate exceeded the team’s expectations
Visual Verification
Inspect integrates visual DOM verification — it doesn’t just check whether the code runs, but can verify the correctness of UI changes through DOM snapshots. This is particularly valuable for frontend tasks.
Coinbase Cloudbot — Agent Councils + Auto-Merge
Coinbase’s internal coding agent is called Cloudbot. Its biggest differentiators are the agent council mechanism and auto-merge capability.
Agent Councils
Cloudbot doesn’t operate as a single agent working alone. It uses a multi-agent “council” architecture — one agent writes code, while others serve as reviewers and validators, completing a round of internal review before any human gets involved.
This ensemble/consensus mechanism reduces the risk of a single LLM making mistakes and gives the system confidence to auto-merge under specific conditions.
Auto-Merge
Unlike the other three companies, Cloudbot can automatically merge PRs when all CI tests pass + the agent council review is positive, without requiring human intervention. Human developers only need to manually review complex cases.
This is a bold design choice — removing humans from the loop and placing full trust in automated quality gates.
Trigger Mechanism
Triggered via Slack commands or PR comments, primarily handling mechanical tasks like dependency upgrades, code migrations, boilerplate, and test generation.
Built from Scratch
Unlike Stripe and Ramp, which each modified open-source tools (Goose and OpenCode respectively), Coinbase’s Cloudbot is entirely custom-built — including the agent council, auto-merge pipeline, and internal architecture comprehension capabilities.
Spotify Honk — From Phone-Described Requirements to Merged PR
Spotify’s internal coding agent is called Honk, with complete technical details shared across three Spotify Engineering Blog posts (November-December 2025).
Origin
Spotify started building its Fleet Management framework back in 2022 for applying code changes in bulk across hundreds of repos. In July 2025, they integrated the Claude Agent SDK into this framework, and Honk was born.
Before this, Spotify had tried building agents in-house, but found that homegrown solutions “required overly rigid instructions and got stuck on complex multi-step edits.” Switching to Claude Code, they found a declarative prompt style — telling the agent what outcome you want rather than step-by-step instructions — worked much better.
Workflow
Claude Code reads the codebase, understands the architecture, writes the implementation, runs tests, pushes the new version, and finally notifies engineers via Slack. Engineers can review directly on their phones and merge to production if everything looks good.
Migration prompts are version-controlled in Git, and Spotify’s internal orchestration system handles triggering the Claude Code agent.
Three-Layer Quality Assurance
In their Part 3 blog post, Spotify detailed three failure modes they worry about most:
- Agent doesn’t produce a PR — low impact, just retry
- PR passes CI but the functionality is wrong — the most serious, erodes team trust
- Unpredictable output
The solution is a verification loop: the agent generates changes → runs formatter/linter/build/test → on failure, feeds error messages back into the loop, with verifier and judge mechanisms guiding the agent toward correctness.
Primary Use Case
Honk’s killer application is large-scale code migration — deprecated API migrations across hundreds of repos. To date, it has completed roughly 50 migrations.
CTO Gustav Soderstrom told analysts:
Spotify’s best developers haven’t written a single line of code since December 2025.
Key Metrics
- 1,500+ agent PRs merged cumulatively
- Currently merging 1,000 PRs every 10 days
- Migration tasks save 60-90% of time
- Built on Claude Code + Claude Agent SDK
Common Architectural Patterns
LangChain founder Harrison Chase observed that Stripe, Ramp, and Coinbase independently developed yet converged on remarkably similar architectures. In March 2026, he released Open SWE — an open-source framework that abstracts these common patterns.
Here are the core design choices across all four companies:
1. Isolated Cloud Sandboxes
Every agent task runs in an isolated container or VM with no access to production, no internet access (Stripe), or access only to a specific scope of resources. This is the foundation of trust.
2. Slack-First Triggers
All four companies use Slack as their primary trigger entry point. Engineers don’t need to switch tools — they issue commands right where they already communicate.
3. Curated Tool Sets
Rather than feeding agents every internal tool, they dynamically curate a small, precise subset based on the task type. Stripe’s Toolshed manages ~500 tools but only serves ~15 at a time.
4. Context Injection
Rich context is injected from sources like Linear issues, GitHub PRs, and Slack threads, giving the agent a full understanding of the task’s background.
5. Sub-Agent Orchestration
Complex tasks are split across multiple sub-agents working together, rather than a single agent handling everything.
Side-by-Side Comparison
| Feature | Stripe Minions | Ramp Inspect | Coinbase Cloudbot | Spotify Honk |
|---|---|---|---|---|
| Base | Goose fork | OpenCode | Custom-built | Claude Code + Agent SDK |
| Trigger | Slack emoji | Slack / CLI | Slack / PR comment | Natural language description |
| Sandbox | Isolated VM | Modal container | Cloud sandbox | Background environment |
| Review | Human required | Human required | Agent council + auto-merge | Human required |
| Weekly PRs | 1,300+ | ~30% of all PRs | Not disclosed | 1,000/10 days |
| Differentiator | Blueprint architecture | Visual DOM verification | Auto-merge | Verification loop + migration optimization |
Other Companies Doing the Same
It’s not just the four above. Here are other large companies with publicly available information:
Google — Agent Smith
Google’s internal coding agent Agent Smith was responsible for 25%+ of new production code by Q3 2024 (Sundar Pichai, earnings call), surpassing 30% in Q1 2025. It takes high-level task descriptions, breaks them into subtasks, writes code across multiple files, runs tests, and iterates until the PR is ready for human review. After its official launch in early 2026, it became so popular that Google had to throttle internal access.
On the external product side, Google launched Antigravity — an agent-first IDE that supports orchestrating multiple parallel agents across different workspaces simultaneously.
Meta — DevMate + Multi-Agent System
Meta’s approach is the most aggressive: DevMate isn’t a single agent but an agent network — comprising Planner, Researcher, Builder, Reviewer, Negotiator, and other roles working together to complete tasks.
The metrics are staggering: DevMate ultimately produces 50% of code changes. Since early 2025, per-engineer output has increased 30%, with heavy users seeing 80% YoY improvement. The H1 2026 internal target is for 65% of engineers to produce 75%+ of their code with AI.
Amazon — Q Developer
Amazon used Q Developer’s code transformation feature to migrate 30,000 Java applications from Java 8/11 to Java 17. CEO Andy Jassy revealed in an earnings call: it saved 4,500 developer-years of effort and $260 million. The average upgrade time per application dropped from ~50 person-days to a few hours, with 79% of auto-generated code reviews accepted directly.
Uber — Minions + Shepherd + uReview
Uber’s agent system comprises three roles: Minions (task agent), Shepherd (migration agent), and uReview (code review agent). uReview analyzes 90%+ of ~65,000 weekly code diffs, with a median review time of just 4 minutes, and 65% of AI comments adopted (higher than the 51% rate for human reviewers). By March 2026, 84% of developers were agentic coding users.
Goldman Sachs — Devin Deployment
Goldman Sachs was the first major bank to deploy Devin (Cognition) (July 2025), scaling from hundreds to 12,000 developers. Primarily used to migrate internal code to newer language versions. Reported 3-4x productivity gains.
Walmart — WIBEY
Walmart’s developer agent WIBEY is one of four “super agents” that saved approximately 4 million developer hours in 2024-2025. Built on Walmart’s proprietary Element ML platform, it is currently being refactored into an agent orchestration architecture.
Industry Landscape
| Company | Tool | Key Metrics |
|---|---|---|
| Agent Smith | 30%+ production code | |
| Meta | DevMate | 50% code changes, multi-agent network |
| Amazon | Q Developer | 4,500 developer-years, $260M saved |
| Uber | Minions/Shepherd/uReview | 84% developer adoption, 90% diffs auto-reviewed |
| Goldman Sachs | Devin | First bank deployment, 12,000 developers |
| Walmart | WIBEY | 4 million hours saved |
| Shopify | Cursor/Claude Code | 3,000 licenses, AI included in performance reviews |
| Block | Goose (open source) | 27,000 GitHub stars, base for Stripe Minions |
| Apple | Xcode Intelligence | Claude integration, agentic coding |
| Airbnb | Internal platform | 97% tech debt migration success rate |
The Big Picture
From the deep analysis of four companies to the full industry landscape, the conclusion is clear: AI coding agents are no longer experiments — they are production infrastructure.
The core trade-offs are evident:
- Speed vs. Control: Coinbase chose auto-merge for maximum velocity; the other three retained human review as a last line of defense
- Custom-built vs. Open-source base: Coinbase is fully custom, Stripe forks Goose, Ramp uses OpenCode, Spotify uses the Claude SDK — there’s no single right answer; it depends on the existing tech stack and internal requirements
- General-purpose vs. Specialized: All systems started with “well-defined, mechanical tasks” (migrations, dependency upgrades, bug fixes), then gradually expanded to more complex scenarios
For teams looking to build similar systems, LangChain’s Open SWE framework is a starting point — it packages the architectural patterns that Stripe, Ramp, and Coinbase independently converged on into an out-of-the-box open-source solution.
For most teams, the question worth asking right now is: How much of your engineering team’s work could actually be replaced by a single Slack message?
References
- Stripe Dev Blog: Minions — Stripe’s one-shot, end-to-end coding agents (Part 1)
- Stripe Dev Blog: Minions — Part 2
- Lenny’s Newsletter: How Stripe built “minions” — Steve Kaliski
- ByteByteGo: How Stripe’s Minions Ship 1,300 PRs a Week
- InfoQ: Stripe Engineers Deploy Minions
- MindStudio: Stripe Minions Blueprint Architecture
- Anup.io: Stripe’s coding agents — the walls matter more than the model
- SitePoint: Deconstructing Stripe’s Minions — One-Shot Agents at Scale
- InfoQ: Ramp Builds Internal Coding Agent That Powers 30% of Pull Requests
- DevOps.com: Open SWE Captures the Architecture That Stripe, Coinbase and Ramp Built Independently
- Spotify Engineering: 1,500+ PRs Later — Spotify’s Background Coding Agent (Part 1)
- Spotify Engineering: Context Engineering — Background Coding Agents (Part 2)
- Spotify Engineering: Feedback Loops — Background Coding Agents (Part 3)
- TechCrunch: Spotify says its best developers haven’t written a line of code since December
- Anthropic Customer Story: Spotify
- GitHub: langchain-ai/open-swe
- LangChain Blog: Open SWE — An Open-Source Framework for Internal Coding Agents
- Harrison Chase on X: Internal cloud coding agents
- ChatPRD: Stripe’s AI Minions Ship 1300 PRs Weekly from a Slack Emoji
- Anthropic: 2026 Agentic Coding Trends Report
- Fortune: Over 25% of Google’s code written by AI
- Google Developers Blog: Build with Google Antigravity
- LinearB: How Meta Built Agentic Infrastructure
- Engineering at Meta: Ranking Engineer Agent
- Amazon CEO Andy Jassy: Q Developer saves 4,500 developer-years
- Pragmatic Engineer: How Uber uses AI for development
- Uber Blog: uReview — Scalable GenAI for Code Review
- CNBC: Goldman Sachs pilots autonomous coder Devin
- Walmart Tech: From Models to Agents — WIBEY
- Pragmatic Engineer: AI Tooling for Software Engineers in 2026
- Block Open Source: Introducing Goose
- GitHub: block/goose
Loading...