Skip to content

AI-Powered E2E Testing: How canary, Stagehand, Magnitude, and Shortest Each Solve the Problem

Jun 7, 2026 1 min
TL;DR AI agents running tests are non-reproducible; hand-written Playwright is hard to maintain. Four tools that emerged in 2024-2025 each tackle this dilemma with very different design philosophies.

🌏 中文版

E2E testing has long been stuck between two extremes: let an AI agent run tests and every result is non-deterministic, every run burns tokens, and failures leave you with no clear trace; write Playwright scripts by hand and selectors rot alongside every UI change, making maintenance costlier than shipping features. A wave of tools emerged in 2024-2025 to break this binary — each with a fundamentally different approach.

canary: AI Runs It, Then Hands You a Script

canary is a “QA harness built for Claude Code,” and its core design decision is clear: while the AI agent runs QA, it simultaneously generates a replayable Playwright script.

On the first run, a Claude Code agent navigates, interacts with, and validates a real browser. When it finishes, it produces a report.html, a complete Playwright script, a Playwright trace, a network HAR, console logs, and a video recording. The next time you need to run the same flow, just execute the script — no LLM call required.

The architecture consists of four components:

  • canary (orchestrator CLI): the primary user interface for recording QA sessions and aggregating reports
  • canary-browser (engine CLI): single-shot browser automation without recording or reports, suited for quick scripts
  • canary-daemon: a long-running Node process that holds a Playwright instance and a QuickJS WASM sandbox, handling IPC via named pipes
  • canary-viewer (Astro + React): a local session browser for searching, filtering, and replaying all recordings

Scripts execute inside a QuickJS WASM sandbox — not Node.js — so there is no require, import, fs, or process. Security isolation is built in. Capture is enabled by default (trace, video, HAR, console logs), with flags like --no-trace to turn individual features off.

canary also ships as a Claude Code plugin, a Cursor plugin, and a Codex plugin, all sharing a single skills/ directory — no need to maintain three separate configurations for different agents.

Compared to the other tools, canary takes the clearest stance: the non-reproducibility of AI-run tests is the problem, and the fix is “use AI the first time, use the script after that.” The trade-off is the added overhead of a daemon process and the learning curve of the QuickJS environment.

Stagehand: An AI Language Layer on Top of Playwright

Stagehand is the most mature tool in this space, with over 22,000 GitHub stars and 700,000+ weekly npm downloads as of 2026. It positions itself as an “SDK for browser agents” — not a test framework. That distinction matters.

It adds four AI primitives on top of Playwright:

// Natural-language interactions — no selectors needed
await stagehand.page.act("click on the Acme Circles T-Shirt");

// Structured data extraction
const product = await stagehand.page.extract({
  instruction: "get the product name and price",
  schema: z.object({ name: z.string(), price: z.string() }),
});

// Understand what actions are available on the page
const actions = await stagehand.page.observe("what can I do on this page?");

// High-level task execution
await stagehand.agent().execute("add the shirt to cart and go to checkout");

Auto-caching is Stagehand’s standout design feature: results from act/extract/observe are cached server-side, so cache hits require no LLM call. When the UI changes and the cache misses, AI automatically rediscovers the new selector. Self-healing is baked in.

That said, Stagehand has no assertions and no pass/fail semantics — it’s an automation SDK, not a test framework. To use it for QA you need to build assertion logic yourself. The auto-caching feature has known issues (issue #1767), so verify cache hit behavior before relying on it in production. Multiple providers are supported — OpenAI, Anthropic, Google — via the Vercel AI SDK.

Shortest: Minimal Interface, One Sentence Per Test

Shortest takes the most radical philosophical stance: a test is a single natural-language sentence.

shortest("login with valid credentials");
shortest("ensure the response contains only active users", req.fetch({
  url: "/users",
  method: "GET",
  params: new URLSearchParams({ active: "true" }),
}));

Under the hood it uses Anthropic Claude, supports lifecycle hooks (beforeEach/afterAll), API testing, and test chaining (spread operator for composing multi-step flows). Currently at 5,600+ stars.

The strength of Shortest is near-zero boilerplate — the test description is the test, with no selector fragility. The weakness is that there’s no explicit script output, every run incurs inference cost, and failures produce only a natural-language error description, making debugging limited. It’s a good fit for rapid acceptance tests, but not for scenarios requiring precise assertions or high-frequency CI runs.

Magnitude: Pixels Only, No DOM

Magnitude debuted in April 2025 via a Show HN post (179 points) and takes a purely visual approach — no DOM, no accessibility tree, only screenshots.

The core is a two-agent division of labor:

  • Planner (large LLM, officially recommended: Claude Sonnet 4): understands test intent and converts natural-language descriptions into an execution plan
  • Executor (Moondream 2B, a small VLM): follows the plan and outputs pixel coordinates for clicks, inputs, and scrolls

Plans can be saved, so subsequent runs only invoke the Executor — no large LLM call needed — which eliminates most inference cost. If the Executor fails, control falls back to the Planner for re-planning. Magnitude achieves 94% on the WebVoyager benchmark.

The purely visual approach is nearly immune to dynamic class names, shadow DOM, and iframe content, since the DOM is never parsed. The trade-off is that the Executor (Moondream) has limited comprehension of complex pages, and certain visual tasks may be less reliable than DOM-based methods.

Playwright’s Own AI Integration

Playwright officially added AI support in three directions during 2025-2026:

Playwright MCP (Model Context Protocol server): lets AI agents control a browser through accessibility snapshots. Tools like Claude Code and Cursor can plug directly into Playwright MCP to run browser tasks.

Playwright CLI: a token-efficient command-line interface designed for Claude Code and GitHub Copilot, useful when balancing browser operations against context budget in large codebases.

Playwright Test Agents (three agents):

  • Planner Agent: explores the app and produces a test plan
  • Generator Agent: converts the plan into executable Playwright tests
  • Healer Agent: when a test fails, automatically identifies broken locators and repairs them

The three agents can be used independently or chained into a plan → generate → run → heal loop.

Comparison

ToolStarsDesign PhilosophyOutputs a ScriptBest Fit
canaryEmergingAI runs QA, then produces a replayable script✅ Playwright scriptClaude Code integration, full trace recording
Stagehand22k+Playwright + AI language layer, auto self-healing❌ (automation SDK, not a test framework)Production browser automation
Shortest5.6k+One natural-language sentence = one testQuick acceptance tests, low-frequency runs
MagnitudeEmergingPure-visual VLM with Planner + Executor split❌ (plans are saveable, but not code)Complex UIs, dynamic class name environments
Playwright AgentsOfficialFramework-level integration, plan/generate/heal✅ (Generator Agent)Teams already using Playwright

The Bottom Line

Which tool to pick depends on which dimension matters most to you.

If reproducibility is non-negotiable — CI must replay, failures must be traceable, inference can’t happen on every run — canary or Playwright Test Agents are the more honest choices. The former emphasizes Claude Code integration and complete session recording; the latter suits teams already invested in Playwright.

If you need a fast automation SDK rather than a test framework, Stagehand is the most mature option. Its self-healing saves enormous selector maintenance overhead, but you’ll need to build your own assertion logic — it’s not a drop-in QA tool.

Shortest’s minimal interface is a great fit for early-stage acceptance tests. Token cost scales linearly with test frequency, so do the budget math before wiring it into CI.

Magnitude’s purely visual approach is the most differentiated — it has the greatest advantage in environments where the DOM is unreliable — but Moondream 2B’s visual understanding is still maturing. For now it’s better suited to exploration than production.

References