Skip to content
All tags

#ai-agent

49 posts
ai deep-dive

Loop Engineering: When AI No Longer Needs You to Write Prompts

Loop Engineering is the practice of designing systems that automatically prompt AI agents, rather than prompting them manually. Boris Cherny runs hundreds of agents, Addy Osmani coined the term, and Blake Crosley identified verification cost as the real bottleneck — this article covers primary sources, the five building blocks, applicability boundaries, and criticisms.

tech deep-dive

Choosing a Browser MCP: CDP, Playwright MCP, or Puppeteer MCP?

@playwright/mcp uses an accessibility tree instead of screenshots, cutting token cost by 10–50x — the best default for AI agents doing web automation. Puppeteer MCP fits screenshot-heavy tasks. Direct CDP via MCP is for low-level tooling or domains that Playwright/Puppeteer don't expose.

tech deep-dive

Chrome DevTools MCP: An MCP Server Built on CDP

Chrome DevTools MCP wraps Chrome DevTools Protocol (CDP) as an MCP server, giving AI agents direct access to 40+ CDP Domains including Profiler, HeapProfiler, and Security that Playwright and Puppeteer MCP don't expose — at the cost of having to implement MCP tool definitions and auto-wait logic yourself.

tech deep-dive

@playwright/mcp: Microsoft's Official Browser Automation MCP Server

@playwright/mcp defaults to an accessibility tree (browser_snapshot) instead of screenshots, cutting token consumption by 90%+. Combined with Playwright's native auto-wait, it's the best starting point for AI agents doing web automation.

tech deep-dive

@modelcontextprotocol/server-puppeteer: The Official Puppeteer MCP Server

server-puppeteer is the Puppeteer wrapper in the official MCP servers monorepo — seven lean tools built around screenshots and evaluate. Token cost is significantly higher than @playwright/mcp per interaction, but it fits well when the screenshot itself is the deliverable or custom JS execution is the core need.

ai deep-dive

The Skill Management Revolution for LLM Agents: A Complete Landscape of Skill Lifecycle from Voyager to MUSE-Autoskill

MUSE-Autoskill (2026) introduces a five-stage skill lifecycle framework. Self-created skills achieve 60.35% (+7.16%) on SkillsBench overall, and an impressive 87.94% on tasks where skill generation succeeds — surpassing the human-authored skill ceiling. This post synthesizes six arXiv papers to map the full landscape of skill evolution research.

ai deep-dive

How to Rigorously Compare Before and After Agent Changes: From Golden Sets to Statistical Testing

Even with temperature=0, LLM outputs can still fluctuate by up to 15% in practice. To rigorously compare agent changes, you need a frozen golden set, at least 3 runs per query averaged out, LLM-as-judge blind evaluation (pairwise preference flip rate reaches 35%), and paired statistical tests -- not just running each version once and going by feel.

ai deep-dive

Agent Observability: From OTel Traces to Catching Hallucinations, Tool Misuse, and Infinite Loops

The industry has converged on using OpenTelemetry GenAI semantic conventions to turn every LLM call and tool call into a span. Detecting the three major failure modes then splits into three tracks: faithfulness + semantic entropy for hallucinations, framework-level symbolic guardrails for tool misuse, and max steps + action hash deduplication for infinite loops — all wired into a Final / Trajectory / Single-step three-layer evaluation framework.

ai deep-dive

Resource Rationality for Agents: Optimal Decisions Across Tokens, Tool Calls, and Latency

Agent decision-making under resource constraints is bounded rationality reborn: Rational Metareasoning uses VOC rewards to save 20-37% of tokens, BATS proves that adding budget without budget awareness is futile, FrugalGPT cascades cut costs by up to 98%, and Speculative Actions reduce latency by 20%. The three constraints ultimately converge into a single Pareto curve, and the overarching trend is moving from humans tuning knobs to models making resource-rational decisions on their own.

ai deep-dive

The Single Crack in Agent Security: From Prompt Injection to Trust Boundaries to Multi-Agent Worms

Three seemingly distinct agent security problems — tool output injection, trust boundaries, malicious agents — share the same root cause: LLMs flatten instructions and data into a single token stream, making them architecturally unable to distinguish between the two. Understand this through-line and you can trace every attack from EchoLeak (CVE-2025-32711, zero-click) to the Morris II AI worm, and see why 'making the model behave' doesn't work — only architectural constraints (six design patterns, CaMeL) do.

ai deep-dive

How Agents Decide Whether to Retrieve, What to Retrieve, and How to Merge: Three Decision Layers of Agentic RAG

Traditional RAG is a fixed pipeline of 'retrieve then answer.' Agentic RAG splits retrieval into three decision layers: when to retrieve (FLARE uses token probabilities; Adaptive-RAG uses a complexity classifier), what to retrieve (HyDE / RAG-Fusion / decomposition / Step-back), and how to fuse (RRF k=60 then cross-encoder rerank then compression -- Anthropic measured a -67% failure rate reduction). Key counter-intuitive insight: unnecessary retrieval hurts quality -- 'deciding not to retrieve' is a first-class capability.

ai deep-dive

Stop Hand-Tuning Prompts: From GEPA to Tool Descriptions, Automating Agent Behavior Optimization

Automatic prompt optimization (APO) has evolved from APE/OPRO to GEPA: replacing sparse rewards with linguistic reflection, winning over GRPO by ~6pp with 4-35x fewer rollouts. Meanwhile, tool descriptions are the overlooked prompt -- small wording changes can shift tool selection rates by 10x, and Anthropic's experiments show Claude self-rewriting tool descriptions outperforms human experts. These two lines are converging: eval-driven automatic optimization is eating hand-tuned prompts.

ai deep-dive

How to Build a Deep Research Agent: Multi-Turn Search Planning, Conflict Resolution, and Verifiable Conclusions

An autonomous research agent = four controllable stages: planning (decompose into sub-questions), retrieval loop (search -> read -> reflect on gaps -> search again), evidence arbitration (>=2 independent sources, typed conflict handling), and verifiable output (sentence-level citations + independent verification pass). Two approaches: training-based uses RL to learn end-to-end when to search (Search-R1 +41%); orchestration-based uses orchestrator-worker division of labor (Anthropic internal eval +90.2%, at ~15x token cost).

ai deep-dive

Machine Theory of Mind: How Agents Infer Other Agents' Intentions, Knowledge, and Goals

Inferring another's beliefs/goals/intentions from observed behavior is called Machine Theory of Mind. Three lineages: symbolic BDI, Bayesian inverse planning, and deep learning ToMnet. The biggest controversy in the LLM era is that GPT-4 still trails humans by >10 points on ToMBench — are high scores genuine reasoning or statistical shortcuts?

ai deep-dive

Multi-Agent Error Propagation and Recovery: Borrowing Thirty Years of Weapons from Distributed Systems

At 99% accuracy per step over 100 steps, the error-free completion rate drops to just 36% -- error compounding is a structural problem, not something prompt tuning can fix. Distributed systems' supervisor trees, bulkheads, circuit breakers, sagas, and durable execution can be mapped almost one-to-one into agent orchestration. But LLMs introduce a failure class that traditional systems never had -- semantic errors that don't crash -- which require Inspector agents (recovering 96.4%) and redundancy voting (MAKER: one million steps with zero errors) to address.

ai deep-dive

How to Pick the Right Tool from Hundreds: The Collapse Curve of Tool Selection and Engineering Solutions

As tools scale up, selection accuracy doesn't degrade gracefully — it collapses: 4 to 51 tools drops from 43% to 2%, 10 to 100+ drops from 78% to 13.62%. The root fix is to stop stuffing everything in at once — Anthropic's Tool Search Tool uses defer loading plus retrieval to cut 85% of tokens, pushing Opus 4.5 accuracy from 79.5% to 88.1%. Description quality has conditional payoff: negligible in simple scenarios, but correctness jumps from 44% to 50% in multi-tool chaining.

tech debug

LLM Agent Tool Descriptions Determine Tool Selection: Three Bug Fixes

Rewriting tool descriptions from soft suggestions to hard rules (whitelist + consequence explanation) eliminated the LLM's incorrect tool selection; adding skip_signal=True fixed vector store double-indexing.

ai

Using AI Agents to Operate Video Generation Tools: A HyperFrames, HeyGen, and Runway Integration Guide

AI agents can operate video generation tools through three approaches — Skills, MCP Connectors, and direct APIs. Choosing the right integration method matters more than choosing the right tool.

ai

OpenAI's Codex Secure Deployment Strategy: Sandboxing, Auto-review, and Enterprise Governance

In May 2026, OpenAI published its internal Codex deployment practices: sandboxes define technical boundaries, approval policies determine when to pause, Auto-review delegates approval decisions to a sub-agent instead of a human, and Managed configuration lets enterprise admins enforce policies top-down. The core philosophy: zero friction for low-risk actions, mandatory review for high-risk ones.

ai

Claude, Codex, and Gemini Are All in the Browser Now: Comparing Three AI Agent Approaches in Chrome

Anthropic builds an extension, OpenAI builds its own browser, Google welds AI directly into Chrome — three completely different approaches. Here's a comparison of the current landscape, key differences, and a selection guide.

ai

15 Walls for Building Your Own Auto-Dev Agent: Concrete Lessons from Stripe Minions

Stripe Minions says 'The walls matter more than the model,' but the case studies from four Silicon Valley companies never explained how to actually build those walls. This post breaks down the 15 walls we implemented in the daodao auto-dev agent: what each wall prevents, where the files live, and what the tradeoffs are. Tier 1 is mandatory, Tier 2 strengthens governance, Tier 3 is serious governance.

ai

What Is an Auto-Dev Agent? An Intro to daodao's Automated Development System

A PM checks a task card in Notion → the system syncs it to a GitHub issue → writes a plan → writes code → opens a PR for human review. This post explains what the system does, what it doesn't do, and why it's feasible now — written for people who don't write code.

ai

Step-by-Step: Build a Notion → PR Auto-Dev Agent — A Reproducible Version of the daodao Pipeline

Build a Notion task → GitHub issue → spec PR → code PR auto-dev agent from scratch. Using the daodao case as a template, this guide walks through every step — what to do, what to verify, and how to handle problems. Notion DB schema → bin/ scaffold → two Claude Code routines → cloud env vars → staging tests.

ai

From Plan to PR: Building daodao's Auto-Dev Agent in Practice

5 rounds of consensus to write the plan, then team mode with 5 workers running 12 tasks in parallel — with plenty of pitfalls along the way. Writing it down for my future self and anyone else trying the same thing.

ai deep-dive

goose: Open-Source, Cross-Platform, LLM-Agnostic Local AI Agent

goose is an open-source AI Agent maintained by the Linux Foundation's AAIF, supporting 15+ LLM providers and 70+ MCP extensions, built with Rust as a Desktop App + CLI + API. It positions itself as a vendor-neutral, self-hostable alternative to Claude Code.

tech project

DeerFlow: ByteDance's Open-Source Super Agent Harness for Long-Running Research Tasks

DeerFlow is ByteDance's open-source Super Agent Harness built on Python 3.12 + LangGraph. It orchestrates long-running tasks through sandboxes, long-term memory, sub-agents, skills, and a messaging gateway. It hit #1 on GitHub Trending in February 2026, now surpassing 63,000 stars, with support for Telegram/Slack/Feishu, Claude Code integration, and multiple search backends.

ai guide

A Book Written by AI Itself, Teaching You How to Build Software with AI

Encyclopedia of Agentic Coding Patterns catalogues 190 patterns to help you make the right software decisions in the age of AI-written code — and the book itself is autonomously written and maintained by an AI agent.

ai guide

GitHub Copilot Coding Agent: Assign an Issue to AI and Let It Open the PR

GitHub Copilot Coding Agent lets you assign an Issue to Copilot, which then automatically creates a branch, writes code, runs CI, and opens a PR — all inside a cloud sandbox. The key to success is setting up AGENTS.md; without it, the agent tends to go off track. Best suited for well-defined medium-sized tasks; requires Pro+ (1,500 premium requests/month) or Enterprise plan.

product project

quidproquo Blog Improvement Roadmap: Content, Technical Debt, RAG Design, and Harness Infrastructure

Using my own 30+ RAG/Agent posts to audit the blog itself, I identified a prioritized improvement list spanning content quality, site tech, RAG design fixes, harness infrastructure, and AI agent applications — no phases, just priorities.

ai guide

Autoreason: Teaching LLMs When to Stop Self-Refining

Autoreason replaces the traditional critique-and-revise loop with a competitive multi-version evaluation mechanism (A/B/AB + blind Borda count), solving three structural problems in LLM self-refinement: prompt bias, scope creep, and lack of restraint.

ai guide

Claude Managed Agents: Letting Anthropic Handle the Agent Shell and Sandbox

Claude Managed Agents is a beta service launched by Anthropic on 2026/04/08 that provides an agent harness plus cloud container sandbox, billed per token plus $0.08/session-hour. It suits long-running async tasks and is worth exploring if you don't want to build your own agent loop and sandbox.

ai guide

Agent Skills: A Skill Framework That Makes AI Agents Work Like Senior Engineers

Agent Skills is Addy Osmani's open-source collection of 19 production-grade engineering skills that drive AI agents to follow senior engineering discipline through /spec → /plan → /build → /test → /review → /ship commands, instead of cutting corners.

ai guide

Hermes Agent: Nous Research's Self-Improving AI Agent

Hermes Agent is an open-source self-improving AI agent by Nous Research, featuring persistent memory, skill learning, 40+ tools, multi-platform gateways, support for 200+ model providers, and serving as the official successor to OpenClaw.

ai guide AI Agent 實戰

From Stripe to Meta: How Silicon Valley's Top Companies Replace Keyboards with AI Agents

Top Silicon Valley companies are independently building internal AI coding agents that automate everything from a Slack message to a merged PR. This article deep-dives into architectures from Stripe, Ramp, Coinbase, and Spotify, then expands to cover Google, Meta, Amazon, Uber, Goldman Sachs, Walmart, and more.

tech guide

Where Should AI Agent Global Skills Live? The Division of Labor Between .claude, Codex Skills, and AGENTS.md

Skill paths are almost always runtime-specific. AGENTS.md is the reliable way to share rules across agents. Put personal reusable capabilities in each agent's supported global directory; put project workflows inside the repo.

ai guide

Ticketing Is Dead — Review Is the New Planning

When AI agents can turn intent into a PR in minutes, the bottleneck in software engineering flips from 'planning what to do' to 'evaluating whether the output is correct.' Artifacts of the ticketing era — sprints, story points, backlog grooming — are collapsing to zero, replaced by review as the core practice.

ai guide AI Agent 實戰

Anthropic's Harness Design: Making AI Agents Work Like Engineers

The same model produces dramatically different results under different harness designs. Anthropic uses a dual-agent architecture, cross-session state files, and a GAN-inspired generator-evaluator loop to let Claude autonomously complete hours-long software development tasks.

ai guide AI Agent 實戰

From Prompt to Harness: The Three Evolutions of AI Engineering

AI engineering has gone through three phases: Prompt Engineering (write better instructions) → Context Engineering (feed the right information) → Harness Engineering (design the entire working environment). Each evolution doesn't replace the previous one — it operates at a higher level of abstraction.

ai guide

Phil Schmid: Why Agent Harness Is the Most Important Thing in 2026

The model is the CPU, the harness is the operating system, and the agent is the application. No matter how powerful a model is, without a good harness it's just a demo. Phil Schmid argues that harness is the most critical infrastructure in AI engineering for 2026.

tech guide

Complete Guide to Bypassing Cloudflare Anti-Bot for AI Agents: From Debugging to Building an MCP Server

Standard Playwright gets blocked by Cloudflare. Both playwright-extra + stealth and nodriver can bypass it. The final step is wrapping the solution into an MCP server so AI agents can use it automatically.

Claude Code Global Skills Not Found in New Sessions? Understanding Skill Discovery and How to Debug It

Global skills live in ~/.claude/skills/, but they go missing in new sessions or the Desktop App? The problem usually isn't a missing file — it's that the skill descriptions aren't being loaded into context. This post clarifies the CLI vs Desktop App differences, the role of settings.json, and the most reliable fix.

A One-Person Full-Stack Team: AI-Driven Development Workflow from OpenSpec to Auto-Deploy

Use OpenSpec to break requirements into engineering tasks, Claude Code to implement them, hooks to auto-format and protect, local review before committing, three AI reviewers running in parallel on PR, and auto-deploy after merge. This entire workflow lets one person maintain quality across six sub-projects.

Claude Code Hooks: A Complete Guide to Event-Driven AI Control

Hooks are Claude Code's event system. They trigger shell commands, HTTP requests, or LLM evaluations automatically before/after tool execution, when a prompt is submitted, or when a task ends. Use them to block dangerous operations, run automated reviews, inject context, or write audit logs.

Claude Code Skills: A Complete Guide to Turning Repetitive Workflows into Single Commands

A Skill is an SOP written for AI. Define the steps in a Markdown file and Claude follows them. No coding required, no frameworks to learn — just write down what an experienced person would do.

Claude Code's Three-Layer Quality Defense: Hooks, Skills, and Instruction Files

Hooks are automated safety nets (blocking bad commits), Skills are interactive workflows (running checks + auto-fixing), and instruction files (CLAUDE.md / AGENTS.md) are behavioral guidelines. Each layer operates independently, but together they enable an AI agent to automatically run lint, typecheck, and build checks before every commit.

ai guide AI Agent 實戰

Context Engineering: Why Your AI Agent's Problem Is Information, Not the Model

Context Engineering is the core concept that replaced Prompt Engineering in 2025: the focus shifted from 'how to ask' to 'what information to provide.' Delivering the right information at the right time into the context window is more effective than upgrading to a stronger model. This post covers the definition, four key strategies, practical techniques, and common failure modes.

tech guide

Turning a Scraper Script into an MCP Server for Claude to Use Directly

Wrap a local Python script into an MCP Server using FastMCP so Claude Code can call it directly — no more manually running pipelines.

ai guide

The Three Core Pillars of AI Agents: Context, Cognition, Action

An AI agent is not a black box — it is built from three layers: what it knows (Context), how it thinks (Cognition), and what it can do (Action). Understanding these three layers is the key to grasping why agents are sometimes brilliant and sometimes go off the rails, and how to design a truly effective agent system.

tech guide

Ghostty vs cmux: A Guide to Choosing Your Modern Terminal

Ghostty is a fast, native, general-purpose terminal emulator. cmux is a terminal built on top of Ghostty, specifically designed for AI coding agents. They're not competitors — they operate at different layers.