#ai-agent

49 posts

ai deep-dive Jun 20, 2026

Loop Engineering: When AI No Longer Needs You to Write Prompts

Loop Engineering is the practice of designing systems that automatically prompt AI agents, rather than prompting them manually. Boris Cherny runs hundreds of agents, Addy Osmani coined the term, and Blake Crosley identified verification cost as the real bottleneck — this article covers primary sources, the five building blocks, applicability boundaries, and criticisms.

#loop-engineering #ai-agent #claude-code #prompt-engineering #harness-engineering #agentic-coding

tech deep-dive Jun 20, 2026

Choosing a Browser MCP: CDP, Playwright MCP, or Puppeteer MCP?

@playwright/mcp uses an accessibility tree instead of screenshots, cutting token cost by 10–50x — the best default for AI agents doing web automation. Puppeteer MCP fits screenshot-heavy tasks. Direct CDP via MCP is for low-level tooling or domains that Playwright/Puppeteer don't expose.

#mcp #browser-automation #playwright #puppeteer #cdp #ai-agent #developer-tools

tech deep-dive Jun 20, 2026

Chrome DevTools MCP: An MCP Server Built on CDP

Chrome DevTools MCP wraps Chrome DevTools Protocol (CDP) as an MCP server, giving AI agents direct access to 40+ CDP Domains including Profiler, HeapProfiler, and Security that Playwright and Puppeteer MCP don't expose — at the cost of having to implement MCP tool definitions and auto-wait logic yourself.

#chrome #cdp #mcp #browser-automation #debugging #devtools #ai-agent

tech deep-dive Jun 20, 2026

@playwright/mcp: Microsoft's Official Browser Automation MCP Server

@playwright/mcp defaults to an accessibility tree (browser_snapshot) instead of screenshots, cutting token consumption by 90%+. Combined with Playwright's native auto-wait, it's the best starting point for AI agents doing web automation.

#playwright #mcp #browser-automation #ai-agent #e2e-testing #developer-tools

tech deep-dive Jun 20, 2026

@modelcontextprotocol/server-puppeteer: The Official Puppeteer MCP Server

server-puppeteer is the Puppeteer wrapper in the official MCP servers monorepo — seven lean tools built around screenshots and evaluate. Token cost is significantly higher than @playwright/mcp per interaction, but it fits well when the screenshot itself is the deliverable or custom JS execution is the core need.

#puppeteer #mcp #browser-automation #ai-agent #developer-tools #chrome

ai deep-dive Jun 6, 2026

The Skill Management Revolution for LLM Agents: A Complete Landscape of Skill Lifecycle from Voyager to MUSE-Autoskill

MUSE-Autoskill (2026) introduces a five-stage skill lifecycle framework. Self-created skills achieve 60.35% (+7.16%) on SkillsBench overall, and an impressive 87.94% on tasks where skill generation succeeds — surpassing the human-authored skill ceiling. This post synthesizes six arXiv papers to map the full landscape of skill evolution research.

#agent-skills #ai-agent #llm #self-refinement #memory #arxiv #paper-review

ai deep-dive Jun 4, 2026

How to Rigorously Compare Before and After Agent Changes: From Golden Sets to Statistical Testing

Even with temperature=0, LLM outputs can still fluctuate by up to 15% in practice. To rigorously compare agent changes, you need a frozen golden set, at least 3 runs per query averaged out, LLM-as-judge blind evaluation (pairwise preference flip rate reaches 35%), and paired statistical tests -- not just running each version once and going by feel.

#evaluation #rag #llm-judge #ab-testing #ai-agent #llm

ai deep-dive Jun 4, 2026

Agent Observability: From OTel Traces to Catching Hallucinations, Tool Misuse, and Infinite Loops

The industry has converged on using OpenTelemetry GenAI semantic conventions to turn every LLM call and tool call into a span. Detecting the three major failure modes then splits into three tracks: faithfulness + semantic entropy for hallucinations, framework-level symbolic guardrails for tool misuse, and max steps + action hash deduplication for infinite loops — all wired into a Final / Trajectory / Single-step three-layer evaluation framework.

#observability #ai-agent #tool-use #llm #opentelemetry

ai deep-dive Jun 4, 2026

Resource Rationality for Agents: Optimal Decisions Across Tokens, Tool Calls, and Latency

Agent decision-making under resource constraints is bounded rationality reborn: Rational Metareasoning uses VOC rewards to save 20-37% of tokens, BATS proves that adding budget without budget awareness is futile, FrugalGPT cascades cut costs by up to 98%, and Speculative Actions reduce latency by 20%. The three constraints ultimately converge into a single Pareto curve, and the overarching trend is moving from humans tuning knobs to models making resource-rational decisions on their own.

#ai-agent #reasoning #test-time-compute #llm #cost-optimization

ai deep-dive Jun 4, 2026

The Single Crack in Agent Security: From Prompt Injection to Trust Boundaries to Multi-Agent Worms

Three seemingly distinct agent security problems — tool output injection, trust boundaries, malicious agents — share the same root cause: LLMs flatten instructions and data into a single token stream, making them architecturally unable to distinguish between the two. Understand this through-line and you can trace every attack from EchoLeak (CVE-2025-32711, zero-click) to the Morris II AI worm, and see why 'making the model behave' doesn't work — only architectural constraints (six design patterns, CaMeL) do.

#security #ai-agent #prompt-injection #multi-agent #llm

ai deep-dive Jun 4, 2026

How Agents Decide Whether to Retrieve, What to Retrieve, and How to Merge: Three Decision Layers of Agentic RAG

Traditional RAG is a fixed pipeline of 'retrieve then answer.' Agentic RAG splits retrieval into three decision layers: when to retrieve (FLARE uses token probabilities; Adaptive-RAG uses a complexity classifier), what to retrieve (HyDE / RAG-Fusion / decomposition / Step-back), and how to fuse (RRF k=60 then cross-encoder rerank then compression -- Anthropic measured a -67% failure rate reduction). Key counter-intuitive insight: unnecessary retrieval hurts quality -- 'deciding not to retrieve' is a first-class capability.

#rag #agentic-rag #retrieval #ai-agent #llm

ai deep-dive Jun 4, 2026

Stop Hand-Tuning Prompts: From GEPA to Tool Descriptions, Automating Agent Behavior Optimization

Automatic prompt optimization (APO) has evolved from APE/OPRO to GEPA: replacing sparse rewards with linguistic reflection, winning over GRPO by ~6pp with 4-35x fewer rollouts. Meanwhile, tool descriptions are the overlooked prompt -- small wording changes can shift tool selection rates by 10x, and Anthropic's experiments show Claude self-rewriting tool descriptions outperforms human experts. These two lines are converging: eval-driven automatic optimization is eating hand-tuned prompts.

#prompt-engineering #tool-use #ai-agent #llm #optimization

ai deep-dive Jun 4, 2026

How to Build a Deep Research Agent: Multi-Turn Search Planning, Conflict Resolution, and Verifiable Conclusions

An autonomous research agent = four controllable stages: planning (decompose into sub-questions), retrieval loop (search -> read -> reflect on gaps -> search again), evidence arbitration (>=2 independent sources, typed conflict handling), and verifiable output (sentence-level citations + independent verification pass). Two approaches: training-based uses RL to learn end-to-end when to search (Search-R1 +41%); orchestration-based uses orchestrator-worker division of labor (Anthropic internal eval +90.2%, at ~15x token cost).

#deep-research #ai-agent #multi-agent #retrieval #llm

ai deep-dive Jun 4, 2026

Machine Theory of Mind: How Agents Infer Other Agents' Intentions, Knowledge, and Goals

Inferring another's beliefs/goals/intentions from observed behavior is called Machine Theory of Mind. Three lineages: symbolic BDI, Bayesian inverse planning, and deep learning ToMnet. The biggest controversy in the LLM era is that GPT-4 still trails humans by >10 points on ToMBench — are high scores genuine reasoning or statistical shortcuts?

#theory-of-mind #multi-agent #ai-agent #llm #reasoning

ai deep-dive Jun 4, 2026

Multi-Agent Error Propagation and Recovery: Borrowing Thirty Years of Weapons from Distributed Systems

At 99% accuracy per step over 100 steps, the error-free completion rate drops to just 36% -- error compounding is a structural problem, not something prompt tuning can fix. Distributed systems' supervisor trees, bulkheads, circuit breakers, sagas, and durable execution can be mapped almost one-to-one into agent orchestration. But LLMs introduce a failure class that traditional systems never had -- semantic errors that don't crash -- which require Inspector agents (recovering 96.4%) and redundancy voting (MAKER: one million steps with zero errors) to address.

#multi-agent #ai-agent #fault-tolerance #orchestration #llm

ai deep-dive Jun 4, 2026

How to Pick the Right Tool from Hundreds: The Collapse Curve of Tool Selection and Engineering Solutions

As tools scale up, selection accuracy doesn't degrade gracefully — it collapses: 4 to 51 tools drops from 43% to 2%, 10 to 100+ drops from 78% to 13.62%. The root fix is to stop stuffing everything in at once — Anthropic's Tool Search Tool uses defer loading plus retrieval to cut 85% of tokens, pushing Opus 4.5 accuracy from 79.5% to 88.1%. Description quality has conditional payoff: negligible in simple scenarios, but correctness jumps from 44% to 50% in multi-tool chaining.

#tool-use #ai-agent #mcp #llm #context-engineering

tech debug May 18, 2026

LLM Agent Tool Descriptions Determine Tool Selection: Three Bug Fixes

Rewriting tool descriptions from soft suggestions to hard rules (whitelist + consequence explanation) eliminated the LLM's incorrect tool selection; adding skip_signal=True fixed vector store double-indexing.

#ai-agent #rag #llm #prompt-engineering #django #python

ai May 10, 2026

Using AI Agents to Operate Video Generation Tools: A HyperFrames, HeyGen, and Runway Integration Guide

AI agents can operate video generation tools through three approaches — Skills, MCP Connectors, and direct APIs. Choosing the right integration method matters more than choosing the right tool.

#ai-agent #video-generation #hyperframes #heygen #mcp #claude-code #cursor

ai May 10, 2026

OpenAI's Codex Secure Deployment Strategy: Sandboxing, Auto-review, and Enterprise Governance

In May 2026, OpenAI published its internal Codex deployment practices: sandboxes define technical boundaries, approval policies determine when to pause, Auto-review delegates approval decisions to a sub-agent instead of a human, and Managed configuration lets enterprise admins enforce policies top-down. The core philosophy: zero friction for low-risk actions, mandatory review for high-risk ones.

#openai #codex #ai-agent #security #sandbox #enterprise

ai May 9, 2026

Claude, Codex, and Gemini Are All in the Browser Now: Comparing Three AI Agent Approaches in Chrome

Anthropic builds an extension, OpenAI builds its own browser, Google welds AI directly into Chrome — three completely different approaches. Here's a comparison of the current landscape, key differences, and a selection guide.

#ai-agent #chrome-extension #claude #codex #chatgpt-atlas #gemini #browser-agent

ai May 9, 2026

15 Walls for Building Your Own Auto-Dev Agent: Concrete Lessons from Stripe Minions

Stripe Minions says 'The walls matter more than the model,' but the case studies from four Silicon Valley companies never explained how to actually build those walls. This post breaks down the 15 walls we implemented in the daodao auto-dev agent: what each wall prevents, where the files live, and what the tradeoffs are. Tier 1 is mandatory, Tier 2 strengthens governance, Tier 3 is serious governance.

#ai-agent #claude-code #guardrails #allowlist #verification-loop #token-budget #test-first #defense-in-depth #pre-commit #sub-agent-council

ai May 9, 2026

What Is an Auto-Dev Agent? An Intro to daodao's Automated Development System

A PM checks a task card in Notion → the system syncs it to a GitHub issue → writes a plan → writes code → opens a PR for human review. This post explains what the system does, what it doesn't do, and why it's feasible now — written for people who don't write code.

#ai-agent #auto-dev-agent #product #automation-overview #non-engineer #notion #github #pipeline

ai May 9, 2026

Step-by-Step: Build a Notion → PR Auto-Dev Agent — A Reproducible Version of the daodao Pipeline

Build a Notion task → GitHub issue → spec PR → code PR auto-dev agent from scratch. Using the daodao case as a template, this guide walks through every step — what to do, what to verify, and how to handle problems. Notion DB schema → bin/ scaffold → two Claude Code routines → cloud env vars → staging tests.

#ai-agent #claude-code #tutorial #notion-sync #openspec #pipeline-automation #auto-dev-agent #routine #cloud-environment #github-automation

ai May 9, 2026

From Plan to PR: Building daodao's Auto-Dev Agent in Practice

5 rounds of consensus to write the plan, then team mode with 5 workers running 12 tasks in parallel — with plenty of pitfalls along the way. Writing it down for my future self and anyone else trying the same thing.

#ai-agent #claude-code #multi-agent #consensus-planning #auto-dev-agent #notion-sync #openspec #pipeline-automation #internal-coding-agent #defense-in-depth

ai deep-dive May 2, 2026

goose: Open-Source, Cross-Platform, LLM-Agnostic Local AI Agent

goose is an open-source AI Agent maintained by the Linux Foundation's AAIF, supporting 15+ LLM providers and 70+ MCP extensions, built with Rust as a Desktop App + CLI + API. It positions itself as a vendor-neutral, self-hostable alternative to Claude Code.

#goose #ai-agent #open-source #mcp #rust #linux-foundation #aaif #claude-code #cli #desktop-app

tech project Apr 21, 2026

DeerFlow: ByteDance's Open-Source Super Agent Harness for Long-Running Research Tasks

DeerFlow is ByteDance's open-source Super Agent Harness built on Python 3.12 + LangGraph. It orchestrates long-running tasks through sandboxes, long-term memory, sub-agents, skills, and a messaging gateway. It hit #1 on GitHub Trending in February 2026, now surpassing 63,000 stars, with support for Telegram/Slack/Feishu, Claude Code integration, and multiple search backends.

#deer-flow #bytedance #agent #langgraph #langchain #ai-agent #open-source #harness

ai guide Apr 18, 2026

A Book Written by AI Itself, Teaching You How to Build Software with AI

Encyclopedia of Agentic Coding Patterns catalogues 190 patterns to help you make the right software decisions in the age of AI-written code — and the book itself is autonomously written and maintained by an AI agent.

#agentic-coding #design-patterns #llm #ai-agent #software-engineering #claude-code

ai guide Apr 18, 2026

GitHub Copilot Coding Agent: Assign an Issue to AI and Let It Open the PR

GitHub Copilot Coding Agent lets you assign an Issue to Copilot, which then automatically creates a branch, writes code, runs CI, and opens a PR — all inside a cloud sandbox. The key to success is setting up AGENTS.md; without it, the agent tends to go off track. Best suited for well-defined medium-sized tasks; requires Pro+ (1,500 premium requests/month) or Enterprise plan.

#github #copilot #coding-agent #ai-agent #github-actions #sandbox #pr-automation

product project Apr 18, 2026

quidproquo Blog Improvement Roadmap: Content, Technical Debt, RAG Design, and Harness Infrastructure

Using my own 30+ RAG/Agent posts to audit the blog itself, I identified a prioritized improvement list spanning content quality, site tech, RAG design fixes, harness infrastructure, and AI agent applications — no phases, just priorities.

#quidproquo #rag #ai-agent #harness-engineering #context-engineering #blog #product-design

ai guide Apr 17, 2026

Autoreason: Teaching LLMs When to Stop Self-Refining

Autoreason replaces the traditional critique-and-revise loop with a competitive multi-version evaluation mechanism (A/B/AB + blind Borda count), solving three structural problems in LLM self-refinement: prompt bias, scope creep, and lack of restraint.

#autoreason #nous-research #self-refinement #llm #borda-count #iterative-reasoning #ai-agent

ai guide Apr 12, 2026

Claude Managed Agents: Letting Anthropic Handle the Agent Shell and Sandbox

Claude Managed Agents is a beta service launched by Anthropic on 2026/04/08 that provides an agent harness plus cloud container sandbox, billed per token plus $0.08/session-hour. It suits long-running async tasks and is worth exploring if you don't want to build your own agent loop and sandbox.

#claude #managed-agents #anthropic #ai-agent #sandbox #serverless #beta

ai guide Apr 10, 2026

Agent Skills: A Skill Framework That Makes AI Agents Work Like Senior Engineers

Agent Skills is Addy Osmani's open-source collection of 19 production-grade engineering skills that drive AI agents to follow senior engineering discipline through /spec → /plan → /build → /test → /review → /ship commands, instead of cutting corners.

#agent-skills #ai-agent #harness-engineering #claude-code #cursor #gemini-cli #development-workflow

ai guide Apr 5, 2026

Hermes Agent: Nous Research's Self-Improving AI Agent

Hermes Agent is an open-source self-improving AI agent by Nous Research, featuring persistent memory, skill learning, 40+ tools, multi-platform gateways, support for 200+ model providers, and serving as the official successor to OpenClaw.

#hermes-agent #nous-research #ai-agent #self-improving #gateway #multi-platform #openclaw

ai guide AI Agent 實戰 Apr 4, 2026

From Stripe to Meta: How Silicon Valley's Top Companies Replace Keyboards with AI Agents

Top Silicon Valley companies are independently building internal AI coding agents that automate everything from a Slack message to a merged PR. This article deep-dives into architectures from Stripe, Ramp, Coinbase, and Spotify, then expands to cover Google, Meta, Amazon, Uber, Goldman Sachs, Walmart, and more.

#ai-agent #coding-agents #stripe-minions #agentic-coding #developer-tools #automation #meta #google #uber #amazon

tech guide Apr 2, 2026

Where Should AI Agent Global Skills Live? The Division of Labor Between .claude, Codex Skills, and AGENTS.md

Skill paths are almost always runtime-specific. AGENTS.md is the reliable way to share rules across agents. Put personal reusable capabilities in each agent's supported global directory; put project workflows inside the repo.

#ai-agent #skills #claude-code #codex #agents-md #developer-tools

ai guide Mar 30, 2026

Ticketing Is Dead — Review Is the New Planning

When AI agents can turn intent into a PR in minutes, the bottleneck in software engineering flips from 'planning what to do' to 'evaluating whether the output is correct.' Artifacts of the ticketing era — sprints, story points, backlog grooming — are collapsing to zero, replaced by review as the core practice.

#code-review #software-engineering #ai-agent #adr #developer-workflow #ticketing

ai guide AI Agent 實戰 Mar 28, 2026

Anthropic's Harness Design: Making AI Agents Work Like Engineers

The same model produces dramatically different results under different harness designs. Anthropic uses a dual-agent architecture, cross-session state files, and a GAN-inspired generator-evaluator loop to let Claude autonomously complete hours-long software development tasks.

#harness-design #ai-agent #anthropic #claude #multi-agent #long-running-agents #agent-sdk

ai guide AI Agent 實戰 Mar 28, 2026

From Prompt to Harness: The Three Evolutions of AI Engineering

AI engineering has gone through three phases: Prompt Engineering (write better instructions) → Context Engineering (feed the right information) → Harness Engineering (design the entire working environment). Each evolution doesn't replace the previous one — it operates at a higher level of abstraction.

#harness-engineering #prompt-engineering #context-engineering #ai-agent #agentic-ai

ai guide Mar 28, 2026

Phil Schmid: Why Agent Harness Is the Most Important Thing in 2026

The model is the CPU, the harness is the operating system, and the agent is the application. No matter how powerful a model is, without a good harness it's just a demo. Phil Schmid argues that harness is the most critical infrastructure in AI engineering for 2026.

#harness-engineering #ai-agent #agent-harness #model-drift #benchmarks #claude-code

tech guide Mar 28, 2026

Complete Guide to Bypassing Cloudflare Anti-Bot for AI Agents: From Debugging to Building an MCP Server

Standard Playwright gets blocked by Cloudflare. Both playwright-extra + stealth and nodriver can bypass it. The final step is wrapping the solution into an MCP server so AI agents can use it automatically.

#cloudflare #anti-bot #playwright #nodriver #stealth #mcp #ai-agent #web-scraping

tech debug Claude Code Automation Guide Mar 27, 2026

Claude Code Global Skills Not Found in New Sessions? Understanding Skill Discovery and How to Debug It

Global skills live in ~/.claude/skills/, but they go missing in new sessions or the Desktop App? The problem usually isn't a missing file — it's that the skill descriptions aren't being loaded into context. This post clarifies the CLI vs Desktop App differences, the role of settings.json, and the most reliable fix.

#claude-code #skills #ai-agent #dx #troubleshooting #settings

tech guide Claude Code 自動化指南 Mar 27, 2026

A One-Person Full-Stack Team: AI-Driven Development Workflow from OpenSpec to Auto-Deploy

Use OpenSpec to break requirements into engineering tasks, Claude Code to implement them, hooks to auto-format and protect, local review before committing, three AI reviewers running in parallel on PR, and auto-deploy after merge. This entire workflow lets one person maintain quality across six sub-projects.

#claude-code #openspec #ai-agent #ci-cd #code-review #dx #monorepo #github-actions

tech guide Claude Code Automation Guide Mar 27, 2026

Claude Code Hooks: A Complete Guide to Event-Driven AI Control

Hooks are Claude Code's event system. They trigger shell commands, HTTP requests, or LLM evaluations automatically before/after tool execution, when a prompt is submitted, or when a task ends. Use them to block dangerous operations, run automated reviews, inject context, or write audit logs.

#claude-code #hooks #ai-agent #automation #dx #event-driven

tech guide Claude Code Automation Guide Mar 27, 2026

Claude Code Skills: A Complete Guide to Turning Repetitive Workflows into Single Commands

A Skill is an SOP written for AI. Define the steps in a Markdown file and Claude follows them. No coding required, no frameworks to learn — just write down what an experienced person would do.

#claude-code #skill #ai-agent #dx #automation #workflow #agent-skills

tech guide Claude Code Automation Guide Mar 26, 2026

Claude Code's Three-Layer Quality Defense: Hooks, Skills, and Instruction Files

Hooks are automated safety nets (blocking bad commits), Skills are interactive workflows (running checks + auto-fixing), and instruction files (CLAUDE.md / AGENTS.md) are behavioral guidelines. Each layer operates independently, but together they enable an AI agent to automatically run lint, typecheck, and build checks before every commit.

#claude-code #ai-agent #dx #ci-cd #code-quality #claude-md #agents-md

ai guide AI Agent 實戰 Mar 24, 2026

Context Engineering: Why Your AI Agent's Problem Is Information, Not the Model

Context Engineering is the core concept that replaced Prompt Engineering in 2025: the focus shifted from 'how to ask' to 'what information to provide.' Delivering the right information at the right time into the context window is more effective than upgrading to a stronger model. This post covers the definition, four key strategies, practical techniques, and common failure modes.

#context-engineering #prompt-engineering #ai-agent #rag #memory #agentic-ai

tech guide Mar 20, 2026

Turning a Scraper Script into an MCP Server for Claude to Use Directly

Wrap a local Python script into an MCP Server using FastMCP so Claude Code can call it directly — no more manually running pipelines.

#mcp #claude #python #fastmcp #ai-agent

ai guide Mar 17, 2026

The Three Core Pillars of AI Agents: Context, Cognition, Action

An AI agent is not a black box — it is built from three layers: what it knows (Context), how it thinks (Cognition), and what it can do (Action). Understanding these three layers is the key to grasping why agents are sometimes brilliant and sometimes go off the rails, and how to design a truly effective agent system.

#ai-agent #context-engineering #llm #reasoning #ReAct #agentic-ai #memory #mcp

tech guide Mar 14, 2026

Ghostty vs cmux: A Guide to Choosing Your Modern Terminal

Ghostty is a fast, native, general-purpose terminal emulator. cmux is a terminal built on top of Ghostty, specifically designed for AI coding agents. They're not competitors — they operate at different layers.

#ghostty #cmux #terminal #macos #ai-agent