- Published on
From Claude Code to Pi: Controlling Autonomous Coding Agents with the Pi SDK
- Authors

- Name
- Xiaoyi Zhu
The hardest part of AI coding agents isn't making them smart. It's making them stop.
I haven't written a blog post in a while. Not because I stopped building things, but because the landscape shifted under my feet. Since Claude Opus 4.5 and the wave of frontier models that followed, many of the problems I used to write about (prompt engineering tricks, context window compression hacks, creative workarounds for model limitations) became significantly easier to solve. Not irrelevant, but no longer the bottleneck.
The models got better. The tooling got better. MCP servers, skills systems, and agent frameworks filled gaps that used to require hand-rolled solutions. Problems that once demanded research and experimentation now have off-the-shelf answers. The newer models are trained on the known limitations, so they sidestep pitfalls their predecessors fell into. And developer tools like MCP and skills extend the foundational model's capabilities so broadly that many tough problems can simply be delegated.
So I found myself spending most of my time learning and using coding agents. Specifically, Claude Code. It became my primary tool for months. The built-in features are excellent, and with community skills like Superpowers, Claude Code is genuinely the best tool out there for tackling the hardest, most complicated tasks as an individual developer. I still use it daily.
Then my team introduced me to Pi.
First Impressions: Pi Felt Like a Downgrade
I'll be honest. When I first tried Pi, I thought Claude Code was just better. And for individual use, it is.
Pi is an open-source, minimal terminal coding agent built by Mario Zechner. You might have heard of it as the agent brain powering OpenClaw, which embeds the Pi SDK to run its AI agent capabilities. The philosophy is "primitives, not features." It ships with only four built-in tools (read, write, edit, bash), a simple system prompt, and that's about it. Compared to Claude Code's rich built-in toolset and polished experience, Pi felt bare.
But as I kept using it, I started noticing something. Pi's minimalism wasn't a limitation. It was a design choice. Pi supports custom extensions and a skills system that let you add whatever capabilities you need. And more importantly, underneath the terminal agent sits the Pi SDK, which exposes full programmatic control over the agent loop.
That's when it clicked. Pi isn't trying to be the best tool for an individual developer. It's trying to be the best foundation for building your own agent systems.
Pi vs. Claude Code
| Dimension | Pi | Claude Code |
|---|---|---|
| Philosophy | Minimal core + extensible | Opinionated, batteries-included |
| Source | Open source (MIT) | Closed source |
| Tools | 4 built-in + custom extensions and skills | Rich built-in toolset |
| Extensibility | Extensions, Skills, Prompt Templates | Limited customization |
| SDK | Full programmatic API (createAgentSession, steer(), abort(), subscribe()) | CLI-focused, limited programmatic control |
| Observability | Event stream with tool-level granularity | Limited visibility into internals |
| Best for | Programmatic control, custom agent loops | Individual developer productivity |
Claude Code is a polished product. It's opinionated, and those opinions are mostly right for interactive use. But its closed-source nature means you work within its boundaries. You can't take full control of the agent loop, intercept tool results, or inject context mid-session programmatically.
Pi gives you all of that. The SDK provides createAgentSession() to spin up a fully programmable agent loop, session.steer() to inject instructions mid-turn, session.abort() for hard stops, and subscribe() for observing every event. But the most important piece is wrappable tool execution. Each tool returned by createCodingTools() has an execute() function you can intercept. You can run the original tool, inspect the result, modify it before the LLM sees it, and make control decisions based on patterns in the agent's behavior.
That's what makes precise control over the agent loop possible. Not prompts, not instructions, but actual programmatic hooks into every step of what the agent does.
From Individual Tool to Agent at Scale
Around the same time I was exploring Pi, I read Stripe's blog posts about Minions: fully unattended coding agents that one-shot tasks from a Slack message to a PR that passes CI, with no human interaction in between. Over a thousand merged PRs per week, all AI-authored, human-reviewed.
The industry calls this pattern an "AI agent harness" or "AI coding harness." To me it's simpler than that: building a complete control system around the agent. Think of it like an assembly line. Before anything runs, you define:
- Inputs: What data the agent can access, what context gets fed into the initial prompt
- Tools: What the agent is allowed to do (and not do)
- Runtime control: When to keep going, when to intervene, when to stop
- End states: What counts as "done" (post a diagnosis note, create a PR, escalate to a human)
- Boundaries: How long it can run, what happens when it times out
Every station, every input, every exit path is defined before the first piece moves.
And I realized: that's exactly what the Pi SDK is designed for. So I decided to build a ticket diagnosis agent using it to explore the full potential.
What I Built with the Pi SDK
Inspired by Stripe's Minions architecture, I built a ticket diagnosis agent using the Pi SDK as the core component, with Claude Code and Pi as my coding partners along the way. Fully unattended: it takes an engineering ticket, investigates the relevant codebase, and produces either a diagnosis note or a fix PR. No human in the loop between trigger and output.
How It Works
The flow is conceptually similar to Stripe's blueprint pattern:
Ticket Created / Triaged
↓
┌────────────────────────────────────────────────────────┐
│ ORCHESTRATOR (deterministic orchestration) │
│ │
│ 1. Pre-fetch ticket metadata, notes, attachments │
│ 2. Clone relevant repositories │
│ 3. Discover and load knowledge base │
│ 4. Build agent session with wrapped tools + skills │
│ 5. Send initial prompt with pre-fetched context │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ AGENT LOOP (LLM + tools, autonomous) │ │
│ │ │ │
│ │ LLM reasons → calls tools → sees results │ │
│ │ (budget footer injected when needed) │ │
│ │ │ │
│ │ Orchestrator monitors for: │ │
│ │ • Budget regime (time-based: HIGH→CRITICAL) │ │
│ │ • Stuck patterns (empty searches, repeated reads│ │
│ │ • Post-steer compliance │ │
│ │ │ │
│ │ Interventions: │ │
│ │ • Passive: budget footer nudges │ │
│ │ • Active: session.steer() interruption │ │
│ │ • Terminal: session.abort() + fallback note │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ VERIFY → ITERATE (if agent proposes a fix) │ │
│ │ │ │
│ │ 1. Run syntax check / tests / linters │ │
│ │ 2. Agent reads git diff, reviews own changes │ │
│ │ 3. If failures → feed back as context → retry │ │
│ │ 4. Only proceed if evidence supports the fix │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ 6. Agent posts diagnosis note OR creates PR │
│ 7. Log session transcript for review │
└────────────────────────────────────────────────────────┘
Three key properties, inspired by Stripe's approach:
- One-shot: A single autonomous session from trigger to output. The agent can iterate internally (search, read, retry), but there's no back-and-forth with a human during the run.
- Fully unattended: No human interaction after the trigger. The agent runs to completion or gets stopped by the orchestrator.
- Deterministic end states: Every possible exit path is defined in code, not left to the LLM to decide. The agent either posts a diagnosis note, creates a PR, or gets stopped by the orchestrator, which posts a fallback note. There's no random exit, no silent failure, no "the agent just stopped." Every run produces a predictable, useful artifact.
The Hard Part: Controlling an Autonomous Agent
Building the agent was straightforward. Controlling it was not. Here's the core tension: an LLM that's "too autonomous" is indistinguishable from one that's stuck.
The Problem I Hit
Early runs revealed a pattern I now think of as "budget blindness." The agent would receive an investigation budget in its initial prompt, then proceed to ignore it completely. Once it got deep enough into the conversation, the initial budget warning had faded from the model's attention window. In one case, the agent ran dozens of consecutive searches against wrong directories, all returning empty results, burning through its budget while finding nothing.
The root causes were obvious in retrospect:
- Budget state was invisible to the LLM. I had budget tracking in place, but it was only visible to me as the developer. The LLM itself only knows what's in its conversation context. If the budget state isn't part of that context, the agent has no idea it exists.
- No stuck detection. The agent could loop through dozens of empty searches with no mechanism to recognize the pattern.
- Prompt-only enforcement doesn't work. Phase gates written in the skill file are just instructions. The LLM overrides them when it feels "close to finding it."
This maps to a broader industry pattern. Research from Google's Budget-Aware Tool-Use paper, Stripe's Minions team, and various SWE-agent (software engineering agent) harness designs all converge on the same insight: prompt-based budget enforcement degrades after ~15 turns. The only reliable mechanism is injecting budget state into every tool result.
The Solution: Three Control Layers
This is where the Pi SDK earned its keep. I implemented a layered control system using its primitives, drawing from patterns documented in Google's BATS framework, Stripe's blueprint architecture, and various SWE-agent harness designs.
Layer 1: Budget Footer Injection (Passive)
Tool results get a budget footer appended before the LLM sees it. This uses Pi's tool wrapping: intercept tool.execute(), run the original, then modify the result.
I originally injected the footer on every single call, but quickly realized there's no point telling the agent "Time: 3m / 40m | Regime: HIGH" when it just started. That's noise, not signal. So I updated it to only inject the budget footer once the regime hits MEDIUM or above. During HIGH regime, the agent just works without distraction. Once time pressure becomes relevant, the footer kicks in:
───
Time: 22m / 40m elapsed | Regime: MEDIUM
Consider posting your hypothesis if you have one.
───
The regime system uses wall-clock time, not call counts (which are a poor proxy for actual investigation time):
| Regime | Elapsed | Behavior |
|---|---|---|
HIGH | 0–50% | No footer injected, agent works freely |
MEDIUM | 50–75% | Gentle nudge to consider posting |
LOW | 75–90% | Direct instruction to post findings |
CRITICAL | 90%+ | Hard stop instruction |
Once the regime crosses into MEDIUM, the LLM sees this footer on every tool call. It can't forget. It can't miss it. Budget state is pushed to the LLM, not pulled by it.
Layer 2: Stuck Pattern Watchdog (Active)
Independent of budget, the harness tracks behavioral patterns:
- 3 consecutive empty bash results (searches returning no output) → stuck
- Same file read 3+ times → stuck
- Same directory searched 3+ times with no new results → stuck
When a stuck pattern is detected, the harness fires session.steer(), Pi's mid-turn interruption primitive. This injects a message into the conversation after the current tool completes, redirecting the agent. The steer message typically tells the agent to stop searching, summarize whatever it has found so far, and output its results even if incomplete.
The critical design decision: steer, don't abort. Steering gives the agent a chance to post useful findings. Aborting kills the session and potentially loses everything the agent has learned.
Layer 3: Hard Abort with Fallback (Terminal)
If the agent ignores the steer and continues investigating (5+ investigation-like tool calls post-steer), the harness escalates to session.abort(). But before aborting, it posts a fallback note containing what the agent searched and any partial findings. The session always produces a useful artifact, even on a forced stop.
Three layers, progressive escalation:
Passive footer ──→ Active steer ──→ Terminal abort
(every call) (stuck detected) (steer ignored)
"You see your "Stop and post "Forced stop,
budget state" findings now" fallback posted"
AI Harness vs. Vibe Coding
This experience made me think a lot about a distinction that keeps coming up: agent harness vs. vibe coding.
| Agent Harness | Vibe Coding | |
|---|---|---|
| Scale | Fleet of agents, many tasks in parallel | One developer, one session |
| Control | Precise, mechanical, auditable | Conversational, ad-hoc |
| Failure mode | Must produce output even on failure | Developer notices and retries |
| Metaphor | Assembly line: standardized, repeatable | Skilled artisan: flexible, creative |
| Optimization | Throughput, consistency, cost per task | Speed, developer experience |
Vibe coding is powerful for exploration, prototyping, and individual productivity. But when you need to run agents across hundreds of tickets per week with consistent quality and auditability, you need an assembly line. Deterministic orchestration with constrained agent autonomy.
Stripe's Minions produce 1,300+ PRs per week. That's not vibe coding. That's manufacturing. The blueprint architecture exists precisely because you can't scale "just let the agent figure it out."
What I Learned
1. The LLM Cannot See Its Own Budget
This is the single most important lesson. Budget warnings in logs, in the initial prompt, in skill files: the LLM doesn't reliably see any of these after enough turns. The only mechanism that works is injecting budget state into every tool result. Google's Budget-Aware Tool-Use research confirms this, showing 30%+ cost reduction when the LLM can see its own budget state on every call.
2. Stuck Detection Is More Important Than Budget Limits
The agent that burned through dozens of empty searches wasn't ignoring its budget. It genuinely believed it was making progress. A budget limit would have eventually stopped it, but stuck detection catches the problem earlier and allows for graceful recovery. The right signal isn't "how many calls have you made." It's "are your recent calls producing new information?"
3. Steer Before You Abort
session.steer() is the Pi SDK's killer feature for anyone building agent control systems. It lets you interrupt an agent mid-turn, inject new instructions, and redirect behavior without losing the conversation context. Aborting is a last resort because it kills the session and potentially loses useful partial findings. The pattern is: nudge, then steer, then abort. Escalate only when softer interventions fail.
4. Prompt-Only Enforcement Has a Half-Life
Skill files, system prompts, initial instructions. They all decay in effectiveness as conversation length grows. This is well-documented in attention research: early-context instructions lose influence after 15-20 turns. Mechanical enforcement (tool wrapping, stuck detection, hard timeouts) is the only reliable backstop.
5. Fully Autonomous Agents Are Harder Than They Look
Zero mid-session feedback is the default for unattended agents. Budget state is invisible. There's no stuck detection. The only mechanical control is a hard abort, and as I found out, even that can have bugs that prevent it from functioning. Building reliable autonomous agents requires thinking about failure modes that simply don't exist in interactive use.
6. Never Trust the Agent's Self-Assessed Confidence
Controlling the agent loop solves the "when to stop" problem. But there's a second problem I ran into: the agent stops at the right time, but its conclusions are wrong.
My agent was designed to create a PR when it had "high confidence" in its diagnosis. The problem: that confidence was entirely self-assessed. METR's research found that ~50% of PRs that pass automated tests would NOT be merged by real maintainers. Academic research consistently shows LLM confidence scores are poorly calibrated: models assign 90%+ confidence to outputs that are only ~50% accurate.
Asking the agent "are you sure?" doesn't help either. ICLR 2024 research showed that LLMs cannot reliably self-correct reasoning without external feedback. The CRITIC framework confirmed that tool-interactive verification (running tests, reading diffs, checking build output) is the critical differentiator. Self-reflection alone produces no improvement and sometimes makes things worse.
The fix had two parts. First, I replaced the binary "confident → PR, else → note" decision with a much narrower gate. Most fixes should result in a diagnosis note with a suggested fix, not a PR. Only the most obvious, mechanically verifiable fixes (typos, missing null checks, wrong variable names) should get auto-PRed. Second, I added a verify→iterate loop before any PR gets created: run syntax checks, run targeted tests, have the agent read its own git diff and review the changes. If anything fails, feed it back as context and retry. This is the same pattern used by platforms like Devin and OpenAI Codex. Evidence from tools, not self-assessment from the model.
7. The SDK Is What Makes It Work
Tool wrapping, steer(), abort(), subscribe(), the event stream. These are simple primitives individually, but they compose into a full control system over the agent loop. Claude Code doesn't expose these because it doesn't need to for its use case. Pi's SDK exposes them because the whole point is to let developers build their own control layer. Everything I built, the budget tracker, the stuck watchdog, the progressive escalation, sits on top of these SDK primitives. Without that level of access to the agent loop, none of it would be possible.
Best Practices for Agent Harness Development
Do
- Inject budget/state into every tool result. The LLM must see its constraints on every call.
- Implement stuck detection based on behavioral patterns, not just call counts.
- Use progressive escalation: passive nudge → active steer → hard abort.
- Always produce an artifact, even on failure. Fallback notes beat silent timeouts.
- Log every tool call with full input/output for post-mortem analysis.
- Require tool-based evidence (test output, diffs, build results) before high-stakes actions like creating PRs.
Don't
- Trust the agent's self-assessed confidence for gating decisions. Use evidence from tools instead.
- Rely on prompt-only enforcement for long-running sessions. It decays.
- Use call counts as the primary budget mechanism. Wall-clock time is a better proxy.
- Abort without attempting graceful shutdown first.
steer()beforeabort(). - Assume the agent will self-regulate. Mechanical enforcement is the only guarantee.
- Build without observability. You need to understand what the agent did and why.
Wrapping Up
The shift from using coding agents to controlling them programmatically is a bigger leap than I expected. The hard problem isn't making agents smart, it's making them governable. An agent that ignores its budget, loops through empty searches, or creates PRs based on self-assessed confidence is worse than no agent at all. Budget visibility, stuck detection, verification loops, graceful failure. None of these come for free. You have to build them.
The Pi SDK gave me the primitives to do that. Its minimalism felt like a limitation at first, but turned out to be exactly the right foundation for building a control system around the agent loop. Claude Code remains my go-to for individual work. For controlling an autonomous agent that no one's watching, Pi was the right tool. Different problems, different tools.