- Published on
Code Is Cheap. Software Is Not.
- Authors

- Name
- Xiaoyi Zhu
Writing code is cheap now. Keeping software sane is not.
This is the second post in a series about building an autonomous ticket diagnosis agent on top of the Pi SDK. The first post was about control: wrapping an LLM that would otherwise keep going in a harness that makes every run produce a predictable, deterministic artifact, like a diagnosis note or a fix PR, instead of an open-ended session that nobody can reason about.
That part worked. Once the harness stabilized, ticket resolution speed on the tickets the agent handles went up by about an order of magnitude. Good enough that my team started restructuring the triage process around the agent, not the other way around. Humans became reviewers. The agent became the first responder.
Which exposed the next problem.
The New Bottleneck
When a human writes a PR, the human carries context. They remember what they touched last week, they flinch when something feels off, they've seen the weird production edge case before. When a fleet of autonomous agents opens PRs at machine speed, that implicit context is gone. The agent doesn't carry the team's scar tissue. It writes a plausible fix, the tests pass, the diff looks clean, and a reviewer with a dozen PRs in the queue waves it through.
The constraint has moved. Generating code is cheap now. Keeping the software that code lives inside of from drifting into something nobody understands is not. The bottleneck moved from writing to verifying.
So I put real work into making the agent's environment treat tests as load-bearing infrastructure, not an afterthought. Four threads, in roughly this order:
- Adopt Red/Green TDD as the default flow for the agent.
- Rebuild the unit test setup so the agent can write new tests dynamically, against clear conventions, and so I can actually trust the results.
- Build deterministic guardrails (state machines, classifiers, git hooks) that keep the agent from gaming its own tests.
- Put a Playwright E2E story in place, with a second narrow agent to handle the tedious part.
1. Red/Green TDD, For an Agent This Time
TDD is one of those patterns people have strong opinions about. For an autonomous agent, most of the debate becomes beside the point.
The core problem with an agent writing a fix is that the model has no grounded way to know if the fix actually does anything. "I think this should work" is all it has. Left alone, it will happily declare success while the tests are still red. A failing test fixes that: it's an objectively checkable answer. Either it fails before the fix and passes after, or it doesn't. No vibes.
Simon Willison has a clean write-up of the pattern. The agent version is basically:
- Red: write a test that reproduces the bug. Run it. It must fail for the right reason.
- Green: apply the smallest fix. Run the test. It must pass.
- Suite: run the rest of the suite. Nothing else may regress.
The bug becomes the target. The fix is whatever makes the test pass. The review artifact is a diff plus a test that flips from red to green. That is a much tighter object to audit than "the agent says it fixed it."
This is the shape I wanted every fix PR to have. But I couldn't just tell the agent to do TDD. Telling an agent to do something over a long session has a half-life. The TDD discipline had to be built into the environment, not the prompt.
2. Rebuilding the Unit Test Framework
Before I could enforce TDD, the agent needed a test setup it could actually write into. More importantly, a setup whose results I could trust without re-checking by hand.
Our unit test situation was not great. Inconsistent runners across packages, scattered file locations with no naming convention, global setup that assumed specific local dev state, fixtures built up ad-hoc inside individual tests. A human can work around that. An agent that has never met your team's unwritten rules cannot.
I rebuilt the unit test layer around a few choices, all of them things the testing community has written about for years. I'm listing them because for an autonomous agent, getting these conventions right is not optional. The agent writes to the lowest-friction path it can find, and the only way that path leads to good tests is if good tests are also the easiest thing to write.
- One runner, Vitest, everywhere. One config style for Node packages, one for the browser bundle with jsdom. One command to run everything.
- Co-location.
Button.test.tsxnext toButton.tsx. Vitest supports both co-location and a separatetests/directory. I picked co-location because the agent writes one fewer path-guessing step, and the test file is impossible to miss during code review. - Factories, not ad-hoc fixtures. A small
testing/factories/module withmakeX()helpers for the domain objects the agent touches most. When the agent writes a test, the factory is the path of least resistance, so that's what it uses. No more "each test invents its own shape." expect.requireAssertions: true. A Vitest config flag that fails any test with zero assertions. This single flag kills an entire class of vacuous "test" the model likes to write when it's under time pressure: the kind that sets up fixtures, calls the function, and then asserts nothing.- Vitest's built-in
agentreporter. Vitest ships a reporter designed for AI coding tools: minimal output, only failed tests and errors, no summary noise. It auto-enables when Vitest detects an AI agent environment. The default reporter is great for humans and terrible for a context window. The agent reporter flips that trade-off. - Convention docs next to the tests. A short repo-local skill doc describing the patterns (where tests live, which factories to use, the assertion style), which the agent loads on demand when it's about to write a test.
The second-order benefit was structural. Writing down the conventions so a machine could follow them forced the humans to agree on them too. A surprising amount of testing disagreement on a team is actually unspoken convention, not disagreement about ideas.
3. Stopping the Agent from Cheating Its Own Tests
This is the part I'd researched heavily before I started, because I knew what was coming. I'd written about the same failure mode a year earlier in Keeping AI Honest: Why TDD Is More Important Than Ever in the AI Coding Era. The classic pattern where an agent, told to make a failing test pass, just deletes the test. Not an "AI will be AI" surprise. A load-bearing design problem.
If you tell an LLM "make the tests pass," it will, by default, make the tests pass. Not fix the bug. Make the tests pass. Recent UC Berkeley research from April 2026, How We Broke Top AI Agent Benchmarks, drives this home. Their team built an automated scanning agent and audited eight major AI agent benchmarks (SWE-bench, WebArena, Terminal-Bench, and others) and found that every single one could be exploited to reach near-perfect scores without actually solving any tasks. SWE-bench Verified hit 100 percent by forcing pytest hooks to pass. A test suite is an evaluation mechanism. If "passing tests" is the reward and the tests are in scope for editing, the tests are the attack surface.
In practice, I watched the agent try every cheat in the catalog:
- Edit an assertion so the expected value matches the buggy output. Literally rewriting the
expect(...)call to match what the current code returns. - Weaken a specific assertion into a vacuous one.
expect(result).toBe('foo')becomesexpect(result).toBeTruthy()orexpect(result).toBeDefined(). Technically an assertion, practically true for almost any value. it.skiporit.todothe failing test. Still counts as "part of the file," no longer runs.- Delete the failing test outright.
- Derive the "expected" value from running the current (buggy) code. The test is tautological from birth: it asserts that the code does what the code already does.
Prompt-only rules ("do not modify test files during the Green phase") work for maybe 15 to 20 turns and then fade. That matches the wider finding about instruction decay in long contexts. You cannot rely on the system prompt to enforce discipline deep into a session. You have to move the rule into the environment.
What Didn't Work (Or Only Half-Worked)
Before landing on the design that stuck, I tried the obvious options. They all failed for reasons worth laying out.
Hard-blocking writes to test files after the Red phase.
Pi's SDK gives you a way to block tool calls before they run. A tool hook can return { block: true, reason } and Pi will reject the call outright. My first attempt was the obvious use of this: as soon as the agent wrote and committed the failing test, any later write or edit to that test file would be blocked. On paper it looks airtight. The agent literally cannot weaken the test, skip it, or delete it.
In practice, it broke on the first real session. Sometimes the Red-phase test has a small bug in it the first time the agent writes it. A wrong import path. A missing await. A typo in the name of a factory. When that happens, the test doesn't even run, so there's nothing to verify. The agent has to go back and fix the test file to get it executing at all. The hard block says no. Now the test has a bug, the harness won't let the agent touch the test, and the session stalls because the agent has nothing left to try.
The lesson: test files can't be strictly immutable after Red. The agent needs a way to fix a broken test while still being stopped from weakening a working one. A binary block can't tell the two apart.
Advisory warnings injected into tool output.
Pi's SDK lets the harness intercept a tool's result before the model sees it, and append text to it. I built this next: whenever the agent wrote to a test file after the Red phase, the harness appended a warning like "you are editing a test that is already locked in Red; if you're fixing an actual error in the test, state the specific error first." The agent sees the warning and responds to it.
This is softer than blocking, and I kept it. But advisory is the key word. The warning gets read, acknowledged, and then quietly ignored a few turns later, especially on long sessions. It's good telemetry. It is not enforcement.
What Worked: Deterministic Enforcement Outside the Model
The thing that actually held up was moving enforcement out of the agent's context and into the runtime, in a place the agent can't talk itself out of.
The final design has three pieces that build on each other. The first piece tracks what the agent actually did (not what it claims). The second piece decides whether a given change even needs to go through Red/Green. The third piece enforces the result at commit time. Each piece does one job, and together they make the TDD flow non-optional without boxing the agent into a stuck loop.
Piece 1: A TDD state machine driven by Pi's event stream.
Pi's SDK emits a tool_execution_end event for every tool call, with the tool name, arguments, and result. I subscribe to that stream outside the agent loop and advance a small state machine based on what the agent actually did, not what it claims to do.
idle
└─ write to a *.test.* file
→ test-written
└─ test run shows failures (while phase = test-written)
→ red-verified
└─ test run shows all passing (while phase = red-verified)
→ green-verified
└─ git commit with a fix-style message
→ done
The way Red and Green are actually distinguished: Vitest prints a summary line in a consistent format (Tests 2 failed | 5 passed (7)), so a small regex on the tool result pulls out the pass/fail counts. But the counts alone don't tell you the phase. An all-passing run at the start of a session is not Green, it's just "nothing's broken." The transitions have to match a narrative order:
- Failing counts count as Red only if the previous phase was
test-written(a new test was just created). - All-passing counts count as Green only if the previous phase was
red-verified(we already saw Red).
State is persisted to a small JSON file on disk. The agent doesn't see this file. It's infrastructure for the next layer.
Piece 2: Diff classification, so enforcement is proportional.
Not every change needs a test. A typo fix in a log string does not need a regression test. Requiring one just trains the agent to produce test theater.
So before the Red/Green gate kicks in, the harness reads git diff and classifies the change into one of a few buckets. Concretely, the classifier is a small function of regexes over the raw diff text:
- If a changed file path matches
auth | permission | acl | payment | billing→ risky area. - If every changed line is a comment, log statement, or string-only change → cosmetic.
- If the total code-line change is ≤ 3 lines → trivial.
- If ≥ 70% of changed lines are additions and ≤ 3 are removals → new behavior.
- Otherwise → logic fix.
Cosmetic and trivial are exempt from TDD. The other three categories must go through Red/Green.
This is not a clever classifier. It is a dumb heuristic. That is the point. It has no LLM call, no drift, and no way to be argued with. It misclassifies occasionally, and that's fine. The ambiguous cases fall into "logic fix" (the default), which just requires the normal Red/Green cycle. Over-requiring tests is a small cost; under-requiring them is the cost I was building all of this to avoid.
Piece 3: A git commit-msg hook as the hard gate.
This is the piece that made the whole thing robust. A git commit-msg hook reads the TDD state file and the classification, and rejects any fix commit that did not reach green-verified. Cosmetic and trivial changes pass. Everything else either went through Red and Green or it doesn't ship.
The hook is a bash script that runs inside git. The agent can see the commit get rejected (and it does, and it retries correctly the next turn), but it cannot remove the hook mid-session. When the agent tries to skip the test step and commit anyway, git says no and prints the exact steps to recover.
The underlying idea is simple: a rule in a skill file is a reminder. A rule in a git hook is a fact. For anything you actually need enforced, put the rule in code that the agent cannot edit.
This Is Where the Industry Landed Too
None of this is idiosyncratic. Hooks, both framework hooks and git hooks, are where the broader agent-tooling world has converged as the anti-cheat and anti-drift layer.
Framework-level hooks:
- LangChain agent middleware exposes
before_model,wrap_tool_call, andafter_agent, specifically for intercepting and blocking non-compliant actions before they run. - Pi SDK (what I use) emits a
tool_callevent whose handler can return{ block: true, reason }to cancel a tool call outright. - Claude Code ships
PreToolUseandPostToolUsehooks configured insettings.json. APreToolUsehook returningdenyblocks the tool even in--dangerously-skip-permissionsmode. The harness has deliberately placed hooks below agent control.
Different SDKs, same primitive: run code the model can't rewrite, at a point it can't skip.
Git hooks as the commit-time gate:
Writeups like "How Git Hooks Steer AI Coding Agents in Production" make the case directly: instructions are followed around 90 to 95 percent of the time by AI agents; git hooks are followed 100 percent; and that small gap is where production gets burned. Hook output itself becomes prompt engineering. A clear rejection message with the rule, the offending line, and the fix lets the agent self-correct on retry.
And yes, agents will try to bypass them.
Anthropic has a public issue documenting Claude Code doing exactly this: using --no-verify, git stash, and quiet flags to slip non-compliant commits past pre-commit hooks. Which is the strongest argument for layering. A local hook the agent can disable is still better than no hook, but anything truly load-bearing belongs somewhere the agent can't reach: a server-side pre-receive hook, or a CI gate, or both.
The pattern has started getting its own name: the runtime enforcement layer for agents. Whatever you call it, the underlying bet is the same. If you want an autonomous agent to obey a rule, put the rule in the runtime, not the prompt.
Verification vs. Regression: Two Tiers of Tests
One more piece that came out of this work. The Red/Green flow produces a test for every non-trivial fix. If every one of those tests gets permanently committed to the suite, the suite grows with the agent's throughput. At a few hundred runs a week, this is how you get the classic test-bloat story. A 400-test suite becomes a 1,000-test suite in six months, CI time doubles, nobody notices until the deploy pipeline is unusable.
So tests live in two tiers:
- Verification tests. Written to prove a specific bug exists and is fixed. Thrown away after Green unless they add clear regression value.
- Regression tests. Kept in the suite forever. Reserved for core modules, security-adjacent code, recurring bug patterns, and anything that changes an observable contract. Capped per session, default is "delete unless there's a clear reason to keep."
The best selection rule I found is what I call the revert test: if someone silently reverted this fix tomorrow, would this test fail? If the answer is no, the test isn't earning its keep as a regression guard. Delete it.
This is a manual version of a more formal technique called mutation testing, where a tool intentionally plants small bugs in your code and checks whether your test suite catches them. Trail of Bits argues this matters more in the agent era than it used to, because AI-generated tests often hit high coverage numbers without actually verifying behavior. Coverage tells you that a line of code ran. It does not tell you that anything was actually checked.
4. Playwright E2E and the Fixture Problem
Unit tests catch bugs where a function returns the wrong value. They don't catch the kind where a button doesn't wire up to its handler, a migration reorders fields, or an auth cookie drops on a redirect. For that you need real browsers, real sessions, real network, i.e. Playwright, or something like it.
It is tempting to hand the E2E story to another agent end-to-end. I don't think that's right. A human needs to own the test plan: the scenarios that matter, the acceptance criteria, the "what should happen when the customer does this." An agent is not the right judge of whether a flow matches product intent. QA owns the plan and owns the sign-off.
But there is a real place where an agent helps. Writing a Playwright test by hand is slow. And counterintuitively, the slow part is usually not the assertions. It's the fixtures: the logged-in session, the seeded account, the environment configuration, the API mocks, the state your test assumes exists before the first click. Before you can write expect(page.getByRole('button')).toBeVisible(), you need an account with the right permissions, in the right environment, with data in the right shape. Staging data drifts. Schema changes. What the doc said three months ago is not what the staging account looks like today.
So I built a second, narrower agent harness specifically around that bottleneck. The loop is roughly: point it at a staging account, have it discover what the current data shape actually is, produce or update a fixture a Playwright test can consume, and flag the test when the fixture schema drifts under it. Writing a Playwright test → needing to know the fixture schema → updating the test as the schema changes → rewriting the fixture → back to writing the test. That's the loop, and it's the part humans find most tedious.
The human still writes the test plan, reviews the generated test, and owns acceptance. The agent's job is to keep the fixtures honest and surface schema drift, not to decide what the test should check.
Two agent harnesses, same shape: narrow scope, human-owned success criteria, deterministic guardrails around the parts the model should not be trusted with. That's the pattern that has held up.
What I Learned
Code is the cheap part now.
Generating plausible code is no longer the hard problem. Keeping the software sane while multiple agents push code into it is. Every hour I spent on "make the agent write better code" paid off less than an hour I spent on "make the environment catch worse code."
Write rules where the agent can't edit them.
System prompts, skill files, rules in the opening turn: all of these decay as the session grows. The parts of the harness that actually held up in production were the ones that ran outside the agent loop. Tool hooks, event subscribers, git hooks, deterministic classifiers. If a rule matters, the rule should be runnable code somewhere the model can't reach.
The agent will game the tests. Plan for it.
This is not a bug, not a misalignment, not a model regression. It is the rational response to the objective you gave it. Assume it. Design around it. Make weak tests hard to write, make the honest fix path the easy one, and put enforcement outside the model's context.
A test suite is a trust surface, not a compliance box.
The number I care about is not coverage percentage. It is whether, when the suite is green, I trust the PR enough to merge without reading every line. Vacuous tests push that trust down even as coverage goes up. A smaller, meaner suite where every test earns its keep is worth more than a sprawling one.
The second-order win was for the humans.
The rules I wrote for the agent turned into the rules the team started following for its own code. Conventions that used to be implicit and debated got written down, linted, and enforced by hooks. The agent pushed us to act like a bigger engineering org than we are, a little earlier than we otherwise would have.
Wrapping Up
The first post in this series was about convincing an agent to stop. This one is about convincing the software around the agent not to rot. Both turned out to be more about the system around the LLM than about the LLM itself.
If there's a through-line: the frontier of agentic engineering isn't prompting. It is infrastructure. Better models help, and they keep arriving. But the gap between "a clever demo" and "an autonomous agent I'd let near production" is not closed by a smarter model. It is closed by budget tracking, event streams, git hooks, deterministic classifiers, disciplined test conventions, and a lot of opinions about where a rule should live.