Published on

Building a Safety Net That Can Grow Itself

Authors
  • avatar
    Name
    Xiaoyi Zhu
    Twitter

The hard part was not generating test code. It was deciding what kind of safety net each change needed.

This post is a continuation of the same thread, but it is not only about autonomous coding agents. Autonomous and non-autonomous agents both need stronger QA coverage. If agents help us move faster, the safety net has to improve too, otherwise speed just makes us less confident. The first post was about controlling an unattended coding agent with the Pi SDK. The second was about what happens when code becomes cheap and verification becomes the real bottleneck.

This post is about the next bottleneck after that: QA.

Once the coding agent started producing useful fixes, the shape of the problem changed again. The team could get to a plausible fix much faster than before. But every fix still needed someone to answer the same human questions:

  • Does this affect a user flow?
  • Is there already a test for that flow?
  • If a test exists, is it still correct?
  • If no test exists, is this worth adding to the suite?
  • Does staging have the data needed to test it?

That work is slow. It is also hard to parallelize. One person can review many small PRs, but one person cannot manually click through every changed flow forever.

So I started building a workflow for a safety net that can grow itself.

The core idea is simple: do not wait for someone to say, "please write a test for this PR." A scheduled process watches code changes, gathers related context, and decides whether the change should become a QA task.

In this post, a QA task is just a plain markdown file. It says what changed, why it may need coverage, what data or setup is required, and what should happen next. If the task is worth doing, later stages prepare staging data and write or update API/E2E tests behind a PR. A run note is just the short log for one scheduled run.

The final shape became three stages:

Monitor  ->  Setup  ->  Write

That split ended up being the important part.


The First Temptation: One Process Does Everything

The obvious version is one big workflow:

new commit lands
  -> read diff
  -> inspect app/API behavior
  -> write test
  -> run test
  -> open PR

But this turns out to be a bad design quickly.

That one-process design has too many jobs. It has to:

  • decide whether the change matters
  • understand product intent
  • check existing coverage
  • prepare data
  • use a staging account safely
  • author page objects or API helpers
  • write API or E2E tests
  • run verification
  • decide whether failure means bad test, stale fixture, missing setup, or real product regression

Those are different kinds of work. They need different failure modes.

  • If monitoring fails, I want a run note and no task.
  • If setup fails, I want the task blocked with a clear reason.
  • If writing fails, I want the attempted test, commands, and error output.
  • If product behavior still fails in staging, I do not want a red test PR. I want the task to say: expected behavior still fails, do not author yet.

One big workflow tends to blur all of that. It keeps going because the next tool call might work. It turns missing setup into brittle test code. It turns unclear product intent into weak assertions. It turns staging drift into flaky tests.

The lesson from the earlier harness work applied again: split the workflow around the places where you need deterministic control.


Human Policy Comes First

One more thing had to come before proactive automation: human policy. The QA team still has to teach the workflow what a good test means:

  • Which product areas matter most?
  • Which flows belong in E2E?
  • Which changes need API coverage instead?
  • Which changes should be skipped?

These rules are not fixed once and forgotten. They are a dynamic process inside the QA team. As we see which tests catch real regressions and which ones only add noise, the rubric changes. The goal is not to remove human direction. The goal is to move human direction upfront, so the workflow can act proactively inside those boundaries. Without that ongoing feedback, the workflow can still move fast, but it may create the wrong kind of work: tests for low-value areas, tasks that should have been skipped, or PRs that add coverage without improving confidence.


The Shape That Worked Better

The better design is a small pipeline with separate responsibilities.

1. Monitor
   Watch app and backend branches.
   Collect related context.
   Classify changes.
   Create a QA task only when coverage is useful.

2. Setup
   Take one QA task that still needs staging prep.
   Verify or create safe staging fixtures.
   Mark it ready for writing only with evidence.

3. Write
   Take one ready QA task.
   Update page objects, API helpers, and tests.
   Run lint + 3x verification.
   Open PR to test repo.

Each stage reads and updates the same markdown task. That file is the source of truth. Context bundles, design notes, tickets, and run logs are evidence. They help, but they should not be the only place where status lives. The task itself needs to say: current state, reason, evidence, and next step.

That sounds like a small detail. It is not. If the explanation only lives in a chat transcript or scheduler log, the next run has to rediscover the context before doing any useful work. If the task says what happened, why it happened, and what evidence supports it, the next stage can pick it up without guessing.

The markdown task becomes the handoff between stages.


Stage 1: Monitor

The monitor stage does not write tests. That rule matters.

Its job is to watch app and backend branches and ask: did something land that should change the QA safety net?

At a high level, each monitor run does this:

fetch source branches
compare last seen commit to current commit
inspect changed files and commit messages
collect related context from tickets, agent notes, enhancement docs, and design notes
classify affected area
check existing tests and page objects
apply QA-owned API/E2E suitability rules
create or update one QA task if useful
write run note
advance last-seen pointer only after success

The monitor is intentionally conservative. Most commits should not become QA tasks. It also does not look only at the diff. It collects related context first: linked tickets, notes from other agents, enhancement or design notes, previous run notes, and any existing QA memory around the product area. That context helps it decide whether the change is real product behavior, a regression risk, or noise.

Docs-only changes, formatting-only changes, test-only changes, build config noise, tiny refactors with no user-visible behavior: skip. A monitor that creates work for every commit creates noise. The writer gets buried. Humans stop trusting the task list.

The monitor should create work when a change touches something worth protecting:

  • core user flows
  • login/session behavior
  • navigation and route access
  • create/edit/save paths
  • filtering, sorting, search, pagination
  • API contracts and permission behavior
  • areas with recurring regressions
  • code that affects both frontend and backend behavior

This is also where API and E2E split happens.

Some changes are browser-flow changes. Those become E2E candidates. Some are backend behavior or data contract changes. Those become API test candidates. Some need both. For a legacy monolith, API tests are not secondary; they are often the fastest way to protect permissions, data shape, and backend behavior. The important part is that this decision is made before test writing starts, not halfway through a confused writing session.

Why monitor creates tasks, not tests

I had to resist giving the monitor more power. It already sees the diff. It can inspect existing tests. It can guess the target behavior. Why not let it author the test immediately?

Because the monitor usually does not know if staging can support the test.

A useful API or E2E test often needs a specific account state: a role, a dashboard, a form, a field, a record, an automation, an API baseline, or a permission. If that foundation is missing, the writer has two bad options:

  1. Write a brittle test against whatever data happens to exist.
  2. Spend half its run trying to build setup while also writing test code.

Both are bad. So the monitor only creates a QA task. It describes intent, affected area, suggested test level, related context, and required fixtures. Then it hands off to setup.


Stage 2: Setup

The setup stage is the plainest part of the system and probably the most important.

Most API/E2E test failures are not clever. They are boring. A fixture here just means the stable data or configuration a test needs before it can run:

  • test user cannot reach route
  • required fixture does not exist
  • feature flag is off
  • expected dashboard is missing
  • record shape changed
  • API fixture uses old IDs
  • staging data drifted
  • product bug still reproduces

If the writer discovers these problems late, it tends to compensate in the test. It adds waits. It uses existing row text. It narrows the assertion. It skips the hard path. The test passes, but trust goes down.

There are usually two reasons API/E2E tests are slow to build. First, the test needs the right fixture or staging data to run against. Second, the test itself needs writing, tuning, and repeated verification. Setup owns the first source of slowness. The writer owns the second.

So setup gets its own stage.

Setup takes a QA task that still needs staging prep and checks whether the staging environment can support it. It is allowed to make safe changes only inside a dedicated QA staging account. Before any mutation-capable step runs, it checks hard gates:

  • environment must be staging
  • base URL must be staging
  • target account must be the dedicated QA account
  • auto-approval must be explicitly enabled for that account

If any gate fails, setup stops.

Then it probes what already exists. If needed, it creates or updates the foundation data: fields, views, dashboards, records, roles, automation fixtures, or API baseline values. For API tests, setup can update a small fixture file from real staging evidence. For E2E tests, setup can verify that the route and visible state exist before the writer starts.

Setup does not write final tests. It only prepares the ground and records evidence.

A good setup outcome looks like this:

task status: ready-to-write
foundation setup: verified
proof:
  - route exists
  - required field exists
  - stable test record exists or can be created by test
  - API operation returns expected baseline shape
next step:
  - writer can author API/E2E test

A bad setup outcome should be explicit too:

task status: blocked
category: missing-test-target
reason:
  - target dashboard not present in shared staging account
  - no safe setup recipe exists yet
next step:
  - human QA decides whether to add fixture manually or skip coverage

This is much better than letting writing fail five times and leave behind a pile of partial files.


Stage 3: Write

Only after monitor says "this is worth testing" and setup says "the target is ready" does writing start.

The writer processes one ready QA task at a time. It creates an isolated workspace, branches from the test repo's staging branch, reads existing tests and page objects, and then decides whether to extend an existing spec or add a new one.

The writer has a few non-negotiable rules:

  • only change the test repo; product source repos are read-only context
  • branch from the test repo's staging branch, open PRs back to staging, and never target main
  • prefer updating existing coverage over adding duplicate files
  • for E2E, page-object first; for API, shared helper first
  • follow repo conventions: semantic locators for E2E, meaningful contract assertions for API, no hard waits, no brittle CSS/XPath unless documented, and no committed auth state, reports, screenshots, or secrets
  • verify three consecutive passes before PR

The structure rule matters for both sides. Page objects and API helpers are shared wrappers around repeated actions. They keep generated tests from turning into one-off scripts.

For E2E, the writer searches page objects first. If reusable behavior exists, use it. If not, extend the page object before writing the test. The test body should read like a user flow, not a pile of selectors.

For API tests, the equivalent rule is shared helpers and fixtures first. API tests should verify status, errors, data shape, and permission contracts. They should not be one-off request blobs copied into separate files.


The Feedback Loop

The whole system is just a feedback loop around staging.

source change lands
  -> monitor gathers related context and asks whether QA coverage should change
  -> setup checks whether staging can support that coverage
  -> writer creates or updates API/E2E tests
  -> verification runs against staging
  -> PR updates test suite
  -> future monitor runs use that suite as evidence

There is another smaller loop inside the writer:

write or update test
  -> run lint
  -> run targeted test
  -> if it fails, fix once with evidence
  -> run 3x
  -> if all pass, PR
  -> if target is wrong, block task instead of forcing a test

That last branch matters. The writer needs permission to stop.

Earlier designs often failed because every QA task was treated as "make a PR." But many QA tasks should not become PRs yet. Sometimes the expected behavior is not actually fixed in staging. Sometimes the target is missing. Sometimes the test would require a custom tenant state that is not safe to create automatically. Sometimes there is already coverage and no change is needed.

A blocked task with a good reason is a successful outcome. It prevents a bad test from entering the suite.


The Hardest Part: Task Quality

I expected the hard part to be setting up the reliable scheduled workflow. But it was not.

The hard part was deciding what deserves a test.

The writer can produce a page-load test for almost anything. It can click a button and assert a title. It can produce green output. That does not mean the suite got better.

The solution was to move the quality decision before test writing. Monitor uses product context and the QA-owned rubric to decide whether a change deserves coverage. Setup proves the target and fixture data are ready in staging. Writing only starts after those two checks. If either check fails, the outcome is skipped or blocked, not a weak PR.

That is what each stage protects:

  • Monitor avoids noisy or low-value tasks.
  • Setup avoids unsafe or missing fixtures.
  • Writing avoids shallow assertions and duplicate coverage.

This is the same lesson from TDD guardrails, but at a larger scale: a test suite is a trust surface. Every weak test spends trust. Every flaky test spends more. If automation can add tests faster than humans can review them, quality control has to move earlier in the pipeline.


Lessons So Far

1. Split stages by failure mode.

Monitor, setup, and writing fail differently. Keeping those stages separate makes failures legible. It also lets each stage stop without corrupting the next one.

2. Make markdown tasks the source of truth.

Logs are not enough. Chat transcripts are not enough. A QA task should carry intent, evidence, status, and outcome reason. The next run should not need to reconstruct everything.

3. Setup is not secondary.

Fixture and account drift are where API/E2E suites die. A setup stage that verifies staging before authoring is not overhead. It is what keeps the writer honest.

4. Give each stage valid non-PR outcomes.

Completed, blocked, failed, no-op. These are all real outcomes. If PR is the only success state, the writer will force PR-shaped answers onto non-PR problems.

5. Keep policy editable by QA.

Human instruction matters a lot here. The QA team has to set the high-level principle for what counts as a good test, what product areas deserve coverage, and what should stay out of scope. That tuning is ongoing. The suitability rubric should be a document QA can edit, not a pile of hardcoded product decisions in the scheduler.

6. Verification has to be boring.

Lint. Typecheck. Targeted run. Three consecutive passes. PR. No drama. No self-judged confidence. Evidence from tools only.


Wrapping Up

The first blog post was about making an agent stop. The second blog post was about making it prove a fix with evidence. This post asks a different question: can the QA safety net grow proactively before someone asks for a test, without filling the suite with shallow tests?

That changed how I think about the problem. A safety net that grows itself is not only test generation. It is the system around test generation:

watch changes
choose useful work
prepare safe state
write focused tests
verify hard
leave an audit trail

I am still learning what works here, and a lot of it comes from trying a small version, watching where it breaks, and adjusting. So far, the pattern that has held up is narrower than one giant autonomous process: clear stages, clear handoffs, deterministic gates, human-owned policy, and boring verification.

Code is cheap now. Tests are also becoming cheap. Trust is not.