All posts

Healthcare for agents

Why tool-using AI agents need care across their lifecycle, not just an eval at release — and the discipline we are early in building to provide it.

Healthcare is mostly invisible when it works.

You go to your GP once a year. She listens to your heart. She checks the things that can be checked. Most years nothing is wrong, and you leave with no story to tell. The visit was the point. The result was an absence — the absence of the catastrophe that would have arrived if no one had looked.

We don’t think of those visits as tests you pass or fail. We think of them as care. The doctor isn’t grading you. She’s keeping watch.

This is the frame I want to argue for, for tool-using AI agents.

The thing we built is the patient we don’t quite know how to care for

When the agent industry talks about safety, it tends to talk about evals. A model is benchmarked; a checkpoint passes a threshold; a release goes out. Eval-as-certification is a useful tool. It is the wrong frame for the lifecycle of a system that touches the world.

Eval-as-certification is what we did with bridges in the nineteenth century: load it up, see if it falls. We built better bridges by building a discipline around every stage of their lifecycle — design review, materials standards, periodic inspection, retirement protocols, post-incident metallurgy. Aviation followed the same arc. Software, eventually, will too.

Agents that can refund, email, deploy, modify records — agents with tools — are infrastructure now. They need an analog to that discipline. Not certification at a moment; care across a life.

We call this idea healthcare for agents.

What a life looks like

An agent has a lifecycle whether or not we name one:

  • Design. Someone writes the prompt, picks the tools, decides what the agent is allowed to do. Decisions are made here that will be invisible later.
  • Release. The new tool, the new policy, the new schema lands in main and ships. This is the moment when the agent’s capabilities change — not its behavior on a curated test set, but the actual surface it gets to act on.
  • Deployment. The agent meets production. Permissions are granted. Real users arrive.
  • Operation. The agent does its job, repeatedly, for a long time. Things drift. The world the agent was designed for keeps moving.
  • Retirement. Eventually the agent is replaced, or sunset, or quietly forgotten while still on a server somewhere.

Each stage asks a question that needs a method to answer. Today the field has clinics for some stages and nothing for others. Evals address operation, partially. Observability records what happened, after. Runtime guardrails enforce, in the moment. Release readiness — the question “what is this agent about to be allowed to do, and is that reviewable?” — has had no named slot.

That is the slot Agents Shipgate was built to occupy.

The pre-flight visit

Tool-use readiness is the pediatric visit, in this metaphor. Before the agent gets real privileges — production-like permissions, the keys to actually refund a customer, email an investor, cancel an invoice — someone needs to look it over.

What is the tool surface? Has anything new been declared? Does the new thing have an approval policy, an idempotency story, a maximum bound on what it can spend or send? Is the schema reviewable, or is it a blank field the model can fill with anything? Have we drifted from what we agreed last time?

These are not questions about whether the agent is good. They are questions about whether the agent is legible. Healthcare isn’t a verdict on the patient — it’s a record we can read and reason about. So is a Tool-Use Readiness Report.

The discipline matters more than the verdict. Most pre-flight visits end with: nothing alarming, see you next time. That’s the point. The visit happened. The record exists.

The other clinics

If release readiness is the pediatrician, the broader healthcare metaphor names slots that mostly don’t have products yet, or whose vocabulary is still being settled. Some of these names are in our code today. Some are landing in the next release. Some are still in design. The work is the same: name a thing carefully enough that a clinician — human or coding-agent — can use the name.

  • Baselines are vaccination records. What did this patient have last time? What’s the new exposure that wasn’t there before? An agents-shipgate baseline is, mechanically, a record of what we accepted as fine for now. So is a child’s immunisation history. Both let strict mode fail on net-new concerns without re-litigating the past.

  • Provenance kind is differential diagnosis. The finding fires — but did it come from a declared fact (a manifest, an MCP export, an OpenAPI spec), an AST extraction, a token-list heuristic, a regex match, or an external policy pack? A reviewer reading a provenance_kind of keyword_heuristic will make different calls than the same reviewer reading static_declaration. Information about how we know what we know is part of the diagnosis. Landing in the next release.

  • Insufficient evidence is the honest answer. Sometimes we look, and we don’t have what we need to say. At least half the tools are low-confidence; the loaders threw too many warnings; we can’t render a verdict in good conscience. The mature thing is to name that — not to overload needs review until it stops meaning anything. insufficient_evidence is what that name looks like in code. Landing in the next release.

  • Lifecycle retros are the M&M conference — morbidity and mortality, the closed-door meeting where clinicians sit with the case that didn’t go right and ask which layer failed. The model? The tool? The policy? The review? The retro doesn’t fix the past; it changes the next time. In design.

These are not features in search of a product. They are a discipline in search of names.

What we are really building

Three Moons Lab is a small lab building, today, the release-readiness clinic — Agents Shipgate. That’s the product. The longer arc is naming and building the discipline that all of those other clinics belong to. We call it agent lifecycle readiness.

Naming a discipline is mostly a slow act. You write the words down. You make the words usable — by reviewers, by coding agents, by AI search engines, by the next engineer trying to explain to her CTO why her team needs to take agent governance seriously. You build the tooling that turns the words into a check, and then a report, and then a meeting, and then a culture.

It is not glamorous. There is no manifesto-shaped breakthrough. Healthcare did not become healthcare because someone wrote a single paper. It became healthcare because a thousand people, over decades, named one thing carefully at a time — triage, vaccination, informed consent, do not resuscitate — and built institutions around the names.

We are early in that arc. Most of the slots aren’t named yet. Some aren’t even seen yet. But the shape of the discipline is starting to come up out of the water, and the people working on agents are starting to recognise it.

What it means to care for a thing that doesn’t know

There is a quieter question underneath the technical one.

What does it mean to care for an agent? The agent doesn’t know we are looking. It doesn’t experience its tool surface as a body. It can’t tell us it’s tired or that something is shifting inside it. We are caring for it the way a parent looks at a sleeping child — for what we might miss, not for what the child can ask for.

The thing we are caring for isn’t, really, the agent. It is the joint between the agent and the world. The capability surface. The set of things the agent has been allowed to touch on behalf of someone who trusted us with the granting of that permission. That is the patient — the relationship, the consent, the action.

The agent is a tool in the older sense of the word, too: a thing made for a purpose, that outlives the moment of its making. Healthcare for agents is less about the agent’s wellbeing than about the wellbeing of the people the agent acts for. That’s an old discipline in new clothes. We have been here before, with bridges and airplanes and pacemakers and, more recently, browsers.

What’s new is that the patient is an artifact made of language and intent. What’s the same is that someone has to keep watch.

Where to start

If you ship a tool-using agent today, the entry point is small and concrete:

  • Run Agents Shipgate against your repo and read the report.
  • Walk the check catalog as a release-readiness checklist.
  • Learn the names in the glossary — they’re the discipline trying to settle.

If you want a longer conversation about what this looks like for your team specifically, the design partner program is the right door.

If you just want to follow along while the discipline gets named one carefully-chosen word at a time, this is the blog where that happens.

Install agents-shipgate GitHub