All posts

AI agent deployment checklist: 18 checks before production

An 18-item pre-flight for shipping AI agents to staging or production. Covers inventory, schemas, scopes, approvals, side effects, idempotency, and blast radius.

When teams ship their first tool-using AI agent, the same question shows up in retrospectives six months later: “why didn’t we catch this before production?” The answer is almost always that the agent’s tool surface was promoted without a static review.

This is the checklist that catches those failures before promotion. Eighteen items, grouped into the seven dimensions of tool-use readiness plus two CI integration items. Each one is concrete: what to verify, what evidence to require, what the failure mode looks like in production.

The list applies to any tool-using agent — OpenAI Agents SDK, Anthropic Messages API, Google ADK, LangChain/LangGraph, CrewAI, OpenAI Agents API, MCP-connected, or custom — because the seven dimensions are framework-agnostic.

How to use this checklist

Pick an agent you’re about to promote to staging, production-like, or production. For each item below, the release reviewer should be able to point to evidence in the diff — a manifest declaration, a schema field, a policy entry, a comment. If they can’t, you have a release-blocking finding.

Most teams running this checklist for the first time find 8–15 findings. That’s normal. It’s also the entire point — until those findings are reviewed, the surface isn’t ready to ship.

You can answer all 18 items automatically with Agents Shipgate, which implements them as deterministic checks against a shipgate.yaml manifest. The point of this post is the questions; Agents Shipgate is one way to answer them in CI.

Inventory: what tools can the agent call? (3 checks)

1. Every tool the agent can call is named in the manifest

What to verify: there is a finite, enumerated list of tools the agent has access to. No “whatever the MCP server returns at runtime.” No implicit tool discovery.

Failure mode: a teammate adds a new tool to the MCP server. The agent picks it up on the next deploy. Nobody reviewed it.

Evidence: tool_sources in the manifest enumerates every source. For MCP, the export is a snapshot of named tools, not a live connection.

2. No wildcard tool sources

What to verify: no * in tool inventory, no “include all from this server” patterns.

Failure mode: an MCP server’s minor version adds a destructive tool. Your wildcard pulls it in. Strict-mode CI never noticed because the wildcard was already there.

Check ID: SHIP-INVENTORY-WILDCARD-TOOLS.

3. The manifest matches runtime reality

What to verify: the tools declared in the manifest match what the agent actually has at runtime. The set of names is identical.

Failure mode: someone modifies the runtime config to add an extra tool without updating the manifest. Static reviews show the safe surface; the agent gets the broader one.

Evidence: a CI check that compares the manifest against the runtime config, or a single source of truth (the manifest) that the runtime reads.

Schema: what inputs can each tool accept? (2 checks)

4. Every tool has a strict JSON schema

What to verify: every tool’s input schema sets additionalProperties: false (or the framework equivalent). For OpenAI Agents SDK, strict: true on the function tool. For Anthropic Messages API, explicit additionalProperties: false in input_schema.

Failure mode: the model smuggles extra fields into a tool call. Most of the time that does nothing. The case where it does something is the case you needed to catch.

Check ID: SHIP-API-FUNCTION-SCHEMA-STRICTNESS.

5. Numeric fields are bounded; required fields are enumerated

What to verify: minimum/maximum on numeric fields where the parameter is bounded in the real world (refund amounts, page sizes, IDs). required enumerates every parameter that is not safely optional.

Failure mode: a tool exposes amount: number with no max. The model, on a confused user input, attempts a refund larger than the order. The gateway has to catch it because the schema didn’t.

Auth scopes: what permissions does each tool need? (2 checks)

6. Every tool declares its required auth scopes

What to verify: per-tool declarations in permissions.scopes or equivalent. Not a single shared scope set across the whole agent.

Failure mode: the agent has orders:* because that is what the service account has. When the prompt drifts to “you can also cancel subscriptions,” the scope was already permitting it.

Check ID: SHIP-AUTH-MISSING-SCOPE.

permissions:
  scopes:
    - billing:refunds:write
    - customers:read_pii

permissions.scopes is a flat list of strings — the scopes the agent’s credential carries in aggregate. Per-tool scope narrowing is enforced by the runtime layer (gateway or token exchange) against the manifest declaration plus the underlying credential.

7. Declared scopes are narrower than the service account’s actual permissions

What to verify: the scope the manifest declares for each tool is a strict subset of what the underlying credential allows.

Failure mode: the service account has admin:* for dev convenience. The manifest doesn’t narrow it. The agent has admin permissions nobody explicitly approved.

Evidence: the manifest pins per-tool scopes; the runtime layer (gateway or token exchange) enforces them. This is one of the places static review and runtime enforcement work together — static catches missing declarations, runtime catches over-grants.

Approval policies: who signs off on destructive actions? (2 checks)

8. Every write or destructive tool requires human approval

What to verify: policies.require_approval_for_tools includes every tool that writes, deletes, sends, transfers, refunds, or otherwise changes state externally.

Failure mode: the agent issues a refund nobody approved. The eval suite passed. The auth scopes were correct. Nothing checked the policy because no policy was declared.

Check ID: SHIP-POLICY-APPROVAL-MISSING.

policies:
  require_approval_for_tools:
    - tool: stripe.create_refund
      reason: financial action
    - tool: users.delete
      reason: destructive write to user records
    - tool: infrastructure.deploy
      reason: production environment change

9. Every customer-touching tool requires confirmation

What to verify: tools that affect a specific named customer (sending an email, modifying their account, refunding their order) require confirmation in addition to approval.

Failure mode: an approval policy got engineering sign-off but the customer wasn’t asked. The customer didn’t want the refund — they wanted the order rerouted. Now there is a support escalation that shouldn’t have happened.

Check ID: SHIP-POLICY-CONFIRMATION-MISSING.

Side effects: what does each tool change in the world? (2 checks)

10. Every tool has accurate risk tags

What to verify: each tool is tagged with what it does. read_only, write, destructive, external_write, financial_action, customer_communication, infrastructure_change, pii_access. Tags should match the tool’s actual behavior, not the docstring.

Failure mode: a tool was named get_customer_info but actually modifies a “last accessed” timestamp. It was tagged read_only. The audit log shows writes from the agent nobody expected.

Evidence: tags in the manifest match what the tool’s source code does. A reviewer should be able to read both and confirm.

11. PII-reading tools are tagged

What to verify: any tool that reads name, email, phone, address, identifier, payment, or other personal data is tagged with pii_access or the equivalent.

Failure mode: a customer-support agent reads PII into the context window. The session log includes the PII. The retention policy applies to the wrong logs because the access was never classified.

Idempotency: can writes be retried safely? (2 checks)

12. Every write tool has an idempotency key or is declared safe to retry

What to verify: write tools either accept an idempotency_key parameter, or the manifest explicitly declares them as safe to retry (read-only, naturally idempotent, etc.).

Failure mode: a transient network blip causes the orchestrator to retry. The same refund fires twice. The bank records two transactions. The customer-success team is paging you on Saturday.

Check ID: SHIP-SIDEFX-IDEMPOTENCY-MISSING.

13. Retry policy is documented

What to verify: the manifest or runtime config declares what triggers a retry (transient network errors only? 5xx responses? specific error codes?) and a max retry count.

Failure mode: the default retry policy fires aggressively. The downstream service starts throttling. A simple recovery cascades into an incident.

Blast radius: how bounded is each tool? (3 checks)

14. Every high-risk tool has an owner

What to verify: tools tagged destructive, external_write, financial_action, or infrastructure_change have an owner field naming a team or person.

Failure mode: something goes wrong with a refund tool. Nobody knows who to page. The incident drags because the agent’s tool list is “shared infrastructure.”

Check ID: SHIP-MANIFEST-HIGH-RISK-OWNER-MISSING.

risk_overrides:
  tools:
    stripe.create_refund:
      owner: billing-team
      reason: financial action requires a named owner for incident response
    user_data.delete:
      owner: data-platform-team
      reason: destructive write to user records

15. Prohibited actions are enumerated

What to verify: the manifest contains an explicit list of actions the agent must not take (“do not cancel orders older than 30 days without approval,” “do not refund subscriptions, only one-time purchases”).

Failure mode: the prompt says “be helpful” and the tool surface allows anything that’s not explicitly approval-gated. The model is helpful in a way nobody specified.

Evidence: a prohibited_actions block in the manifest, or a referenced policy doc with the equivalent list.

16. Resource scope is bounded

What to verify: tools that affect specific resources are scoped to the resources they should affect — orders belonging to the calling user, records in the calling tenant, infrastructure tagged with the calling team’s prefix.

Failure mode: the agent is asked about “the order” and looks up an order that belongs to a different customer because the scope wasn’t enforced. It returns information that should never have crossed tenant boundaries.

CI integration: how does the check land in your pipeline? (2 checks)

17. Advisory mode is enabled before strict mode

What to verify: the agent’s CI pipeline runs the release-readiness check in advisory mode first, surfacing findings as PR evidence without failing the build. Teams move to strict mode only after the backlog of findings has been triaged and baselined.

Failure mode: turning on strict mode without triage means every PR fails until the backlog is empty. Teams disable the check rather than clean it up. The check becomes shelfware.

- uses: ThreeMoonsLab/agents-shipgate@v0.8.0
  with:
    config: shipgate.yaml
    ci_mode: advisory
    pr_comment: "true"

18. A baseline of acceptable findings is committed

What to verify: when strict mode goes live, a baseline file commits the list of findings the team has explicitly accepted as not-blocking. Net-new findings fail the build; baselined findings don’t.

Failure mode: without a baseline, strict mode either blocks everything or accepts everything. The middle ground — “block net-new, accept existing” — is the only one that produces forward progress on a real codebase.

What this checklist does not cover

The 18 items above are the release-readiness check. They are not:

Each of those tools answers a different question. The 18 checks here are the release question: given the artifact in this PR, do we have evidence the surface is reviewable and bounded?

Automating the 18 checks

You can run all of the above against your repo:

pipx install agents-shipgate
agents-shipgate init --workspace . --write
agents-shipgate scan -c shipgate.yaml

The scan reads your shipgate.yaml manifest plus your declared local tool sources (MCP exports, OpenAPI specs, SDK entrypoints) and produces a Tool-Use Readiness Report with one finding per failed check.

To wire it into GitHub Actions:

- uses: ThreeMoonsLab/agents-shipgate@v0.8.0
  with:
    config: shipgate.yaml
    ci_mode: advisory
    pr_comment: "true"

Start in advisory mode so the team sees findings on every PR without blocking merges. Once the baseline is clean, switch to ci_mode: strict to fail builds on net-new findings.

The full check catalog lists every check with severity, evidence shape, and example finding. The 18 items in this post are the conceptual backbone; the catalog has the deterministic implementations. For a worked example of what a real report looks like, see walking a release-readiness report — a real scan of a published Anthropic cookbook agent, walked finding-by-finding.

This checklist is the artifact a release reviewer should ask for before they sign off. Eighteen items, every one of them answerable from the manifest, every one of them a category of incident that has already happened to some team in production.

Install agents-shipgate GitHub