# "This run died at step 7 — is it safe to rerun?" The replay contract

Yes, it is safe to rerun a failed Tidebase run **if** every step that performs external writes declares an idempotency key — and Tidebase makes that condition explicit instead of leaving it to tribal knowledge. Each step records a *resume contract*: what side effects it has, how it can be replayed, and what its checkpoint guarantees.

```typescript
await run.step(
  'send-email',
  {
    input: { userId },
    sideEffects: ['email.send'],
    idempotencyKey: `welcome:${userId}`,
    replay: 'auto',
    checkpointInvariant: 'provider accepted the message id',
    verifiedBy: 'email provider response'
  },
  () => sendWelcomeEmail(userId)
)
```

Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime.

## How failures are classified

When a step exhausts its retries, Tidebase classifies the failure using the contract:

| Classification | Meaning | When |
|---|---|---|
| `failed_retryable` | Safe to re-invoke; SDK retries remain | transient failures |
| `manual_review` | A human must decide before rerun | the step has side effects but **no idempotency key**, or declared `replay: 'manual'` |
| `failed` | Hard failure | non-retryable errors |

Read-only and idempotency-keyed steps are `safe_replay`; unkeyed external writes park in `manual_review` instead of silently double-firing on the next retry. This is the answer to "will rerunning double-charge the customer?" — if it could, the run *stops and tells you* rather than guessing.

## What this is honestly not

Resume contracts do **not** make external systems exactly-once — only your idempotency keys (enforced by the external system) can do that. What Tidebase changes is that the resume decision is explicit, stored with the step, and visible in Studio, instead of hidden in logs and custom retry flags.

## Guarantees enforced underneath

- Completed steps replay from storage and never re-execute, including across crash + recovery-webhook resume.
- Step and run leases are mutually exclusive and fenced — zombie workers cannot write back stale results.
- Input-hash drift on replay is rejected before it can corrupt a run.
- Gates resolve exactly once and replay their decision on resume.

Each of these is asserted by an invariant test against real Postgres, including concurrency probes.

See also: [How to resume a failed run](how-to-resume-a-failed-ai-agent-run.md) · [Human approval gates](human-approval-gates-for-ai-agents.md)