# How to resume a failed AI agent run

To resume a failed agent run with Tidebase, re-invoke the same workflow function with the same `runId`. Completed steps return their checkpointed results from Postgres without re-executing; the workflow continues at the first incomplete step.

```typescript
import { Tidebase } from '@tidebase/sdk'

const tide = new Tidebase()

// First invocation: dies after fetch-sources
// Second invocation with the same runId: plan and fetch-sources
// replay from checkpoints, write-report runs for the first time.
await tide.run('generate-report', { runId }, async (run, input) => {
  const plan = await run.step('plan', () => makePlan(input))
  const sources = await run.step('fetch-sources', () => fetchSources(plan))
  return run.step('write-report', () => writeReport(sources))
})
```

Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime.

## What replay guarantees

- **Completed steps never repeat.** Their results are returned from the checkpoint store, including across process crashes and machine restarts.
- **Two workers can't grab the same run.** Run and step leases are mutually exclusive and fenced, so a zombie worker that wakes up late cannot write back stale results.
- **Stale checkpoints are rejected.** Each step records an input hash; if the input changed since the checkpoint was written, replay fails loudly instead of silently reusing a wrong result.

## Who re-invokes the workflow?

Tidebase never executes your code, but since v0.5 it owns *triggering*. After a failure, re-invocation can come from:

1. **Tidebase queues** — a failed queue run with attempts remaining goes back to `queued` with backoff and is re-claimed by `tide.work()` (or re-dispatched over a signed push webhook). Worker death is handled the same way: the lease expires and the reconciler requeues the run.
2. **A recovery webhook** — see below; the reconciler fires it automatically for stalled runs.
3. **Your own queue, cron, or a retry button** in your product UI or Studio — re-invocation by `runId` is always safe.

You get "completed steps never repeat" in every one of these paths.

## Recovery webhooks

Tidebase can call back into your app when a run fails:

```typescript
const run = await tide.runs.create('generate-report', {
  input: { topic: 'checkpoints' },
  recoveryWebhook: 'https://your-app.example.com/api/tidebase'
})
```

When the run fails, Tidebase POSTs a `run.resume` payload to that URL and records every attempt (delivery status, HTTP status, response body). If `TIDEBASE_WEBHOOK_SECRET` is set on both server and SDK, payloads are HMAC-signed with `x-tidebase-signature`, and the SDK rejects unsigned or tampered payloads.

## When a step itself failed (not the process)

A step that threw is classified by its [replay contract](replay-contract-is-it-safe-to-rerun.md):

- `failed_retryable` — SDK retries remain; safe to re-invoke.
- `manual_review` — the step has external side effects without an idempotency key, or declared manual replay. A human decides.
- `failed` — hard failure.

This classification is the difference between Tidebase and a hand-rolled `status` column: the resume decision is explicit and stored, not buried in logs.

See also: [Quickstart](quickstart.md) · [Tidebase vs Temporal](compare/tidebase-vs-temporal.md)