Tidebase Tidebase GitHub Start self-hosting
Docs

How to resume a failed AI agent run

To resume a failed agent run with Tidebase, re-invoke the same workflow function with the same runId. Completed steps return their checkpointed results from Postgres without re-executing; the workflow continues at the first incomplete step.

import { Tidebase } from '@tidebase/sdk'

const tide = new Tidebase()

// First invocation: dies after fetch-sources
// Second invocation with the same runId: plan and fetch-sources
// replay from checkpoints, write-report runs for the first time.
await tide.run('generate-report', { runId }, async (run, input) => {
  const plan = await run.step('plan', () => makePlan(input))
  const sources = await run.step('fetch-sources', () => fetchSources(plan))
  return run.step('write-report', () => writeReport(sources))
})

Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime.

What replay guarantees

  • Completed steps never repeat. Their results are returned from the checkpoint store, including across process crashes and machine restarts.
  • Two workers can’t grab the same run. Run and step leases are mutually exclusive and fenced, so a zombie worker that wakes up late cannot write back stale results.
  • Stale checkpoints are rejected. Each step records an input hash; if the input changed since the checkpoint was written, replay fails loudly instead of silently reusing a wrong result.

Who re-invokes the workflow?

Tidebase never executes your code, but since v0.5 it owns triggering. After a failure, re-invocation can come from:

  1. Tidebase queues — a failed queue run with attempts remaining goes back to queued with backoff and is re-claimed by tide.work() (or re-dispatched over a signed push webhook). Worker death is handled the same way: the lease expires and the reconciler requeues the run.
  2. A recovery webhook — see below; the reconciler fires it automatically for stalled runs.
  3. Your own queue, cron, or a retry button in your product UI or Studio — re-invocation by runId is always safe.

You get “completed steps never repeat” in every one of these paths.

Recovery webhooks

Tidebase can call back into your app when a run fails:

const run = await tide.runs.create('generate-report', {
  input: { topic: 'checkpoints' },
  recoveryWebhook: 'https://your-app.example.com/api/tidebase'
})

When the run fails, Tidebase POSTs a run.resume payload to that URL and records every attempt (delivery status, HTTP status, response body). If TIDEBASE_WEBHOOK_SECRET is set on both server and SDK, payloads are HMAC-signed with x-tidebase-signature, and the SDK rejects unsigned or tampered payloads.

When a step itself failed (not the process)

A step that threw is classified by its replay contract:

  • failed_retryable — SDK retries remain; safe to re-invoke.
  • manual_review — the step has external side effects without an idempotency key, or declared manual replay. A human decides.
  • failed — hard failure.

This classification is the difference between Tidebase and a hand-rolled status column: the resume decision is explicit and stored, not buried in logs.

See also: Quickstart · Tidebase vs Temporal