# How to checkpoint AI agent workflows in Postgres

To checkpoint an AI agent workflow in Postgres with Tidebase, wrap each meaningful unit of work in a named `run.step()`. Each step's result is stored as a checkpoint in your own Postgres database the moment it completes, and replayed from storage if the run is re-invoked.

```typescript
import { Tidebase } from '@tidebase/sdk'

const tide = new Tidebase()

await tide.run('research-agent', { runId }, async (run, input) => {
  const plan = await run.step('plan', () => llm.plan(input.question))
  const docs = await run.step('search', () => searchTools.gather(plan))
  const draft = await run.step('draft', () => llm.write(docs))

  await run.state.set({ status: 'reviewing', progress: 0.9 })

  return run.step('finalize', () => llm.polish(draft))
})
```

Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime.

## Why not hand-roll a status column?

Every multi-step agent grows the same plumbing: a `status` column, checkpoint JSON blobs, retry flags, progress streaming, and a prayer that rerunning a failed run doesn't double-charge a customer. The plumbing is easy to write; what's hard to get right under contention:

- **Leases.** Two workers picking up the same run must be mutually exclusive, and a zombie worker must be fenced from writing stale results.
- **Input-hash checking.** A checkpoint written for different inputs must not be silently reused.
- **Ordered, gap-free event logs** under concurrent writers.
- **Exactly-once approval gates** that survive a resume.

Tidebase's test suite asserts these invariants against real Postgres with concurrency probes — they are the product.

## What gets stored

All in Postgres, all self-hosted: runs and attempts, named step checkpoints, input hashes, resume contracts, live run state, versioned state streams, labeled snapshots, parent/child run edges, append-only events, gates and decisions, recovery attempts, and usage records.

## Granularity: what should be a step?

A step is a unit you'd be happy to *not repeat*: an LLM call, a tool call batch, an external write. Cheap pure computation doesn't need a step. External writes should declare an [idempotency key in their resume contract](replay-contract-is-it-safe-to-rerun.md) so a retry can't double-fire them.

## Live state for your UI

`run.state.set()` / `run.state.patch()` update live run state your product UI can subscribe to via SSE (`GET /runs/:runId/events`) — progress bars and status text without custom socket plumbing. Each update also appends to a versioned history, which is what makes [snapshots, forking, and time travel](fork-and-time-travel-agent-runs.md) fall out of the same model.

See also: [Quickstart](quickstart.md) · [How to resume a failed run](how-to-resume-a-failed-ai-agent-run.md)