How to checkpoint AI agent workflows in Postgres
To checkpoint an AI agent workflow in Postgres with Tidebase, wrap each meaningful unit of work in a named run.step(). Each step’s result is stored as a checkpoint in your own Postgres database the moment it completes, and replayed from storage if the run is re-invoked.
import { Tidebase } from '@tidebase/sdk'
const tide = new Tidebase()
await tide.run('research-agent', { runId }, async (run, input) => {
const plan = await run.step('plan', () => llm.plan(input.question))
const docs = await run.step('search', () => searchTools.gather(plan))
const draft = await run.step('draft', () => llm.write(docs))
await run.state.set({ status: 'reviewing', progress: 0.9 })
return run.step('finalize', () => llm.polish(draft))
})
Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime.
Why not hand-roll a status column?
Every multi-step agent grows the same plumbing: a status column, checkpoint JSON blobs, retry flags, progress streaming, and a prayer that rerunning a failed run doesn’t double-charge a customer. The plumbing is easy to write; what’s hard to get right under contention:
- Leases. Two workers picking up the same run must be mutually exclusive, and a zombie worker must be fenced from writing stale results.
- Input-hash checking. A checkpoint written for different inputs must not be silently reused.
- Ordered, gap-free event logs under concurrent writers.
- Exactly-once approval gates that survive a resume.
Tidebase’s test suite asserts these invariants against real Postgres with concurrency probes — they are the product.
What gets stored
All in Postgres, all self-hosted: runs and attempts, named step checkpoints, input hashes, resume contracts, live run state, versioned state streams, labeled snapshots, parent/child run edges, append-only events, gates and decisions, recovery attempts, and usage records.
Granularity: what should be a step?
A step is a unit you’d be happy to not repeat: an LLM call, a tool call batch, an external write. Cheap pure computation doesn’t need a step. External writes should declare an idempotency key in their resume contract so a retry can’t double-fire them.
Live state for your UI
run.state.set() / run.state.patch() update live run state your product UI can subscribe to via SSE (GET /runs/:runId/events) — progress bars and status text without custom socket plumbing. Each update also appends to a versioned history, which is what makes snapshots, forking, and time travel fall out of the same model.
See also: Quickstart · How to resume a failed run