How to checkpoint AI agent workflows in Postgres

To checkpoint an AI agent workflow in Postgres with Tidebase, wrap each meaningful unit of work in a named run.step(). Each step’s result is stored as a checkpoint in your own Postgres database the moment it completes, and replayed from storage if the run is re-invoked.

import { Tidebase } from '@tidebase/sdk'

const tide = new Tidebase()

await tide.run('research-agent', { runId }, async (run, input) => {
  const plan = await run.step('plan', () => llm.plan(input.question))
  const docs = await run.step('search', () => searchTools.gather(plan))
  const draft = await run.step('draft', () => llm.write(docs))

  await run.state.set({ status: 'reviewing', progress: 0.9 })

  return run.step('finalize', () => llm.polish(draft))
})

Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime.

Why not hand-roll a status column?

Every multi-step agent grows the same plumbing: a status column, checkpoint JSON blobs, retry flags, progress streaming, and a prayer that rerunning a failed run doesn’t double-charge a customer. The plumbing is easy to write; what’s hard to get right under contention:

Leases. Two workers picking up the same run must be mutually exclusive, and a zombie worker must be fenced from writing stale results.
Input-hash checking. A checkpoint written for different inputs must not be silently reused.
Ordered, gap-free event logs under concurrent writers.
Exactly-once approval gates that survive a resume.

Tidebase’s test suite asserts these invariants against real Postgres with concurrency probes — they are the product.

What gets stored

All in Postgres, all self-hosted: runs and attempts, named step checkpoints, input hashes, resume contracts, live run state, versioned state streams, labeled snapshots, parent/child run edges, append-only events, gates and decisions, recovery attempts, and usage records.

Granularity: what should be a step?

A step is a unit you’d be happy to not repeat: an LLM call, a tool call batch, an external write. Cheap pure computation doesn’t need a step. External writes should declare an idempotency key in their resume contract so a retry can’t double-fire them.

Live state for your UI

run.state.set() / run.state.patch() update live run state your product UI can subscribe to via SSE (GET /runs/:runId/events) — progress bars and status text without custom socket plumbing. Each update also appends to a versioned history, which is what makes snapshots, forking, and time travel fall out of the same model.