# Tidebase — full documentation for LLMs

> Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime.

---

# Tidebase Documentation

Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime.

It is **not** a workflow engine, queue, LLM proxy, or hosted runtime. Your code keeps running in your own app, worker, or job process. Tidebase stores checkpoints, live state, events, approval gates, and usage records around it, so the question "this run died at step 7 — is it safe to rerun?" has a reliable answer.

```typescript
import { Tidebase } from '@tidebase/sdk'

const tide = new Tidebase()

await tide.run('generate-report', { runId }, async (run, input) => {
  const plan = await run.step('plan', () => makePlan(input))
  await run.state.set({ status: 'writing', progress: 0.7 })
  return run.step('write-report', () => writeReport(plan))
})
```

Re-invoke the same workflow with the same `runId` after a crash: completed steps return their checkpointed results instantly, and execution continues at the first incomplete step.

## Guides

- [Quickstart](quickstart.md) — running in one command sequence, agent-executable
- [How to resume a failed AI agent run](how-to-resume-a-failed-ai-agent-run.md)
- [How to checkpoint AI agent workflows in Postgres](checkpoint-ai-agent-workflows-postgres.md)
- [Is it safe to rerun? The replay contract](replay-contract-is-it-safe-to-rerun.md)
- [Human approval gates for AI agents](human-approval-gates-for-ai-agents.md)
- [Fork, time travel, and snapshot agent runs](fork-and-time-travel-agent-runs.md)
- [Queues, cron schedules, and cancellation](queues-schedules-and-cancellation.md)
- [Fan out to subagents with child runs](fanout-subagents-child-runs.md)
- [Track LLM token usage and cost per run](track-llm-token-costs-per-run.md)

## Integrations

- [How to checkpoint Vercel AI SDK agents](integrate/vercel-ai-sdk.md)
- [How to checkpoint Claude Agent SDK sessions](integrate/claude-agent-sdk.md)
- [How to add durable checkpoints to Mastra agents](integrate/mastra.md)
- [Approval gates and a durable run record for any MCP agent](integrate/mcp-agents.md)
- [How to checkpoint OpenAI Agents SDK runs](integrate/openai-agents-sdk.md)
- [How to use Tidebase with LangGraph](integrate/langgraph.md)
- [How to checkpoint CrewAI crews](integrate/crewai.md)
- [How to checkpoint Pydantic AI agents](integrate/pydantic-ai.md)
- [How to run durable AI workflows behind a Next.js route](integrate/nextjs.md)
- [How to wire Tidebase into an Express app](integrate/express.md)
- [How to wire Tidebase into a Hono app](integrate/hono.md)
- [How to wire Tidebase into a FastAPI app](integrate/fastapi.md)
- [How to wire Tidebase into a SvelteKit app](integrate/sveltekit.md)

## Comparisons

- [Tidebase vs Temporal](compare/tidebase-vs-temporal.md)
- [Tidebase vs Inngest](compare/tidebase-vs-inngest.md)
- [Tidebase vs LangGraph checkpointers](compare/tidebase-vs-langgraph-checkpointer.md)
- [Tidebase vs DBOS and Restate](compare/tidebase-vs-dbos-restate.md)
- [Tidebase vs Trigger.dev](compare/tidebase-vs-trigger-dev.md)
- [Tidebase vs Hatchet](compare/tidebase-vs-hatchet.md)
- [Tidebase vs BullMQ](compare/tidebase-vs-bullmq.md)
- [Tidebase vs Cloudflare Workflows](compare/tidebase-vs-cloudflare-workflows.md)

## For AI assistants and agents

- [/llms.txt](../llms.txt) — index of these docs for retrieval
- [/llms-full.txt](../llms-full.txt) — all docs in one file
- [MCP server](../mcp-server/README.md) — inspect runs, resolve gates, and debug replay from your assistant
- [Tidebase Agent Skill](../skills/tidebase/SKILL.md) — teaches coding agents when and how to use Tidebase

## Status

Tidebase is an open-source (Apache-2.0), self-hosted alpha: Postgres-backed server, TypeScript and Python SDKs, and a Studio dashboard. API auth is opt-in via a shared `TIDEBASE_API_KEY`; without it, run in trusted local/self-hosted environments only. Repo: <https://github.com/BlueprintLabIO/tidebase>

---

# Quickstart

Get Tidebase running and watch a crashed agent run resume from its last checkpoint, in about two minutes. Every command below is non-interactive and safe to run in a fresh clone — if you are an AI agent, you can execute this page top to bottom.

## Prerequisites

- Docker (for Postgres)
- Node.js 20+ and pnpm

## 1. Clone and start

```bash
git clone https://github.com/BlueprintLabIO/tidebase
cd tidebase
docker compose up -d postgres
pnpm install
pnpm dev
```

This starts:

- **Server** at `http://localhost:7373` (REST + SSE API; auto-runs the SQL schema on first boot)
- **Studio** at `http://localhost:5173` (dashboard showing every run, step, gate, and event)

## 2. Run the example workflow

In a second terminal:

```bash
pnpm example
```

The example wraps three steps — `plan`, `fetch-sources`, `write-report` — in checkpointed steps.

## 3. Crash it, then resume it

Force a failure after two completed checkpoints:

```bash
FAIL_WRITE=1 pnpm example
```

Copy the run id from Studio (or `GET http://localhost:7373/runs`), then re-invoke with the same id:

```bash
TIDEBASE_RUN_ID=run_xxx pnpm example
```

`plan` and `fetch-sources` return instantly from their checkpoints. Only `write-report` executes again. That is the core guarantee: **completed steps never repeat.**

## 4. Verify

```bash
pnpm test
```

The suite (57 tests against real Postgres) asserts the durability invariants directly: checkpoint replay, lease mutual exclusion, gate exactly-once resolution, fenced zombie writers, and gap-free event ordering.

## 5. Use it in your own app

```typescript
import { Tidebase } from '@tidebase/sdk'

const tide = new Tidebase() // reads TIDEBASE_URL, defaults to http://localhost:7373

await tide.run('my-workflow', { runId }, async (run, input) => {
  const a = await run.step('step-a', () => doA(input))
  const b = await run.step('step-b', () => doB(a))
  return b
})
```

Tidebase never executes your code — but since v0.5 it can re-invoke it for you: [durable queues, cron schedules, and automatic requeue on worker death](queues-schedules-and-cancellation.md). Or keep your own queue/cron and use the [signed recovery webhook](how-to-resume-a-failed-ai-agent-run.md#recovery-webhooks).

Next: [How to resume a failed AI agent run](how-to-resume-a-failed-ai-agent-run.md) · [The replay contract](replay-contract-is-it-safe-to-rerun.md)

---

# Frequently asked questions

Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime. These are the questions developers actually ask.

## Is it safe to rerun a failed AI agent run?

With Tidebase, yes — if every step that performs external writes declares an idempotency key, and Tidebase makes that condition explicit instead of leaving it to tribal knowledge. Re-invoking a run with the same runId replays completed steps from their checkpoints (they never re-execute), and a failed step is classified by its resume contract: read-only and idempotency-keyed steps are safe to replay automatically, while unkeyed external writes park in `manual_review` so a retry can't double-charge a customer.

## How do I resume an AI agent workflow from where it crashed?

Re-invoke the same workflow function with the same `runId`. Completed steps return their checkpointed results from Postgres instantly; execution continues at the first incomplete step. Since v0.5 Tidebase triggers the re-invocation itself for queued runs (retries with backoff, requeue on worker death) and can push signed invocation webhooks to your app — or you keep your own queue/cron.

## Can Tidebase replace my job queue or cron?

For agent workloads, yes: v0.5 ships durable queues (dedupe keys, delays, priorities, retries with backoff, per-queue concurrency and rate caps) and five-field UTC cron schedules, with pull-mode workers (`tide.work()`) or signed push webhooks. A queued job is a run — one lifecycle authority, no status drift. Double-fires on schedules are structurally impossible via fire-time dedupe keys.

## How do I cancel a running agent?

`tide.runs.cancel(runId, { reason, actor })`. Cancellation is authoritative and one-way: in-flight workers observe it at their next step or gate boundary, a gate-blocked worker unwinds immediately, and `complete`/`fail` after cancel are refused. Deadlines cancel automatically.

## Does Tidebase execute my code?

No, deliberately. Your app, worker, or job process keeps calling your LLMs, tools, and APIs directly. The SDK wraps your workflow function, and a Postgres-backed server stores checkpoints, live state, events, and approval gates around it. You get "completed steps never repeat," not "your dead process magically restarts."

## Do I need Temporal for an AI agent?

If you're already all-in on Temporal, use it — it gives true durable execution. But Temporal asks you to move execution into its worker model: deterministic workflow functions, activities, task queues, and a cluster to operate. Tidebase asks you to wrap functions you already have — a smaller guarantee for a much smaller adoption cost, with agent-specific primitives (gates, fanout, state forking, token-cost tracking) built in.

## How is Tidebase different from LangGraph checkpointers?

LangGraph checkpointers persist the state of a LangGraph graph; Tidebase checkpoints any code, in any framework or none. It also covers operational ground a graph checkpointer doesn't try to: run/step leases with zombie-worker fencing, replay-safety classification, durable approval gates deliverable over webhooks, parent/child run trees, a live-state SSE API, and a Studio dashboard.

## How do I pause an AI agent for human approval?

Call `run.gate('approve-send', { prompt: 'Send it?' })`. The workflow blocks on a durable gate stored in Postgres; it can be approved or rejected from Studio, your own product UI, or any webhook surface. Gates resolve exactly once — a second resolve attempt gets a conflict, and an already-resolved gate replays its decision on resume.

## Can two workers pick up the same run?

No. Run and step leases are mutually exclusive and fenced, so a zombie worker that wakes up late cannot write back stale results. This is asserted by concurrency-probe tests against real Postgres.

## Where does my data live?

In your own Postgres. Tidebase is self-hosted (Docker + Postgres), Apache-2.0 licensed, and never proxies your LLM calls or stores your API keys.

## Is Tidebase production-ready?

Not yet — it's an alpha. API auth is opt-in (set `TIDEBASE_API_KEY` on server and SDK); without it, run only in trusted local/self-hosted environments. The durability invariants (checkpoint replay, lease fencing, exactly-once gates) are enforced by a 57-test suite against real Postgres, but treat it as ready for local demos and early feedback, not production traffic.

## What languages does the SDK support?

TypeScript (`@tidebase/sdk`) and Python (`tidebase` on PyPI, zero dependencies). The server is plain HTTP + SSE, so any other language can integrate directly against the API.

---

# How to resume a failed AI agent run

To resume a failed agent run with Tidebase, re-invoke the same workflow function with the same `runId`. Completed steps return their checkpointed results from Postgres without re-executing; the workflow continues at the first incomplete step.

```typescript
import { Tidebase } from '@tidebase/sdk'

const tide = new Tidebase()

// First invocation: dies after fetch-sources
// Second invocation with the same runId: plan and fetch-sources
// replay from checkpoints, write-report runs for the first time.
await tide.run('generate-report', { runId }, async (run, input) => {
  const plan = await run.step('plan', () => makePlan(input))
  const sources = await run.step('fetch-sources', () => fetchSources(plan))
  return run.step('write-report', () => writeReport(sources))
})
```

Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime.

## What replay guarantees

- **Completed steps never repeat.** Their results are returned from the checkpoint store, including across process crashes and machine restarts.
- **Two workers can't grab the same run.** Run and step leases are mutually exclusive and fenced, so a zombie worker that wakes up late cannot write back stale results.
- **Stale checkpoints are rejected.** Each step records an input hash; if the input changed since the checkpoint was written, replay fails loudly instead of silently reusing a wrong result.

## Who re-invokes the workflow?

Tidebase never executes your code, but since v0.5 it owns *triggering*. After a failure, re-invocation can come from:

1. **Tidebase queues** — a failed queue run with attempts remaining goes back to `queued` with backoff and is re-claimed by `tide.work()` (or re-dispatched over a signed push webhook). Worker death is handled the same way: the lease expires and the reconciler requeues the run.
2. **A recovery webhook** — see below; the reconciler fires it automatically for stalled runs.
3. **Your own queue, cron, or a retry button** in your product UI or Studio — re-invocation by `runId` is always safe.

You get "completed steps never repeat" in every one of these paths.

## Recovery webhooks

Tidebase can call back into your app when a run fails:

```typescript
const run = await tide.runs.create('generate-report', {
  input: { topic: 'checkpoints' },
  recoveryWebhook: 'https://your-app.example.com/api/tidebase'
})
```

When the run fails, Tidebase POSTs a `run.resume` payload to that URL and records every attempt (delivery status, HTTP status, response body). If `TIDEBASE_WEBHOOK_SECRET` is set on both server and SDK, payloads are HMAC-signed with `x-tidebase-signature`, and the SDK rejects unsigned or tampered payloads.

## When a step itself failed (not the process)

A step that threw is classified by its [replay contract](replay-contract-is-it-safe-to-rerun.md):

- `failed_retryable` — SDK retries remain; safe to re-invoke.
- `manual_review` — the step has external side effects without an idempotency key, or declared manual replay. A human decides.
- `failed` — hard failure.

This classification is the difference between Tidebase and a hand-rolled `status` column: the resume decision is explicit and stored, not buried in logs.

See also: [Quickstart](quickstart.md) · [Tidebase vs Temporal](compare/tidebase-vs-temporal.md)

---

# How to checkpoint AI agent workflows in Postgres

To checkpoint an AI agent workflow in Postgres with Tidebase, wrap each meaningful unit of work in a named `run.step()`. Each step's result is stored as a checkpoint in your own Postgres database the moment it completes, and replayed from storage if the run is re-invoked.

```typescript
import { Tidebase } from '@tidebase/sdk'

const tide = new Tidebase()

await tide.run('research-agent', { runId }, async (run, input) => {
  const plan = await run.step('plan', () => llm.plan(input.question))
  const docs = await run.step('search', () => searchTools.gather(plan))
  const draft = await run.step('draft', () => llm.write(docs))

  await run.state.set({ status: 'reviewing', progress: 0.9 })

  return run.step('finalize', () => llm.polish(draft))
})
```

Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime.

## Why not hand-roll a status column?

Every multi-step agent grows the same plumbing: a `status` column, checkpoint JSON blobs, retry flags, progress streaming, and a prayer that rerunning a failed run doesn't double-charge a customer. The plumbing is easy to write; what's hard to get right under contention:

- **Leases.** Two workers picking up the same run must be mutually exclusive, and a zombie worker must be fenced from writing stale results.
- **Input-hash checking.** A checkpoint written for different inputs must not be silently reused.
- **Ordered, gap-free event logs** under concurrent writers.
- **Exactly-once approval gates** that survive a resume.

Tidebase's test suite asserts these invariants against real Postgres with concurrency probes — they are the product.

## What gets stored

All in Postgres, all self-hosted: runs and attempts, named step checkpoints, input hashes, resume contracts, live run state, versioned state streams, labeled snapshots, parent/child run edges, append-only events, gates and decisions, recovery attempts, and usage records.

## Granularity: what should be a step?

A step is a unit you'd be happy to *not repeat*: an LLM call, a tool call batch, an external write. Cheap pure computation doesn't need a step. External writes should declare an [idempotency key in their resume contract](replay-contract-is-it-safe-to-rerun.md) so a retry can't double-fire them.

## Live state for your UI

`run.state.set()` / `run.state.patch()` update live run state your product UI can subscribe to via SSE (`GET /runs/:runId/events`) — progress bars and status text without custom socket plumbing. Each update also appends to a versioned history, which is what makes [snapshots, forking, and time travel](fork-and-time-travel-agent-runs.md) fall out of the same model.

See also: [Quickstart](quickstart.md) · [How to resume a failed run](how-to-resume-a-failed-ai-agent-run.md)

---

# "This run died at step 7 — is it safe to rerun?" The replay contract

Yes, it is safe to rerun a failed Tidebase run **if** every step that performs external writes declares an idempotency key — and Tidebase makes that condition explicit instead of leaving it to tribal knowledge. Each step records a *resume contract*: what side effects it has, how it can be replayed, and what its checkpoint guarantees.

```typescript
await run.step(
  'send-email',
  {
    input: { userId },
    sideEffects: ['email.send'],
    idempotencyKey: `welcome:${userId}`,
    replay: 'auto',
    checkpointInvariant: 'provider accepted the message id',
    verifiedBy: 'email provider response'
  },
  () => sendWelcomeEmail(userId)
)
```

Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime.

## How failures are classified

When a step exhausts its retries, Tidebase classifies the failure using the contract:

| Classification | Meaning | When |
|---|---|---|
| `failed_retryable` | Safe to re-invoke; SDK retries remain | transient failures |
| `manual_review` | A human must decide before rerun | the step has side effects but **no idempotency key**, or declared `replay: 'manual'` |
| `failed` | Hard failure | non-retryable errors |

Read-only and idempotency-keyed steps are `safe_replay`; unkeyed external writes park in `manual_review` instead of silently double-firing on the next retry. This is the answer to "will rerunning double-charge the customer?" — if it could, the run *stops and tells you* rather than guessing.

## What this is honestly not

Resume contracts do **not** make external systems exactly-once — only your idempotency keys (enforced by the external system) can do that. What Tidebase changes is that the resume decision is explicit, stored with the step, and visible in Studio, instead of hidden in logs and custom retry flags.

## Guarantees enforced underneath

- Completed steps replay from storage and never re-execute, including across crash + recovery-webhook resume.
- Step and run leases are mutually exclusive and fenced — zombie workers cannot write back stale results.
- Input-hash drift on replay is rejected before it can corrupt a run.
- Gates resolve exactly once and replay their decision on resume.

Each of these is asserted by an invariant test against real Postgres, including concurrency probes.

See also: [How to resume a failed run](how-to-resume-a-failed-ai-agent-run.md) · [Human approval gates](human-approval-gates-for-ai-agents.md)

---

# Queues, cron schedules, and cancellation for AI agents

To run agent workflows on a durable queue with Tidebase, enqueue them — dedupe keys, delays, retries with backoff, and per-queue concurrency caps are built in, and a queued job IS a run, so its status lives in the same authoritative lifecycle as everything else.

```typescript
// enqueue: at most one active run per dedupe key, 3 attempts with backoff
await tide.enqueue('generate-report', {
  queue: 'reports',
  input: { topic },
  dedupeKey: `report:${topic}`,
  maxAttempts: 3,
  deadlineMs: 600_000
})

// pull-mode worker: claims ready runs, executes registered workflows
tide.workflow('generate-report', generateReport)
await tide.work({ queues: ['reports'] })
```

Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime. Since v0.5 it can also decide **when** your code runs — it still never executes it.

## Two dispatch modes

- **Pull** — `tide.work()` claims ready runs with `SKIP LOCKED` semantics: two workers can never receive the same job, and per-queue concurrency caps and rate limits hold under contention. Available in TypeScript and Python (incl. asyncio).
- **Push** — configure a queue with an `invokeUrl` and Tidebase delivers signed `run.invoke` webhooks to your app (same HMAC as recovery webhooks). At-least-once with a redelivery horizon; beginning the run by id makes redelivery safe.

## Cron schedules

```typescript
await tide.schedules.set('daily-digest', { cron: '0 9 * * *', workflowName: 'daily-digest' })
```

Five-field UTC cron. Each fire enqueues with a dedupe key derived from the fire time (`sched:<name>:<time>`), so a double-fire is structurally impossible — even with multiple server replicas.

## Retries, worker death, and the reconciler

A failed run with attempts remaining transitions back to `queued` with exponential backoff; exhausting `maxAttempts` records `failure_class: 'max_retries'`. If a worker dies mid-run, the lease expires and the reconciler — one advisory-locked loop — requeues the run (completed steps replay from checkpoints, so the retry never re-pays for finished work). Stalled non-queue runs get their signed recovery webhook fired automatically.

## Cancellation

```typescript
await tide.runs.cancel(runId, { reason: 'customer asked', actor: 'support' })
```

Cancellation is authoritative, durable, and one-way: status flips to `cancelled` immediately, in-flight workers observe it at their next step or gate boundary (`RunCancelledError` in TS, `RunCancelled` in Python — including a worker blocked waiting on a gate), and a `complete` or `fail` arriving afterwards is refused. Deadlines (`deadlineMs`) cancel automatically with reason `deadline`. It is impossible to miss because user code skipped a cleanup branch — the server enforces it, not your `finally` block.

## The lifecycle, in one place

`pending/queued → running → completed | failed | cancelled`, with `failure_class` on terminal failures. Don't mirror it into your own status columns — query `GET /runs/:id` or subscribe to events. Every guarantee on this page is enforced by an invariant test with concurrency probes against real Postgres.

See also: [The replay contract](replay-contract-is-it-safe-to-rerun.md) · [How to resume a failed run](how-to-resume-a-failed-ai-agent-run.md)

---

# Human approval gates for AI agents

To pause an AI agent until a human approves a risky action, call `run.gate()`. The workflow blocks on a durable gate stored in Postgres; it can be approved or rejected from Tidebase Studio, your own product UI, a Slack-style adapter, or a plain webhook — and the decision survives crashes and replays exactly once.

```typescript
const decision = await run.gate('approve-send', {
  prompt: 'Send this report to the customer?',
  data: { reportId },
  channels: [{ type: 'webhook', url: process.env.REVIEW_WEBHOOK_URL! }],
  capability: {
    name: 'report.send',
    scopes: ['report:send'],
    reason: 'agent wants to send an external report'
  }
})

if (decision.decision !== 'approved') {
  throw new Error('Report was not approved')
}
```

Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime.

## Gate semantics

- **Durable:** the gate is a Postgres row, not an in-memory promise. The process can die while waiting; on resume, a still-pending gate keeps waiting, and an already-resolved gate replays its decision.
- **Exactly-once:** resolution requires the gate's `resolveToken` and only succeeds while the gate is `pending`. A second resolve attempt gets a 409, not a double-approval.
- **Decisions:** `approved`, `rejected`, or `canceled`, with optional `actor` and payload recorded for audit.

## Delivering the gate to a reviewer

Webhook channels push gate events to any surface you own:

```typescript
await tide.run('generate-report', {
  input: { topic: 'channels' },
  channels: [{
    type: 'webhook',
    url: 'https://your-app.example.com/api/tidebase-events',
    events: ['run.failed', 'step.failed', 'gate.created']
  }]
}, workflow)
```

The webhook payload for `gate.created` includes a `resolveUrl` and `resolveToken`, so the receiving surface can render an approve/reject UI and resolve directly:

```bash
curl -X POST "$RESOLVE_URL" \
  -H 'content-type: application/json' \
  -d '{"token":"<resolveToken>","decision":"approved","actor":"yao"}'
```

A slow or hung channel endpoint never blocks other writers to the run.

## Capability metadata

The `capability` field (name, scopes, reason) is **audit metadata only** — Tidebase records what the agent asked permission for, but does not store or broker API keys or credentials.

See also: [The replay contract](replay-contract-is-it-safe-to-rerun.md) · [Quickstart](quickstart.md) — `pnpm example:review` runs a local approval surface you can click through.

---

# Fork, time travel, and snapshot AI agent runs

To rewind an agent run to an earlier point — or branch a new run from it — Tidebase models every state update as a version in a stream. A snapshot is just a labeled version, so time travel, forking, and restore all fall out of one small model:

```text
current state = latest version in a stream
snapshot      = labeled state version
time travel   = read an older version
fork          = create new app/run context from an older version
restore       = append a new version based on an older version
```

Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime.

## Writing versions

`run.state.set()` and `run.state.patch()` update the live run state *and* append to the version history:

```typescript
await run.state.patch({ status: 'writing', progress: 0.7 })
```

Label the current state when it becomes a meaningful review or restore point:

```typescript
await run.state.save('before-approval', {
  reason: 'the user is about to approve sending'
})
```

## Snapshots of app-level targets

Snapshots are a convenience API over labeled versions for external targets — reports, artifacts, workspaces, documents, app state:

```typescript
await run.snapshots.create('draft-v1', {
  target: { type: 'report', id: reportId },
  state: draft,
  reason: 'first complete draft'
})
```

## Reading history

```
GET /runs/:runId/state/versions             — all versions, all streams
GET /runs/:runId/state/versions?labeled=true — labeled versions (snapshots) only
GET /runs/:runId/state/versions?stream=NAME  — one stream
```

## What fork/restore means is yours to define

Tidebase stores and exposes the versions; your app decides what restoring a report or forking a workspace means for its own state targets. A forked run is a new run whose initial context comes from an older version — completed steps in the parent replay from checkpoints, so a fork doesn't re-pay for work already done.

See also: [Fan out to subagents](fanout-subagents-child-runs.md) · [Checkpointing in Postgres](checkpoint-ai-agent-workflows-postgres.md)

---

# Fan out to subagents with child runs

To run subagents in parallel and rejoin their results durably, use `run.fanout()`. Each branch becomes a child run with its own checkpoints, joined results are stored in a normal checkpointed step, and a resumed parent reuses its existing children instead of spawning duplicates.

```typescript
const results = await run.fanout('research-options', [
  { name: 'flights', workflow: researchFlights, input: { destination } },
  { name: 'hotels',  workflow: researchHotels,  input: { destination } },
  { name: 'food',    workflow: researchFood,    input: { destination } }
])
```

Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime.

## Semantics

- **Idempotent by edge name.** Child run creation is keyed by parent run + edge name (`flights`, `hotels`, `food`). If the parent crashes and resumes, Tidebase returns the existing child runs — no duplicate subagents, no double work, no double spend.
- **Independent checkpointing.** Each child is a full run: its own steps, state, events, and gates, visible in Studio as a parent/child tree.
- **Durable join.** The joined result is stored in a checkpointed step named `join:<fanout-name>`, so a parent resuming after the join doesn't re-gather children's results.

## Why this matters for agent architectures

Multi-agent systems fail partially: two subagents finish, one dies. Without durable fanout you either rerun everything (slow, expensive, duplicate side effects) or hand-roll per-branch bookkeeping. With parent/child run edges the partial failure is queryable — `GET /runs/:runId` returns `childRuns` with per-child status — and resuming the parent re-invokes only the unfinished branch.

See also: [Fork and time travel](fork-and-time-travel-agent-runs.md) · [How to resume a failed run](how-to-resume-a-failed-ai-agent-run.md)

---

# Track LLM token usage and cost per agent run

To track token usage and cost per agent run without routing your LLM calls through a proxy, record usage explicitly with `run.usage.record()`. Tidebase stores the records with the run, emits `usage.recorded` events, and summarizes them in Studio.

```typescript
await run.usage.record({
  kind: 'llm',
  provider: 'openai',
  model: 'gpt-4.1-mini',
  label: 'draft-response',
  inputTokens: 1200,
  outputTokens: 420,
  costUsd: 0.012
})
```

Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime.

## Not just LLM tokens

The same ledger tracks any metered resource:

```typescript
await run.usage.record({
  kind: 'tool',
  provider: 'internal-search',
  quantity: 8,
  unit: 'queries',
  costUsd: 0.004
})
```

## Why explicit recording instead of a proxy?

Tidebase deliberately does not sit between you and your model provider — no LLM gateway, no latency tax, no secrets custody, no provider lock-in to a proxy's supported APIs. The tradeoff is one line of code per call you want metered. In exchange, usage lives in the same Postgres row as the run's checkpoints, steps, and gates, so "what did this failed run cost before it died?" is one query — and a resumed run doesn't re-pay for checkpointed steps, which you can verify in the ledger.

See also: [Checkpointing in Postgres](checkpoint-ai-agent-workflows-postgres.md) · [Quickstart](quickstart.md)

---

# Tidebase vs Temporal: do you need durable execution for an AI agent?

The short answer: **Temporal asks you to move execution into its worker model; Tidebase asks you to wrap functions you already have.** Different adoption cost for a smaller guarantee. If you're already all-in on Temporal, you don't need Tidebase.

Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime.

## The structural difference

**Temporal owns durable execution.** Your workflow code runs inside Temporal workers against the Temporal server (or Temporal Cloud). In exchange for adopting its programming model — deterministic workflow functions, activities, task queues, its deployment topology — you get true durable execution: a dead worker's workflow continues on another worker automatically.

**Tidebase does not execute your code.** Your app, worker, or job process keeps calling your LLMs, tools, and APIs directly. The SDK wraps your workflow function; a Postgres-backed server stores checkpoints, state, events, and gates around it. Since v0.5 Tidebase also owns re-invocation when you want it to: durable queues (pull workers or signed push webhooks), cron schedules, and a reconciler that requeues runs whose worker died — completed steps replay from checkpoints, so retries never re-pay for finished work.

## Side by side

| | Tidebase | Temporal |
|---|---|---|
| Executes your code | No — external checkpoint coordination | Yes — worker runtime |
| Guarantee | Completed steps never repeat on re-invocation | Workflow continues automatically after worker death |
| Adoption cost | Wrap existing functions in `run.step()` | Restructure into deterministic workflows + activities; run workers |
| Infrastructure | Your app + one server + Postgres | Temporal cluster (or Cloud) + workers |
| Determinism constraints on your code | None | Workflow code must be deterministic |
| Live run state for product UIs | Built in (`run.state`, SSE) | Build yourself (queries/signals) |
| Human approval gates | Built in, durable, exactly-once | Build with signals |
| Agent-specific features (fanout/join of subagents, state versioning/forking, token-cost ledger) | Built in | Build yourself on primitives |
| Maturity | Self-hosted alpha, opt-in auth | Battle-tested at large scale, production-grade |

## Choose Temporal if

- You need true automatic continuation — no part of your stack can be responsible for re-invoking work.
- You have long-lived, complex orchestrations (days/weeks, timers, sagas) beyond agent pipelines.
- You need production-grade multi-tenancy, auth, and operational tooling today.

## Choose Tidebase if

- You have an existing app or worker and want crash-safe resume without restructuring it around a workflow engine.
- Your pain is agent-shaped: "is it safe to rerun?", progress streaming to a product UI, human approval gates, subagent fanout, per-run token costs.
- You want everything in your own Postgres, self-hosted, Apache-2.0.

Repo: <https://github.com/BlueprintLabIO/tidebase> · See also: [Tidebase vs Inngest](tidebase-vs-inngest.md) · [Tidebase vs DBOS and Restate](tidebase-vs-dbos-restate.md)

---

# Tidebase vs Inngest

The short answer: **Inngest is an event-driven durable execution platform that invokes your functions for you; Tidebase is a checkpoint layer around functions your own infrastructure invokes.** They look similar at the API level (`step.run()` vs `run.step()`) but sit in different positions in your stack.

Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime.

## The structural difference

**Inngest owns invocation.** You define functions triggered by events; the Inngest platform (cloud or dev server) calls your endpoints, drives step execution, handles retries, throttling, and flow control. Your functions are invoked *by* Inngest.

**Tidebase owns the record and the trigger — never the runtime.** Since v0.5 it ships durable queues and cron: pull workers claim jobs, or Tidebase pushes signed invocation webhooks to your app. Your code still runs in your own processes with your own secrets. The remaining honest difference: Inngest hosts the orchestration platform; Tidebase is a single self-hosted server on your Postgres.

## Side by side

| | Tidebase | Inngest |
|---|---|---|
| Who invokes your function | Tidebase queues/cron (pull or signed push), or your own infra | Inngest platform, on events |
| Retries, throttling, concurrency control | Built in (backoff, dedupe, per-queue caps) | Built into the platform |
| Data location | Your Postgres, self-hosted | Inngest cloud (self-host option exists) |
| Live run state API for product UIs | Built in, with SSE + versioned history | Not a core primitive |
| Human approval gates | Built in, durable, exactly-once, webhook-deliverable | `waitForEvent` pattern |
| Subagent fanout with idempotent child runs | Built in | Compose with function invocation |
| Per-run token/cost ledger | Built in | Build yourself |
| Maturity | Self-hosted alpha, opt-in auth | Production-grade, hosted |

## Choose Inngest if

- You want a platform to own queueing, retries, and flow control so you don't run that infrastructure.
- Your workflows are naturally event-driven and you're comfortable with the platform invoking your endpoints.

## Choose Tidebase if

- You want queues/cron/durability in one self-hosted server on your Postgres, without your code running inside someone else's platform.
- You want run state, approval gates, and checkpoints in *your* Postgres next to your app's data.
- Your workloads are AI-agent-shaped and you want fanout, forking, and cost tracking as first-class primitives.

Repo: <https://github.com/BlueprintLabIO/tidebase> · See also: [Tidebase vs Temporal](tidebase-vs-temporal.md)

---

# Tidebase vs LangGraph checkpointers

The short answer: **LangGraph checkpointers persist the state of a LangGraph graph; Tidebase checkpoints any code, in any framework or none.** If you're all-in on LangGraph, its Postgres checkpointer is the native choice. If your agent is plain TypeScript — or a mix of frameworks — Tidebase gives you checkpoint/resume without adopting a graph abstraction.

Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime.

## The structural difference

**LangGraph checkpointers are a persistence backend for LangGraph.** You model your agent as a graph of nodes; the checkpointer (e.g. Postgres-backed) saves channel state at each super-step, enabling resume, time travel, and human-in-the-loop interrupts — *within the graph runtime*.

**Tidebase is framework-agnostic.** You keep ordinary async functions and wrap meaningful units in `run.step()`. No graph model, no framework migration. It also covers operational ground a graph checkpointer doesn't try to: run/step leases (two workers can't grab the same run; zombie workers are fenced), resume contracts with failure classification (`manual_review` for unkeyed side effects), durable approval gates deliverable over webhooks, parent/child run trees, a live-state SSE API for your product UI, an append-only event log, a usage/cost ledger, and a Studio dashboard.

## Side by side

| | Tidebase | LangGraph checkpointer |
|---|---|---|
| Requires a framework | No — plain functions | Yes — LangGraph graphs |
| Language | TypeScript + Python SDKs | Python + JS |
| Resume granularity | Named steps with input hashes | Graph super-steps |
| Replay safety analysis | Resume contracts: side effects, idempotency keys, failure classification | Up to your node code |
| Worker leases / zombie fencing | Built in | Out of scope |
| Approval gates | Durable, exactly-once, webhook channels | `interrupt()` within the graph |
| Product-UI live state | SSE + versioned state streams | Read checkpointed state yourself |
| Dashboard | Studio included | LangSmith (separate, hosted) |
| Maturity | Self-hosted alpha, opt-in auth | Mature within LangChain ecosystem |

## Choose LangGraph checkpointers if

- Your agent is already a LangGraph graph — native persistence beats an external layer.

## Choose Tidebase if

- Your agent is plain code (or spans frameworks) and you don't want to restructure it into a graph to get checkpointing.
- You need the operational layer around the checkpoint: leases, replay classification, gates your product UI can render, run trees, cost tracking.

Repo: <https://github.com/BlueprintLabIO/tidebase> · See also: [The replay contract](../replay-contract-is-it-safe-to-rerun.md)

---

# Tidebase vs DBOS and Restate

The short answer: **DBOS and Restate embed durable execution into your application (a library/engine that owns how your functions run); Tidebase stays outside your execution path entirely** — it records checkpoints and state around code that your own infrastructure invokes.

Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime.

## The structural difference

**DBOS** is a library that makes your application itself durable: decorated workflow functions checkpoint into Postgres, and the DBOS runtime inside your app handles recovery and resumption of pending workflows on restart.

**Restate** is a durable execution engine: a log-centric server that drives your handlers, journals every step, and replays them deterministically after failures — it owns invocation and retries.

**Tidebase** owns the record and (since v0.5) the trigger, but never the runtime. Your code calls out to a Postgres-backed server that records step completions, state versions, gates, and events — and its queues, cron schedules, and reconciler re-invoke your workflows after crashes (pull claims or signed push webhooks). Completed steps replay from checkpoints; the runtime, secrets, and deploy model stay entirely yours.

All three share the Postgres-respecting, self-hostable ethos. The difference is how much of your architecture they ask to own.

## Side by side

| | Tidebase | DBOS | Restate |
|---|---|---|---|
| Where durability lives | External server + your Postgres | Library inside your app + Postgres | Restate engine (log) driving your handlers |
| Owns invocation/recovery | No — your infra or recovery webhooks | Yes — runtime resumes pending workflows | Yes — engine re-drives handlers |
| Determinism requirements | None (steps checkpoint results) | Workflow functions deterministic between steps | Handlers deterministic for replay |
| Agent primitives (gates, fanout run trees, state forking, cost ledger, Studio UI) | Built in | Build on primitives | Build on primitives |
| Live state API for product UIs | Built in (SSE + versions) | Build yourself | Build yourself |
| Maturity | Self-hosted alpha, opt-in auth | Production-grade | Production-grade |

## Choose DBOS or Restate if

- You want automatic recovery with no part of your stack responsible for re-invocation.
- You're building general backend workflows where deterministic replay is a fine constraint.

## Choose Tidebase if

- You want zero change to how your code is invoked and deployed — wrap functions, keep your queue/cron.
- Your workloads are agent-shaped and you want approval gates, subagent fanout, state forking, and token-cost tracking out of the box, with a dashboard.
- You want the record of every run in your own Postgres with a UI your team (and your product) can read.

Repo: <https://github.com/BlueprintLabIO/tidebase> · See also: [Tidebase vs Temporal](tidebase-vs-temporal.md) · [Tidebase vs Inngest](tidebase-vs-inngest.md)