# Tidebase — full documentation for LLMs > Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime. --- # Tidebase Documentation Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime. It is **not** a workflow engine, queue, LLM proxy, or hosted runtime. Your code keeps running in your own app, worker, or job process. Tidebase stores checkpoints, live state, events, approval gates, and usage records around it, so the question "this run died at step 7 — is it safe to rerun?" has a reliable answer. ```typescript import { Tidebase } from '@tidebase/sdk' const tide = new Tidebase() await tide.run('generate-report', { runId }, async (run, input) => { const plan = await run.step('plan', () => makePlan(input)) await run.state.set({ status: 'writing', progress: 0.7 }) return run.step('write-report', () => writeReport(plan)) }) ``` Re-invoke the same workflow with the same `runId` after a crash: completed steps return their checkpointed results instantly, and execution continues at the first incomplete step. ## Guides - [Quickstart](quickstart.md) — running in one command sequence, agent-executable - [How to resume a failed AI agent run](how-to-resume-a-failed-ai-agent-run.md) - [How to checkpoint AI agent workflows in Postgres](checkpoint-ai-agent-workflows-postgres.md) - [Is it safe to rerun? The replay contract](replay-contract-is-it-safe-to-rerun.md) - [Human approval gates for AI agents](human-approval-gates-for-ai-agents.md) - [Fork, time travel, and snapshot agent runs](fork-and-time-travel-agent-runs.md) - [Queues, cron schedules, and cancellation](queues-schedules-and-cancellation.md) - [Fan out to subagents with child runs](fanout-subagents-child-runs.md) - [Track LLM token usage and cost per run](track-llm-token-costs-per-run.md) ## Integrations - [How to checkpoint Vercel AI SDK agents](integrate/vercel-ai-sdk.md) - [How to checkpoint Claude Agent SDK sessions](integrate/claude-agent-sdk.md) - [How to add durable checkpoints to Mastra agents](integrate/mastra.md) - [Approval gates and a durable run record for any MCP agent](integrate/mcp-agents.md) - [How to checkpoint OpenAI Agents SDK runs](integrate/openai-agents-sdk.md) - [How to use Tidebase with LangGraph](integrate/langgraph.md) - [How to checkpoint CrewAI crews](integrate/crewai.md) - [How to checkpoint Pydantic AI agents](integrate/pydantic-ai.md) - [How to run durable AI workflows behind a Next.js route](integrate/nextjs.md) - [How to wire Tidebase into an Express app](integrate/express.md) - [How to wire Tidebase into a Hono app](integrate/hono.md) - [How to wire Tidebase into a FastAPI app](integrate/fastapi.md) - [How to wire Tidebase into a SvelteKit app](integrate/sveltekit.md) ## Comparisons - [Tidebase vs Temporal](compare/tidebase-vs-temporal.md) - [Tidebase vs Inngest](compare/tidebase-vs-inngest.md) - [Tidebase vs LangGraph checkpointers](compare/tidebase-vs-langgraph-checkpointer.md) - [Tidebase vs DBOS and Restate](compare/tidebase-vs-dbos-restate.md) - [Tidebase vs Trigger.dev](compare/tidebase-vs-trigger-dev.md) - [Tidebase vs Hatchet](compare/tidebase-vs-hatchet.md) - [Tidebase vs BullMQ](compare/tidebase-vs-bullmq.md) - [Tidebase vs Cloudflare Workflows](compare/tidebase-vs-cloudflare-workflows.md) ## For AI assistants and agents - [/llms.txt](../llms.txt) — index of these docs for retrieval - [/llms-full.txt](../llms-full.txt) — all docs in one file - [MCP server](../mcp-server/README.md) — inspect runs, resolve gates, and debug replay from your assistant - [Tidebase Agent Skill](../skills/tidebase/SKILL.md) — teaches coding agents when and how to use Tidebase ## Status Tidebase is an open-source (Apache-2.0), self-hosted alpha: Postgres-backed server, TypeScript and Python SDKs, and a Studio dashboard. API auth is opt-in via a shared `TIDEBASE_API_KEY`; without it, run in trusted local/self-hosted environments only. Repo: --- # Quickstart Get Tidebase running and watch a crashed agent run resume from its last checkpoint, in about two minutes. Every command below is non-interactive and safe to run in a fresh clone — if you are an AI agent, you can execute this page top to bottom. ## Prerequisites - Docker (for Postgres) - Node.js 20+ and pnpm ## 1. Clone and start ```bash git clone https://github.com/BlueprintLabIO/tidebase cd tidebase docker compose up -d postgres pnpm install pnpm dev ``` This starts: - **Server** at `http://localhost:7373` (REST + SSE API; auto-runs the SQL schema on first boot) - **Studio** at `http://localhost:5173` (dashboard showing every run, step, gate, and event) ## 2. Run the example workflow In a second terminal: ```bash pnpm example ``` The example wraps three steps — `plan`, `fetch-sources`, `write-report` — in checkpointed steps. ## 3. Crash it, then resume it Force a failure after two completed checkpoints: ```bash FAIL_WRITE=1 pnpm example ``` Copy the run id from Studio (or `GET http://localhost:7373/runs`), then re-invoke with the same id: ```bash TIDEBASE_RUN_ID=run_xxx pnpm example ``` `plan` and `fetch-sources` return instantly from their checkpoints. Only `write-report` executes again. That is the core guarantee: **completed steps never repeat.** ## 4. Verify ```bash pnpm test ``` The suite (57 tests against real Postgres) asserts the durability invariants directly: checkpoint replay, lease mutual exclusion, gate exactly-once resolution, fenced zombie writers, and gap-free event ordering. ## 5. Use it in your own app ```typescript import { Tidebase } from '@tidebase/sdk' const tide = new Tidebase() // reads TIDEBASE_URL, defaults to http://localhost:7373 await tide.run('my-workflow', { runId }, async (run, input) => { const a = await run.step('step-a', () => doA(input)) const b = await run.step('step-b', () => doB(a)) return b }) ``` Tidebase never executes your code — but since v0.5 it can re-invoke it for you: [durable queues, cron schedules, and automatic requeue on worker death](queues-schedules-and-cancellation.md). Or keep your own queue/cron and use the [signed recovery webhook](how-to-resume-a-failed-ai-agent-run.md#recovery-webhooks). Next: [How to resume a failed AI agent run](how-to-resume-a-failed-ai-agent-run.md) · [The replay contract](replay-contract-is-it-safe-to-rerun.md) --- # Frequently asked questions Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime. These are the questions developers actually ask. ## Is it safe to rerun a failed AI agent run? With Tidebase, yes — if every step that performs external writes declares an idempotency key, and Tidebase makes that condition explicit instead of leaving it to tribal knowledge. Re-invoking a run with the same runId replays completed steps from their checkpoints (they never re-execute), and a failed step is classified by its resume contract: read-only and idempotency-keyed steps are safe to replay automatically, while unkeyed external writes park in `manual_review` so a retry can't double-charge a customer. ## How do I resume an AI agent workflow from where it crashed? Re-invoke the same workflow function with the same `runId`. Completed steps return their checkpointed results from Postgres instantly; execution continues at the first incomplete step. Since v0.5 Tidebase triggers the re-invocation itself for queued runs (retries with backoff, requeue on worker death) and can push signed invocation webhooks to your app — or you keep your own queue/cron. ## Can Tidebase replace my job queue or cron? For agent workloads, yes: v0.5 ships durable queues (dedupe keys, delays, priorities, retries with backoff, per-queue concurrency and rate caps) and five-field UTC cron schedules, with pull-mode workers (`tide.work()`) or signed push webhooks. A queued job is a run — one lifecycle authority, no status drift. Double-fires on schedules are structurally impossible via fire-time dedupe keys. ## How do I cancel a running agent? `tide.runs.cancel(runId, { reason, actor })`. Cancellation is authoritative and one-way: in-flight workers observe it at their next step or gate boundary, a gate-blocked worker unwinds immediately, and `complete`/`fail` after cancel are refused. Deadlines cancel automatically. ## Does Tidebase execute my code? No, deliberately. Your app, worker, or job process keeps calling your LLMs, tools, and APIs directly. The SDK wraps your workflow function, and a Postgres-backed server stores checkpoints, live state, events, and approval gates around it. You get "completed steps never repeat," not "your dead process magically restarts." ## Do I need Temporal for an AI agent? If you're already all-in on Temporal, use it — it gives true durable execution. But Temporal asks you to move execution into its worker model: deterministic workflow functions, activities, task queues, and a cluster to operate. Tidebase asks you to wrap functions you already have — a smaller guarantee for a much smaller adoption cost, with agent-specific primitives (gates, fanout, state forking, token-cost tracking) built in. ## How is Tidebase different from LangGraph checkpointers? LangGraph checkpointers persist the state of a LangGraph graph; Tidebase checkpoints any code, in any framework or none. It also covers operational ground a graph checkpointer doesn't try to: run/step leases with zombie-worker fencing, replay-safety classification, durable approval gates deliverable over webhooks, parent/child run trees, a live-state SSE API, and a Studio dashboard. ## How do I pause an AI agent for human approval? Call `run.gate('approve-send', { prompt: 'Send it?' })`. The workflow blocks on a durable gate stored in Postgres; it can be approved or rejected from Studio, your own product UI, or any webhook surface. Gates resolve exactly once — a second resolve attempt gets a conflict, and an already-resolved gate replays its decision on resume. ## Can two workers pick up the same run? No. Run and step leases are mutually exclusive and fenced, so a zombie worker that wakes up late cannot write back stale results. This is asserted by concurrency-probe tests against real Postgres. ## Where does my data live? In your own Postgres. Tidebase is self-hosted (Docker + Postgres), Apache-2.0 licensed, and never proxies your LLM calls or stores your API keys. ## Is Tidebase production-ready? Not yet — it's an alpha. API auth is opt-in (set `TIDEBASE_API_KEY` on server and SDK); without it, run only in trusted local/self-hosted environments. The durability invariants (checkpoint replay, lease fencing, exactly-once gates) are enforced by a 57-test suite against real Postgres, but treat it as ready for local demos and early feedback, not production traffic. ## What languages does the SDK support? TypeScript (`@tidebase/sdk`) and Python (`tidebase` on PyPI, zero dependencies). The server is plain HTTP + SSE, so any other language can integrate directly against the API. --- # How to resume a failed AI agent run To resume a failed agent run with Tidebase, re-invoke the same workflow function with the same `runId`. Completed steps return their checkpointed results from Postgres without re-executing; the workflow continues at the first incomplete step. ```typescript import { Tidebase } from '@tidebase/sdk' const tide = new Tidebase() // First invocation: dies after fetch-sources // Second invocation with the same runId: plan and fetch-sources // replay from checkpoints, write-report runs for the first time. await tide.run('generate-report', { runId }, async (run, input) => { const plan = await run.step('plan', () => makePlan(input)) const sources = await run.step('fetch-sources', () => fetchSources(plan)) return run.step('write-report', () => writeReport(sources)) }) ``` Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime. ## What replay guarantees - **Completed steps never repeat.** Their results are returned from the checkpoint store, including across process crashes and machine restarts. - **Two workers can't grab the same run.** Run and step leases are mutually exclusive and fenced, so a zombie worker that wakes up late cannot write back stale results. - **Stale checkpoints are rejected.** Each step records an input hash; if the input changed since the checkpoint was written, replay fails loudly instead of silently reusing a wrong result. ## Who re-invokes the workflow? Tidebase never executes your code, but since v0.5 it owns *triggering*. After a failure, re-invocation can come from: 1. **Tidebase queues** — a failed queue run with attempts remaining goes back to `queued` with backoff and is re-claimed by `tide.work()` (or re-dispatched over a signed push webhook). Worker death is handled the same way: the lease expires and the reconciler requeues the run. 2. **A recovery webhook** — see below; the reconciler fires it automatically for stalled runs. 3. **Your own queue, cron, or a retry button** in your product UI or Studio — re-invocation by `runId` is always safe. You get "completed steps never repeat" in every one of these paths. ## Recovery webhooks Tidebase can call back into your app when a run fails: ```typescript const run = await tide.runs.create('generate-report', { input: { topic: 'checkpoints' }, recoveryWebhook: 'https://your-app.example.com/api/tidebase' }) ``` When the run fails, Tidebase POSTs a `run.resume` payload to that URL and records every attempt (delivery status, HTTP status, response body). If `TIDEBASE_WEBHOOK_SECRET` is set on both server and SDK, payloads are HMAC-signed with `x-tidebase-signature`, and the SDK rejects unsigned or tampered payloads. ## When a step itself failed (not the process) A step that threw is classified by its [replay contract](replay-contract-is-it-safe-to-rerun.md): - `failed_retryable` — SDK retries remain; safe to re-invoke. - `manual_review` — the step has external side effects without an idempotency key, or declared manual replay. A human decides. - `failed` — hard failure. This classification is the difference between Tidebase and a hand-rolled `status` column: the resume decision is explicit and stored, not buried in logs. See also: [Quickstart](quickstart.md) · [Tidebase vs Temporal](compare/tidebase-vs-temporal.md) --- # How to checkpoint AI agent workflows in Postgres To checkpoint an AI agent workflow in Postgres with Tidebase, wrap each meaningful unit of work in a named `run.step()`. Each step's result is stored as a checkpoint in your own Postgres database the moment it completes, and replayed from storage if the run is re-invoked. ```typescript import { Tidebase } from '@tidebase/sdk' const tide = new Tidebase() await tide.run('research-agent', { runId }, async (run, input) => { const plan = await run.step('plan', () => llm.plan(input.question)) const docs = await run.step('search', () => searchTools.gather(plan)) const draft = await run.step('draft', () => llm.write(docs)) await run.state.set({ status: 'reviewing', progress: 0.9 }) return run.step('finalize', () => llm.polish(draft)) }) ``` Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime. ## Why not hand-roll a status column? Every multi-step agent grows the same plumbing: a `status` column, checkpoint JSON blobs, retry flags, progress streaming, and a prayer that rerunning a failed run doesn't double-charge a customer. The plumbing is easy to write; what's hard to get right under contention: - **Leases.** Two workers picking up the same run must be mutually exclusive, and a zombie worker must be fenced from writing stale results. - **Input-hash checking.** A checkpoint written for different inputs must not be silently reused. - **Ordered, gap-free event logs** under concurrent writers. - **Exactly-once approval gates** that survive a resume. Tidebase's test suite asserts these invariants against real Postgres with concurrency probes — they are the product. ## What gets stored All in Postgres, all self-hosted: runs and attempts, named step checkpoints, input hashes, resume contracts, live run state, versioned state streams, labeled snapshots, parent/child run edges, append-only events, gates and decisions, recovery attempts, and usage records. ## Granularity: what should be a step? A step is a unit you'd be happy to *not repeat*: an LLM call, a tool call batch, an external write. Cheap pure computation doesn't need a step. External writes should declare an [idempotency key in their resume contract](replay-contract-is-it-safe-to-rerun.md) so a retry can't double-fire them. ## Live state for your UI `run.state.set()` / `run.state.patch()` update live run state your product UI can subscribe to via SSE (`GET /runs/:runId/events`) — progress bars and status text without custom socket plumbing. Each update also appends to a versioned history, which is what makes [snapshots, forking, and time travel](fork-and-time-travel-agent-runs.md) fall out of the same model. See also: [Quickstart](quickstart.md) · [How to resume a failed run](how-to-resume-a-failed-ai-agent-run.md) --- # "This run died at step 7 — is it safe to rerun?" The replay contract Yes, it is safe to rerun a failed Tidebase run **if** every step that performs external writes declares an idempotency key — and Tidebase makes that condition explicit instead of leaving it to tribal knowledge. Each step records a *resume contract*: what side effects it has, how it can be replayed, and what its checkpoint guarantees. ```typescript await run.step( 'send-email', { input: { userId }, sideEffects: ['email.send'], idempotencyKey: `welcome:${userId}`, replay: 'auto', checkpointInvariant: 'provider accepted the message id', verifiedBy: 'email provider response' }, () => sendWelcomeEmail(userId) ) ``` Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime. ## How failures are classified When a step exhausts its retries, Tidebase classifies the failure using the contract: | Classification | Meaning | When | |---|---|---| | `failed_retryable` | Safe to re-invoke; SDK retries remain | transient failures | | `manual_review` | A human must decide before rerun | the step has side effects but **no idempotency key**, or declared `replay: 'manual'` | | `failed` | Hard failure | non-retryable errors | Read-only and idempotency-keyed steps are `safe_replay`; unkeyed external writes park in `manual_review` instead of silently double-firing on the next retry. This is the answer to "will rerunning double-charge the customer?" — if it could, the run *stops and tells you* rather than guessing. ## What this is honestly not Resume contracts do **not** make external systems exactly-once — only your idempotency keys (enforced by the external system) can do that. What Tidebase changes is that the resume decision is explicit, stored with the step, and visible in Studio, instead of hidden in logs and custom retry flags. ## Guarantees enforced underneath - Completed steps replay from storage and never re-execute, including across crash + recovery-webhook resume. - Step and run leases are mutually exclusive and fenced — zombie workers cannot write back stale results. - Input-hash drift on replay is rejected before it can corrupt a run. - Gates resolve exactly once and replay their decision on resume. Each of these is asserted by an invariant test against real Postgres, including concurrency probes. See also: [How to resume a failed run](how-to-resume-a-failed-ai-agent-run.md) · [Human approval gates](human-approval-gates-for-ai-agents.md) --- # Queues, cron schedules, and cancellation for AI agents To run agent workflows on a durable queue with Tidebase, enqueue them — dedupe keys, delays, retries with backoff, and per-queue concurrency caps are built in, and a queued job IS a run, so its status lives in the same authoritative lifecycle as everything else. ```typescript // enqueue: at most one active run per dedupe key, 3 attempts with backoff await tide.enqueue('generate-report', { queue: 'reports', input: { topic }, dedupeKey: `report:${topic}`, maxAttempts: 3, deadlineMs: 600_000 }) // pull-mode worker: claims ready runs, executes registered workflows tide.workflow('generate-report', generateReport) await tide.work({ queues: ['reports'] }) ``` Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime. Since v0.5 it can also decide **when** your code runs — it still never executes it. ## Two dispatch modes - **Pull** — `tide.work()` claims ready runs with `SKIP LOCKED` semantics: two workers can never receive the same job, and per-queue concurrency caps and rate limits hold under contention. Available in TypeScript and Python (incl. asyncio). - **Push** — configure a queue with an `invokeUrl` and Tidebase delivers signed `run.invoke` webhooks to your app (same HMAC as recovery webhooks). At-least-once with a redelivery horizon; beginning the run by id makes redelivery safe. ## Cron schedules ```typescript await tide.schedules.set('daily-digest', { cron: '0 9 * * *', workflowName: 'daily-digest' }) ``` Five-field UTC cron. Each fire enqueues with a dedupe key derived from the fire time (`sched::