# How to checkpoint CrewAI crews

To make a CrewAI pipeline durable with Tidebase, wrap each `crew.kickoff(...)` in a checkpointed step. A multi-crew pipeline that dies between crews resumes with finished crews replaying from Postgres — no agent re-runs, no tokens re-billed — and human approval can sit between crews as a durable gate.

```python
from crewai import Agent, Crew, Task
from tidebase import Tidebase

tide = Tidebase()  # reads TIDEBASE_URL, default http://localhost:7373

def pipeline(run, input):
    research = run.step(
        "research-crew",
        lambda: build_research_crew().kickoff(inputs={"topic": input["topic"]}).raw,
        input={"topic": input["topic"]},
    )

    draft = run.step(
        "writing-crew",
        lambda: build_writing_crew().kickoff(inputs={"findings": research}).raw,
        input={"findings": research},
    )

    decision = run.gate("approve-publish", "Publish this draft?", data={"preview": draft[:200]})
    if not decision.approved:
        return {"published": False}

    run.step(
        "publish",
        lambda: cms.publish(draft),
        input={"draft": draft},
        side_effects=["cms"],
        idempotency_key=f"publish-{run.run_id}",
    )
    return {"published": True}

tide.run("content-pipeline", pipeline, run_id=run_id, input={"topic": topic})
```

Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime.

The honest tradeoff: Tidebase does not execute your code — after a crash, something (a Tidebase queue worker, a recovery webhook handler, your own cron or retry button) must re-invoke the pipeline with the same `run_id`. Tidebase's guarantee is that doing so is safe: completed crews replay, the gate's decision replays, and the publish never fires twice.

## Step granularity: one step per kickoff

A `crew.kickoff()` is the natural checkpoint unit — it's expensive, nondeterministic, and runs to completion. Checkpoint its **output** (`.raw`, or `.json_dict` for structured outputs — pick something JSON-serializable, not the whole result object), and pass the crew's inputs as the step `input` so a changed topic invalidates the stale checkpoint loudly instead of silently replaying.

If the process dies *mid*-kickoff, that crew re-runs from the start on resume — there is no finished checkpoint to replay. CrewAI agents that use tools with external side effects (post to an API, send email) deserve the same treatment as the `publish` step: wrap the tool's body in a `run.step` with `side_effects` and an `idempotency_key`, so a re-run crew can't double-fire what already happened. The classification rules are in [the replay contract](../replay-contract-is-it-safe-to-rerun.md).

## A durable gate between crews

The `approve-publish` gate parks the run in Postgres until a human decides — from Studio, a webhook channel into Slack, or your own UI — and the decision is exactly-once and recorded with the actor. This is the natural place for editorial review in content pipelines: the expensive crews are checkpointed behind you, so the run can wait hours without holding any compute. See [human approval gates](../human-approval-gates-for-ai-agents.md).

## Recording cost per run

CrewAI exposes token usage after a kickoff via the crew's usage metrics. Record it inside the step so a replay doesn't double-count:

```python
metrics = crew.usage_metrics
run.usage.record(
    kind="llm",
    provider="openai",
    label="research-crew",
    input_tokens=metrics.prompt_tokens,
    output_tokens=metrics.completion_tokens,
)
```

Across many runs this gives you per-pipeline cost as queryable Postgres rows — see [tracking LLM token costs per run](../track-llm-token-costs-per-run.md).

## What Tidebase does not do here

- **It does not orchestrate the crew.** Task order, delegation, and agent collaboration stay entirely in CrewAI; Tidebase checkpoints at the boundaries you choose.
- **It does not replace CrewAI's memory.** Crew memory is conversational context; Tidebase is the durable run record around it.
- **Alpha, opt-in auth.** Self-hosted alpha — set `TIDEBASE_API_KEY` before exposing the server beyond localhost.

Repo: <https://github.com/BlueprintLabIO/tidebase> · See also: [How to resume a failed AI agent run](../how-to-resume-a-failed-ai-agent-run.md)