How to checkpoint CrewAI crews

To make a CrewAI pipeline durable with Tidebase, wrap each crew.kickoff(...) in a checkpointed step. A multi-crew pipeline that dies between crews resumes with finished crews replaying from Postgres — no agent re-runs, no tokens re-billed — and human approval can sit between crews as a durable gate.

from crewai import Agent, Crew, Task
from tidebase import Tidebase

tide = Tidebase()  # reads TIDEBASE_URL, default http://localhost:7373

def pipeline(run, input):
    research = run.step(
        "research-crew",
        lambda: build_research_crew().kickoff(inputs={"topic": input["topic"]}).raw,
        input={"topic": input["topic"]},
    )

    draft = run.step(
        "writing-crew",
        lambda: build_writing_crew().kickoff(inputs={"findings": research}).raw,
        input={"findings": research},
    )

    decision = run.gate("approve-publish", "Publish this draft?", data={"preview": draft[:200]})
    if not decision.approved:
        return {"published": False}

    run.step(
        "publish",
        lambda: cms.publish(draft),
        input={"draft": draft},
        side_effects=["cms"],
        idempotency_key=f"publish-{run.run_id}",
    )
    return {"published": True}

tide.run("content-pipeline", pipeline, run_id=run_id, input={"topic": topic})

Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime.

The honest tradeoff: Tidebase does not execute your code — after a crash, something (a Tidebase queue worker, a recovery webhook handler, your own cron or retry button) must re-invoke the pipeline with the same run_id. Tidebase’s guarantee is that doing so is safe: completed crews replay, the gate’s decision replays, and the publish never fires twice.

Step granularity: one step per kickoff

A crew.kickoff() is the natural checkpoint unit — it’s expensive, nondeterministic, and runs to completion. Checkpoint its output (.raw, or .json_dict for structured outputs — pick something JSON-serializable, not the whole result object), and pass the crew’s inputs as the step input so a changed topic invalidates the stale checkpoint loudly instead of silently replaying.

If the process dies mid-kickoff, that crew re-runs from the start on resume — there is no finished checkpoint to replay. CrewAI agents that use tools with external side effects (post to an API, send email) deserve the same treatment as the publish step: wrap the tool’s body in a run.step with side_effects and an idempotency_key, so a re-run crew can’t double-fire what already happened. The classification rules are in the replay contract.

A durable gate between crews

The approve-publish gate parks the run in Postgres until a human decides — from Studio, a webhook channel into Slack, or your own UI — and the decision is exactly-once and recorded with the actor. This is the natural place for editorial review in content pipelines: the expensive crews are checkpointed behind you, so the run can wait hours without holding any compute. See human approval gates.

Recording cost per run

CrewAI exposes token usage after a kickoff via the crew’s usage metrics. Record it inside the step so a replay doesn’t double-count:

metrics = crew.usage_metrics
run.usage.record(
    kind="llm",
    provider="openai",
    label="research-crew",
    input_tokens=metrics.prompt_tokens,
    output_tokens=metrics.completion_tokens,
)

Across many runs this gives you per-pipeline cost as queryable Postgres rows — see tracking LLM token costs per run.

What Tidebase does not do here

It does not orchestrate the crew. Task order, delegation, and agent collaboration stay entirely in CrewAI; Tidebase checkpoints at the boundaries you choose.
It does not replace CrewAI’s memory. Crew memory is conversational context; Tidebase is the durable run record around it.
Alpha, opt-in auth. Self-hosted alpha — set TIDEBASE_API_KEY before exposing the server beyond localhost.

Repo: https://github.com/BlueprintLabIO/tidebase · See also: How to resume a failed AI agent run