How to checkpoint OpenAI Agents SDK runs

To make an OpenAI Agents SDK pipeline durable with Tidebase, wrap each Runner.run(...) call in a checkpointed step with the async Python SDK (tidebase.aio), and wrap function-tool bodies that hit external APIs in their own steps. Re-invoking with the same run_id after a crash replays finished agent runs from Postgres instead of re-running them.

from agents import Agent, Runner, function_tool
from tidebase.aio import AsyncTidebase

tide = AsyncTidebase()  # reads TIDEBASE_URL, default http://localhost:7373

researcher = Agent(
    name="researcher",
    instructions="Research the topic and return key findings.",
)
writer = Agent(
    name="writer",
    instructions="Turn findings into a short publishable post.",
)

async def pipeline(run, input):
    findings = await run.step(
        "research",
        lambda: Runner.run(researcher, f"Research: {input['topic']}"),
        input={"topic": input["topic"]},
    )

    draft = await run.step(
        "draft",
        lambda: Runner.run(writer, f"Write a post from:\n{findings.final_output}"),
        input={"findings": findings.final_output},
    )

    await run.state_set({"phase": "done"})
    return draft.final_output

result = await tide.run("research-pipeline", pipeline, run_id=run_id, input={"topic": topic})

Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime.

The honest tradeoff: Tidebase does not execute your code — a Tidebase queue worker, recovery webhook handler, or your own retry must re-invoke the pipeline after a failure. And a replayed step returns the recorded agent output from Postgres; it does not re-run the agent. That is the point — a crash after research finished means research never runs (or bills) again — but the replayed output reflects what the model said the first time.

One step per `Runner.run`

Each Runner.run(...) is the expensive, nondeterministic unit — the SDK may execute many model turns, handoffs, and tool calls inside it. Pass the prompt material as the step’s input: Tidebase hashes it, so a changed prompt rejects the stale checkpoint loudly instead of replaying an answer to a question you no longer asked. The result objects serialize with the step; final_output is what you’ll usually thread forward.

Note the step fn returns the coroutine — tidebase.aio awaits it on the event loop, so lambda: Runner.run(...) is correct as written.

Tools with side effects get their own steps

If a @function_tool writes to the outside world, the crash you care about happens inside Runner.run — and on resume the whole agent run re-executes (it never completed, so there is no checkpoint). Inner tool steps are what prevent the re-run from double-firing:

def make_tools(run):
    @function_tool
    async def create_ticket(title: str, body: str) -> str:
        return await run.step(
            f"create-ticket:{title}",
            lambda: ticket_api.create(title, body),
            input={"title": title, "body": body},
            side_effects=["ticketing-api"],
            idempotency_key=f"ticket-{run.run_id}-{title}",
        )
    return [create_ticket]

Build the tools inside the workflow so they close over run, and pass them to the Agent. A side-effecting step that fails without an idempotency key parks as manual_review instead of being blindly retried — the exact rules are in the replay contract.

Record usage per run

The Agents SDK exposes token usage on the run result’s context (result.context_wrapper.usage). Record it inside the step so replays don’t double-count:

await run.usage_record(
    kind="llm",
    provider="openai",
    model="gpt-4.1-mini",
    input_tokens=usage.input_tokens,
    output_tokens=usage.output_tokens,
)

See tracking LLM token costs per run.

What Tidebase does not do here

It does not own the agent loop. Handoffs, guardrails, and tool orchestration stay in the Agents SDK; Tidebase checkpoints around Runner.run boundaries.
It does not proxy OpenAI calls. Your keys, your network path.
Alpha, opt-in auth. Self-hosted alpha — set TIDEBASE_API_KEY before exposing the server beyond localhost.

Repo: https://github.com/BlueprintLabIO/tidebase · See also: How to resume a failed AI agent run

How to checkpoint OpenAI Agents SDK runs

One step per Runner.run

Tools with side effects get their own steps

Record usage per run

What Tidebase does not do here

One step per `Runner.run`