Tidebase Tidebase GitHub Start self-hosting
Docs

How to checkpoint OpenAI Agents SDK runs

To make an OpenAI Agents SDK pipeline durable with Tidebase, wrap each Runner.run(...) call in a checkpointed step with the async Python SDK (tidebase.aio), and wrap function-tool bodies that hit external APIs in their own steps. Re-invoking with the same run_id after a crash replays finished agent runs from Postgres instead of re-running them.

from agents import Agent, Runner, function_tool
from tidebase.aio import AsyncTidebase

tide = AsyncTidebase()  # reads TIDEBASE_URL, default http://localhost:7373

researcher = Agent(
    name="researcher",
    instructions="Research the topic and return key findings.",
)
writer = Agent(
    name="writer",
    instructions="Turn findings into a short publishable post.",
)

async def pipeline(run, input):
    findings = await run.step(
        "research",
        lambda: Runner.run(researcher, f"Research: {input['topic']}"),
        input={"topic": input["topic"]},
    )

    draft = await run.step(
        "draft",
        lambda: Runner.run(writer, f"Write a post from:\n{findings.final_output}"),
        input={"findings": findings.final_output},
    )

    await run.state_set({"phase": "done"})
    return draft.final_output

result = await tide.run("research-pipeline", pipeline, run_id=run_id, input={"topic": topic})

Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime.

The honest tradeoff: Tidebase does not execute your code — a Tidebase queue worker, recovery webhook handler, or your own retry must re-invoke the pipeline after a failure. And a replayed step returns the recorded agent output from Postgres; it does not re-run the agent. That is the point — a crash after research finished means research never runs (or bills) again — but the replayed output reflects what the model said the first time.

One step per Runner.run

Each Runner.run(...) is the expensive, nondeterministic unit — the SDK may execute many model turns, handoffs, and tool calls inside it. Pass the prompt material as the step’s input: Tidebase hashes it, so a changed prompt rejects the stale checkpoint loudly instead of replaying an answer to a question you no longer asked. The result objects serialize with the step; final_output is what you’ll usually thread forward.

Note the step fn returns the coroutine — tidebase.aio awaits it on the event loop, so lambda: Runner.run(...) is correct as written.

Tools with side effects get their own steps

If a @function_tool writes to the outside world, the crash you care about happens inside Runner.run — and on resume the whole agent run re-executes (it never completed, so there is no checkpoint). Inner tool steps are what prevent the re-run from double-firing:

def make_tools(run):
    @function_tool
    async def create_ticket(title: str, body: str) -> str:
        return await run.step(
            f"create-ticket:{title}",
            lambda: ticket_api.create(title, body),
            input={"title": title, "body": body},
            side_effects=["ticketing-api"],
            idempotency_key=f"ticket-{run.run_id}-{title}",
        )
    return [create_ticket]

Build the tools inside the workflow so they close over run, and pass them to the Agent. A side-effecting step that fails without an idempotency key parks as manual_review instead of being blindly retried — the exact rules are in the replay contract.

Record usage per run

The Agents SDK exposes token usage on the run result’s context (result.context_wrapper.usage). Record it inside the step so replays don’t double-count:

await run.usage_record(
    kind="llm",
    provider="openai",
    model="gpt-4.1-mini",
    input_tokens=usage.input_tokens,
    output_tokens=usage.output_tokens,
)

See tracking LLM token costs per run.

What Tidebase does not do here

  • It does not own the agent loop. Handoffs, guardrails, and tool orchestration stay in the Agents SDK; Tidebase checkpoints around Runner.run boundaries.
  • It does not proxy OpenAI calls. Your keys, your network path.
  • Alpha, opt-in auth. Self-hosted alpha — set TIDEBASE_API_KEY before exposing the server beyond localhost.

Repo: https://github.com/BlueprintLabIO/tidebase · See also: How to resume a failed AI agent run