How to checkpoint OpenAI Agents SDK runs
To make an OpenAI Agents SDK pipeline durable with Tidebase, wrap each Runner.run(...) call in a checkpointed step with the async Python SDK (tidebase.aio), and wrap function-tool bodies that hit external APIs in their own steps. Re-invoking with the same run_id after a crash replays finished agent runs from Postgres instead of re-running them.
from agents import Agent, Runner, function_tool
from tidebase.aio import AsyncTidebase
tide = AsyncTidebase() # reads TIDEBASE_URL, default http://localhost:7373
researcher = Agent(
name="researcher",
instructions="Research the topic and return key findings.",
)
writer = Agent(
name="writer",
instructions="Turn findings into a short publishable post.",
)
async def pipeline(run, input):
findings = await run.step(
"research",
lambda: Runner.run(researcher, f"Research: {input['topic']}"),
input={"topic": input["topic"]},
)
draft = await run.step(
"draft",
lambda: Runner.run(writer, f"Write a post from:\n{findings.final_output}"),
input={"findings": findings.final_output},
)
await run.state_set({"phase": "done"})
return draft.final_output
result = await tide.run("research-pipeline", pipeline, run_id=run_id, input={"topic": topic})
Tidebase is an open-source checkpoint layer for AI agents: wrap your steps, and failed runs resume from the last safe point — in your own Postgres, without moving execution into a new runtime.
The honest tradeoff: Tidebase does not execute your code — a Tidebase queue worker, recovery webhook handler, or your own retry must re-invoke the pipeline after a failure. And a replayed step returns the recorded agent output from Postgres; it does not re-run the agent. That is the point — a crash after research finished means research never runs (or bills) again — but the replayed output reflects what the model said the first time.
One step per Runner.run
Each Runner.run(...) is the expensive, nondeterministic unit — the SDK may execute many model turns, handoffs, and tool calls inside it. Pass the prompt material as the step’s input: Tidebase hashes it, so a changed prompt rejects the stale checkpoint loudly instead of replaying an answer to a question you no longer asked. The result objects serialize with the step; final_output is what you’ll usually thread forward.
Note the step fn returns the coroutine — tidebase.aio awaits it on the event loop, so lambda: Runner.run(...) is correct as written.
Tools with side effects get their own steps
If a @function_tool writes to the outside world, the crash you care about happens inside Runner.run — and on resume the whole agent run re-executes (it never completed, so there is no checkpoint). Inner tool steps are what prevent the re-run from double-firing:
def make_tools(run):
@function_tool
async def create_ticket(title: str, body: str) -> str:
return await run.step(
f"create-ticket:{title}",
lambda: ticket_api.create(title, body),
input={"title": title, "body": body},
side_effects=["ticketing-api"],
idempotency_key=f"ticket-{run.run_id}-{title}",
)
return [create_ticket]
Build the tools inside the workflow so they close over run, and pass them to the Agent. A side-effecting step that fails without an idempotency key parks as manual_review instead of being blindly retried — the exact rules are in the replay contract.
Record usage per run
The Agents SDK exposes token usage on the run result’s context (result.context_wrapper.usage). Record it inside the step so replays don’t double-count:
await run.usage_record(
kind="llm",
provider="openai",
model="gpt-4.1-mini",
input_tokens=usage.input_tokens,
output_tokens=usage.output_tokens,
)
See tracking LLM token costs per run.
What Tidebase does not do here
- It does not own the agent loop. Handoffs, guardrails, and tool orchestration stay in the Agents SDK; Tidebase checkpoints around
Runner.runboundaries. - It does not proxy OpenAI calls. Your keys, your network path.
- Alpha, opt-in auth. Self-hosted alpha — set
TIDEBASE_API_KEYbefore exposing the server beyond localhost.
Repo: https://github.com/BlueprintLabIO/tidebase · See also: How to resume a failed AI agent run