Troubleshooting and recovery

This is the operational triage page. When a paired session misbehaves, work in two passes:

Diagnose with ctxrelay doctor and ctxrelay status before you change anything.
Recover or fix the specific symptom using the root-cause table below.

The most important habit is don't force-restart on instinct. ContextRelay keeps every message, handoff, note, and decision in a durable on-disk ledger, so a healthy daemon is almost always recoverable, and the symptom that looks like a crash is frequently a watchdog or a stale connection that a restart only makes worse.

Use the same three binaries

contextrelay, ctxrelay, and context-relay all point to the same CLI. This page uses ctxrelay for brevity; swap in whichever you prefer.

First moves: diagnose before you touch anything

`ctxrelay doctor` - is the environment sane?

ctxrelay doctor

doctor runs a checklist and prints OK, WARN, or ERR per line:

bun, claude, and codex binaries (plus the Codex app-server surface).
Provider auth probes (a real claude -p and codex exec call). Skip these with --no-auth when you are offline or just want the fast structural checks:
```
ctxrelay doctor --no-auth
```
Project config (.contextrelay/config.json), state directory, and daemon tokens.
Daemon health, including a warning if a same-project peer resolved a different instance/state directory.
Claude plugin registration - flags a version mismatch and tells you how to fix it.
Stale state - dead-pid daemon.pid, codex-tui.pid, and daemon.lock files under .contextrelay/state/. doctor reports these so you can clean them up; a clean stop (below) clears them for you.

If any check is ERR, doctor exits non-zero. Start there.

`ctxrelay status` - what is the live session doing?

ctxrelay status
ctxrelay status --json

status prints the daemon, session, connection, ledger, task, autonomy, finality, and backup state. The --json form is machine-readable and is the right input for scripts (for example, reading stateDir and controlPort).

`ctxrelay instances` - is this a port collision?

ctxrelay instances

ContextRelay scopes each project to its own port group (the first project gets 4500 Codex app-server / 4501 proxy / 4502 daemon control; additional projects increment by 10). instances lists every known project, its instance id, assigned ports, health, and when it was last seen. If two checkouts of the same repo are fighting over a port, you will see it here.

Crash recovery

When a session dies - power loss, terminal closed, a real daemon crash - the ledger on disk survives. A fresh pair can reconstruct where you were from it.

ctxrelay recover
ctxrelay recover --json

recover summarizes the recovery context:

the resolved session, instance, and ports, and whether the daemon is currently reachable;
the last recorded shutdown and the last turn-watchdog event;
possibly interrupted commands (commands that started but never recorded completion);
recent failures and blockers from the ledger;
the working-tree git status;
a ready-to-paste resume prompt that tells the agents to call read_context and task_state first, then continue from the newest request.

The ledger is the source of truth

You do not reconstruct state from a chat transcript. Agents only share what is written into the bridge messages and the ledger, so recover, read_context, and task_state are how a new session learns what already happened.

Symptom → cause → fix

"Daemon disconnected" mid-turn

This usually is not a crash

A "daemon disconnected" message most often masks the turn watchdog resetting a long-running Codex turn - not a real failure.

ContextRelay caps the wall-clock budget of a single Codex turn with CONTEXTRELAY_TURN_MAX_MS (default 300000, i.e. 5 minutes). When a turn exceeds it, the watchdog clears that turn from the busy set without killing Codex and records a turn_watchdog event. The connection looks like it dropped, but the daemon is fine.

Fix: check first, don't restart reflexively.

ctxrelay status        # is the daemon actually healthy?
ctxrelay recover       # shows the last watchdog event if one fired

For legitimately long turns, raise the budget instead of force-restarting (set all of the agents' launch env, or export it before launching the pair):

export CONTEXTRELAY_TURN_MAX_MS=900000   # 15 minutes

Set it to 0 to disable the watchdog entirely (not recommended for unattended runs).

"It keeps crashing on restart"

A restart loop is almost always self-inflicted by force-restart thrash: orphaned Codex app-server processes plus over-strict port classification, fed by repeated hard restarts. The stale-bundle / version-mismatch warning from doctor is advisory - it does not mean you must force-restart.

Fix: stop cleanly once, then relaunch a single time when the session is idle.

ctxrelay kill          # clean stop of THIS project instance
# …then a single relaunch:
ctxrelay pair          # or: ctxrelay claude / ctxrelay codex

ctxrelay kill marks the daemon as intentionally stopped before terminating it, which closes the reconnect race that a raw process kill would open, and it cleans up stale state files. For a genuine emergency across every project, the all-instances stop is:

ctxrelay kill --all

To stop only one named session's Codex runtime while leaving the daemon, Claude, and other sessions running:

ctxrelay kill --session <id>

After a kill, start a fresh Claude Code conversation (or run /resume) so Claude fully reconnects to the relaunched daemon.

"Codex has no ContextRelay tools"

If Codex cannot see send_to_claude, handoff_to_claude, read_context, and the other Codex-side MCP tools, its MCP registration is missing.

Fix:

ctxrelay codex-mcp status     # show the current registration
ctxrelay codex-mcp install    # register the ContextRelay MCP server for Codex

Once installed, the registration is global, so any codex session in the project picks up the tools - not only sessions launched with ctxrelay codex. If you instead want plain codex windows to stop auto-attaching, remove it:

ctxrelay codex-mcp remove

(ctxrelay codex launches Codex connected to the daemon directly; codex-mcp controls whether the tools are registered for standalone Codex sessions.)

"Stale Claude attachment" or "a live MCP call timed out"

If ctxrelay status shows Claude as attached but the foreground Claude is gone, clear the stale attachment without disturbing Codex or the daemon:

ctxrelay detach-claude

This detaches the active Claude foreground only; Codex and the daemon keep running. If no Claude was attached, it tells you so.

For long reviews, prefer durable messaging over live calls. Live deliberation and wait tools (deliberate_with_codex / wait_for_messages and the Codex-side equivalents) are bounded and can time out at the bridge layer for multi-minute work. When you expect a long turn, have the peer post a reply plus an append_note and pick it up from the ledger with read_context, instead of holding a live deliberation open.

"Codex stopped taking turns" (provider rate limits)

When Codex rejects a turn/start because of provider rate limits, ContextRelay does not pretend the turn succeeded: the rejection is recorded as a turn_aborted runtime event and Claude receives a system_turn_aborted notice naming the reason. Queued Claude-to-Codex injections are not retried immediately after a quota abort, so the pair does not burn through the rest of your quota in a retry loop.

Fix: this is upstream quota, not a ContextRelay fault. Wait for the limit to reset (or switch the Codex model/account), then resend the message or handoff. The daemon stays healthy throughout - ctxrelay status should still show it up.

"ContextRelay won't activate" / "is it even on?"

A recent ContextRelay release supports a dormant-by-default mode (see Activation: auto-connect vs dormant), so "nothing is happening" can mean the session resolved to dormant rather than broken.

Fix: ask the gate why it decided what it decided.

ctxrelay gate-check --why     # prints "active - <reason>" or "dormant - <reason>"
ctxrelay gate-check --json    # machine-readable {active, reason}

gate-check exits 0 when active and 1 when dormant. The activation reason is resolved by a fixed precedence - remember the top two rules when a session surprises you:

The env override CONTEXTRELAY_AUTO_CONNECT (0/1/true/false) beats everything.
A per-workspace attach marker beats project and global config.

So if a session is unexpectedly active, check for CONTEXTRELAY_AUTO_CONNECT in your environment and for an attach marker (left by ctxrelay attach). To opt the current workspace in or out in-session:

ctxrelay attach      # write the activation marker for this workspace
ctxrelay detach      # remove it (does not stop the daemon or Codex)

Plugin or instruction blocks look out of date after an update

After updating the npm package (npm i -g @proofofwork-agency/contextrelay@latest), reconcile the Claude/Codex-facing surface with ctxrelay upgrade:

ctxrelay upgrade
ctxrelay upgrade --dry-run    # preview every change, write nothing

upgrade is idempotent and safe to re-run. It:

migrate-merges .contextrelay/config.json, adding new default keys while preserving your existing values (it does not delete your settings or change your coordinator);
refreshes the managed CLAUDE.md / AGENTS.md blocks in place, preserving each file's slim (dormant) or full state;
refreshes the bare /contextrelay command only if it is already present;
re-registers and reinstalls the Claude plugin (skip with --no-plugin);
prints the from → to version and reminds you to run /reload-plugins in a running Claude Code so the new plugin loads.

Use --instructions refresh|project|global|both|skip to control instruction handling (default refresh touches only files that already carry a managed block).

On an older release without ctxrelay upgrade?

Use the manual fallback: ctxrelay dev (for local source checkouts) or ctxrelay instructions install to refresh the managed blocks, then ctxrelay doctor to confirm the plugin registration is current. See Upgrading ContextRelay.

Ports and state hygiene

Stale lock and pid files under .contextrelay/state/ (such as daemon.lock, daemon.pid, and codex-tui.pid) are detected by ctxrelay doctor and cleaned up by a clean ctxrelay kill. You rarely need to delete them by hand.

If you override ports with environment variables, set all three or none - partial port overrides are rejected:

export CODEX_WS_PORT=4600
export CODEX_PROXY_PORT=4601
export CONTEXTRELAY_CONTROL_PORT=4602

If you only need to know which ports a project is using, prefer ctxrelay instances (or ctxrelay status --json) over guessing.

Next steps

Frequently asked questions - shorter answers to common "how do I…" questions.
Architecture overview - how the daemon, bridge, and ledger fit together, which explains why these symptoms happen.
CLI command reference - every command, action, and flag in one place.
Environment variables reference - including CONTEXTRELAY_TURN_MAX_MS, CONTEXTRELAY_AUTO_CONNECT, and the port overrides referenced above.

First moves: diagnose before you touch anything​

ctxrelay doctor - is the environment sane?​

ctxrelay status - what is the live session doing?​

ctxrelay instances - is this a port collision?​

Crash recovery​

Symptom → cause → fix​

"Daemon disconnected" mid-turn​

"It keeps crashing on restart"​

"Codex has no ContextRelay tools"​

"Stale Claude attachment" or "a live MCP call timed out"​

"Codex stopped taking turns" (provider rate limits)​

"ContextRelay won't activate" / "is it even on?"​

Plugin or instruction blocks look out of date after an update​

Ports and state hygiene​

Next steps​