Exchange Outage Recovery Playbook
This playbook outlines actionable steps to follow when a connected exchange experiences an outage. It prioritizes protection of arbitrage positions and capital, quick detection and containment, safe reconnection procedures, and communication workflows. Use this guide to reduce financial and operational risk during downtime events.
Overview
Exchange outages can be caused by infrastructure failures, DDoS attacks, software bugs, or maintenance. For arbitrage operators the primary risks are stalled orders, stuck positions, price divergence, and liquidity asymmetry across venues. This section describes detection signals and initial containment goals.
Immediate Actions (Step-by-step)
Detect & Confirm
Confirm outage via multiple signals: API errors, order execution failures, websocket disconnects, third-party monitors, and community reports. Avoid single-source assumptions.
Contain & Isolate
Halt new order placements to the affected exchange, cancel pending orders if possible, and move risk-sensitive processes to read-only mode. Prevent automated strategies from placing further trades against an outage.
Protect Positions
If you can safely hedge exposure on other venues, do so conservatively. Prioritize reducing directional risk and preserving collateral. Avoid aggressive liquidation attempts during partial connectivity.
Risk Mitigation Patterns
Pre-authorized Emergency Trades
Maintain small, pre-authorized trade plans and API keys used only for emergency hedging so you can act fast without manual approvals.
Cross-exchange Liquidity Pools
Keep liquidity across multiple venues to deploy hedges and reduce single-exchange dependency. Monitor funding rates and funding reset windows during outages.
Circuit Breakers & Throttles
Implement automated circuit breakers that pause strategies after repeated failures, with a controlled restart procedure once systems are verified.
Reconnection & Recovery Checklist
- 1. Verify Exchange Status: Check official status pages, API health endpoints, and multiple independent monitors.
- 2. Test Non-destructive Calls: Poll market data and account read endpoints before any write actions.
- 3. Resume Slowly: Re-enable order placement with tight throttles and small sizes; validate fills on small test trades.
- 4. Reconcile Balances: Recompute on-chain/venue balances and ensure no orphaned positions or stuck orders remain.
- 5. Monitor: Intensify monitoring for the next 24–72 hours for residual instability or reconciliation issues.
Communication & Monitoring
Alert stakeholders, log timelines, and keep a public-facing status channel updated. Use detailed post-mortems to refine playbooks and automate parts of the recovery process.
Prepare Your Ops
Build automated monitors, store emergency runbooks, and keep cross-exchange liquidity ready. Start with our operational checklist and integrate with your team alerts.
Get personalized alerts
Register for a free account to save articles, set watchlists and receive instant notifications about market movements.
Create Free AccountConclusion
A tested outage recovery playbook minimizes surprises and financial loss. Routine drills, clear isolation rules, and conservative reconnection policies make the difference between a recoverable event and a catastrophic one.
Categories
Sources & References
-
1Exchange Status PageOfficial exchange status and incident reports
-
2Operational Playbooks for Trading FirmsBest practices and runbooks
-
3Incident Response CommunicationTemplates for stakeholder updates