Incident Response Runbook¶
Overview¶
This runbook defines procedures for handling incidents in the trading system. Speed and accuracy are critical—when money is at risk, every second counts.
Incident Severity Levels¶
| Level | Description | Response Time | Examples |
|---|---|---|---|
| P1 | Capital at immediate risk | < 5 minutes | Runaway position, system failure during trade |
| P2 | Trading impaired | < 30 minutes | Can't open new positions, delayed signals |
| P3 | Degraded performance | < 4 hours | Higher latency, partial data |
| P4 | Minor issue | Next business day | Cosmetic bugs, logging gaps |
Shutdown Triggers¶
Automatic Shutdown (Kill Switch)¶
| Trigger | Threshold | Action |
|---|---|---|
| Daily loss | > 3% | Close all, halt trading |
| Total drawdown | > 6% | Close all, halt trading |
| Connection lost | > 60 seconds | Close all, halt trading |
| Error rate | > 10 errors/minute | Pause new entries |
| Price anomaly | Spread > 5x normal | Pause new entries |
Manual Shutdown Triggers¶
Execute manual shutdown when: - Unexpected market event (flash crash, halt) - System behaving erratically - Suspicious activity detected - Unable to diagnose issue quickly - Founder/Risk manager orders shutdown
Shutdown Command¶
# Emergency shutdown - all environments
python -m execution.emergency_shutdown --all --reason "description"
# Or MT5 specifically
python -m execution.mt5.kill_switch --activate --reason "description"
# Or via API
curl -X POST https://api.edgefactory.local/emergency/shutdown \
-H "Authorization: Bearer ${EMERGENCY_TOKEN}" \
-d '{"reason": "description", "scope": "all"}'
Incident Response Procedures¶
P1: Capital at Immediate Risk¶
┌─────────────────────────────────────────────────────────────────┐
│ P1 RESPONSE TIMELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 0:00 ─── INCIDENT DETECTED │
│ │ │
│ 0:01 ─── ACTIVATE KILL SWITCH (automatic or manual) │
│ │ └── All positions closed │
│ │ └── All pending orders cancelled │
│ │ └── New trading halted │
│ │ │
│ 0:05 ─── ASSESS SITUATION │
│ │ └── Current P&L │
│ │ └── Open positions (should be zero) │
│ │ └── System status │
│ │ │
│ 0:15 ─── NOTIFY STAKEHOLDERS │
│ │ └── Founder (all channels) │
│ │ └── Log incident │
│ │ │
│ 1:00 ─── BEGIN ROOT CAUSE ANALYSIS │
│ │ │
│ 4:00 ─── INITIAL FINDINGS DOCUMENTED │
│ │ │
│ 24:00 ─── POST-MORTEM COMPLETE │
│ │
└─────────────────────────────────────────────────────────────────┘
P1 Checklist¶
## P1 Incident Response Checklist
### Immediate (0-5 minutes)
- [ ] Kill switch activated
- [ ] Positions verified closed
- [ ] Pending orders cancelled
- [ ] Screenshot current state
- [ ] Note exact time of incident
### Assessment (5-15 minutes)
- [ ] Current P&L calculated
- [ ] Affected accounts identified
- [ ] Error logs captured
- [ ] Market conditions noted
### Communication (15-30 minutes)
- [ ] Founder notified
- [ ] Incident logged in system
- [ ] External parties notified (if applicable)
### Investigation (30+ minutes)
- [ ] Root cause identified
- [ ] Fix or workaround determined
- [ ] Resume decision made
P2: Trading Impaired¶
1. ASSESS (0-5 min)
- What specifically is not working?
- Are open positions at risk?
- Can we continue with degraded functionality?
2. MITIGATE (5-15 min)
- If positions at risk: close them manually
- If signals not generating: document and continue manually
- If execution failing: switch to backup or manual
3. DIAGNOSE (15-60 min)
- Check logs for errors
- Test individual components
- Identify root cause
4. FIX OR ESCALATE (60+ min)
- If fixable: implement fix, test, resume
- If not fixable: escalate to P1 if risk increases
P3: Degraded Performance¶
1. DOCUMENT
- Specific symptoms
- Affected components
- Impact on trading
2. ASSESS URGENCY
- Is this getting worse?
- Can we wait until non-trading hours?
3. SCHEDULE FIX
- Create ticket
- Plan fix for next maintenance window
- Monitor for escalation
Rollback Procedures¶
Strategy Rollback¶
# 1. Stop current strategy signals
python -m execution.disable_strategy --strategy=ICT_QM
# 2. Identify previous version
git log --oneline strategies/pine_v6/ict_qm.pine | head -5
# 3. Revert to previous version
git checkout HEAD~1 -- strategies/pine_v6/ict_qm.pine
# 4. Deploy reverted version to TradingView
# (Manual: copy/paste to TradingView Pine Editor)
# 5. Re-enable with old version
python -m execution.enable_strategy --strategy=ICT_QM --version=previous
# 6. Verify signals
python -m execution.verify_signals --strategy=ICT_QM
System Rollback¶
# 1. Identify last stable deployment
git tag --list 'v*' --sort=-creatordate | head -5
# 2. Checkout stable version
git checkout v1.2.2
# 3. Restart services
sudo systemctl restart edge-factory-api
sudo systemctl restart edge-factory-webhook
# 4. Verify health
curl http://localhost:8000/health
# 5. Test execution (paper mode)
python -m execution.test_pipeline --mode=paper
Database Rollback (If Applicable)¶
# 1. Stop services
sudo systemctl stop edge-factory-api
# 2. Restore from backup
cp /backups/edge_factory_$(date -d "yesterday" +%Y%m%d).db /data/edge_factory.db
# 3. Restart services
sudo systemctl start edge-factory-api
# 4. Verify data integrity
python -m src.data.verify_integrity
Communication Templates¶
P1 Alert (Immediate)¶
🚨 P1 INCIDENT: [Brief Description]
Time: [HH:MM UTC]
Status: Kill switch ACTIVATED
Positions: CLOSED
Impact: [Description]
Investigating. Updates every 15 minutes.
P1 Update¶
📢 P1 UPDATE: [Brief Description]
Time: [HH:MM UTC]
Status: [Investigating/Mitigating/Resolved]
Root cause: [Known/Unknown/Suspected: X]
ETA to resolution: [Time or Unknown]
Next update: [Time]
P1 Resolution¶
✅ P1 RESOLVED: [Brief Description]
Time: [HH:MM UTC]
Duration: [X minutes/hours]
Root cause: [Description]
Resolution: [What was done]
Post-mortem: [Link or "Scheduled for DATE"]
Trading: [Resumed/Still halted - reason]
Post-Mortem Template¶
# Incident Post-Mortem: [INCIDENT_ID]
## Summary
- **Date**: YYYY-MM-DD
- **Duration**: X hours Y minutes
- **Severity**: P1/P2/P3/P4
- **Impact**: [Description of impact]
## Timeline
| Time (UTC) | Event |
|------------|-------|
| HH:MM | Incident began |
| HH:MM | Detected by [method] |
| HH:MM | Response initiated |
| HH:MM | [Key milestone] |
| HH:MM | Resolution confirmed |
## Root Cause
[Detailed description of what caused the incident]
## Impact Assessment
- **Financial**: $X loss / No financial impact
- **Positions affected**: X
- **Trades missed**: X
- **Data loss**: None / [Description]
## What Went Well
1. [Something that worked]
2. [Another thing]
## What Went Poorly
1. [Something that didn't work]
2. [Another thing]
## Action Items
| ID | Action | Owner | Due Date | Status |
|----|--------|-------|----------|--------|
| 1 | [Action] | [Name] | YYYY-MM-DD | Open |
| 2 | [Action] | [Name] | YYYY-MM-DD | Open |
## Lessons Learned
1. [Lesson]
2. [Lesson]
## Prevention
[What changes will prevent this from happening again]
---
Post-mortem author: [Name]
Post-mortem date: YYYY-MM-DD
Review date: YYYY-MM-DD
Incident Log¶
Maintain ongoing log at docs/runbooks/incident_log.md:
# Incident Log
| ID | Date | Severity | Summary | Duration | Resolution |
|----|------|----------|---------|----------|------------|
| INC-2024-001 | 2024-01-15 | P2 | Webhook timeout | 45 min | Increased timeout |
| INC-2024-002 | 2024-01-20 | P3 | Slow data feed | 2 hours | Provider issue |
Emergency Contacts¶
# config/emergency_contacts.yaml
contacts:
founder:
name: Kelvin Onwudinjo
phone: "+353XXXXXXXXX"
email: "kelvin@example.com"
escalation_level: 1
broker_support:
name: "[Broker] Support"
phone: "[Number]"
hours: "24/7"
exchange_support:
name: "[Exchange] Support"
url: "[Support URL]"
Next 7 Days: Action Plan¶
- Set up emergency notification system (Discord/Telegram)
- Configure automatic kill switch triggers
- Create incident response drill schedule
- Initialize incident log at
docs/runbooks/incident_log.md - Test rollback procedure in paper environment
- Document emergency contacts
- Create monitoring dashboard for incident detection