Incident Response Runbook¶

Overview¶

This runbook defines procedures for handling incidents in the trading system. Speed and accuracy are critical—when money is at risk, every second counts.

Incident Severity Levels¶

Level	Description	Response Time	Examples
P1	Capital at immediate risk	< 5 minutes	Runaway position, system failure during trade
P2	Trading impaired	< 30 minutes	Can't open new positions, delayed signals
P3	Degraded performance	< 4 hours	Higher latency, partial data
P4	Minor issue	Next business day	Cosmetic bugs, logging gaps

Shutdown Triggers¶

Automatic Shutdown (Kill Switch)¶

Trigger	Threshold	Action
Daily loss	> 3%	Close all, halt trading
Total drawdown	> 6%	Close all, halt trading
Connection lost	> 60 seconds	Close all, halt trading
Error rate	> 10 errors/minute	Pause new entries
Price anomaly	Spread > 5x normal	Pause new entries

Manual Shutdown Triggers¶

Execute manual shutdown when: - Unexpected market event (flash crash, halt) - System behaving erratically - Suspicious activity detected - Unable to diagnose issue quickly - Founder/Risk manager orders shutdown

Shutdown Command¶

# Emergency shutdown - all environments
python -m execution.emergency_shutdown --all --reason "description"

# Or MT5 specifically
python -m execution.mt5.kill_switch --activate --reason "description"

# Or via API
curl -X POST https://api.edgefactory.local/emergency/shutdown \
  -H "Authorization: Bearer ${EMERGENCY_TOKEN}" \
  -d '{"reason": "description", "scope": "all"}'

Incident Response Procedures¶

P1: Capital at Immediate Risk¶

┌─────────────────────────────────────────────────────────────────┐
│                    P1 RESPONSE TIMELINE                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  0:00 ─── INCIDENT DETECTED                                    │
│    │                                                            │
│  0:01 ─── ACTIVATE KILL SWITCH (automatic or manual)           │
│    │      └── All positions closed                             │
│    │      └── All pending orders cancelled                     │
│    │      └── New trading halted                               │
│    │                                                            │
│  0:05 ─── ASSESS SITUATION                                     │
│    │      └── Current P&L                                      │
│    │      └── Open positions (should be zero)                  │
│    │      └── System status                                    │
│    │                                                            │
│  0:15 ─── NOTIFY STAKEHOLDERS                                  │
│    │      └── Founder (all channels)                           │
│    │      └── Log incident                                     │
│    │                                                            │
│  1:00 ─── BEGIN ROOT CAUSE ANALYSIS                            │
│    │                                                            │
│  4:00 ─── INITIAL FINDINGS DOCUMENTED                          │
│    │                                                            │
│ 24:00 ─── POST-MORTEM COMPLETE                                 │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

P1 Checklist¶

## P1 Incident Response Checklist

### Immediate (0-5 minutes)
- [ ] Kill switch activated
- [ ] Positions verified closed
- [ ] Pending orders cancelled
- [ ] Screenshot current state
- [ ] Note exact time of incident

### Assessment (5-15 minutes)
- [ ] Current P&L calculated
- [ ] Affected accounts identified
- [ ] Error logs captured
- [ ] Market conditions noted

### Communication (15-30 minutes)
- [ ] Founder notified
- [ ] Incident logged in system
- [ ] External parties notified (if applicable)

### Investigation (30+ minutes)
- [ ] Root cause identified
- [ ] Fix or workaround determined
- [ ] Resume decision made

P2: Trading Impaired¶

1. ASSESS (0-5 min)
   - What specifically is not working?
   - Are open positions at risk?
   - Can we continue with degraded functionality?

2. MITIGATE (5-15 min)
   - If positions at risk: close them manually
   - If signals not generating: document and continue manually
   - If execution failing: switch to backup or manual

3. DIAGNOSE (15-60 min)
   - Check logs for errors
   - Test individual components
   - Identify root cause

4. FIX OR ESCALATE (60+ min)
   - If fixable: implement fix, test, resume
   - If not fixable: escalate to P1 if risk increases

P3: Degraded Performance¶

1. DOCUMENT
   - Specific symptoms
   - Affected components
   - Impact on trading

2. ASSESS URGENCY
   - Is this getting worse?
   - Can we wait until non-trading hours?

3. SCHEDULE FIX
   - Create ticket
   - Plan fix for next maintenance window
   - Monitor for escalation

Rollback Procedures¶

Strategy Rollback¶

# 1. Stop current strategy signals
python -m execution.disable_strategy --strategy=ICT_QM

# 2. Identify previous version
git log --oneline strategies/pine_v6/ict_qm.pine | head -5

# 3. Revert to previous version
git checkout HEAD~1 -- strategies/pine_v6/ict_qm.pine

# 4. Deploy reverted version to TradingView
# (Manual: copy/paste to TradingView Pine Editor)

# 5. Re-enable with old version
python -m execution.enable_strategy --strategy=ICT_QM --version=previous

# 6. Verify signals
python -m execution.verify_signals --strategy=ICT_QM

System Rollback¶

# 1. Identify last stable deployment
git tag --list 'v*' --sort=-creatordate | head -5

# 2. Checkout stable version
git checkout v1.2.2

# 3. Restart services
sudo systemctl restart edge-factory-api
sudo systemctl restart edge-factory-webhook

# 4. Verify health
curl http://localhost:8000/health

# 5. Test execution (paper mode)
python -m execution.test_pipeline --mode=paper

Database Rollback (If Applicable)¶

# 1. Stop services
sudo systemctl stop edge-factory-api

# 2. Restore from backup
cp /backups/edge_factory_$(date -d "yesterday" +%Y%m%d).db /data/edge_factory.db

# 3. Restart services
sudo systemctl start edge-factory-api

# 4. Verify data integrity
python -m src.data.verify_integrity

Communication Templates¶

P1 Alert (Immediate)¶

🚨 P1 INCIDENT: [Brief Description]

Time: [HH:MM UTC]
Status: Kill switch ACTIVATED
Positions: CLOSED
Impact: [Description]

Investigating. Updates every 15 minutes.

P1 Update¶

📢 P1 UPDATE: [Brief Description]

Time: [HH:MM UTC]
Status: [Investigating/Mitigating/Resolved]
Root cause: [Known/Unknown/Suspected: X]
ETA to resolution: [Time or Unknown]
Next update: [Time]

P1 Resolution¶

✅ P1 RESOLVED: [Brief Description]

Time: [HH:MM UTC]
Duration: [X minutes/hours]
Root cause: [Description]
Resolution: [What was done]
Post-mortem: [Link or "Scheduled for DATE"]
Trading: [Resumed/Still halted - reason]

Post-Mortem Template¶

# Incident Post-Mortem: [INCIDENT_ID]

## Summary
- **Date**: YYYY-MM-DD
- **Duration**: X hours Y minutes
- **Severity**: P1/P2/P3/P4
- **Impact**: [Description of impact]

## Timeline
| Time (UTC) | Event |
|------------|-------|
| HH:MM | Incident began |
| HH:MM | Detected by [method] |
| HH:MM | Response initiated |
| HH:MM | [Key milestone] |
| HH:MM | Resolution confirmed |

## Root Cause
[Detailed description of what caused the incident]

## Impact Assessment
- **Financial**: $X loss / No financial impact
- **Positions affected**: X
- **Trades missed**: X
- **Data loss**: None / [Description]

## What Went Well
1. [Something that worked]
2. [Another thing]

## What Went Poorly
1. [Something that didn't work]
2. [Another thing]

## Action Items
| ID | Action | Owner | Due Date | Status |
|----|--------|-------|----------|--------|
| 1 | [Action] | [Name] | YYYY-MM-DD | Open |
| 2 | [Action] | [Name] | YYYY-MM-DD | Open |

## Lessons Learned
1. [Lesson]
2. [Lesson]

## Prevention
[What changes will prevent this from happening again]

---
Post-mortem author: [Name]
Post-mortem date: YYYY-MM-DD
Review date: YYYY-MM-DD

Incident Log¶

Maintain ongoing log at docs/runbooks/incident_log.md:

# Incident Log

| ID | Date | Severity | Summary | Duration | Resolution |
|----|------|----------|---------|----------|------------|
| INC-2024-001 | 2024-01-15 | P2 | Webhook timeout | 45 min | Increased timeout |
| INC-2024-002 | 2024-01-20 | P3 | Slow data feed | 2 hours | Provider issue |

Emergency Contacts¶

# config/emergency_contacts.yaml
contacts:
  founder:
    name: Kelvin Onwudinjo
    phone: "+353XXXXXXXXX"
    email: "kelvin@example.com"
    escalation_level: 1

  broker_support:
    name: "[Broker] Support"
    phone: "[Number]"
    hours: "24/7"

  exchange_support:
    name: "[Exchange] Support"
    url: "[Support URL]"

Next 7 Days: Action Plan¶

Set up emergency notification system (Discord/Telegram)
Configure automatic kill switch triggers
Create incident response drill schedule
Initialize incident log at docs/runbooks/incident_log.md
Test rollback procedure in paper environment
Document emergency contacts
Create monitoring dashboard for incident detection