Skip to content

Incident Response Runbook

Overview

This runbook defines procedures for handling incidents in the trading system. Speed and accuracy are critical—when money is at risk, every second counts.

Incident Severity Levels

Level Description Response Time Examples
P1 Capital at immediate risk < 5 minutes Runaway position, system failure during trade
P2 Trading impaired < 30 minutes Can't open new positions, delayed signals
P3 Degraded performance < 4 hours Higher latency, partial data
P4 Minor issue Next business day Cosmetic bugs, logging gaps

Shutdown Triggers

Automatic Shutdown (Kill Switch)

Trigger Threshold Action
Daily loss > 3% Close all, halt trading
Total drawdown > 6% Close all, halt trading
Connection lost > 60 seconds Close all, halt trading
Error rate > 10 errors/minute Pause new entries
Price anomaly Spread > 5x normal Pause new entries

Manual Shutdown Triggers

Execute manual shutdown when: - Unexpected market event (flash crash, halt) - System behaving erratically - Suspicious activity detected - Unable to diagnose issue quickly - Founder/Risk manager orders shutdown

Shutdown Command

# Emergency shutdown - all environments
python -m execution.emergency_shutdown --all --reason "description"

# Or MT5 specifically
python -m execution.mt5.kill_switch --activate --reason "description"

# Or via API
curl -X POST https://api.edgefactory.local/emergency/shutdown \
  -H "Authorization: Bearer ${EMERGENCY_TOKEN}" \
  -d '{"reason": "description", "scope": "all"}'

Incident Response Procedures

P1: Capital at Immediate Risk

┌─────────────────────────────────────────────────────────────────┐
│                    P1 RESPONSE TIMELINE                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  0:00 ─── INCIDENT DETECTED                                    │
│    │                                                            │
│  0:01 ─── ACTIVATE KILL SWITCH (automatic or manual)           │
│    │      └── All positions closed                             │
│    │      └── All pending orders cancelled                     │
│    │      └── New trading halted                               │
│    │                                                            │
│  0:05 ─── ASSESS SITUATION                                     │
│    │      └── Current P&L                                      │
│    │      └── Open positions (should be zero)                  │
│    │      └── System status                                    │
│    │                                                            │
│  0:15 ─── NOTIFY STAKEHOLDERS                                  │
│    │      └── Founder (all channels)                           │
│    │      └── Log incident                                     │
│    │                                                            │
│  1:00 ─── BEGIN ROOT CAUSE ANALYSIS                            │
│    │                                                            │
│  4:00 ─── INITIAL FINDINGS DOCUMENTED                          │
│    │                                                            │
│ 24:00 ─── POST-MORTEM COMPLETE                                 │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

P1 Checklist

## P1 Incident Response Checklist

### Immediate (0-5 minutes)
- [ ] Kill switch activated
- [ ] Positions verified closed
- [ ] Pending orders cancelled
- [ ] Screenshot current state
- [ ] Note exact time of incident

### Assessment (5-15 minutes)
- [ ] Current P&L calculated
- [ ] Affected accounts identified
- [ ] Error logs captured
- [ ] Market conditions noted

### Communication (15-30 minutes)
- [ ] Founder notified
- [ ] Incident logged in system
- [ ] External parties notified (if applicable)

### Investigation (30+ minutes)
- [ ] Root cause identified
- [ ] Fix or workaround determined
- [ ] Resume decision made

P2: Trading Impaired

1. ASSESS (0-5 min)
   - What specifically is not working?
   - Are open positions at risk?
   - Can we continue with degraded functionality?

2. MITIGATE (5-15 min)
   - If positions at risk: close them manually
   - If signals not generating: document and continue manually
   - If execution failing: switch to backup or manual

3. DIAGNOSE (15-60 min)
   - Check logs for errors
   - Test individual components
   - Identify root cause

4. FIX OR ESCALATE (60+ min)
   - If fixable: implement fix, test, resume
   - If not fixable: escalate to P1 if risk increases

P3: Degraded Performance

1. DOCUMENT
   - Specific symptoms
   - Affected components
   - Impact on trading

2. ASSESS URGENCY
   - Is this getting worse?
   - Can we wait until non-trading hours?

3. SCHEDULE FIX
   - Create ticket
   - Plan fix for next maintenance window
   - Monitor for escalation

Rollback Procedures

Strategy Rollback

# 1. Stop current strategy signals
python -m execution.disable_strategy --strategy=ICT_QM

# 2. Identify previous version
git log --oneline strategies/pine_v6/ict_qm.pine | head -5

# 3. Revert to previous version
git checkout HEAD~1 -- strategies/pine_v6/ict_qm.pine

# 4. Deploy reverted version to TradingView
# (Manual: copy/paste to TradingView Pine Editor)

# 5. Re-enable with old version
python -m execution.enable_strategy --strategy=ICT_QM --version=previous

# 6. Verify signals
python -m execution.verify_signals --strategy=ICT_QM

System Rollback

# 1. Identify last stable deployment
git tag --list 'v*' --sort=-creatordate | head -5

# 2. Checkout stable version
git checkout v1.2.2

# 3. Restart services
sudo systemctl restart edge-factory-api
sudo systemctl restart edge-factory-webhook

# 4. Verify health
curl http://localhost:8000/health

# 5. Test execution (paper mode)
python -m execution.test_pipeline --mode=paper

Database Rollback (If Applicable)

# 1. Stop services
sudo systemctl stop edge-factory-api

# 2. Restore from backup
cp /backups/edge_factory_$(date -d "yesterday" +%Y%m%d).db /data/edge_factory.db

# 3. Restart services
sudo systemctl start edge-factory-api

# 4. Verify data integrity
python -m src.data.verify_integrity

Communication Templates

P1 Alert (Immediate)

🚨 P1 INCIDENT: [Brief Description]

Time: [HH:MM UTC]
Status: Kill switch ACTIVATED
Positions: CLOSED
Impact: [Description]

Investigating. Updates every 15 minutes.

P1 Update

📢 P1 UPDATE: [Brief Description]

Time: [HH:MM UTC]
Status: [Investigating/Mitigating/Resolved]
Root cause: [Known/Unknown/Suspected: X]
ETA to resolution: [Time or Unknown]
Next update: [Time]

P1 Resolution

✅ P1 RESOLVED: [Brief Description]

Time: [HH:MM UTC]
Duration: [X minutes/hours]
Root cause: [Description]
Resolution: [What was done]
Post-mortem: [Link or "Scheduled for DATE"]
Trading: [Resumed/Still halted - reason]

Post-Mortem Template

# Incident Post-Mortem: [INCIDENT_ID]

## Summary
- **Date**: YYYY-MM-DD
- **Duration**: X hours Y minutes
- **Severity**: P1/P2/P3/P4
- **Impact**: [Description of impact]

## Timeline
| Time (UTC) | Event |
|------------|-------|
| HH:MM | Incident began |
| HH:MM | Detected by [method] |
| HH:MM | Response initiated |
| HH:MM | [Key milestone] |
| HH:MM | Resolution confirmed |

## Root Cause
[Detailed description of what caused the incident]

## Impact Assessment
- **Financial**: $X loss / No financial impact
- **Positions affected**: X
- **Trades missed**: X
- **Data loss**: None / [Description]

## What Went Well
1. [Something that worked]
2. [Another thing]

## What Went Poorly
1. [Something that didn't work]
2. [Another thing]

## Action Items
| ID | Action | Owner | Due Date | Status |
|----|--------|-------|----------|--------|
| 1 | [Action] | [Name] | YYYY-MM-DD | Open |
| 2 | [Action] | [Name] | YYYY-MM-DD | Open |

## Lessons Learned
1. [Lesson]
2. [Lesson]

## Prevention
[What changes will prevent this from happening again]

---
Post-mortem author: [Name]
Post-mortem date: YYYY-MM-DD
Review date: YYYY-MM-DD

Incident Log

Maintain ongoing log at docs/runbooks/incident_log.md:

# Incident Log

| ID | Date | Severity | Summary | Duration | Resolution |
|----|------|----------|---------|----------|------------|
| INC-2024-001 | 2024-01-15 | P2 | Webhook timeout | 45 min | Increased timeout |
| INC-2024-002 | 2024-01-20 | P3 | Slow data feed | 2 hours | Provider issue |

Emergency Contacts

# config/emergency_contacts.yaml
contacts:
  founder:
    name: Kelvin Onwudinjo
    phone: "+353XXXXXXXXX"
    email: "kelvin@example.com"
    escalation_level: 1

  broker_support:
    name: "[Broker] Support"
    phone: "[Number]"
    hours: "24/7"

  exchange_support:
    name: "[Exchange] Support"
    url: "[Support URL]"

Next 7 Days: Action Plan

  • Set up emergency notification system (Discord/Telegram)
  • Configure automatic kill switch triggers
  • Create incident response drill schedule
  • Initialize incident log at docs/runbooks/incident_log.md
  • Test rollback procedure in paper environment
  • Document emergency contacts
  • Create monitoring dashboard for incident detection