Monitoring Specification
Overview
Monitoring operates at three levels: 1. Real-time: Trade-by-trade, position-level alerts 2. Session: Daily/weekly performance review 3. Strategic: Long-term drift and decay detection
Live KPIs
Real-Time Metrics (Updated Per Trade)
| Metric | Calculation | Alert Threshold |
| Daily P&L | Sum of closed + unrealized | < -2% warning, < -3% critical |
| Open P&L | Current mark-to-market | < -1.5% per position |
| Daily Trades | Count of trades today | > 10 (overtrading) |
| Win Streak | Consecutive wins | > 5 (review for luck) |
| Loss Streak | Consecutive losses | > 3 (pause and review) |
| Largest Loss | Single trade loss | > 1.5% of account |
Session Metrics (Updated Hourly/Daily)
| Metric | Calculation | Warning | Critical |
| Rolling Sharpe (7d) | Annualized from daily returns | < 0.5 | < 0 |
| Rolling Win Rate (20 trades) | Wins / Total | < 40% | < 30% |
| Rolling R:R (20 trades) | Avg Win / Avg Loss | < 1.2 | < 1.0 |
| Drawdown (current) | Peak - Current / Peak | > 5% | > 8% |
| Profit Factor (MTD) | Gross Profit / Gross Loss | < 1.3 | < 1.0 |
Portfolio-Level Metrics
| Metric | Target | Check Frequency |
| Total Exposure | < 80% of capital | Real-time |
| Strategy Correlation | < 0.5 pairwise | Daily |
| Sector Concentration | < 40% single asset class | Daily |
| Net Delta | Balanced unless intentional | Real-time |
Execution Quality Metrics
| Metric | Expected | Warning | Critical |
| Signal-to-fill latency | < 2s | > 5s | > 10s |
| Slippage vs model | ±1 tick | > 2 ticks | > 5 ticks |
| Fill rate | > 95% | < 90% | < 80% |
| Rejected orders | < 2% | > 5% | > 10% |
| Partial fills | < 5% | > 10% | > 20% |
Drift and Decay Detection
What is Strategy Decay?
Strategy decay occurs when a previously profitable strategy loses edge over time due to: - Market regime change - Increased competition (alpha crowding) - Structural market changes - Parameter staleness
Detection Methods
def detect_performance_drift(returns, lookback=60, threshold=0.5):
"""
Compare recent performance to historical baseline.
Alert if recent Sharpe < threshold × historical Sharpe.
"""
historical_sharpe = calculate_sharpe(returns[:-lookback])
recent_sharpe = calculate_sharpe(returns[-lookback:])
degradation = recent_sharpe / historical_sharpe
if degradation < threshold:
return Alert(
level='warning',
message=f'Performance degraded to {degradation:.0%} of baseline'
)
2. Regime Change Detection
def detect_regime_change(prices, volatility_window=20):
"""
Detect significant regime shifts using:
- Volatility regime (low/medium/high)
- Trend regime (trending/ranging)
- Correlation regime (correlated/decorrelated)
"""
current_vol = prices.pct_change().rolling(volatility_window).std().iloc[-1]
historical_vol = prices.pct_change().std()
vol_ratio = current_vol / historical_vol
if vol_ratio > 2.0:
return Regime.HIGH_VOL
elif vol_ratio < 0.5:
return Regime.LOW_VOL
else:
return Regime.NORMAL
3. Win Rate Decay
def detect_win_rate_decay(trades, baseline_win_rate, window=50):
"""
Alert if rolling win rate drops significantly below baseline.
"""
recent_wins = trades[-window:]['is_win'].mean()
if recent_wins < baseline_win_rate * 0.7: # 30% degradation
return Alert(
level='critical',
message=f'Win rate dropped to {recent_wins:.0%} vs {baseline_win_rate:.0%} baseline'
)
4. Distribution Shift (KL Divergence)
def detect_distribution_shift(returns, reference_returns, threshold=0.5):
"""
Compare return distribution to reference period.
High KL divergence indicates regime change.
"""
from scipy.stats import entropy
# Bin returns into histogram
bins = np.linspace(-0.05, 0.05, 50)
p = np.histogram(returns, bins=bins, density=True)[0] + 1e-10
q = np.histogram(reference_returns, bins=bins, density=True)[0] + 1e-10
kl_div = entropy(p, q)
if kl_div > threshold:
return Alert(
level='warning',
message=f'Return distribution shifted (KL={kl_div:.2f})'
)
Decay Response Protocol
| Signal | Threshold | Action |
| 30-day Sharpe < 0.3 | Trigger | Flag for review |
| 60-day Sharpe < 0 | Trigger | Reduce position size 50% |
| Win rate < 30% (50 trades) | Trigger | Pause strategy |
| Max DD > backtest max × 1.5 | Trigger | Pause strategy |
| Regime change detected | Trigger | Review parameter fit |
Alert System
Alert Levels
| Level | Response Time | Notification Channel | Auto-Action |
| Info | Next review | Log only | None |
| Warning | < 1 hour | Discord/Telegram | None |
| Critical | Immediate | All channels + SMS | Reduce exposure |
| Emergency | Immediate | All + phone call | Halt trading |
Alert Configuration
# config/alerts.yaml
channels:
discord:
webhook_url: ${DISCORD_WEBHOOK}
levels: [warning, critical, emergency]
telegram:
bot_token: ${TELEGRAM_BOT_TOKEN}
chat_id: ${TELEGRAM_CHAT_ID}
levels: [warning, critical, emergency]
log:
path: logs/alerts.log
levels: [info, warning, critical, emergency]
alerts:
daily_loss:
metric: daily_pnl_pct
warning: -2.0
critical: -3.0
message: "Daily P&L alert: {value}%"
drawdown:
metric: current_drawdown_pct
warning: -5.0
critical: -8.0
message: "Drawdown alert: {value}%"
loss_streak:
metric: consecutive_losses
warning: 3
critical: 5
message: "Loss streak: {value} consecutive losses"
{
"timestamp": "2024-01-15T14:30:00Z",
"level": "warning",
"strategy": "ICT_QM_1H",
"metric": "daily_pnl_pct",
"value": -2.3,
"threshold": -2.0,
"message": "Daily P&L alert: -2.3%",
"action_required": "Review positions",
"dashboard_link": "https://dashboard.example.com/strategy/ICT_QM_1H"
}
Incident Response
Incident Classification
| Severity | Description | Examples |
| P1 | Capital at immediate risk | Runaway position, system down during open trade |
| P2 | Trading impaired | Can't open new positions, delayed signals |
| P3 | Degraded performance | Higher latency, partial data |
| P4 | Minor issue | Cosmetic, logging gaps |
Response Procedures
P1: Capital at Risk
1. IMMEDIATE: Kill all open orders
2. IMMEDIATE: Flatten all positions (market orders)
3. WITHIN 5 MIN: Notify founder via all channels
4. WITHIN 15 MIN: Document current state
5. WITHIN 1 HOUR: Root cause analysis started
6. WITHIN 24 HOURS: Post-mortem document
P2: Trading Impaired
1. WITHIN 5 MIN: Pause new signal generation
2. WITHIN 15 MIN: Assess impact on open positions
3. WITHIN 30 MIN: Implement workaround or escalate to P1
4. WITHIN 4 HOURS: Root cause identified
5. WITHIN 24 HOURS: Fix deployed or documented workaround
Runbook Pointers
| Scenario | Runbook Location |
| Broker API down | docs/runbooks/broker_outage.md |
| TradingView webhook failure | docs/runbooks/webhook_failure.md |
| Position size mismatch | docs/runbooks/position_reconcile.md |
| Strategy generating bad signals | docs/runbooks/signal_validation.md |
| Data feed issues | docs/runbooks/data_issues.md |
| General incident response | docs/runbooks/incident_response.md |
Post-Incident Review Template
# Incident Report: [INCIDENT_ID]
## Summary
- Date/Time: YYYY-MM-DD HH:MM UTC
- Duration: X hours
- Severity: P1/P2/P3/P4
- Impact: [Description of impact]
## Timeline
- HH:MM - Event detected
- HH:MM - Response initiated
- HH:MM - Mitigation applied
- HH:MM - Resolution confirmed
## Root Cause
[Description of root cause]
## Resolution
[What was done to fix it]
## Lessons Learned
- [Lesson 1]
- [Lesson 2]
## Action Items
- [ ] [Action 1] - Owner - Due date
- [ ] [Action 2] - Owner - Due date
Dashboard Requirements
Minimum Viable Dashboard
| Panel | Content | Update Frequency |
| P&L Summary | Today, MTD, YTD | Real-time |
| Open Positions | Symbol, size, unrealized P&L | Real-time |
| Drawdown Gauge | Current DD vs max allowed | Real-time |
| Recent Trades | Last 10 trades with P&L | Per trade |
| Alert Feed | Last 24h alerts | Per alert |
| Strategy Health | Per-strategy Sharpe, win rate | Hourly |
Dashboard Tech Stack
| Component | Technology | Rationale |
| Backend | Python + FastAPI | Existing stack |
| Database | SQLite | Simple, portable |
| Frontend | Streamlit or Grafana | Rapid development |
| Hosting | Local or VPS | Data stays private |
Next 7 Days: Action Plan