Monitoring Specification¶

Overview¶

Monitoring operates at three levels: 1. Real-time: Trade-by-trade, position-level alerts 2. Session: Daily/weekly performance review 3. Strategic: Long-term drift and decay detection

Live KPIs¶

Real-Time Metrics (Updated Per Trade)¶

Metric	Calculation	Alert Threshold
Daily P&L	Sum of closed + unrealized	< -2% warning, < -3% critical
Open P&L	Current mark-to-market	< -1.5% per position
Daily Trades	Count of trades today	> 10 (overtrading)
Win Streak	Consecutive wins	> 5 (review for luck)
Loss Streak	Consecutive losses	> 3 (pause and review)
Largest Loss	Single trade loss	> 1.5% of account

Session Metrics (Updated Hourly/Daily)¶

Metric	Calculation	Warning	Critical
Rolling Sharpe (7d)	Annualized from daily returns	< 0.5	< 0
Rolling Win Rate (20 trades)	Wins / Total	< 40%	< 30%
Rolling R:R (20 trades)	Avg Win / Avg Loss	< 1.2	< 1.0
Drawdown (current)	Peak - Current / Peak	> 5%	> 8%
Profit Factor (MTD)	Gross Profit / Gross Loss	< 1.3	< 1.0

Portfolio-Level Metrics¶

Metric	Target	Check Frequency
Total Exposure	< 80% of capital	Real-time
Strategy Correlation	< 0.5 pairwise	Daily
Sector Concentration	< 40% single asset class	Daily
Net Delta	Balanced unless intentional	Real-time

Execution Quality Metrics¶

Metric	Expected	Warning	Critical
Signal-to-fill latency	< 2s	> 5s	> 10s
Slippage vs model	±1 tick	> 2 ticks	> 5 ticks
Fill rate	> 95%	< 90%	< 80%
Rejected orders	< 2%	> 5%	> 10%
Partial fills	< 5%	> 10%	> 20%

Drift and Decay Detection¶

What is Strategy Decay?¶

Strategy decay occurs when a previously profitable strategy loses edge over time due to: - Market regime change - Increased competition (alpha crowding) - Structural market changes - Parameter staleness

Detection Methods¶

1. Rolling Performance Degradation¶

def detect_performance_drift(returns, lookback=60, threshold=0.5):
    """
    Compare recent performance to historical baseline.

    Alert if recent Sharpe < threshold × historical Sharpe.
    """
    historical_sharpe = calculate_sharpe(returns[:-lookback])
    recent_sharpe = calculate_sharpe(returns[-lookback:])

    degradation = recent_sharpe / historical_sharpe
    if degradation < threshold:
        return Alert(
            level='warning',
            message=f'Performance degraded to {degradation:.0%} of baseline'
        )

2. Regime Change Detection¶

def detect_regime_change(prices, volatility_window=20):
    """
    Detect significant regime shifts using:
    - Volatility regime (low/medium/high)
    - Trend regime (trending/ranging)
    - Correlation regime (correlated/decorrelated)
    """
    current_vol = prices.pct_change().rolling(volatility_window).std().iloc[-1]
    historical_vol = prices.pct_change().std()

    vol_ratio = current_vol / historical_vol

    if vol_ratio > 2.0:
        return Regime.HIGH_VOL
    elif vol_ratio < 0.5:
        return Regime.LOW_VOL
    else:
        return Regime.NORMAL

3. Win Rate Decay¶

def detect_win_rate_decay(trades, baseline_win_rate, window=50):
    """
    Alert if rolling win rate drops significantly below baseline.
    """
    recent_wins = trades[-window:]['is_win'].mean()

    if recent_wins < baseline_win_rate * 0.7:  # 30% degradation
        return Alert(
            level='critical',
            message=f'Win rate dropped to {recent_wins:.0%} vs {baseline_win_rate:.0%} baseline'
        )

4. Distribution Shift (KL Divergence)¶

def detect_distribution_shift(returns, reference_returns, threshold=0.5):
    """
    Compare return distribution to reference period.
    High KL divergence indicates regime change.
    """
    from scipy.stats import entropy

    # Bin returns into histogram
    bins = np.linspace(-0.05, 0.05, 50)
    p = np.histogram(returns, bins=bins, density=True)[0] + 1e-10
    q = np.histogram(reference_returns, bins=bins, density=True)[0] + 1e-10

    kl_div = entropy(p, q)

    if kl_div > threshold:
        return Alert(
            level='warning',
            message=f'Return distribution shifted (KL={kl_div:.2f})'
        )

Decay Response Protocol¶

Signal	Threshold	Action
30-day Sharpe < 0.3	Trigger	Flag for review
60-day Sharpe < 0	Trigger	Reduce position size 50%
Win rate < 30% (50 trades)	Trigger	Pause strategy
Max DD > backtest max × 1.5	Trigger	Pause strategy
Regime change detected	Trigger	Review parameter fit

Alert System¶

Alert Levels¶

Level	Response Time	Notification Channel	Auto-Action
Info	Next review	Log only	None
Warning	< 1 hour	Discord/Telegram	None
Critical	Immediate	All channels + SMS	Reduce exposure
Emergency	Immediate	All + phone call	Halt trading

Alert Configuration¶

# config/alerts.yaml
channels:
  discord:
    webhook_url: ${DISCORD_WEBHOOK}
    levels: [warning, critical, emergency]

  telegram:
    bot_token: ${TELEGRAM_BOT_TOKEN}
    chat_id: ${TELEGRAM_CHAT_ID}
    levels: [warning, critical, emergency]

  log:
    path: logs/alerts.log
    levels: [info, warning, critical, emergency]

alerts:
  daily_loss:
    metric: daily_pnl_pct
    warning: -2.0
    critical: -3.0
    message: "Daily P&L alert: {value}%"

  drawdown:
    metric: current_drawdown_pct
    warning: -5.0
    critical: -8.0
    message: "Drawdown alert: {value}%"

  loss_streak:
    metric: consecutive_losses
    warning: 3
    critical: 5
    message: "Loss streak: {value} consecutive losses"

Alert Message Format¶

{
  "timestamp": "2024-01-15T14:30:00Z",
  "level": "warning",
  "strategy": "ICT_QM_1H",
  "metric": "daily_pnl_pct",
  "value": -2.3,
  "threshold": -2.0,
  "message": "Daily P&L alert: -2.3%",
  "action_required": "Review positions",
  "dashboard_link": "https://dashboard.example.com/strategy/ICT_QM_1H"
}

Incident Response¶

Incident Classification¶

Severity	Description	Examples
P1	Capital at immediate risk	Runaway position, system down during open trade
P2	Trading impaired	Can't open new positions, delayed signals
P3	Degraded performance	Higher latency, partial data
P4	Minor issue	Cosmetic, logging gaps

Response Procedures¶

P1: Capital at Risk¶

1. IMMEDIATE: Kill all open orders
2. IMMEDIATE: Flatten all positions (market orders)
3. WITHIN 5 MIN: Notify founder via all channels
4. WITHIN 15 MIN: Document current state
5. WITHIN 1 HOUR: Root cause analysis started
6. WITHIN 24 HOURS: Post-mortem document

P2: Trading Impaired¶

1. WITHIN 5 MIN: Pause new signal generation
2. WITHIN 15 MIN: Assess impact on open positions
3. WITHIN 30 MIN: Implement workaround or escalate to P1
4. WITHIN 4 HOURS: Root cause identified
5. WITHIN 24 HOURS: Fix deployed or documented workaround

Runbook Pointers¶

Scenario	Runbook Location
Broker API down	`docs/runbooks/broker_outage.md`
TradingView webhook failure	`docs/runbooks/webhook_failure.md`
Position size mismatch	`docs/runbooks/position_reconcile.md`
Strategy generating bad signals	`docs/runbooks/signal_validation.md`
Data feed issues	`docs/runbooks/data_issues.md`
General incident response	`docs/runbooks/incident_response.md`

Post-Incident Review Template¶

# Incident Report: [INCIDENT_ID]

## Summary
- Date/Time: YYYY-MM-DD HH:MM UTC
- Duration: X hours
- Severity: P1/P2/P3/P4
- Impact: [Description of impact]

## Timeline
- HH:MM - Event detected
- HH:MM - Response initiated
- HH:MM - Mitigation applied
- HH:MM - Resolution confirmed

## Root Cause
[Description of root cause]

## Resolution
[What was done to fix it]

## Lessons Learned
- [Lesson 1]
- [Lesson 2]

## Action Items
- [ ] [Action 1] - Owner - Due date
- [ ] [Action 2] - Owner - Due date

Dashboard Requirements¶

Minimum Viable Dashboard¶

Panel	Content	Update Frequency
P&L Summary	Today, MTD, YTD	Real-time
Open Positions	Symbol, size, unrealized P&L	Real-time
Drawdown Gauge	Current DD vs max allowed	Real-time
Recent Trades	Last 10 trades with P&L	Per trade
Alert Feed	Last 24h alerts	Per alert
Strategy Health	Per-strategy Sharpe, win rate	Hourly

Dashboard Tech Stack¶

Component	Technology	Rationale
Backend	Python + FastAPI	Existing stack
Database	SQLite	Simple, portable
Frontend	Streamlit or Grafana	Rapid development
Hosting	Local or VPS	Data stays private

Next 7 Days: Action Plan¶

Create docs/runbooks/incident_response.md template
Set up Discord/Telegram webhook for alerts
Implement basic P&L tracking in SQLite
Create drift detection prototype (src/monitoring/drift.py)
Define alert thresholds in config/alerts.yaml
Build minimal Streamlit dashboard (P&L + positions)
Test alert delivery end-to-end