Skip to content

Monitoring Specification

Overview

Monitoring operates at three levels: 1. Real-time: Trade-by-trade, position-level alerts 2. Session: Daily/weekly performance review 3. Strategic: Long-term drift and decay detection

Live KPIs

Real-Time Metrics (Updated Per Trade)

Metric Calculation Alert Threshold
Daily P&L Sum of closed + unrealized < -2% warning, < -3% critical
Open P&L Current mark-to-market < -1.5% per position
Daily Trades Count of trades today > 10 (overtrading)
Win Streak Consecutive wins > 5 (review for luck)
Loss Streak Consecutive losses > 3 (pause and review)
Largest Loss Single trade loss > 1.5% of account

Session Metrics (Updated Hourly/Daily)

Metric Calculation Warning Critical
Rolling Sharpe (7d) Annualized from daily returns < 0.5 < 0
Rolling Win Rate (20 trades) Wins / Total < 40% < 30%
Rolling R:R (20 trades) Avg Win / Avg Loss < 1.2 < 1.0
Drawdown (current) Peak - Current / Peak > 5% > 8%
Profit Factor (MTD) Gross Profit / Gross Loss < 1.3 < 1.0

Portfolio-Level Metrics

Metric Target Check Frequency
Total Exposure < 80% of capital Real-time
Strategy Correlation < 0.5 pairwise Daily
Sector Concentration < 40% single asset class Daily
Net Delta Balanced unless intentional Real-time

Execution Quality Metrics

Metric Expected Warning Critical
Signal-to-fill latency < 2s > 5s > 10s
Slippage vs model ±1 tick > 2 ticks > 5 ticks
Fill rate > 95% < 90% < 80%
Rejected orders < 2% > 5% > 10%
Partial fills < 5% > 10% > 20%

Drift and Decay Detection

What is Strategy Decay?

Strategy decay occurs when a previously profitable strategy loses edge over time due to: - Market regime change - Increased competition (alpha crowding) - Structural market changes - Parameter staleness

Detection Methods

1. Rolling Performance Degradation

def detect_performance_drift(returns, lookback=60, threshold=0.5):
    """
    Compare recent performance to historical baseline.

    Alert if recent Sharpe < threshold × historical Sharpe.
    """
    historical_sharpe = calculate_sharpe(returns[:-lookback])
    recent_sharpe = calculate_sharpe(returns[-lookback:])

    degradation = recent_sharpe / historical_sharpe
    if degradation < threshold:
        return Alert(
            level='warning',
            message=f'Performance degraded to {degradation:.0%} of baseline'
        )

2. Regime Change Detection

def detect_regime_change(prices, volatility_window=20):
    """
    Detect significant regime shifts using:
    - Volatility regime (low/medium/high)
    - Trend regime (trending/ranging)
    - Correlation regime (correlated/decorrelated)
    """
    current_vol = prices.pct_change().rolling(volatility_window).std().iloc[-1]
    historical_vol = prices.pct_change().std()

    vol_ratio = current_vol / historical_vol

    if vol_ratio > 2.0:
        return Regime.HIGH_VOL
    elif vol_ratio < 0.5:
        return Regime.LOW_VOL
    else:
        return Regime.NORMAL

3. Win Rate Decay

def detect_win_rate_decay(trades, baseline_win_rate, window=50):
    """
    Alert if rolling win rate drops significantly below baseline.
    """
    recent_wins = trades[-window:]['is_win'].mean()

    if recent_wins < baseline_win_rate * 0.7:  # 30% degradation
        return Alert(
            level='critical',
            message=f'Win rate dropped to {recent_wins:.0%} vs {baseline_win_rate:.0%} baseline'
        )

4. Distribution Shift (KL Divergence)

def detect_distribution_shift(returns, reference_returns, threshold=0.5):
    """
    Compare return distribution to reference period.
    High KL divergence indicates regime change.
    """
    from scipy.stats import entropy

    # Bin returns into histogram
    bins = np.linspace(-0.05, 0.05, 50)
    p = np.histogram(returns, bins=bins, density=True)[0] + 1e-10
    q = np.histogram(reference_returns, bins=bins, density=True)[0] + 1e-10

    kl_div = entropy(p, q)

    if kl_div > threshold:
        return Alert(
            level='warning',
            message=f'Return distribution shifted (KL={kl_div:.2f})'
        )

Decay Response Protocol

Signal Threshold Action
30-day Sharpe < 0.3 Trigger Flag for review
60-day Sharpe < 0 Trigger Reduce position size 50%
Win rate < 30% (50 trades) Trigger Pause strategy
Max DD > backtest max × 1.5 Trigger Pause strategy
Regime change detected Trigger Review parameter fit

Alert System

Alert Levels

Level Response Time Notification Channel Auto-Action
Info Next review Log only None
Warning < 1 hour Discord/Telegram None
Critical Immediate All channels + SMS Reduce exposure
Emergency Immediate All + phone call Halt trading

Alert Configuration

# config/alerts.yaml
channels:
  discord:
    webhook_url: ${DISCORD_WEBHOOK}
    levels: [warning, critical, emergency]

  telegram:
    bot_token: ${TELEGRAM_BOT_TOKEN}
    chat_id: ${TELEGRAM_CHAT_ID}
    levels: [warning, critical, emergency]

  log:
    path: logs/alerts.log
    levels: [info, warning, critical, emergency]

alerts:
  daily_loss:
    metric: daily_pnl_pct
    warning: -2.0
    critical: -3.0
    message: "Daily P&L alert: {value}%"

  drawdown:
    metric: current_drawdown_pct
    warning: -5.0
    critical: -8.0
    message: "Drawdown alert: {value}%"

  loss_streak:
    metric: consecutive_losses
    warning: 3
    critical: 5
    message: "Loss streak: {value} consecutive losses"

Alert Message Format

{
  "timestamp": "2024-01-15T14:30:00Z",
  "level": "warning",
  "strategy": "ICT_QM_1H",
  "metric": "daily_pnl_pct",
  "value": -2.3,
  "threshold": -2.0,
  "message": "Daily P&L alert: -2.3%",
  "action_required": "Review positions",
  "dashboard_link": "https://dashboard.example.com/strategy/ICT_QM_1H"
}

Incident Response

Incident Classification

Severity Description Examples
P1 Capital at immediate risk Runaway position, system down during open trade
P2 Trading impaired Can't open new positions, delayed signals
P3 Degraded performance Higher latency, partial data
P4 Minor issue Cosmetic, logging gaps

Response Procedures

P1: Capital at Risk

1. IMMEDIATE: Kill all open orders
2. IMMEDIATE: Flatten all positions (market orders)
3. WITHIN 5 MIN: Notify founder via all channels
4. WITHIN 15 MIN: Document current state
5. WITHIN 1 HOUR: Root cause analysis started
6. WITHIN 24 HOURS: Post-mortem document

P2: Trading Impaired

1. WITHIN 5 MIN: Pause new signal generation
2. WITHIN 15 MIN: Assess impact on open positions
3. WITHIN 30 MIN: Implement workaround or escalate to P1
4. WITHIN 4 HOURS: Root cause identified
5. WITHIN 24 HOURS: Fix deployed or documented workaround

Runbook Pointers

Scenario Runbook Location
Broker API down docs/runbooks/broker_outage.md
TradingView webhook failure docs/runbooks/webhook_failure.md
Position size mismatch docs/runbooks/position_reconcile.md
Strategy generating bad signals docs/runbooks/signal_validation.md
Data feed issues docs/runbooks/data_issues.md
General incident response docs/runbooks/incident_response.md

Post-Incident Review Template

# Incident Report: [INCIDENT_ID]

## Summary
- Date/Time: YYYY-MM-DD HH:MM UTC
- Duration: X hours
- Severity: P1/P2/P3/P4
- Impact: [Description of impact]

## Timeline
- HH:MM - Event detected
- HH:MM - Response initiated
- HH:MM - Mitigation applied
- HH:MM - Resolution confirmed

## Root Cause
[Description of root cause]

## Resolution
[What was done to fix it]

## Lessons Learned
- [Lesson 1]
- [Lesson 2]

## Action Items
- [ ] [Action 1] - Owner - Due date
- [ ] [Action 2] - Owner - Due date

Dashboard Requirements

Minimum Viable Dashboard

Panel Content Update Frequency
P&L Summary Today, MTD, YTD Real-time
Open Positions Symbol, size, unrealized P&L Real-time
Drawdown Gauge Current DD vs max allowed Real-time
Recent Trades Last 10 trades with P&L Per trade
Alert Feed Last 24h alerts Per alert
Strategy Health Per-strategy Sharpe, win rate Hourly

Dashboard Tech Stack

Component Technology Rationale
Backend Python + FastAPI Existing stack
Database SQLite Simple, portable
Frontend Streamlit or Grafana Rapid development
Hosting Local or VPS Data stays private

Next 7 Days: Action Plan

  • Create docs/runbooks/incident_response.md template
  • Set up Discord/Telegram webhook for alerts
  • Implement basic P&L tracking in SQLite
  • Create drift detection prototype (src/monitoring/drift.py)
  • Define alert thresholds in config/alerts.yaml
  • Build minimal Streamlit dashboard (P&L + positions)
  • Test alert delivery end-to-end