ML Patterns Playbook¶

Pattern Catalog¶

Pattern 1: Regime-Conditional Strategy Selection¶

Problem: Single strategy underperforms across all market conditions.

Solution: Train regime classifier, then select/weight strategies per regime.

class RegimeConditionalStrategy:
    def __init__(self, strategies, regime_model):
        self.strategies = strategies  # Dict: regime -> strategy
        self.regime_model = regime_model

    def generate_signal(self, data):
        # 1. Classify current regime
        regime = self.regime_model.predict(data.features)

        # 2. Select strategy for regime
        if regime in self.strategies:
            return self.strategies[regime].generate_signal(data)
        else:
            return None  # No signal in unknown regime

When to Use: - Strategy shows regime-dependent performance - Sufficient history in each regime - Regimes are identifiable in real-time

Anti-Pattern: Overfitting regime boundaries to maximize backtest.

Pattern 2: Probability-Weighted Sizing¶

Problem: Equal sizing ignores signal quality variation.

Solution: Scale position size by predicted success probability.

def probability_weighted_size(base_size, probability, min_prob=0.5, max_prob=0.7):
    """
    Scale position size by signal probability.

    probability: Model's P(success) estimate
    Returns multiplier between 0.5 and 1.5
    """
    if probability < min_prob:
        return 0  # Don't trade

    # Linear scaling between min and max
    scale = (probability - min_prob) / (max_prob - min_prob)
    scale = np.clip(scale, 0, 1)

    # Map to size multiplier [0.5, 1.5]
    multiplier = 0.5 + scale * 1.0

    return base_size * multiplier

When to Use: - Well-calibrated probability model - Sufficient trade frequency to realize statistical edge - Risk limits accommodate size variation

Anti-Pattern: Over-concentrating on "high probability" signals that are overfit.

Pattern 3: Ensemble with Disagreement Filter¶

Problem: Single model has blind spots.

Solution: Use multiple models, only trade when they agree.

class EnsembleFilter:
    def __init__(self, models, agreement_threshold=0.6):
        self.models = models
        self.agreement_threshold = agreement_threshold

    def should_trade(self, features):
        predictions = [m.predict(features) for m in self.models]
        agreement = sum(predictions) / len(predictions)

        if agreement >= self.agreement_threshold:
            return {'trade': True, 'direction': 'long', 'confidence': agreement}
        elif agreement <= (1 - self.agreement_threshold):
            return {'trade': True, 'direction': 'short', 'confidence': 1 - agreement}
        else:
            return {'trade': False, 'reason': 'insufficient_agreement'}

When to Use: - Models trained on different feature sets - Models use different algorithms - Historical disagreement predicts losses

Anti-Pattern: Using correlated models that fail together.

Pattern 4: Volatility-Scaled Features¶

Problem: Raw price features behave differently across volatility regimes.

Solution: Normalize features by volatility.

def volatility_scaled_features(df, vol_window=20):
    """
    Scale price-based features by rolling volatility.
    """
    atr = calculate_atr(df, vol_window)

    features = {
        # Raw: distance from MA (varies with vol)
        # Scaled: same metric in ATR units (stable)
        'price_vs_ma': (df['close'] - df['sma_20']) / atr,

        # Raw: absolute range
        # Scaled: range as multiple of typical range
        'range_ratio': (df['high'] - df['low']) / atr,

        # Raw: absolute move
        # Scaled: move significance
        'return_atr': df['close'].pct_change() / (atr / df['close'])
    }

    return pd.DataFrame(features)

When to Use: Always for price-based features in ML models.

Anti-Pattern: Mixing scaled and unscaled features without understanding.

Pattern 5: Online Learning with Decay¶

Problem: Market dynamics change; static models go stale.

Solution: Continuously update model with exponential decay on old data.

class OnlineLearner:
    def __init__(self, base_model, decay_factor=0.99):
        self.model = base_model
        self.decay = decay_factor
        self.sample_weights = []

    def partial_fit(self, X_new, y_new):
        # Decay old sample weights
        self.sample_weights = [w * self.decay for w in self.sample_weights]

        # Add new samples with weight 1.0
        self.sample_weights.extend([1.0] * len(X_new))

        # Retrain with weighted samples
        X_all = np.vstack([self.X_history, X_new])
        y_all = np.concatenate([self.y_history, y_new])

        self.model.fit(X_all, y_all, sample_weight=self.sample_weights)

When to Use: - Stable feature set - Regular new data - Gradual regime changes

Anti-Pattern: Decaying too fast (overfits recent), too slow (stale model).

Failure Modes & How We Detect Them¶

Failure Mode 1: Lookahead Bias¶

What It Is: Using future information in features or labels.

Symptoms: - Backtest Sharpe > 3 (unrealistic) - Sharp performance drop in paper/live - Features improve when you add "future" bars

Detection:

def test_lookahead(strategy, data):
    """
    Test for lookahead by running on truncated data.
    """
    signals_full = strategy.generate_signals(data)
    signals_partial = strategy.generate_signals(data[:-100])

    # Signals for overlapping period should be identical
    overlap = signals_partial.index
    mismatch = (signals_full.loc[overlap] != signals_partial).sum()

    if mismatch > 0:
        raise LookaheadError(f"Lookahead detected: {mismatch} signals changed")

Prevention: - Strict point-in-time feature computation - Purged cross-validation - Code review checklist

Failure Mode 2: Label Leakage¶

What It Is: Target variable information leaks into features.

Symptoms: - Near-perfect training accuracy - Model relies heavily on one feature - Feature importance shows suspicious pattern

Detection:

def test_label_leakage(X, y, feature_names):
    """
    Check if any single feature predicts target too well.
    """
    from sklearn.metrics import roc_auc_score

    leaky_features = []
    for i, name in enumerate(feature_names):
        auc = roc_auc_score(y, X[:, i])
        if auc > 0.8 or auc < 0.2:  # Suspiciously predictive
            leaky_features.append((name, auc))

    if leaky_features:
        raise LeakageError(f"Potential label leakage: {leaky_features}")

Prevention: - Compute labels after features - Never derive features from trade outcomes - Document feature-label temporal relationship

Failure Mode 3: Overfitting¶

What It Is: Model memorizes noise instead of learning patterns.

Symptoms: - Large gap between train and test performance - Performance degrades with more data - Model is highly complex (many parameters)

Detection:

def test_overfitting(model, X_train, y_train, X_test, y_test, max_gap=0.15):
    """
    Check train vs test performance gap.
    """
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)

    gap = train_score - test_score

    if gap > max_gap:
        raise OverfitError(f"Overfit detected: train={train_score:.3f}, test={test_score:.3f}, gap={gap:.3f}")

    return {'train': train_score, 'test': test_score, 'gap': gap}

Prevention: - Regularization (L1/L2, dropout) - Early stopping - Simpler models first - Cross-validation

Failure Mode 4: Distribution Shift¶

What It Is: Test/live data differs from training data.

Symptoms: - Sudden performance drop - Feature distributions change - Model confidence decreases

Detection:

def detect_distribution_shift(X_train, X_live, threshold=0.25):
    """
    Detect feature distribution shift using PSI.
    """
    from scipy.stats import wasserstein_distance

    shifts = {}
    for i in range(X_train.shape[1]):
        psi = population_stability_index(X_train[:, i], X_live[:, i])
        if psi > threshold:
            shifts[i] = psi

    if shifts:
        raise DistributionShiftWarning(f"Feature drift detected: {shifts}")

def population_stability_index(expected, actual, bins=10):
    """Calculate PSI between two distributions."""
    expected_pct = np.histogram(expected, bins=bins)[0] / len(expected)
    actual_pct = np.histogram(actual, bins=bins)[0] / len(actual)

    # Avoid log(0)
    expected_pct = np.clip(expected_pct, 0.001, None)
    actual_pct = np.clip(actual_pct, 0.001, None)

    psi = np.sum((actual_pct - expected_pct) * np.log(actual_pct / expected_pct))
    return psi

Prevention: - Monitor feature distributions - Retrain on rolling window - Regime-aware training

Failure Mode 5: Calibration Drift¶

What It Is: Predicted probabilities no longer match actual frequencies.

Symptoms: - "60% probability" signals win only 45% - Confidence-based sizing underperforms - Reliability diagram shows deviation

Detection:

def monitor_calibration(predictions, outcomes, window=100):
    """
    Monitor rolling calibration.
    """
    results = []

    for i in range(window, len(predictions)):
        window_preds = predictions[i-window:i]
        window_outcomes = outcomes[i-window:i]

        # Bin predictions
        bins = [0.5, 0.55, 0.6, 0.65, 0.7, 1.0]
        for j in range(len(bins)-1):
            mask = (window_preds >= bins[j]) & (window_preds < bins[j+1])
            if mask.sum() > 10:
                expected = window_preds[mask].mean()
                actual = window_outcomes[mask].mean()
                error = abs(expected - actual)

                if error > 0.1:
                    results.append({
                        'bar': i,
                        'bin': f'{bins[j]}-{bins[j+1]}',
                        'expected': expected,
                        'actual': actual,
                        'error': error
                    })

    return results

Prevention: - Regular recalibration - Isotonic regression post-hoc - Monitor calibration metrics

Failure Mode 6: Correlated Model Failures¶

What It Is: Multiple models fail simultaneously.

Symptoms: - Ensemble underperforms during specific periods - Drawdowns cluster across models - Correlation spikes during losses

Detection:

def detect_correlated_failures(model_returns, threshold=0.7):
    """
    Check if model failures are correlated.
    """
    # Focus on losing periods
    losing_mask = model_returns.sum(axis=1) < 0

    if losing_mask.sum() < 20:
        return {'correlated': False, 'reason': 'insufficient_losses'}

    losing_returns = model_returns[losing_mask]
    correlation = losing_returns.corr()

    # Check off-diagonal correlations
    mask = ~np.eye(correlation.shape[0], dtype=bool)
    max_corr = correlation.values[mask].max()

    if max_corr > threshold:
        return {
            'correlated': True,
            'max_correlation': max_corr,
            'action': 'diversify_models'
        }

Prevention: - Train on different feature sets - Use different algorithms - Explicit decorrelation objective

Detection Pipeline¶

Automated Checks (Run Daily)¶

# ci/ml_health_checks.yaml
checks:
  - name: feature_drift
    script: python -m src.monitoring.feature_drift
    threshold: psi < 0.25
    action_on_fail: alert

  - name: calibration
    script: python -m src.monitoring.calibration_check
    threshold: ece < 0.1
    action_on_fail: flag_review

  - name: performance_degradation
    script: python -m src.monitoring.model_performance
    threshold: sharpe_30d > 0.3
    action_on_fail: alert

  - name: lookahead_test
    script: pytest tests/test_bias.py::test_no_lookahead
    action_on_fail: block_deploy

Manual Review Triggers¶

Condition	Trigger Review
30-day Sharpe < 0.5	Weekly review
Any kill trigger hit	Immediate review
New model deployment	Pre-deploy review
Regime change detected	Strategy review

Next 7 Days: Action Plan¶

Implement src/monitoring/feature_drift.py with PSI calculation
Create tests/test_bias.py with lookahead and leakage tests
Build calibration monitoring in src/monitoring/calibration_check.py
Set up model card template in models/template/model_card.yaml
Configure CI pipeline for daily ML health checks
Document feature store schema in docs/feature_store.md
Create first regime detection prototype (rule-based)