ML Patterns Playbook¶
Pattern Catalog¶
Pattern 1: Regime-Conditional Strategy Selection¶
Problem: Single strategy underperforms across all market conditions.
Solution: Train regime classifier, then select/weight strategies per regime.
class RegimeConditionalStrategy:
def __init__(self, strategies, regime_model):
self.strategies = strategies # Dict: regime -> strategy
self.regime_model = regime_model
def generate_signal(self, data):
# 1. Classify current regime
regime = self.regime_model.predict(data.features)
# 2. Select strategy for regime
if regime in self.strategies:
return self.strategies[regime].generate_signal(data)
else:
return None # No signal in unknown regime
When to Use: - Strategy shows regime-dependent performance - Sufficient history in each regime - Regimes are identifiable in real-time
Anti-Pattern: Overfitting regime boundaries to maximize backtest.
Pattern 2: Probability-Weighted Sizing¶
Problem: Equal sizing ignores signal quality variation.
Solution: Scale position size by predicted success probability.
def probability_weighted_size(base_size, probability, min_prob=0.5, max_prob=0.7):
"""
Scale position size by signal probability.
probability: Model's P(success) estimate
Returns multiplier between 0.5 and 1.5
"""
if probability < min_prob:
return 0 # Don't trade
# Linear scaling between min and max
scale = (probability - min_prob) / (max_prob - min_prob)
scale = np.clip(scale, 0, 1)
# Map to size multiplier [0.5, 1.5]
multiplier = 0.5 + scale * 1.0
return base_size * multiplier
When to Use: - Well-calibrated probability model - Sufficient trade frequency to realize statistical edge - Risk limits accommodate size variation
Anti-Pattern: Over-concentrating on "high probability" signals that are overfit.
Pattern 3: Ensemble with Disagreement Filter¶
Problem: Single model has blind spots.
Solution: Use multiple models, only trade when they agree.
class EnsembleFilter:
def __init__(self, models, agreement_threshold=0.6):
self.models = models
self.agreement_threshold = agreement_threshold
def should_trade(self, features):
predictions = [m.predict(features) for m in self.models]
agreement = sum(predictions) / len(predictions)
if agreement >= self.agreement_threshold:
return {'trade': True, 'direction': 'long', 'confidence': agreement}
elif agreement <= (1 - self.agreement_threshold):
return {'trade': True, 'direction': 'short', 'confidence': 1 - agreement}
else:
return {'trade': False, 'reason': 'insufficient_agreement'}
When to Use: - Models trained on different feature sets - Models use different algorithms - Historical disagreement predicts losses
Anti-Pattern: Using correlated models that fail together.
Pattern 4: Volatility-Scaled Features¶
Problem: Raw price features behave differently across volatility regimes.
Solution: Normalize features by volatility.
def volatility_scaled_features(df, vol_window=20):
"""
Scale price-based features by rolling volatility.
"""
atr = calculate_atr(df, vol_window)
features = {
# Raw: distance from MA (varies with vol)
# Scaled: same metric in ATR units (stable)
'price_vs_ma': (df['close'] - df['sma_20']) / atr,
# Raw: absolute range
# Scaled: range as multiple of typical range
'range_ratio': (df['high'] - df['low']) / atr,
# Raw: absolute move
# Scaled: move significance
'return_atr': df['close'].pct_change() / (atr / df['close'])
}
return pd.DataFrame(features)
When to Use: Always for price-based features in ML models.
Anti-Pattern: Mixing scaled and unscaled features without understanding.
Pattern 5: Online Learning with Decay¶
Problem: Market dynamics change; static models go stale.
Solution: Continuously update model with exponential decay on old data.
class OnlineLearner:
def __init__(self, base_model, decay_factor=0.99):
self.model = base_model
self.decay = decay_factor
self.sample_weights = []
def partial_fit(self, X_new, y_new):
# Decay old sample weights
self.sample_weights = [w * self.decay for w in self.sample_weights]
# Add new samples with weight 1.0
self.sample_weights.extend([1.0] * len(X_new))
# Retrain with weighted samples
X_all = np.vstack([self.X_history, X_new])
y_all = np.concatenate([self.y_history, y_new])
self.model.fit(X_all, y_all, sample_weight=self.sample_weights)
When to Use: - Stable feature set - Regular new data - Gradual regime changes
Anti-Pattern: Decaying too fast (overfits recent), too slow (stale model).
Failure Modes & How We Detect Them¶
Failure Mode 1: Lookahead Bias¶
What It Is: Using future information in features or labels.
Symptoms: - Backtest Sharpe > 3 (unrealistic) - Sharp performance drop in paper/live - Features improve when you add "future" bars
Detection:
def test_lookahead(strategy, data):
"""
Test for lookahead by running on truncated data.
"""
signals_full = strategy.generate_signals(data)
signals_partial = strategy.generate_signals(data[:-100])
# Signals for overlapping period should be identical
overlap = signals_partial.index
mismatch = (signals_full.loc[overlap] != signals_partial).sum()
if mismatch > 0:
raise LookaheadError(f"Lookahead detected: {mismatch} signals changed")
Prevention: - Strict point-in-time feature computation - Purged cross-validation - Code review checklist
Failure Mode 2: Label Leakage¶
What It Is: Target variable information leaks into features.
Symptoms: - Near-perfect training accuracy - Model relies heavily on one feature - Feature importance shows suspicious pattern
Detection:
def test_label_leakage(X, y, feature_names):
"""
Check if any single feature predicts target too well.
"""
from sklearn.metrics import roc_auc_score
leaky_features = []
for i, name in enumerate(feature_names):
auc = roc_auc_score(y, X[:, i])
if auc > 0.8 or auc < 0.2: # Suspiciously predictive
leaky_features.append((name, auc))
if leaky_features:
raise LeakageError(f"Potential label leakage: {leaky_features}")
Prevention: - Compute labels after features - Never derive features from trade outcomes - Document feature-label temporal relationship
Failure Mode 3: Overfitting¶
What It Is: Model memorizes noise instead of learning patterns.
Symptoms: - Large gap between train and test performance - Performance degrades with more data - Model is highly complex (many parameters)
Detection:
def test_overfitting(model, X_train, y_train, X_test, y_test, max_gap=0.15):
"""
Check train vs test performance gap.
"""
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
gap = train_score - test_score
if gap > max_gap:
raise OverfitError(f"Overfit detected: train={train_score:.3f}, test={test_score:.3f}, gap={gap:.3f}")
return {'train': train_score, 'test': test_score, 'gap': gap}
Prevention: - Regularization (L1/L2, dropout) - Early stopping - Simpler models first - Cross-validation
Failure Mode 4: Distribution Shift¶
What It Is: Test/live data differs from training data.
Symptoms: - Sudden performance drop - Feature distributions change - Model confidence decreases
Detection:
def detect_distribution_shift(X_train, X_live, threshold=0.25):
"""
Detect feature distribution shift using PSI.
"""
from scipy.stats import wasserstein_distance
shifts = {}
for i in range(X_train.shape[1]):
psi = population_stability_index(X_train[:, i], X_live[:, i])
if psi > threshold:
shifts[i] = psi
if shifts:
raise DistributionShiftWarning(f"Feature drift detected: {shifts}")
def population_stability_index(expected, actual, bins=10):
"""Calculate PSI between two distributions."""
expected_pct = np.histogram(expected, bins=bins)[0] / len(expected)
actual_pct = np.histogram(actual, bins=bins)[0] / len(actual)
# Avoid log(0)
expected_pct = np.clip(expected_pct, 0.001, None)
actual_pct = np.clip(actual_pct, 0.001, None)
psi = np.sum((actual_pct - expected_pct) * np.log(actual_pct / expected_pct))
return psi
Prevention: - Monitor feature distributions - Retrain on rolling window - Regime-aware training
Failure Mode 5: Calibration Drift¶
What It Is: Predicted probabilities no longer match actual frequencies.
Symptoms: - "60% probability" signals win only 45% - Confidence-based sizing underperforms - Reliability diagram shows deviation
Detection:
def monitor_calibration(predictions, outcomes, window=100):
"""
Monitor rolling calibration.
"""
results = []
for i in range(window, len(predictions)):
window_preds = predictions[i-window:i]
window_outcomes = outcomes[i-window:i]
# Bin predictions
bins = [0.5, 0.55, 0.6, 0.65, 0.7, 1.0]
for j in range(len(bins)-1):
mask = (window_preds >= bins[j]) & (window_preds < bins[j+1])
if mask.sum() > 10:
expected = window_preds[mask].mean()
actual = window_outcomes[mask].mean()
error = abs(expected - actual)
if error > 0.1:
results.append({
'bar': i,
'bin': f'{bins[j]}-{bins[j+1]}',
'expected': expected,
'actual': actual,
'error': error
})
return results
Prevention: - Regular recalibration - Isotonic regression post-hoc - Monitor calibration metrics
Failure Mode 6: Correlated Model Failures¶
What It Is: Multiple models fail simultaneously.
Symptoms: - Ensemble underperforms during specific periods - Drawdowns cluster across models - Correlation spikes during losses
Detection:
def detect_correlated_failures(model_returns, threshold=0.7):
"""
Check if model failures are correlated.
"""
# Focus on losing periods
losing_mask = model_returns.sum(axis=1) < 0
if losing_mask.sum() < 20:
return {'correlated': False, 'reason': 'insufficient_losses'}
losing_returns = model_returns[losing_mask]
correlation = losing_returns.corr()
# Check off-diagonal correlations
mask = ~np.eye(correlation.shape[0], dtype=bool)
max_corr = correlation.values[mask].max()
if max_corr > threshold:
return {
'correlated': True,
'max_correlation': max_corr,
'action': 'diversify_models'
}
Prevention: - Train on different feature sets - Use different algorithms - Explicit decorrelation objective
Detection Pipeline¶
Automated Checks (Run Daily)¶
# ci/ml_health_checks.yaml
checks:
- name: feature_drift
script: python -m src.monitoring.feature_drift
threshold: psi < 0.25
action_on_fail: alert
- name: calibration
script: python -m src.monitoring.calibration_check
threshold: ece < 0.1
action_on_fail: flag_review
- name: performance_degradation
script: python -m src.monitoring.model_performance
threshold: sharpe_30d > 0.3
action_on_fail: alert
- name: lookahead_test
script: pytest tests/test_bias.py::test_no_lookahead
action_on_fail: block_deploy
Manual Review Triggers¶
| Condition | Trigger Review |
|---|---|
| 30-day Sharpe < 0.5 | Weekly review |
| Any kill trigger hit | Immediate review |
| New model deployment | Pre-deploy review |
| Regime change detected | Strategy review |
Next 7 Days: Action Plan¶
- Implement
src/monitoring/feature_drift.pywith PSI calculation - Create
tests/test_bias.pywith lookahead and leakage tests - Build calibration monitoring in
src/monitoring/calibration_check.py - Set up model card template in
models/template/model_card.yaml - Configure CI pipeline for daily ML health checks
- Document feature store schema in
docs/feature_store.md - Create first regime detection prototype (rule-based)