Machine Learning Specification
ML Use Cases
Overview
| Use Case | Purpose | Model Type | Update Frequency |
| Regime Detection | Classify market state | Clustering / HMM | Daily |
| Volatility Forecasting | Size positions, set stops | GARCH / ML | Per bar |
| Signal Filtering | Estimate P(success) | Classifier | Per signal |
| Portfolio Allocation | Optimize weights | Optimizer | Weekly |
Use Case 1: Regime Detection
Purpose
Classify current market into regimes to: - Adjust strategy parameters per regime - Avoid trading in hostile regimes - Set appropriate risk limits
Regime Taxonomy
| Dimension | States | Detection Method |
| Volatility | Low / Normal / High / Extreme | ATR percentile vs 90-day rolling |
| Trend | Trending / Ranging / Choppy | ADX threshold + price vs MA |
| Liquidity | Normal / Thin / Stressed | Spread percentile + volume |
| Correlation | Decorrelated / Correlated / Crisis | Rolling pairwise correlation |
Model Options
# Option A: Rule-based (transparent, no training)
def classify_vol_regime(atr, atr_90d_pct):
if atr_90d_pct < 25:
return 'low'
elif atr_90d_pct < 75:
return 'normal'
elif atr_90d_pct < 95:
return 'high'
else:
return 'extreme'
# Option B: HMM (data-driven)
from hmmlearn import hmm
model = hmm.GaussianHMM(n_components=4, covariance_type="diag")
model.fit(features) # features: returns, vol, volume
Features (Regime Detection)
| Feature | Calculation | Lookahead Safe |
| ATR_20 | 20-bar ATR | Yes |
| ATR_percentile | ATR vs 90-day rolling | Yes |
| ADX_14 | 14-bar ADX | Yes |
| Price_vs_SMA50 | (Close - SMA50) / ATR | Yes |
| Spread_percentile | Current spread vs 30-day | Yes |
| Volume_ratio | Volume / 20-bar avg | Yes |
Use Case 2: Volatility Forecasting
Purpose
- Position sizing (inverse vol weighting)
- Stop-loss distance calibration
- Risk budget allocation
Model Options
| Model | Pros | Cons | Use When |
| GARCH(1,1) | Interpretable, fast | Symmetric | FX, Indices |
| EGARCH | Asymmetric response | More params | Equities |
| HAR-RV | Multi-scale | Needs HF data | If available |
| ML (XGBoost) | Flexible | Black box | Rich feature set |
GARCH Implementation
from arch import arch_model
def forecast_volatility(returns, horizon=1):
"""
Forecast volatility using GARCH(1,1).
Returns annualized volatility forecast.
"""
model = arch_model(returns * 100, vol='Garch', p=1, q=1)
fit = model.fit(disp='off')
forecast = fit.forecast(horizon=horizon)
# Convert to annualized
daily_var = forecast.variance.values[-1, 0]
return np.sqrt(daily_var * 252) / 100
def calculate_position_size(account_value, risk_pct, entry, stop, vol_forecast, vol_target=0.15):
"""
Volatility-adjusted position sizing.
vol_forecast: annualized volatility
vol_target: target annualized volatility (15% default)
"""
# Base risk per trade
risk_amount = account_value * risk_pct
stop_distance = abs(entry - stop)
base_size = risk_amount / stop_distance
# Vol adjustment (scale down in high vol)
vol_scalar = vol_target / vol_forecast
vol_scalar = np.clip(vol_scalar, 0.5, 1.5) # Limit adjustment
return base_size * vol_scalar
Use Case 3: Signal Filtering
Purpose
Estimate probability of success for each trade signal to: - Filter low-probability setups - Adjust position size by confidence - Track calibration over time
Anti-Leakage Requirements
| Requirement | Implementation |
| No future data | Features computed only from t-1 and earlier |
| No target leakage | Label derived after feature cutoff |
| Purged CV | Gap between train and test |
| Point-in-time features | Simulate real-time feature availability |
Feature Engineering
def create_signal_features(df, signal_bar):
"""
Create features for signal filtering.
All features use data BEFORE signal_bar only.
"""
features = {}
# Price action features (before signal)
features['atr_ratio'] = df.loc[:signal_bar-1, 'atr'].iloc[-1] / df.loc[:signal_bar-1, 'atr'].mean()
features['trend_strength'] = df.loc[:signal_bar-1, 'adx'].iloc[-1]
features['rsi'] = df.loc[:signal_bar-1, 'rsi'].iloc[-1]
# Session features
features['hour'] = df.loc[signal_bar, 'timestamp'].hour
features['day_of_week'] = df.loc[signal_bar, 'timestamp'].dayofweek
features['is_killzone'] = is_killzone(df.loc[signal_bar, 'timestamp'])
# Recent performance (no leakage - uses closed trades only)
features['recent_win_rate'] = get_recent_win_rate(closed_trades, lookback=20)
return features
Label Definition
def create_label(entry_price, exit_price, direction):
"""
Binary label: 1 if trade was profitable, 0 otherwise.
CRITICAL: Label is determined AFTER trade closes.
Never use label information in features.
"""
if direction == 'long':
return 1 if exit_price > entry_price else 0
else:
return 1 if exit_price < entry_price else 0
Model Training
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.calibration import CalibratedClassifierCV
def train_signal_filter(X_train, y_train):
"""
Train calibrated classifier for probability estimation.
"""
base_model = GradientBoostingClassifier(
n_estimators=100,
max_depth=3,
min_samples_leaf=20,
random_state=42
)
# Calibrate for accurate probabilities
model = CalibratedClassifierCV(base_model, method='isotonic', cv=5)
model.fit(X_train, y_train)
return model
Filtering Logic
def should_take_signal(model, features, min_prob=0.55):
"""
Filter signals based on predicted probability.
"""
prob = model.predict_proba(features.reshape(1, -1))[0, 1]
return {
'take_signal': prob >= min_prob,
'probability': prob,
'size_multiplier': np.clip((prob - 0.5) * 4, 0.5, 1.5) # Scale size by confidence
}
Use Case 4: Portfolio Allocation
Purpose
Allocate capital across strategies to maximize risk-adjusted returns while respecting constraints.
Allocation Methods
| Method | Description | When to Use |
| Equal Weight | 1/N allocation | Baseline, few strategies |
| Inverse Vol | Weight by 1/volatility | Simple vol targeting |
| Risk Parity | Equal risk contribution | Diversification priority |
| Mean-Variance | Optimize Sharpe | Sufficient history |
Risk Parity Implementation
import numpy as np
from scipy.optimize import minimize
def risk_parity_weights(cov_matrix):
"""
Calculate risk parity weights.
Each strategy contributes equally to portfolio risk.
"""
n = cov_matrix.shape[0]
def risk_contribution(weights):
port_vol = np.sqrt(weights @ cov_matrix @ weights)
marginal_contrib = cov_matrix @ weights
risk_contrib = weights * marginal_contrib / port_vol
return risk_contrib
def objective(weights):
rc = risk_contribution(weights)
target = np.ones(n) / n # Equal risk contribution
return np.sum((rc - target) ** 2)
constraints = {'type': 'eq', 'fun': lambda w: np.sum(w) - 1}
bounds = [(0.05, 0.5) for _ in range(n)] # Min 5%, max 50% per strategy
result = minimize(objective, np.ones(n)/n, bounds=bounds, constraints=constraints)
return result.x
Constraints
| Constraint | Limit | Rationale |
| Min weight per strategy | 5% | Maintain diversification |
| Max weight per strategy | 50% | Avoid concentration |
| Max correlation exposure | 60% | Limit correlated bets |
| Turnover limit | 20%/month | Reduce transaction costs |
Feature Store Rules
Time Alignment by Asset Class
| Asset Class | Feature Cutoff | Rationale |
| FX | Bar close - 1 bar | Ensure bar is fully formed |
| Crypto | Bar close - 1 bar | 24/7, use UTC alignment |
| Futures | Bar close - 1 bar | Account for settlement |
| Equities/Indices | Prior day close | Corporate actions |
Feature Availability Matrix
| Feature Type | Available At | Lag Required |
| Price-based (OHLC) | Bar close | 1 bar |
| Volume-based | Bar close | 1 bar |
| Indicator (MA, RSI) | Bar close | 1 bar |
| Fundamental | EOD | 1 day |
| Sentiment | Variable | Case by case |
Feature Validation Checklist
Validation Framework
Purged/Embargoed Cross-Validation
from sklearn.model_selection import TimeSeriesSplit
def purged_cv(X, y, n_splits=5, purge_gap=10, embargo_gap=5):
"""
Time series CV with purge and embargo.
purge_gap: bars removed between train and test
embargo_gap: bars removed after test before next train
"""
splits = []
tscv = TimeSeriesSplit(n_splits=n_splits)
for train_idx, test_idx in tscv.split(X):
# Purge: remove overlap
train_idx = train_idx[train_idx < test_idx.min() - purge_gap]
# Embargo: remove recent from test
test_idx = test_idx[test_idx > train_idx.max() + embargo_gap]
if len(train_idx) > 0 and len(test_idx) > 0:
splits.append((train_idx, test_idx))
return splits
Walk-Forward with Regime Splits
def walk_forward_regime_aware(X, y, regimes, n_windows=6):
"""
Walk-forward validation ensuring each window
contains multiple regime types.
"""
results = []
for i in range(n_windows):
train_end = int(len(X) * (0.5 + i * 0.1))
test_end = int(len(X) * (0.6 + i * 0.1))
X_train, y_train = X[:train_end], y[:train_end]
X_test, y_test = X[train_end:test_end], y[train_end:test_end]
# Check regime coverage
train_regimes = set(regimes[:train_end])
test_regimes = set(regimes[train_end:test_end])
results.append({
'window': i,
'train_regimes': train_regimes,
'test_regimes': test_regimes,
'regime_overlap': len(train_regimes & test_regimes)
})
return results
Calibration Check
def check_calibration(y_true, y_prob, n_bins=10):
"""
Check if predicted probabilities are well-calibrated.
"""
from sklearn.calibration import calibration_curve
fraction_positives, mean_predicted = calibration_curve(
y_true, y_prob, n_bins=n_bins
)
# Calibration error
ece = np.mean(np.abs(fraction_positives - mean_predicted))
return {
'expected_calibration_error': ece,
'pass': ece < 0.1, # <10% error is acceptable
'bins': list(zip(mean_predicted, fraction_positives))
}
Model Governance
Model Card Template
# models/signal_filter/model_card.yaml
model_name: signal_filter_v1
version: 1.0.0
created: 2024-01-15
author: Edge Factory
description: |
Binary classifier predicting trade success probability.
Used to filter low-quality signals.
training_data:
source: backtested_trades
date_range: 2020-01-01 to 2023-12-31
n_samples: 5000
positive_rate: 0.52
features:
- atr_ratio
- trend_strength
- rsi
- hour
- is_killzone
- recent_win_rate
performance:
auc_roc: 0.62
calibration_error: 0.07
oos_accuracy: 0.58
thresholds:
min_probability: 0.55
kill_trigger_auc: 0.52
deployment:
status: production
deployed_date: 2024-01-20
monitoring_dashboard: /monitoring/signal_filter
Version Control
models/
├── signal_filter/
│ ├── v1.0.0/
│ │ ├── model.pkl
│ │ ├── model_card.yaml
│ │ └── validation_report.md
│ ├── v1.1.0/
│ │ └── ...
│ └── current -> v1.0.0/ # Symlink to active version
Kill Model Triggers
| Trigger | Threshold | Action |
| AUC drops below random | AUC < 0.52 | Immediate kill |
| Calibration degrades | ECE > 0.15 | Flag for review |
| Feature drift | PSI > 0.25 | Retrain required |
| 30-day degradation | Sharpe < 0.3 | Review meeting |
| Regime performance | Fails in 2+ regimes | Investigate |
Rollback Procedure
- Identify performance degradation
- Compare current vs previous version
- If previous better:
ln -sf v0.9.0 current - Log rollback in
models/changelog.md - Investigate root cause
- Plan fix or confirm rollback is permanent