Agent Orchestration Runbook¶
Overview¶
This runbook defines how agents are orchestrated for common workflows. The Independent Auditor is a mandatory gatekeeper for all deployments.
Workflow Diagrams¶
Workflow 1: New Strategy Development¶
┌─────────────────────────────────────────────────────────────────────┐
│ NEW STRATEGY WORKFLOW │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ STRATEGY │────▶│ QUANT │────▶│ INDEPENDENT │ │
│ │ DEVELOPER │ │ ENGINEER │ │ AUDITOR │ │
│ └──────────────┘ └──────────────┘ └──────┬───────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ - Hypothesis - Backtest - Anti-bias tests │
│ - Pine/Python - Features - Full audit │
│ - Initial test - Walk-forward - PASS/FAIL verdict │
│ │ │
│ ┌───────┴───────┐ │
│ │ │ │
│ PASS FAIL │
│ │ │ │
│ ▼ ▼ │
│ ┌────────────┐ Return to │
│ │ DEPLOYMENT │ Strategy Dev │
│ │ AGENT │ with issues │
│ └────────────┘ │
│ │ │
│ ▼ │
│ Paper → Live │
│ │
└─────────────────────────────────────────────────────────────────────┘
Workflow 2: Model Training/Update¶
┌─────────────────────────────────────────────────────────────────────┐
│ MODEL UPDATE WORKFLOW │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ DATA │────▶│ ML │────▶│ INDEPENDENT │ │
│ │ ENGINEER │ │ ENGINEER │ │ AUDITOR │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ - Fetch data - Train model - Validation check │
│ - Validate - Model card - Calibration check │
│ - Feature store - OOS metrics - Drift check │
│ │
└─────────────────────────────────────────────────────────────────────┘
Orchestration Rules¶
Rule 1: Mandatory Gatekeeper¶
rule: gatekeeper_required
description: |
The Independent Auditor MUST approve before ANY deployment.
No exceptions. No bypasses.
enforcement:
- Deployment Agent checks for audit report
- Audit report must have verdict: PASS
- Audit must be dated within 30 days
- Audit must cover current version
Rule 2: Agent Isolation¶
rule: agent_isolation
description: |
Agents cannot audit their own work.
enforcement:
- Strategy Developer cannot run audit_strategy on own code
- ML Engineer cannot run audit on own model
- Auditor must be independent context/session
Rule 3: Sequential Dependencies¶
rule: sequential_flow
description: |
Certain operations must complete before others begin.
dependencies:
backtest: requires [strategy_code, validated_data]
audit: requires [backtest_results, documentation]
deploy_paper: requires [audit_pass]
deploy_live: requires [paper_results, audit_pass]
Agent Communication Protocol¶
Message Format¶
# Standard inter-agent message
message:
from: agent_name
to: agent_name | "orchestrator"
type: request | response | handoff
task_id: unique_identifier
payload:
# Task-specific data
status: pending | in_progress | completed | failed
timestamp: ISO8601
Handoff Protocol¶
# When one agent hands off to another
handoff:
from: strategy_developer
to: quant_engineer
task_id: strat-001
artifacts:
- path: strategies/pine_v6/new_strategy.pine
checksum: sha256:abc123
- path: docs/strategies/new_strategy.md
checksum: sha256:def456
message: "Strategy ready for backtesting"
next_steps:
- Run full backtest
- Walk-forward validation
- Generate performance report
Orchestration Procedures¶
Procedure 1: Initiate New Strategy¶
# Step 1: Create strategy branch
git checkout -b feature/strat-{id}-{name}
# Step 2: Invoke Strategy Developer
# Input: Hypothesis document, data access
# Output: Strategy code, initial documentation
# Step 3: Verify deliverables
ls strategies/pine_v6/{strategy_name}.pine # Or python/
ls docs/strategies/{strategy_name}.md
# Step 4: Handoff to Quant Engineer
# Create handoff message in tracking system
Procedure 2: Request Audit¶
# Audit request format
audit_request:
strategy_name: string
version: string
artifacts:
strategy_code: path
backtest_results: path
documentation: path
feature_code: path (if applicable)
requestor: agent_name
priority: normal | urgent
deadline: YYYY-MM-DD (optional)
Procedure 3: Handle Audit Failure¶
# On audit failure
failure_response:
task_id: string
verdict: FAIL
issues:
- id: issue-001
severity: critical | major | minor
category: lookahead | leakage | overfitting | documentation | other
description: string
location: file:line or general
remediation: suggested fix
next_steps:
- Return to originating agent
- Fix issues in order of severity
- Re-submit for audit
- Critical issues block progress
Escalation Procedures¶
Escalation Level 1: Agent Conflict¶
Trigger: Agents disagree on approach or findings.
Resolution: 1. Document both positions 2. Founder reviews within 24 hours 3. Decision logged in docs/runbooks/decisions_log.md
Escalation Level 2: Audit Disagreement¶
Trigger: Strategy developer disputes audit findings.
Resolution: 1. Auditor provides detailed evidence 2. Developer provides counter-evidence 3. Founder makes final determination 4. If disputed finding was valid: no penalty 5. If disputed finding was invalid: audit process reviewed
Escalation Level 3: Deployment Emergency¶
Trigger: Live deployment issue discovered post-launch.
Resolution: 1. Immediate: Deployment Agent executes rollback 2. Within 1 hour: Incident documented 3. Within 24 hours: Post-mortem initiated 4. Audit process reviewed for gaps
Monitoring Agent Activity¶
Activity Logging¶
All agent activities logged to logs/agent_activity.log:
{
"timestamp": "2024-01-15T14:30:00Z",
"agent": "quant_engineer",
"action": "backtest_analysis",
"task_id": "strat-001",
"inputs": {"strategy": "ict_qm", "data": "btcusdt_1h"},
"outputs": {"report": "research/reports/ict_qm_backtest.md"},
"duration_seconds": 120,
"status": "completed"
}
Health Checks¶
| Check | Frequency | Alert If |
|---|---|---|
| Agent response time | Per invocation | > 5 min |
| Audit backlog | Daily | > 3 pending |
| Failed audits | Weekly | > 50% failure rate |
| Deployment success | Per deployment | Any failure |
Recovery Procedures¶
Procedure: Stalled Workflow¶
Symptoms: Task not progressing, agent unresponsive.
Steps: 1. Check agent activity log for last action 2. Identify blocking issue 3. If agent error: restart agent with context 4. If data issue: fix data, resume from last checkpoint 5. If unclear: escalate to Founder
Procedure: Corrupted Handoff¶
Symptoms: Receiving agent cannot process artifacts.
Steps: 1. Verify artifact checksums 2. If mismatch: request re-generation from source agent 3. If valid but unprocessable: check format compatibility 4. Document issue for process improvement