Offline RL Learning Quality Changelog: Causal Synthetic Action Loop
Root-cause notes for low candidate yield and the updated missing-combo learning-quality loop.
Summary
As of the 2026-06-06 check, the missing-combo offline RL loop was not blocked by runtime failures or upload failures. The bottleneck was learning quality. The latest 40 attempts produced 0 candidate uploads, and the best real gate Sharpe was 0.441 against the 0.700 gate.
The automation now records learning-quality findings after each attempt instead of only checking whether the process is alive. The report separates no-candidate windows, synthetic-real transfer gaps, flat policy collapse, and high-turnover low-Sharpe behavior so the next engineering action is explicit.
Evidence
| Item | Value |
|---|---|
| Analysis window | Latest 40 completed attempts |
| Candidate uploads | 0 |
| Candidate S3 model/meta pairs | 0 |
| Gate failed | 40 |
| Flat policy collapse | 12 |
| Synthetic-real Sharpe gaps | 21 |
| Best recent attempt | DOGEUSDT rebrac/tcn/v2 |
| Best rolling 30d Sharpe | 0.441 |
| Gate threshold | 0.700 |
Changes
1. Added the training_progress.py quality-report command. 2. Updated train_missing_agent_combos.sh to refresh learning_quality_changelog.md and learning_quality_summary.json after each recorded attempt. 3. Changed the default missing-combo synthetic action labelers to trend_following,mean_reversion,random. 4. Kept oracle_random_mix out of the default grid; it now requires explicit opt-in for dedicated experiments. 5. Stopped passing --offline-oracle-random-ratio unless oracle_random_mix is actually selected.
Operating Rule
The automation must not stop at "the process is alive" or "there is no error." If no candidates are produced or real gate Sharpe stays weak, it must read the current learning_quality_changelog.md, classify the learning failure, and apply a narrow code or config improvement.
When synthetic Sharpe is high but real gate Sharpe is low, a restart is not the fix. Treat the synthetic action labeler, reward objective, turnover/action regularization, and best-checkpoint selection as the next improvement surface.
Current Run State
The Vast jetech_integrated_offline_v2 session was restarted with causal labelers.
| Item | Value |
|---|---|
| Active combo | QQQUSDT / cql / price_causal_mixer / v2 |
| Synthetic action labelers | trend_following,mean_reversion,random |
| Oracle labeler | Excluded by default |
| Progress Slack | disabled |
| Attempt/failure/error Slack | disabled |
| Candidate backtest Slack | enabled |
| Quality changelog | enabled |
Next Checks
- Check whether the causal labeler switch reduces flat ratio for
cql/price_causal_mixer/v2. - If the synthetic-real gap remains, prioritize real-gate-aligned objective tuning or conservative penalty tuning over labeler churn.
- If best recent Sharpe does not move toward
0.700, narrow the grid around near-miss combos and tune there.