Research
Jun 6, 20266 min readJeTech Lab

Offline RL Learning Quality Changelog: Causal Synthetic Action Loop

Root-cause notes for low candidate yield and the updated missing-combo learning-quality loop.

Agent Model
Offline RL
Learning Quality
Price Causal Mixer

Summary

As of the 2026-06-06 check, the missing-combo offline RL loop was not blocked by runtime failures or upload failures. The bottleneck was learning quality. The latest 40 attempts produced 0 candidate uploads, and the best real gate Sharpe was 0.441 against the 0.700 gate.

The automation now records learning-quality findings after each attempt instead of only checking whether the process is alive. The report separates no-candidate windows, synthetic-real transfer gaps, flat policy collapse, and high-turnover low-Sharpe behavior so the next engineering action is explicit.

Evidence

ItemValue
Analysis windowLatest 40 completed attempts
Candidate uploads0
Candidate S3 model/meta pairs0
Gate failed40
Flat policy collapse12
Synthetic-real Sharpe gaps21
Best recent attemptDOGEUSDT rebrac/tcn/v2
Best rolling 30d Sharpe0.441
Gate threshold0.700

Changes

1. Added the training_progress.py quality-report command. 2. Updated train_missing_agent_combos.sh to refresh learning_quality_changelog.md and learning_quality_summary.json after each recorded attempt. 3. Changed the default missing-combo synthetic action labelers to trend_following,mean_reversion,random. 4. Kept oracle_random_mix out of the default grid; it now requires explicit opt-in for dedicated experiments. 5. Stopped passing --offline-oracle-random-ratio unless oracle_random_mix is actually selected.

Operating Rule

The automation must not stop at "the process is alive" or "there is no error." If no candidates are produced or real gate Sharpe stays weak, it must read the current learning_quality_changelog.md, classify the learning failure, and apply a narrow code or config improvement.

When synthetic Sharpe is high but real gate Sharpe is low, a restart is not the fix. Treat the synthetic action labeler, reward objective, turnover/action regularization, and best-checkpoint selection as the next improvement surface.

Current Run State

The Vast jetech_integrated_offline_v2 session was restarted with causal labelers.

ItemValue
Active comboQQQUSDT / cql / price_causal_mixer / v2
Synthetic action labelerstrend_following,mean_reversion,random
Oracle labelerExcluded by default
Progress Slackdisabled
Attempt/failure/error Slackdisabled
Candidate backtest Slackenabled
Quality changelogenabled

Next Checks

  • Check whether the causal labeler switch reduces flat ratio for cql/price_causal_mixer/v2.
  • If the synthetic-real gap remains, prioritize real-gate-aligned objective tuning or conservative penalty tuning over labeler churn.
  • If best recent Sharpe does not move toward 0.700, narrow the grid around near-miss combos and tune there.