Jun 6, 20266 min readJeTech Lab

Offline RL Learning Quality Changelog: Causal Synthetic Action Loop

Root-cause notes for low candidate yield and the updated missing-combo learning-quality loop.

Agent Model

Offline RL

Learning Quality

Price Causal Mixer

Summary

As of the 2026-06-06 check, the missing-combo offline RL loop was not blocked by runtime failures or upload failures. The bottleneck was learning quality. The latest 40 attempts produced 0 candidate uploads, and the best real gate Sharpe was 0.441 against the 0.700 gate.

The automation now records learning-quality findings after each attempt instead of only checking whether the process is alive. The report separates no-candidate windows, synthetic-real transfer gaps, flat policy collapse, and high-turnover low-Sharpe behavior so the next engineering action is explicit.

Evidence

Item	Value
Analysis window	Latest `40` completed attempts
Candidate uploads	`0`
Candidate S3 model/meta pairs	`0`
Gate failed	`40`
Flat policy collapse	`12`
Synthetic-real Sharpe gaps	`21`
Best recent attempt	`DOGEUSDT rebrac/tcn/v2`
Best rolling 30d Sharpe	`0.441`
Gate threshold

Changes

1. Added the training_progress.py quality-report command. 2. Updated train_missing_agent_combos.sh to refresh learning_quality_changelog.md and learning_quality_summary.json after each recorded attempt. 3. Changed the default missing-combo synthetic action labelers to trend_following,mean_reversion,random. 4. Kept oracle_random_mix out of the default grid; it now requires explicit opt-in for dedicated experiments. 5. Stopped passing --offline-oracle-random-ratio unless oracle_random_mix is actually selected.

Operating Rule

The automation must not stop at "the process is alive" or "there is no error." If no candidates are produced or real gate Sharpe stays weak, it must read the current learning_quality_changelog.md, classify the learning failure, and apply a narrow code or config improvement.

When synthetic Sharpe is high but real gate Sharpe is low, a restart is not the fix. Treat the synthetic action labeler, reward objective, turnover/action regularization, and best-checkpoint selection as the next improvement surface.

Current Run State

The Vast jetech_integrated_offline_v2 session was restarted with causal labelers.

Item	Value
Active combo	`QQQUSDT / cql / price_causal_mixer / v2`
Synthetic action labelers	`trend_following,mean_reversion,random`
Oracle labeler	Excluded by default
Progress Slack	disabled
Attempt/failure/error Slack	disabled
Candidate backtest Slack	enabled
Quality changelog	enabled

Next Checks

Check whether the causal labeler switch reduces flat ratio for cql/price_causal_mixer/v2.
If the synthetic-real gap remains, prioritize real-gate-aligned objective tuning or conservative penalty tuning over labeler churn.
If best recent Sharpe does not move toward 0.700, narrow the grid around near-miss combos and tune there.