Agent Training Note: The Gap Between Synthetic Training and Real 5Y Gates
Why synthetic-data RL training and real five-year backtest gates diverge for AI trading models, including TradeFi, crypto expansion, and rolling Sharpe.
Summary
Recent TradeFi agent training has produced fewer candidate models than expected. At first glance this can look like an algorithm issue across PPO, SAC, CrossQ, or model backbones. The stronger diagnosis is different: the training objective and the candidate registry gate are not aligned tightly enough.
Training improves risk-adjusted reward inside synthetic episodes. Candidate registration, however, requires the model to pass 30D rolling avg Sharpe >= 1.0 on a continuous five-year real-market daily backtest. This gate rewards consistency across a long real period, not only a few strong windows.
What We Observed
As of May 13, 2026 KST, the VM training results show repeated patterns.
| Observation | Interpretation |
|---|---|
| Synthetic Sharpe and Calmar can be high while the real gate is low | Good behavior in the synthetic task does not fully transfer to the real five-year gate |
Many results have an exact gate value of 0.0 | The model often converges to a flat policy with almost no position |
| Some runs peak early after only a few thousand steps, then degrade | Selection is catching short stochastic real-eval peaks rather than stable learning |
| Performance differs strongly by symbol | SPY/QQQ/XAU behave differently from country ETFs, metals, and energy contracts |
Examples make the mismatch clear.
| Run | Synthetic best | Real 5Y 30D Avg Sharpe |
|---|---|---|
XAGUSDT / ppo_tcn_v1 | Sharpe 4.81, Calmar 12.15 | 0.912 |
COPPERUSDT / ppo_itransformer_v1 | Sharpe 10.97, Calmar 10.78 | 0.882 |
SPYUSDT / ppo_itransformer_v1 | Calmar 4.11 | 0.139 |
These numbers do not prove that the algorithms are useless. They show that the synthetic task can be solved in ways that do not satisfy the real registry gate.
Problem 1: Objective-Gate Mismatch
The current reward is built from per-bar return, transaction cost, drawdown, downside risk, and selected risk-adjusted terms. That is useful for teaching a policy to survive an episode and avoid large losses.
The registry gate is more specific.
| Field | Current gate |
|---|---|
| Evaluation data | Five years of real daily market data |
| Evaluation mode | Continuous five-year strategy return series |
| Main metric | Average of 30-day rolling Sharpe |
| Pass condition | 5Y 30D Avg Sharpe >= 1.0 |
| TradeFi annualization | 252 periods/year |
The model optimizes synthetic episode reward, but it must pass a continuous real-market rolling Sharpe gate. If those two are not aligned, high synthetic scores will not reliably become valid candidates.
Problem 2: Flat Policy Is Too Easy
If a trading agent holds no position, it earns no return, but it also avoids most transaction cost and drawdown penalties. With target exposure deadbands and rebalance deadbands, small actions are also collapsed to zero or to the existing position.
That makes a zero-position policy an attractive local optimum when the model is uncertain. On the VM, repeated 5Y 30D Avg Sharpe = 0.0 results are a strong sign that the model did not create meaningful exposure.
A flat policy may look safe, but it is not a useful operating candidate. A JeTech candidate model must produce a real signal while controlling risk.
Problem 3: Synthetic Data Is Not the Same as TradeFi Reality
For some TradeFi symbols, Binance listing history is not long enough for the maximum inference window, so the training pipeline uses yfinance-backed substitute symbols. That data has weekends, holidays, and exchange closures, unlike crypto calendars.
Even when synthetic episodes are calibrated from the recent five-year reference set and Temporal GAN outputs, they cannot fully reproduce each symbol's micro-regime. SPY, QQQ, EWY, gold, silver, copper, crude oil, and natural gas have different trend, gap, volatility clustering, and mean-reversion structures.
One shared reward function and one shared preset family will not fit every symbol equally well.
What We Changed
The goal of this change is not to inflate candidate pass rates. The goal is to identify failures more clearly, stop wasting compute on obvious flat policies, and move the TradeFi reward slightly closer to the real gate.
| Change | Purpose |
|---|---|
| Added exposure diagnostics | Record whether the model actually takes positions |
| Added flat real-eval early stop | Stop runs that remain flat across consecutive real-eval rounds |
| Added a TradeFi flat-position penalty | Make zero exposure during meaningful market moves non-free |
| Expanded gate metadata | Make registry failures easier to explain in JeTech Lab |
The new diagnostics include:
| Metric | Meaning |
|---|---|
avg_abs_realized_exposure | Average realized position strength |
flat_realized_exposure_ratio | Share of bars where realized exposure was near zero |
mean_turnover | Average position change |
total_turnover | Total position change |
nonzero_strategy_return_ratio | Share of bars with non-zero strategy return |
For TradeFi training, the script now applies a small flat-position penalty by default.
flat penalty = penalty_scale
* max(abs(asset_return) - return_threshold, 0)
* (1 - abs(realized_exposure))The default values are penalty_scale=0.02 and return_threshold=0.001. This does not force the agent to trade every bar. It only prevents meaningful market moves with zero exposure from being completely free.
Next Improvements
This is the first defensive layer. The deeper improvements are:
| Workstream | Description |
|---|---|
| Stronger real-window validation | Use real historical windows more directly during training-time selection |
| Symbol-specific presets | Separate deadbands and reward scales for SPY/QQQ, EWY, metals, and energy |
| Multi-seed retries | Compare multiple stochastic runs instead of judging a symbol from one seed |
| Supervised warm start | Imitate simple baselines such as momentum or volatility targeting before RL fine-tuning |
| Gate-aligned reward | Bring 30-day rolling Sharpe and exposure quality closer to the reward itself |
What This Means for Subscribers
A low candidate-generation rate does not mean JeTech Lab is accepting weak models. It mostly means the real five-year gate is filtering them out.
That said, a strict gate only works well when the training system is designed around it. This research note and the accompanying code changes are part of that alignment work. Future agent model updates will make it easier to review not only returns, but also exposure quality, flat ratio, and rolling Sharpe stability.