2026년 6월 5일14 min readJeTech Lab

Offline RL 트레이딩 모델 개선 노트: Synthetic Action Label과 Price Causal Mixer

JeTech agent의 offline RL 학습 경로를 가격-only 입력, synthetic action label, 그리고 learned price encoder 중심으로 재정리한 설계와 구현 노트입니다.

Agent Model

Offline RL

Synthetic Data

Reinforcement Learning

Time Series

요약

이번 변경의 핵심은 용어와 모델 구조를 같이 바로잡는 것이다. 기존 코드의 behavior_policies라는 이름은 실제 역할을 정확히 설명하지 못했다. 이 값은 운영 중인 policy가 아니라, synthetic price path 위에 offline RL trajectory를 만들기 위해 -1..1 범위의 목표 포지션을 붙이는 labeler다. 따라서 새 public 이름은 synthetic_action_labelers로 둔다. 기존 behavior_policies는 저장된 실행과 CLI 호환을 위한 deprecated alias로만 유지한다.

두 번째 변경은 모델 입력 원칙을 더 엄격히 지키는 것이다. JeTech agent의 신규 offline RL 모델은 가격 히스토리만 보고 거래해야 한다. 사람이 만든 가격 파생 feature를 늘리는 대신, encoder 자체가 가격 표현을 학습하게 만든다. 이를 위해 새 backbone price_causal_mixer를 추가했다. 이 encoder는 window_size x OHLC 4개 값만 받으며, legacy 7차원 observation이나 35개 engineered feature path를 통과하지 않는다.

이번 실험의 TOBE 방향은 더 크다. 현재는 synthetic price data를 먼저 만들고, 그 위에 position/action label을 붙이는 2단계다. 최종적으로는 가격과 포지션을 동시에 생성하는 synthetic trajectory generator로 가야 한다.

단계	현재	TOBE
Synthetic data	가격 path 생성	가격 path와 포지션 path를 함께 생성
Action source	price path 뒤에 synthetic action labeler를 적용	latent regime, risk budget, execution rule까지 포함한 joint trajectory
Offline RL dataset	observation, action, reward, next observation 전환으로 변환	생성 단계부터 policy support가 있는 trajectory로 저장
Model input	OHLC-only 유지	OHLC-only 유지
Feature 표현	encoder가 학습	encoder가 학습

왜 behavior policy가 아닌가

behavior policy는 offline RL 문헌에서 dataset을 만든 실제 행동 정책을 가리킬 때 자연스럽다. 하지만 JeTech synthetic 학습에서는 실제 시장에서 관측한 사람 또는 봇의 행동 로그가 없다. 우리가 가진 것은 synthetic price path이고, 여기에 학습용 action을 만들어 붙인다.

따라서 더 정확한 이름은 아래 중 하나다.

이름	장점	단점
`synthetic_action_labeler`	현재 구현을 가장 정확히 설명한다. 가격 path에 action/position label을 붙인다.	labeler라는 단어가 모델 학습 느낌을 줄 수 있다.
`synthetic_position_labeler`	trading 도메인에서는 action보다 position이 더 직관적이다.	코드의 action tensor와 이름이 조금 멀다.
`trajectory_labeler`	offline RL transition 생성 과정을 잘 설명한다.	가격 생성과 action 생성이 분리된 현재 구조를 숨길 수 있다.
`synthetic_trajectory_generator`	TOBE 구조에 가장 잘 맞는다.	현재 구현은 아직 joint generation이 아니다.

이번 코드는 public CLI/env 이름을 synthetic_action_labelers로 정리했다. 이유는 실제 저장되는 tensor가 offline RL의 action이고, 값의 의미가 target exposure 또는 position이기 때문이다. 문서에서는 “synthetic action/position label”이라고 같이 부른다.

현재 학습 파이프라인

현재 offline RL 학습은 아래 흐름이다.

CLI / missing-combo runner
  -> OfflineAgentTrainingConfig
  -> SyntheticEpisodeConfig
  -> TradingEnvConfig
  -> collect_offline_dataset()
  -> generate_synthetic_episode()
  -> synthetic_action_labelers
  -> simulate_trading_trajectory()
  -> OfflineTransitionDataset
  -> normalize observations/actions/rewards
  -> offline RL algorithm
  -> checkpoint / synthetic eval
  -> real 5y registry gate
  -> S3 candidate upload

더 자세히 보면 window_size x OHLC가 모델까지 전달되는 경로는 아래와 같다.

Layer	데이터 형태	설명
Price source	`N x OHLC`	Binance/Yahoo real frame 또는 synthetic frame
Synthetic episode	`N x open, high, low, close`	synthetic calibration, tail stress, reference blend를 반영한 price path
Action label	`N x action[-1,1]`	`random`, `oracle_random_mix`, `trend_following`, `mean_reversion` 등이 생성
Trading simulator	`transition_count x (obs, action, reward, next_obs, done)`	turnover, drawdown, reward, position 변화를 계산
Offline dataset	`obs: T x window x 4`	신규 기본값은

legacy path는 따로 유지된다. 오래된 모델은 window_size x 7 raw observation을 받고, 모델 내부에서 35개 engineered feature로 변환한다. 신규 학습 기본값은 이 path를 쓰지 않는다.

신규 path
  OHLC 4
    -> LayerNorm / Linear / sequence encoder
    -> learned features
    -> policy/value heads

legacy path
  prev_close, open, high, low, close, liquidity_stress, target_exposure
    -> engineered 35 features
    -> sequence encoder
    -> policy/value heads

Synthetic Action Labeler

이번 변경에서 추가한 labeler는 두 개다.

Labeler	입력	Lookahead	의미
`trend_following`	최근 close history	없음	최근 가격 경로가 상승이면 long, 하락이면 short
`mean_reversion`	최근 close history	없음	최근 가격 경로가 상승이면 short, 하락이면 long

이 둘은 oracle이 아니다. 미래 수익률을 보지 않고, 현재 시점까지의 close path만 본다. action magnitude는 최근 path return, realized volatility, deadband, turnover penalty, noise를 반영해 [-1, 1]로 clip한다.

기존 labeler와 함께 쓰면 dataset support가 조금 넓어진다.

Labeler	역할
`random`	action support를 넓히는 baseline
`oracle_random_mix`	미래 수익률 기반 oracle과 random을 섞어 방향성 힌트를 준다
`trend_following`	가격-only momentum archetype
`mean_reversion`	가격-only contrarian archetype

중요한 점은 이것이 아직 “좋은 trading model” 자체가 아니라는 것이다. 이것은 offline RL이 학습할 trajectory support를 만드는 장치다. 모델이 실제로 좋아지는지는 real gate에서 봐야 한다.

Price Causal Mixer 구조

price_causal_mixer는 가격-only 원칙을 코드 레벨에서 강제하는 encoder다.

Input
  shape: batch x window_size x 4
  columns: open, high, low, close

Block 0
  LayerNorm(4)
  Linear(4 -> hidden_size)
  GELU
  LayerNorm(hidden_size)

Repeated causal mixer blocks
  LayerNorm(hidden)
  causal depthwise Conv1d(kernel=3, dilation=1,2,4,...)
  residual
  LayerNorm(hidden)
  channel MLP(hidden -> 2*hidden -> hidden)
  residual

Pooling
  65% last token
  35% learned query attention over past window

Output
  LayerNorm
  GELU
  Linear(hidden -> features_dim)
  Tanh

v2 preset은 아래와 같다.

설정	값
`window_size`	160
`observation_schema`	`ohlc`
`encoder_hidden_size`	192
`encoder_layers`	5
`features_dim`	192
policy/value head	`[384, 192]`

v3 preset은 더 큰 실험용이다.

설정	값
`window_size`	224
`encoder_hidden_size`	256
`encoder_layers`	6
`features_dim`	256
policy/value head	`[512, 256]`

모델명과 버전 규칙

이번 변경에서 price_causal_mixer라는 새 이름을 붙인 이유는 encoder backbone이 바뀌었기 때문이다. JeTech에서는 다음 규칙을 쓴다.

변경	분류
offline RL 알고리즘 변경, 예: `td3_bc` -> `cql`	새 algorithm/model 조합
encoder backbone 또는 encoder+head architecture family 변경	새 model/backbone 이름
같은 encoder에서 `window_size`, hidden size, layer 수, `features_dim`, head 차원 변경	같은 model의 새 version
같은 조합을 다시 학습해서 등록/승격	새 artifact promotion version

따라서 price_causal_mixer는 새 모델 family이고, price_causal_mixer v2와 v3는 같은 family 안의 capacity/window preset 차이다.

이 구조는 사용자가 말한 원칙과 맞다. encoder도 모델의 일부다. 따라서 “더 좋은 feature를 사람이 만들어 넣는다”가 아니라, “가격만 넣고 더 큰 encoder가 feature를 학습한다”가 맞는 방향이다. 다만 파라미터 수만 늘리면 자동으로 좋아지는 것은 아니다. offline RL에서는 dataset support가 좁으면 큰 모델이 support 밖 action value를 더 자신 있게 틀릴 수 있다. 그래서 synthetic action/position path의 다양성과 real gate가 같이 필요하다.

구현 변경 범위

이번 구현은 좁게 들어갔다.

파일	변경
`trading/agents/offline_rl.py`	`synthetic_action_labelers` CLI/config/manifest 추가, `trend_following`, `mean_reversion` labeler 추가, 기존 `behavior_policies` alias 유지
`trading/agents/models/price_causal_mixer.py`	OHLC-only causal mixer encoder 추가
`trading/agents/models/registry.py`	`price_causal_mixer`, `price_mixer`, `causal_price_mixer`, `price_only_mixer` 등록
`trading/agents/config/runtime_presets.json`	기존 backbone v3 capacity preset 추가, `price_causal_mixer` v2/v3 추가

로컬 검증

로컬에서는 아래 검증을 통과했다.

Check	결과
Runtime preset JSON parse	pass
`trading/tests/test_offline_rl.py`	60 passed
`trading/tests/test_missing_agent_combos.py` + `test_training_progress.py`	31 passed
CPU smoke training	result JSON, `last_model.pt` 생성

CPU smoke command의 핵심 설정은 아래였다.

symbol=BTCUSDT
algorithm=td3_bc
backbone=price_causal_mixer
model_version=v2
synthetic_action_labelers=trend,mean_revert
episode_length=256
train_steps=1
registry_gate=off
real_eval=off

smoke 결과:

Metric	값
encoder	`price_causal_mixer`
observation schema	`ohlc`
window size	160
transition count	96
synthetic action labelers	`trend_following`, `mean_reversion`
artifact	`last_model.pt` 생성

1-step smoke의 Sharpe는 의미가 없다. 이 검증의 목적은 성능이 아니라 새 backbone이 dataset collection, train loop, eval, artifact path를 모두 통과하는지 확인하는 것이다.

Vast 실험 결과

Vast에는 코드를 동기화했고, 원격에서 import/assertion 검증을 통과했다. targeted run은 끝까지 실행됐지만 후보 등록 gate는 통과하지 못했다.

항목	값
combo	`BTCUSDT / td3_bc / price_causal_mixer / v2`
dataset episodes	100
episode length	1000
train steps	10000
batch size	64
device	cuda
gate	rolling 30d Sharpe threshold 0.7
labelers	`random, trend_following, mean_reversion, oracle_random_mix`
failure/progress Slack	off
candidate backtest Slack	on
결과	gate failed, candidate upload 없음

핵심 지표는 아래와 같다.

지표	값
transitions	67,077
synthetic annualized return	0.9004
synthetic Sharpe	4.5269
real rolling 30d Sharpe	-0.20625
flat realized exposure ratio	0.1479
long realized exposure ratio	0.2349
short realized exposure ratio	0.6172
max one-sided realized exposure ratio	0.6172
total turnover	121.95

해석은 명확하다. 새 encoder와 학습 pipeline은 동작했지만, synthetic action labeler mix가 real 5Y gate로 일반화되지는 않았다. 이것은 train error나 upload failure가 아니라 synthetic-real trajectory gap이다.

다음 개선 계획

1. Synthetic price + position joint generator

현재는 가격 path를 만든 뒤 action labeler를 붙인다. TOBE는 아래 구조다.

latent market regime
  -> price dynamics
  -> risk budget
  -> intended position path
  -> execution constraints
  -> realized reward path
  -> offline RL trajectory

여기서 position path는 단순 oracle이 아니어야 한다. 다음 archetype을 섞어야 한다.

Archetype	목적
trend follower	추세 구간에서 position을 유지하는 support
mean reverter	과열/급락 이후 반대 포지션 support
volatility reducer	변동성 급증 시 exposure 축소
drawdown stopper	손실 누적 시 position을 줄이는 risk behavior
random explorer	policy support를 넓히는 action coverage

이렇게 하면 synthetic data는 “가격 데이터 + 라벨”이 아니라 “거래 가능한 trajectory”가 된다.

2. Offline RL 알고리즘 개선

우선순위는 아래와 같다.

우선순위	작업
P0	TD3+BC에서 action balance와 turnover regularization을 유지하되, labeler mix별 pass rate를 집계
P0	CQL/IQL에서 low-turnover one-sided exposure failure를 별도 bucket으로 분리
P1	Decision Transformer는 return-to-go target quantile과 inverse action loss를 labeler mix별로 조정
P1	best checkpoint 선정 기준에 synthetic eval뿐 아니라 real gate proxy를 더 강하게 반영
P2	episode-level reward quantile filtering을 labeler별로 다르게 적용

3. 모델 개선

모델 쪽 원칙은 명확하다.

1. 신규 model input은 OHLC-only를 유지한다. 2. feature engineering을 늘리는 대신 encoder capacity와 inductive bias를 늘린다. 3. action/position diversity는 synthetic trajectory 쪽에서 만든다. 4. gate는 rolling Sharpe, flat ratio, one-sided ratio, turnover를 같이 본다.

다음 후보는 세 가지다.

후보	설명
`price_causal_mixer v2`	이번에 추가한 안정형 learned price encoder
`price_causal_mixer v3`	더 큰 hidden/features dim으로 learned feature capacity 확대
action-aware sequence model	관측 OHLC와 과거 action/reward를 같이 encode하는 offline sequence model

마지막 후보는 주의가 필요하다. 사용자의 원칙이 “최근 가격정보만 보고 trading”이라면 runtime inference에서 과거 action을 넣는 것은 허용될 수 있지만, 외부 feature를 넣는 것은 아니다. 과거 action은 모델이 만든 자신의 상태이고, 가격에서 파생한 수동 feature와는 다르다.

결론

이번 변경은 feature engineering을 더 넣는 방향이 아니다. 반대로 feature engineering을 모델 밖에서 줄이고, 가격-only encoder와 synthetic trajectory 품질을 키우는 방향이다.

정리하면 다음과 같다.

결정	이유
`behavior_policies` 대신 `synthetic_action_labelers` 사용	실제 역할이 운영 policy가 아니라 synthetic action/position label 생성이기 때문이다.
OHLC-only 입력 유지	가격만 보고 trading해야 한다는 원칙을 모델 API에서 지키기 위해서다.
`price_causal_mixer` 추가	사람이 만든 feature 대신 encoder가 가격 표현을 학습하게 하기 위해서다.
price+position joint generator로 진화	offline RL dataset support를 더 현실적인 trajectory 단위로 만들기 위해서다.
real gate 중심 평가 유지	synthetic metric만으로는 운용 모델 품질을 보장할 수 없기 때문이다.

다음 실험은 같은 구조에서 labeler mix, v3 capacity, algorithm별 실패 bucket을 기준으로 이어간다.