AI Traders
Backtesting
Available

Backtest that doesn't lie

Walkforward, not cherry-picked window. Per-bar funding rate. Orderbook-based slippage. Brier score on every model in the ensemble. What you see is close to what you get live.

Why typical backtests deceive

The classic retail-backtest trap is overfitting to one period. "Parameters from 2023 gave +400% on BTC" — yes, because they were tuned to exactly that period. On out-of-sample they often net zero or a loss. Not a bug — math: with enough parameter sweeps you can always retrospectively profit any strategy.

Second trap is ignoring funding-rate. On HL perps funding can eat 5–12 bps per 60 days on a long. Backtests that miss this show "paper" profits that turn negative live.

How we do it

Walkforward

In-sample window (60 days) → parameter selection → out-of-sample (14 days) → forward-roll. The graph shows only out-of-sample — what the strategy did on "unseen" data.

Per-bar funding rate

Funding recomputed every 8h window and subtracted from open-position PnL. Not "average over period" — literal 8h windows.

Orderbook slippage

Large orders eat the book. We use historical HL L2 snapshots: at 10K USD slippage is one thing, at 100K — another. Not flat-percent.

Per-model Brier

Each ensemble model has a Brier calibration score on a rolling 14-day window. If calibration drifts above 0.25, the model is temporarily disabled until it recovers.

Equity curve and max DD

Not just Sharpe. We show equity curve, max drawdown, Calmar (return ÷ max DD), Sortino. Full risk profile.

Real HL data

180+ days L2 orderbook snapshots, funding rates, mark prices from Hyperliquid itself. Not Binance, not "ETH proxy" — literally HL.

What we show per vault

01

Out-of-sample equity curve

Rolling 180-day window. What the strategy earned on unseen data — data not used to pick parameters.

02

Sharpe, Sortino, Calmar

Three risk-return ratios. Sharpe penalizes any volatility, Sortino — only downside, Calmar looks at max drawdown.

03

PnL distribution

Histogram of per-trade profit. Visible that "one best day doesn't carry everything" — instead, hundreds of small wins.

04

Per-model Brier

Each ensemble model's calibration over time. If one breaks — instantly visible, no live surprises.

Backtest FAQ

Can I backtest with my own parameters?+
On each vault page — yes, in Phase 1. Right now (Phase 0) only the curator's official parameters. After public release, a sandbox for custom parameters lands.
Where do historical data come from?+
Hyperliquid publishes L2 snapshots, funding rates and mark prices via their API. We store raw snapshots in TimescaleDB. Depth — since HL mainnet launch (~2 years); active window is last 180 days.
What if backtest shows +20% but live shows +5%?+
Normal divergence on a short window. Full convergence takes ~90 days live. If the gap stays above 50% after 90 days — that's a signal the backtest or live setup is wrong. The vault curator must investigate.
Can parameters be tuned so that 'backtest looks great'?+
They can, but walkforward strictly constrains: in-sample picks parameters, out-of-sample tests them. Overfit parameters don't survive out-of-sample. It's visible immediately — tuning for public display becomes unprofitable.