I've been day trading indices (DAX, Nasdaq) profitably throughout 2025 using a specific personal strategy. At some point I got curious: if I gave this exact strategy to different LLMs as a prompt, which one would execute it best?
I couldn't find a benchmark that tested this. The academic ones focus on stock portfolios with daily rebalancing. Nothing tested LLMs on fast-paced index day trading where you need to read price action and make quick directional calls based on a defined strategy.
So I built DayTradingBench. The core idea is simple: every model receives the exact same prompt containing my trading strategy and the exact same live market data. The only variable is the model itself. This way I'm measuring pure decision-making capability — not prompt engineering.
How it works:
- 11 LLMs trade autonomously during live market hours (08:00–21:00 UTC, Mon–Fri)
- Every 15 minutes each model gets a market snapshot and must return a structured JSON decision: LONG, SHORT, or HOLD — with stop loss, take profit, confidence, and risk percentage
- All start with the same $100k virtual balance
- Positions auto-close on SL/TP hits (checked every 10 seconds) or at session end
There are two input modes: text mode (structured OHLCV data) and vision mode (candlestick chart images sent to the model). Same strategy prompt, different way of presenting the market data. This lets me compare whether models trade better reading numbers or reading charts.
The performance gap between models is much larger than I expected, even though they all receive identical instructions.
Built this as a solo dev. The system runs fully autonomously 24/5 — I mostly just watch the results now.
Would love to hear what HN thinks, especially if you've experimented with LLMs for trading.
Just reading your description, it sounds like there are two variables:
1. Prompt adherence: how well the models follow your stated strategy
2. Decision quality: how well models do on judgment calls that aren’t explicitly in the strategy
Candidly, since you haven’t shared the strategy, there’s no way for me to evaluate either (1) or (2). A model’s performance could be coming from the quality of your strategy, the model itself, or an interaction between the two, and I can’t disentangle that from what you’ve provided.
So as presented, the benchmark is basically useless to me for evaluating models (not because it’s pointless overall, but because I can’t tell what it’s actually measuring without seeing the strategy).
That's a fair point. You're right that without seeing the strategy, you can't fully disentangle what drives the differences.
But the strategy itself isn't really the point. Since every model gets the exact same prompt and the exact same market data, the only variable is the model. So relative performance differences are real regardless of what the strategy contains. If Model A consistently outperforms Model B under identical conditions, that tells you something meaningful about the model.
And honestly, that blend of prompt adherence and decision quality is how people actually use LLMs in practice. You give it instructions and context, and you care about the result.
You're right though that the strategy being private limits what outsiders can evaluate. It's something I'm thinking about.
To be more specific: the prompt defines a trading philosophy and tells models what to look for in the charts. But the actual read and the decision is entirely on the model. Using your framing — it's closer to "here's inspiration, now maximize money" than "implement this exact strategy."
Which means improvisation within that framework is exactly what's being measured.
I've been day trading indices (DAX, Nasdaq) profitably throughout 2025 using a specific personal strategy. At some point I got curious: if I gave this exact strategy to different LLMs as a prompt, which one would execute it best?
I couldn't find a benchmark that tested this. The academic ones focus on stock portfolios with daily rebalancing. Nothing tested LLMs on fast-paced index day trading where you need to read price action and make quick directional calls based on a defined strategy.
So I built DayTradingBench. The core idea is simple: every model receives the exact same prompt containing my trading strategy and the exact same live market data. The only variable is the model itself. This way I'm measuring pure decision-making capability — not prompt engineering.
How it works:
- 11 LLMs trade autonomously during live market hours (08:00–21:00 UTC, Mon–Fri) - Every 15 minutes each model gets a market snapshot and must return a structured JSON decision: LONG, SHORT, or HOLD — with stop loss, take profit, confidence, and risk percentage - All start with the same $100k virtual balance - Positions auto-close on SL/TP hits (checked every 10 seconds) or at session end
There are two input modes: text mode (structured OHLCV data) and vision mode (candlestick chart images sent to the model). Same strategy prompt, different way of presenting the market data. This lets me compare whether models trade better reading numbers or reading charts.
The performance gap between models is much larger than I expected, even though they all receive identical instructions.
Built this as a solo dev. The system runs fully autonomously 24/5 — I mostly just watch the results now.
Would love to hear what HN thinks, especially if you've experimented with LLMs for trading.
Just reading your description, it sounds like there are two variables:
1. Prompt adherence: how well the models follow your stated strategy
2. Decision quality: how well models do on judgment calls that aren’t explicitly in the strategy
Candidly, since you haven’t shared the strategy, there’s no way for me to evaluate either (1) or (2). A model’s performance could be coming from the quality of your strategy, the model itself, or an interaction between the two, and I can’t disentangle that from what you’ve provided.
So as presented, the benchmark is basically useless to me for evaluating models (not because it’s pointless overall, but because I can’t tell what it’s actually measuring without seeing the strategy).
That's a fair point. You're right that without seeing the strategy, you can't fully disentangle what drives the differences.
But the strategy itself isn't really the point. Since every model gets the exact same prompt and the exact same market data, the only variable is the model. So relative performance differences are real regardless of what the strategy contains. If Model A consistently outperforms Model B under identical conditions, that tells you something meaningful about the model.
And honestly, that blend of prompt adherence and decision quality is how people actually use LLMs in practice. You give it instructions and context, and you care about the result.
You're right though that the strategy being private limits what outsiders can evaluate. It's something I'm thinking about.
> Model A consistently outperforms Model B under identical conditions, that tells you something meaningful about the model.
Not really! Sorry to harp on this, but there are two ways one model could outperform another:
1) It adheres to your strategy better
2) It improvises
If the prompt was "maximize money, here's inspiration" improvising is fine. If the prompt was "implement the strategy," improvising is failure.
Right now you have a leaderboard; you don’t yet have a benchmark, because you can’t tell whether high P&L reflects correctness.
To be more specific: the prompt defines a trading philosophy and tells models what to look for in the charts. But the actual read and the decision is entirely on the model. Using your framing — it's closer to "here's inspiration, now maximize money" than "implement this exact strategy." Which means improvisation within that framework is exactly what's being measured.
But yeah, it's closer to a leaderboard right now.
Get rid of your nag screen for one
fair point, I'll look into showing it only to EU visitors.