Building a system · Chapter 5 · 14 min read

Backtesting and the many ways it lies to you

Lookahead and survivorship bias, overfitting, in-sample versus out-of-sample, costs and slippage, and why a flawless backtest is a warning sign rather than a triumph.

Once you have a rules-based system, the obvious next step is backtesting — running it over historical data to see how it would have performed. Done with brutal honesty, backtesting is genuinely valuable: it can reveal that a beloved idea would have bled money for years, saving you from learning that with real cash. Done the way most people do it, backtesting is an elaborate machine for fooling yourself — it produces beautiful equity curves that have almost nothing to do with the future. This chapter is about the second case, because the traps are everywhere and they're subtle.

Lookahead bias: using tomorrow's news today

Lookahead bias is the most insidious error because it's so easy to commit by accident. It means letting your backtest use information that wouldn't have been available at the moment the trade was made. The classic version: testing a rule that buys based on a company's annual results 'as of' a date before those results were actually published — in reality the numbers came out weeks later, so your backtest is quietly trading on knowledge from the future. It hides in dozens of small ways: using a day's closing price to decide a trade you assume executed at that same day's open; using data that was later revised, as though the revised figure was known at the time; filtering your stock universe today based on which companies turned out to matter. Each leaks a little of the future into the past, and since the future is exactly what you're trying to predict, even a tiny leak produces spectacular, fictional results. If your backtest looks too good, suspect a lookahead leak before you celebrate.

Survivorship bias: testing only the winners

Survivorship bias comes from the data itself. If you backtest a strategy on 'the stocks in the index today' over the last twenty years, you've made a fatal silent choice: today's index is composed of companies that survived and thrived. The ones that went bankrupt, got delisted, collapsed, or were quietly dropped from the index aren't in your dataset at all — they were removed precisely because they did badly. So you've unintentionally backtested 'buy stocks, given that we already know they don't go to zero'. Of course that looks profitable — you've excluded every catastrophe in advance. A real strategy, trading in real time, would have held some of those doomed companies, because at the time nobody knew which ones would fail. Survivorship bias makes nearly any long-only strategy look better than reality, sometimes dramatically, and it's baked into the easiest, most convenient datasets — which is exactly why so many people fall into it.

Overfitting: the curve-fitter's downfall

Overfitting — also called curve-fitting — is the deepest trap of all, and the most seductive because it feels like diligence. It means tuning your system's parameters until it fits the historical data almost perfectly: tweaking the moving-average length, the RSI threshold, the stop distance, the exact filters, adjusting and re-running until the equity curve is breathtaking. The trouble is that with enough parameters you can fit any dataset perfectly — you've just memorised the past's random noise, not discovered a real pattern. An overfit system is exquisitely tuned to events that will never recur in exactly that form; it has learned the specific accidents of history — this particular crash on this particular day, that exact whipsaw — none of which repeat. When you run it forward on new data, it falls apart immediately, because the noise it memorised is gone and the genuine signal (if any) was drowned out by all that tuning. The more knobs you turn and the more perfectly it fits the past, the more certainly it's overfit and the more violently it will fail live.

In-sample versus out-of-sample

The standard defence against overfitting is to split your history into two parts. You build and tune your system only on the in-sample period — the data you're allowed to look at and optimise against. Then you test the finished, frozen system on the out-of-sample period — data you held back and never touched while designing. If the system holds up out-of-sample, you've some reason to believe you found a real pattern rather than memorised noise. If it works in-sample and collapses out-of-sample — which is the usual outcome — you overfit.

The discipline only works if you're ruthlessly honest about it. The moment you peek at the out-of-sample results and go back to tweak the system, that data is contaminated — it's become in-sample, because you've now optimised against it too. The out-of-sample set is a one-shot test; you get to use it once, on a finished system. Every 'let me just adjust this and re-check' burns its value. This honesty is hard, because the whole point of the held-back data is to tell you 'no', and we're all expert at rationalising one more tweak.

Costs and slippage: the friction the backtest forgets

Even an honest, well-validated backtest usually overstates returns by ignoring transaction costs and slippage. Transaction costs are the tolls we covered in the markets module — brokerage, STT, exchange fees, GST, stamp duty — charged on every trade. Slippage is subtler: the gap between the price your backtest assumed you got and the price you'd actually get live, because the order book moved, the spread was wide, or your own order pushed the price. Backtests love to assume you bought exactly at the printed price; reality rarely obliges. These look trivial per trade and they're lethal in aggregate, especially for high-frequency strategies. A system that trades many times a day might show a glorious gross return and a negative net return once realistic costs and slippage are subtracted — every trade pays the toll, and a thousand small tolls add up to ruin. This is why a strategy's trade frequency matters enormously: the more it trades, the more friction it must overcome just to break even, and the more a frictionless backtest flatters it. Always subtract generous, pessimistic costs before believing any result.

What an honest backtest is actually for

After all these warnings, backtesting isn't useless — it's just widely misunderstood. Its real job is not to prove a strategy will make money; nothing can do that, because the future isn't the past. Its real job is to disqualify bad ideas cheaply, to give you a realistic sense of how ugly the drawdowns might get so you can size and stomach them, and to set expectations honest enough that you don't abandon a sound system at its first rough patch. A backtest that says 'this would have had a brutal 30% drawdown you'd have to survive' is more valuable than one promising riches. Use it to understand risk and to kill bad ideas — not to manufacture confidence you haven't earned.

Key takeaways

✓Lookahead bias leaks future information into past decisions; it doesn't crash the backtest, it makes it beautifully, falsely profitable.
✓Survivorship bias tests only the companies that survived, silently excluding every catastrophe and flattering long-only strategies.
✓Overfitting tunes a system to historical noise; the more parameters and the more perfect the fit, the harder it fails live.
✓Split history into in-sample (build) and out-of-sample (test once); peeking and re-tweaking contaminates the validation and defeats the purpose.
✓Subtract realistic transaction costs and slippage; frequent-trading systems often go from gross-profitable to net-negative — and a flawless backtest is a red flag, not a trophy.

Education, not investment advice. Nothing here is a recommendation to buy or sell any security.