John Moody and Matthew Saffell
We propose to train trading systems by optimizing financial objective functions via reinforcement learning. The performance functions that we consider as value functions are profit or wealth, the Sharpe ratio and our recently proposed differential Sharpe ratio for online learning. In Moody & Wu (1997), we presented empirical results in controlled experiments that demonstrated the advantages of reinforcement learning relative to supervised learning. Here we extend our previous work to compare Q-Learning to a reinforcement learning technique based on real-time recurrent learning (RTRL) that maximizes immediate reward. Our simulation results include a spectacular demonstration of the presence of predictability in the monthly Standard and Poors 500 stock index for the 25 year period 1970 through 1994. Our reinforcement trader achieves a simulated out-of-sample profit of over 4000% for this period, compared to the return for a buy and hold strategy of about 1300% (with dividends reinvested). This superior result is achieved with substantially lower risk.