AAAI Publications, Twenty-Third International Conference on Automated Planning and Scheduling

Font Size: 
Temporal-Difference Search in Computer Go
David Silver, Richard Sutton, Martin Mueller

Last modified: 2013-06-02


Temporal-difference (TD) learning is one of the most successful and broadly applied solutions to the reinforcement learning problem; it has been used to achieve master-level play in chess, checkers and backgammon. Monte-Carlo tree search is a recent algorithm for simulation-based search, which has been used to achieve master-level play in Go. We have introduced a new approach to high-performance planning. Our method, TD search, combines TD learning with simulation-based search. Like Monte-Carlo tree search, value estimates are updated by learning online from simulated experience. Like TD learning, it uses value function approximation and bootstrapping to efficiently generalise between related states. We applied TD search to the game of 9x9 Go, using a million binary features matching simple patterns of stones. Without any explicit search tree, our approach outperformed a vanilla Monte-Carlo tree search with the same number of simulations. When combined with a simple alpha-beta search, our program also outperformed all traditional (pre-Monte-Carlo) search and machine learning programs on the 9x9 Computer Go Server.


Reinforcement Learning; Simulation-Based Planning

Full Text: PDF