Michael Bowling and Manuela Veloso
Stochastic games are a general model of interaction between multiple agents. They have recently been the focus of a great deal of research in reinforcement learning as they are both descriptive and have a well-defined Nash equilibrium solu-tion. Most of this recent work, although very general, has only been applied to small games with at most hundreds of states. On the other hand, there are landmark results of learn-ing being successfully applied to specific large and complex games such as Checkers and Backgammon. In this paper we describe a scalable learning algorithm for stochastic games, that combines three separate ideas from reinforcement learn-ing into a single algorithm. These ideas are tile coding for generalization, policy gradient ascent as the basic learning method, and our previous work on the WoLF ("Win or Learn Fast") variable learning rate to encourage convergence. We apply this algorithm to the intractably sized game-theoretic card game Goofspiel, showing preliminary results of learn-ing in self-play. We demonstrate that policy gradient ascent can learn even in this highly non-stationary problem with si-multaneous learning. We also show that the WoLF principle continues to have a converging effect even in large problems with approximation and generalization.