John Moody, Yufeng Liu, Matthew Saffell, and Kyoungju Youn
We investigate repeated matrix games with stochastic players as a microcosm for studying dynamic, multi-agent interactions using the Stochastic Direct Reinforcement (SDR) policy gradient algorithm. SDR is a generalization of Recurrent Reinforcement Learning (RRL) that supports stochastic policies. Unlike other RL algorithms, SDR and RRL use recurrent policy gradients to properly address temporal credit assignment resulting from recurrent structure. Our main goals in this paper are to (1) distinguish recurrent memory from standard, non-recurrent memory for policy gradient RL, (2) compare SDR with Q-type learning methods for simple games, (3) distinguish reactive from endogenous dynamical agent behavior and (4) explore the use of recurrent learning for interacting, dynamic agents. We find that SDR players learn much faster and hence outperform recently-proposed Q-type learners for the simple game Rock, Paper, Scissors (RPS). With more complex, dynamic SDR players and opponents, we demonstrate that recurrent representations and SDR’s recurrent policy gradients yield better performance than non-recurrent players. For the Iterated Prisoners Dilemma, we show that non-recurrent SDR agents learn only to defect (Nash equilibrium), while SDR agents with recurrent gradients can learn a variety of interesting behaviors, including cooperation.