Satinder P. Singh
Reinforcement learning (RL) has become a central paradigm for solving learning-control problems in robotics and artificial intelligence. R L researchers have focussed almost exclusively on problems where the controller has to maximize the discounted sum of payoffs. However, as emphasized by Schwartz (1993), in many problems, e.g., those for which the optimal behavior is a limit cycle, it is more natural and computationally advantageous to formulate tasks so that the controller’s objective is to maximize the average payoff received per time step. In this paper I derive new average-payoff RL algorithms as stochastic approximation methods for solving the system of equations associated with the policy evaluation and optimal control questions in average-payoff RL tasks. These algorithms are analogous to the popular td and Q-learning algorithms already developed for the discounted-payoff case. One of the algorithms clerived here is a significant variation of Schwartz’s R-learning algorithm. Preliminary empirical results are presented to validate these new algorithms.