Applying reinforcement learning in large Markov Decision Process (MDP) is an important issue for solving very large problems. Since the exact resolution is often intractable, many approaches have been proposed to approximate the value function or to approximate directly the policy by gradient methods. Such approaches provide a policy on all the state space whereas classical reinforcement learning algorithms do not guarantee in finite time the exploration of all states. However, these approaches often need a manual definition of the parameter for approximation functions. Recently, Lagoudakis introduced the problem of approximating policy by a policy iteration algorithm using a mix between a rollout algorithm and Support Vector Machines (SVM). The work presented in this paper is an extension of Lagoudakis' idea. We propose a new and more general formalism which combines reinforcement learning and supervised learning formalism. To learn an approximation of an optimal policy, we propose some combinations of various algorithms (reinforcement and supervised learning). Contrary to Lagoudakis' approach, we are not restricted to an approximated policy iteration but we can use any reinforcement learning algorithms. One of the arguments for this approach is that in reinforcement learning, the most important result to obtain is the optimal policy, i.e a total order of actions for each state. That is why, we do not focus on the value of a state but on direct policy approximation.
Subjects: 12. Machine Learning and Discovery; 12.1 Reinforcement Learning
Submitted: Apr 9, 2007