I. Szita and A. Lorincz
In this article we propose a method that can deal with certain combinatorial reinforcement learning tasks. We demonstrate the approach in the popular Ms. Pac-Man game. We define a set of high-level observation and action modules, from which rule-based policies are constructed automatically. In these policies, actions are temporally extended, and may work concurrently. The policy of the agent is encoded by a compact decision list. The components of the list are selected from a large pool of rules, which can be either hand-crafted or generated automatically. A suitable selection of rules is learnt by the cross-entropy method, a recent global optimization algorithm that fits our framework smoothly. Cross-entropy-optimized policies perform better than our hand-crafted policy, and reach the score of average human players. We argue that learning is successful mainly because (i) policies may apply concurrent actions and thus the policy space is sufficiently rich, (ii) the search is biased towards low-complexity policies and therefore, solutions with a compact description can be found quickly if they exist.