Jürgen Schmidhuber and Jieyu Zhao
Reinforcement learning based on direct search in policy space requires few assumptions about the environment. Hence it is applicable in certain situations where most traditional reinforcement learning algorithms based on dynamic programming are not, especially in partially observable, deterministic worlds. In realistic settings, however, reliable policy evaluations are complicated by numerous sources of uncertainty, such as stochasticity in policy and environment. Given a limited life-time, how much time should a direct policy searcher spend on policy evaluations to obtain reliable statistics? Despite the fundamental nature of this question it has not received much attention yet. Our efficient approach based on the success-story algorithm (SSA) is radical in the sense that it never stops evaluating any previous policy modification except those it undoes for lack of empirical evidence that they have contributed to lifelong reward accelerations. Here we identify SSA’s fundamental advantages over traditional direct policy search (such as stochastic hill-climbing) on problems involving several sources of stochasticity and uncertainty.