Steven D. Whitehead
Reinforcement learning algorithms, when used to solve multi-stage decision problems, perform a kind of online (incremental) search to find an optimal decision policy. The time complexity of this search strongly depends upon the size and structure of the state space and upon a priori knowledge encoded in the learners initial parameter values. When a priori knowledge is not available, search is unbiased and can be excessive. Cooperative mechanisms help reduce search by providing the learner with shorter latency feed-back and auxiliary sources of experience. These mechanisms are based on the observation that in nature, intelligent agents exist in a cooperative social environment that helps structure and guide learning. Within this context, learning involves information transfer as much as it does discovery by trial-and-error. Two cooperative mechanisms are described: Learning with an External Critic (or LEC) and Learning By Watching (or LBW). The search time complexity of these algorithms, along with unbiased Q-learning, are analyzed for problem solving tasks on a restricted class of state spaces. The results indicate that while unbiased search can be expected to require time moderately exponential in the size of the state space, the LEC and LBW algorithms require at most time linear in the size of the state space and under appropriate conditions, are independent of the state space size altogether; requiring time proportional to the length of the optimal solution path. While these analytic results apply only to a restricted class of tasks, they shed light on the complexity of search in reinforcement learning in general and the utility of cooperative mechanisms for reducing search.