Alexander L. Strehl, Lihong Li, Michael L. Littman
Real-time Dynamic Programming (RTDP) is a popular algorithm for planning in a Markov Decision Process (MDP). It can also be viewed as a learning algorithm, where the agent improves the value function and policy while acting in an MDP. It has been empirically observed that an RTDP agent generally performs well when viewed this way, but past theoretical results have been limited to asymptotic convergence proofs. We show that, like true learning algorithms E^3 and R-MAX, a slightly modified version of RTDP satisfies a Probably Approximately Correct (PAC) condition (with better sample complexity bounds). In other words, we show that the number of timesteps in an infinite-length run in which the RTDP agent acts according to a non-epsilon-optimal policy from its current state is less than some polynomial (in the size of the MDP), with high probability. We also show that a randomized version of RTDP is PAC with asymptotically equal sample complexity bounds, but has much less per-step computational cost, O(ln(S)ln(SA)+ln(A)), rather than O(S+ln(A)) when we consider only the dependence on S, the number of states and A, the number of actions of the MDP.
Subjects: 12.1 Reinforcement Learning; 15.7 Search
Submitted: May 15, 2006