PAC Reinforcement Learning Bounds for RTDP and Rand-RTDP

Authors

Alexander L. Strehl

Lihong Li

Michael L. Littman

Track:

Contents

Downloads:

Download PDF

Abstract:

Real-time Dynamic Programming (RTDP) is a popular algorithm for planning in a Markov Decision Process (MDP). It can also be viewed as a learning algorithm, where the agent improves the value function and policy while acting in an MDP. It has been empirically observed that an RTDP agent generally performs well when viewed this way, but past theoretical results have been limited to asymptotic convergence proofs. We show that, like true learning algorithms E^3 and R-MAX, a slightly modified version of RTDP satisfies a Probably Approximately Correct (PAC) condition (with better sample complexity bounds). In other words, we show that the number of timesteps in an infinite-length run in which the RTDP agent acts according to a non-epsilon-optimal policy from its current state is less than some polynomial (in the size of the MDP), with high probability. We also show that a randomized version of RTDP is PAC with asymptotically equal sample complexity bounds, but has much less per-step computational cost, O(ln(S)ln(SA)+ln(A)), rather than O(S+ln(A)) when we consider only the dependence on S, the number of states and A, the number of actions of the MDP.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.