Eyal Even-Dar, Shie Mannor, and Yishay Mansour
We consider incorporating action elimination procedures in reinforcement learning algorithms. We suggest a framework that is based on learning an upper and a lower estimates of the value function or the Q-function and eliminating actions that are not optimal. We provide a model-based and a model-free variants of the elimination method. We further derive stopping conditions that guarantee that the learned policy is approximately optimal with high probability. Simulations demonstrate a considerable speedup and added robustness.