Leveraging Observations in Bandits: Between Risks and Benefits

Andrei Lupu; Audrey Durand; Doina Precup

doi:10.1609/aaai.v33i01.33016112

Authors

Andrei Lupu McGill University
Audrey Durand McGill University
Doina Precup McGill University

DOI:

https://doi.org/10.1609/aaai.v33i01.33016112

Abstract

Imitation learning has been widely used to speed up learning in novice agents, by allowing them to leverage existing data from experts. Allowing an agent to be influenced by external observations can benefit to the learning process, but it also puts the agent at risk of following sub-optimal behaviours. In this paper, we study this problem in the context of bandits. More specifically, we consider that an agent (learner) is interacting with a bandit-style decision task, but can also observe a target policy interacting with the same environment. The learner observes only the target’s actions, not the rewards obtained. We introduce a new bandit optimism modifier that uses conditional optimism contingent on the actions of the target in order to guide the agent’s exploration. We analyze the effect of this modification on the well-known Upper Confidence Bound algorithm by proving that it preserves a regret upper-bound of order O(lnT), even in the presence of a very poor target, and we derive the dependency of the expected regret on the general target policy. We provide empirical results showing both great benefits as well as certain limitations inherent to observational learning in the multi-armed bandit setting. Experiments are conducted using targets satisfying theoretical assumptions with high probability, thus narrowing the gap between theory and application.

Leveraging Observations in Bandits: Between Risks and Benefits

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription