Edwin de Jong
Agents in Multi Agent Systems can coordinate their actions by communicating. We investigate a minimal form of communication, where the signals that agents send represent evaluations of the behavior of the receiving agent. Learning to act according to these signals is a typical Reinforcement Learning problem. The backpropagation neural network has been used to predict rewards that will follow an action. The first results made clear that a mechanism for balmacing between exploitation and exploration was needed. We introduce the Exploration Buckets algorithm, a method that favors both actions with high prediction errors and actions that have been ignored for some time. The algorithm’s scope is not restricted to a single learning algorithm, mad its main characterics are its insensibility to large (or even continuous) state spaces and its appropriateness for online learning; the Exploration/ Exploitation balance does not depend on properties external to the system, such as time.