Sven Koenig and Reid G. Simmons
We investigate the problem for an agent to reach one of a number of goal states by taking actions, where actions cause a deterministic state change. Initially, the topology of the state space and its size are unknown to the agent. We compare a zero-initialized version of Q-learning, that minimizes deliberation time between action executions, with other uninformed search algorithms (i.e. where the effects of actions are initially unknown). We show that the big-O worst-case complexity of every uninformed search algorithm over all domains is at least as large as the big-O worstcase complexity of Q-learning. However, learning and subsequently using a map of the state space can provide a search algorithm with an advantage over Q-learning in some (but not all) domains. Formally, we show that there exists an uninformed search algorithm that dominates Q-learning, i.e. always performs no worse, but performs strictly better in at least one case. In particular, there is at least one domain in which the number of action executions can be reduced by more than a constant factor.