Fahiem Bacchus, Craig Boutilier, Adam Grove
Markov decision processes (MDPs) are a very popular tool for decision theoretic planning (DTP), partly because of the well-developed, expressive theory that includes effective solution techniques. But the Markov assumption - that dynamics and rewards depend on the current state only, and not on history - is often inappropriate. This is especially true of rewards: we frequently wish to associate rewards with behaviors that extend over time. Of course, such reward processes can be encoded in an MDP should we have a rich enough state space (where states encode enough history). However it is often difficult to "hand craft" suitable state spaces that encode an appropriate amount of history. We consider this problem in the case where non-Markovian rewards are encoded by assigning values to formulas of a temporal logic. These formulas characterize the value of temporally extended behaviors. We argue that this allows a natural representation of many commonly encountered non-Markovian rewards. The main result is an algorithm which, given a decision process with non-Markovian rewards expressed in this manner, automatically constructs an equivalent MDP (with Markovian reward structure), allowing optimal policy construction using standard techniques.