Bhaskara Marthi, David Latham, Stuart Russell, and Carlos Guestrin
We describe a language for partially specifying policies in domains consisting of multiple subagents working together to maximize a common reward function. The language extends ALisp with constructs for concurrency and dynamic assignment of subagents to tasks. During learning, the subagents learn a distributed representation of the Q-function for this partial policy. They then coordinate at runtime to find the best joint action at each step. We give examples showing that programs in this language are natural and concise. We also describe online and batch learning algorithms for learning a linear approximation to the Q-function, which make use of the coordination structure of the problem.