Deep Reactive Policies for Planning in Stochastic Nonlinear Domains
Recent advances in applying deep learning to planning have shown that Deep Reactive Policies (DRPs) can be powerful for fast decision-making in complex environments. However, an important limitation of current DRP-based approaches is either the need of optimal planners to be used as ground truth in a supervised learning setting or the sample complexity of high-variance policy gradient estimators, which are particularly troublesome in continuous state-action domains. In order to overcome those limitations, we introduce a framework for training DRPs in continuous stochastic spaces via gradient-based policy search. The general approach is to explicitly encode a parametric policy as a deep neural network, and to formulate the probabilistic planning problem as an optimization task in a stochastic computation graph by exploiting the re-parameterization of the transition probability densities; the optimization is then solved by leveraging gradient descent algorithms that are able to handle non-convex objective functions. We benchmark our approach against stochastic planning domains exhibiting arbitrary differentiable nonlinear transition and cost functions (e.g., Reservoir Control, HVAC and Navigation). Results show that DRPs with more than 125,000 continuous action parameters can be optimized by our approach for problems with 30 state fluents and 30 action fluents on inexpensive hardware under 6 minutes. Also, we observed a speedup of 5 orders of magnitude in the average inference time per decision step of DRPs when compared to other state-of-the-art online gradient-based planners when the same level of solution quality is required.