Padhraic Smyth, David Wolpert
Exploratory data analysis is inherently an iterative, interactive endeavor. In the context of massive data sets, however, many current data analysis algorithms will not scale appropriately to permit interaction on a human time-scale. In this paper "anytime data analysis" is proposed as a general framework to enable exploratory data analysis of massive data sets. Anytime data analysis takes into account not only the quality of the model being fit but also the resources (time and memory) used to achieve that fit. The framework is discussed in some detail for interactive multivariate density estimation. Out-of-sample log-likelihood and model combination techniques (such as stacking) are used to greedily explore the data landscape. The method is applied to two significant scientific data sets where it is shown that it can be better to combine multiple "cheap-to-construct" models than to spend the same time optimizing the parameters of a single more complex model.