C. Lee Giles, Steve Lawrence
Experimental results reported in the machine learning AI literature can be misleading. This paper investigates the common processes of data averaging (reporting results in terms of the mean and standard deviation of the results from multiple trials) and data snooping in the context of neural networks, one of the most popular AI machine learning models. Both of these processes can result in misleading results and inaccurate conclusions. We demonstrate how easily this can happen and propose techniques for avoiding these very important problems. For data averaging, common presentation assumes that the distribution of individual results is Gaussian. However, we investigate the distribution for common problems and find that it often does not approximate the Gaussian distribution, may not be symmetric, and may be multimodal. We show that assuming Gaussian distributions can significantly affect the interpretation of results, especially those of comparison studies. For a controlled task, we find that the distribution of performance is skewed towards better performance for smoother target functions and skewed towards worse performance for more complex target functions. We propose new guidelines for reporting performance which provide more information about the actual distribution (e.g. box-whiskers plots). For data snooping, we demonstrate that optimization of performance via experimentation with multiple parameters can lead to significance being assigned to results which are due to chance. We suggest that precise descriptions of experimental techniques can be very important to the evaluation of results, and that we need to be aware of potential data snooping biases when formulating these experimental techniques (e.g. selecting the test procedure). Additionally, it is important to only rely on appropriate statistical tests and to ensure that any assumptions made in the tests are valid (e.g. normality of the distribution).