Theodore Johnson and Tamraparni Dasu, AT&T LabsResearch
The comparison of two data sets can reveal a great deal of information about the time-varying nature of an observed process. For example, suppose that the points in a data set represent a customer’s activity by their location in n-dimensional space. A comparison of the distribution of points in two such data sets can indicate how the customer activity has changed between the observation periods. Other applications include data integrity checking. An unexpected change in a data set can indicate a problem in the data collection process. We propose a fast, inexpensive method for comparing massive high dimensional data sets that does not make any distributional assumptions. The method adapts the power of classical statistics for use on complex, high dimensional data sets. We generate a map of the data set (a DataSphere), and compare data sets by comparing their DataSpheres. The DataSphere can be generated in two passes over the data set, stored in a database, and aggregated at multiple levels. We illustrate the use of our set comparison technique with an example analysis of data sets drawn from AT&T data warehouses.