V. Seshadri and Raguram Sasisekharan, AT&T Bell Laboratories; Sholom M. Weiss, Rutgers University
Techniques for learning from data typically require data to be in standard form. Measurements must be encoded in a numerical format such as binary true-or-false features, numerical features, or possibly numerical codes. In addition, for classification, a clear goal for learning must be specified. While some databases may readily be arranged in standard form, many others may be combinations of numerical fields or text, with thousands of possibilities for each data field, and multiple instances of the same field specification. A significant portion of the effort in real-world data mining applications involves defining, identifying, and encoding the data into suitable features. In this paper, we describe an automatic feature extraction procedure, adapted from modern text categorization techniques, that maps very large databases into manageable datasets in standard form. We describe a commercial application of this procedure to mining a collection of very large databases of home appliance service records for a major international retailer.