Katharina Probst, Rayid Ghani, Marko Krema, Andrew Fano, Yan Liu
We describe an approach to extract attribute-value pairs from product descriptions. This allows us to represent products as sets of such attribute-value pairs to augment product databases. Such a representation is useful for a variety of tasks where treating a product as a set of attribute-value pairs is more useful than as an atomic entity. Examples of such applications include product recommendations, product comparison, and demand forecasting. We formulate the extraction as a classification problem and use a semi-supervised algorithm (co-EM) along with (Naive Bayes). The extraction system requires very little initial user supervision: using unlabeled data, we automatically extract an initial seed list that serves as training data for the supervised and semi-supervised classification algorithms. Finally, the extracted attributes and values are linked to form pairs using dependency information and co-location scores. We present promising results on product descriptions in two categories of sporting goods.
Subjects: 13. Natural Language Processing; 12. Machine Learning and Discovery
Submitted: Oct 16, 2006