Using Web Searches on Important Words to Create Background Sets for LSI Classification

Sarah Zelikovitz, Marina Kogan

The world wide web has a wealth of information that is related to almost any text classification task. This paper presents a method for mining the web to improve text classification, by creating a background text set. Our algorithm uses the information gain criterion to create lists of important words for each class of a text categorization problem. It then searches the web on various combinations of these words to produce a set of related data. We use this set of background text with Latent Semantic Indexing classification to create an expanded term by document matrix on which singular value decomposition is done. We provide empirical results that this approach improves accuracy on unseen test examples in different domains.

Subjects: 1.10 Information Retrieval; 12. Machine Learning and Discovery

Submitted: Jan 25, 2006

This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.