Scott A. Weiss, Simon Kasif. and Eric Brill
We report on our investigations into topic classification with USENET newsgroups. Our framework is to determine the newsgroup that a new document should be posted to. We train our system by forming "metadocuments" that represent each topic. We discuss our experiments with this method, and provide evidence that choosing particular documents or words to use in these models degrades classification accuracy. We also describe a technique called classification-based retrieval for finding documents similar to a query document.