Hirotoshi Taira, NTT Communication Science Labs, and Masahiko Haruno, ATR Human Information Processing Research Labs
This paper investigates the effect of prior feature selection in Support Vector Machine (SVM) text categorization. The input space was gradually increased by mutual information (MI) filtering and part-of-speech (POS) filtering, which determine the portion of words that are appropriate for SVM learning from the information-theoretic and linguistic perspectives, respectively. The common results for both filtering are that 1) the optimal number of features was completely different among categories, and 2) the average performance for categories was best when all of the words were used. In addition, the comparison of two experiments clarifies that 3) POS filtering consistently outperforms MI filtering, which indicates that SVMs cannot find irrelevant parts-of-speech. These results suggest a simple strategy to utilize a full number of words that are picked up by a rough filtering technique like part-of-speech tagging.