Sihem Belabbes, Gilles Richard
As a side effect of e-marketing strategy the number of spam e-mails is rocketing, the time and cost needed to deal with spam as well. Spam filtering is one of the most difficult tasks among diverse kinds of text categorization, sad consequence of spammers dynamic efforts to escape filtering. In this paper, we investigate the use of Kolmogorov complexity theory as a backbone for spam filtering, avoiding the burden of text analysis, keywords and blacklists update. Exploiting the fact that we can estimate a message information content through compression techniques, we represent an e-mail as a multi-dimensional real vector and then we implement a support vector machine classifier to classify new incoming e-mails. The first results we get exhibit interesting accuracy rates and emphasize the relevance of our idea.
Subjects: 12. Machine Learning and Discovery; 13. Natural Language Processing
Submitted: Feb 6, 2008