Wenyuan Dai, Gui-Rong Xue, Qiang Yang, Yong Yu
A basic assumption in traditional machine learning is that the training and test data distributions be identical. This assumption may not hold in many situations in practice, but we may be forced to rely on a different-distribution data to learn a prediction model. For example, this may be the case when it is expensive to label the data in a domain of interest, although in a related but different domain there may be plenty of labeled data available. In this paper, we propose a novel transfer-learning algorithm for text classification based on an EM-based Naive Bayes classifier. Our solution is to first estimate the initial probabilities under a distribution of one labeled data set, and then use an EM algorithm to revise the model for a different distribution of the test data which are unlabeled. We show that our algorithm is very effective in several different pairs of domains, where the distances between the different distributions are measured using the Kullback-Leibler (KL) divergence. Moreover, KL-divergence is used to decide the trade-off parameters in our algorithm. In the experiment, our algorithm outperforms the traditional supervised and semi-supervised learning algorithms when the distributions of the training and test sets are increasingly different.
Subjects: 12. Machine Learning and Discovery; 13. Natural Language Processing
Submitted: Apr 22, 2007