Social Media Relevance Filtering Using Perplexity-Based Positive-Unlabelled Learning
Internet user-generated data, like Twitter, offers data scientists a public real-time data source that can provide insights, supplementing traditional data. However, identifying relevant data for such analyses can be time-consuming. In this paper, we introduce our Perplexity variant of Positive-Unlabelled Learning (PPUL) framework as a means to perform social media relevance filtering. We note that this task is particularly well suited to a PU Learning approach. We demonstrate how perplexity can identify candidate examples of the negative class, using language models. To learn such models, we experiment with both statistical methods and a Variational Autoencoder. Our PPUL method generally outperforms strong PU Learning baselines, which we demonstrate on five different data sets: the Hazardous Product Review data set, two well known social media data sets, and two real case studies in relevance filtering. All datasets have manual annotations for evaluation, and, in each case, PPUL attains state-of-the-art performance, with gains ranging from 4 to 17% improvement over competitive baselines. We show that the PPUL framework is effective when the amount of positive annotated data is small, and it is appropriate for both content that is triggered by an event and a general topic of interest.