Social Media Relevance Filtering Using Perplexity-Based Positive-Unlabelled Learning

  • Sunghwan Mac Kim Lorica Health
  • Stephen Wan CSIRO Data61
  • Cécile Paris CSIRO Data61
  • Andreas Duenser CSIRO Data61

Abstract

Internet user-generated data, like Twitter, offers data scientists a public real-time data source that can provide insights, supplementing traditional data. However, identifying relevant data for such analyses can be time-consuming. In this paper, we introduce our Perplexity variant of Positive-Unlabelled Learning (PPUL) framework as a means to perform social media relevance filtering. We note that this task is particularly well suited to a PU Learning approach. We demonstrate how perplexity can identify candidate examples of the negative class, using language models. To learn such models, we experiment with both statistical methods and a Variational Autoencoder. Our PPUL method generally outperforms strong PU Learning baselines, which we demonstrate on five different data sets: the Hazardous Product Review data set, two well known social media data sets, and two real case studies in relevance filtering. All datasets have manual annotations for evaluation, and, in each case, PPUL attains state-of-the-art performance, with gains ranging from 4 to 17% improvement over competitive baselines. We show that the PPUL framework is effective when the amount of positive annotated data is small, and it is appropriate for both content that is triggered by an event and a general topic of interest.

Published
2020-05-26
How to Cite
Kim, S. M., Wan, S., Paris, C., & Duenser, A. (2020). Social Media Relevance Filtering Using Perplexity-Based Positive-Unlabelled Learning. Proceedings of the International AAAI Conference on Web and Social Media, 14(1), 370-381. Retrieved from https://www.aaai.org/ojs/index.php/ICWSM/article/view/7307