Social Media Relevance Filtering Using Perplexity-Based Positive-Unlabelled Learning

Sunghwan Mac Kim; Stephen Wan; Cécile Paris; Andreas Duenser

doi:10.1609/icwsm.v14i1.7307

Authors

Sunghwan Mac Kim Lorica Health
Stephen Wan CSIRO Data61
Cécile Paris CSIRO Data61
Andreas Duenser CSIRO Data61

DOI:

https://doi.org/10.1609/icwsm.v14i1.7307

Abstract

Internet user-generated data, like Twitter, offers data scientists a public real-time data source that can provide insights, supplementing traditional data. However, identifying relevant data for such analyses can be time-consuming. In this paper, we introduce our Perplexity variant of Positive-Unlabelled Learning (PPUL) framework as a means to perform social media relevance filtering. We note that this task is particularly well suited to a PU Learning approach. We demonstrate how perplexity can identify candidate examples of the negative class, using language models. To learn such models, we experiment with both statistical methods and a Variational Autoencoder. Our PPUL method generally outperforms strong PU Learning baselines, which we demonstrate on five different data sets: the Hazardous Product Review data set, two well known social media data sets, and two real case studies in relevance filtering. All datasets have manual annotations for evaluation, and, in each case, PPUL attains state-of-the-art performance, with gains ranging from 4 to 17% improvement over competitive baselines. We show that the PPUL framework is effective when the amount of positive annotated data is small, and it is appropriate for both content that is triggered by an event and a general topic of interest.

Social Media Relevance Filtering Using Perplexity-Based Positive-Unlabelled Learning

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information