AAAI Publications, Sixth International AAAI Conference on Weblogs and Social Media

Font Size: 
Filtering Noisy Web Data by Identifying and Leveraging Users' Contributions
Alina Mihaela Stoica

Last modified: 2012-05-20


In this paper we present several methods for collecting Web textual contents and filtering noisy data. We show that knowing which user publishes which contents can contribute to detecting noise. We begin by collecting data from two forums and from Twitter. For the forums, we extract the meaningful information from each discussion (texts of question and answers, IDs of users, date). For the Twitter dataset, we first detect tweets with very similar texts, which helps avoiding redundancy in further analysis. Also, this leads us to clusters of tweets that can be used in the same way as the forum discussions: they can be modeled by bipartite graphs. The analysis of nodes of the resulting graphs shows that network structure and content type (noisy or relevant) are not independent, so network studying can help in filtering noise.


noise filtering; bipartite graphs

Full Text: PDF