Donghui Feng, Jihie Kim, Erin Shaw, Eduard Hovy
Online discussion boards are a popular form of web-based computer-mediated communication, especially in the areas of distributed education and customer support. Automatic analysis for discussion understanding would enable better information assessment and assistance. This paper describes an extensive study of the relationship between individual messages and full discussion threads. We present a new approach to classifying discussions using a Rocchio-style classifier with little cost for data labeling. In place of a labeled data set, we employ a coarse domain ontology that is automatically induced from a canonical text in a novel way and use it to build discussion topic profiles. We describe a new classify-by-dominance strategy for classifying discussion threads and demonstrate that in the presence of noise it can perform better than the standard classify-as-a-whole approach with an error rate reduction of 16.8%. This analysis of human conversation via online discussions provides a basis for the development of future information extraction and question answering techniques.
Subjects: 13. Natural Language Processing; 1.10 Information Retrieval