Semi-Supervised Sequence Modeling with Syntactic Topic Models

Wei Li

Although there has been significant previous work on semi-supervised learning for classification, there has been relatively little in sequence modeling. This paper presents an approach that leverages recent work in manifold-learning on sequences to discover word clusters from language data, including both syntactic classes and semantic topics. From unlabeled data we form a smooth, low-dimensional feature space, where each word token is projected based on its underlying role as a function or content word. We then use this projection as additional input features to a linear-chain conditional random field trained on limited labeled training data. On standard part-of-speech tagging and Chinese word segmentation data sets we show as much as 14\% error reduction due to the unlabeled data, and also statistically-significant improvements over a related semi-supervised sequence tagging method due to Miller et al.

Content Area: 12. Machine Learning

Subjects: 12. Machine Learning and Discovery; 13. Natural Language Processing

Submitted: May 10, 2005

This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.