Unsupervised Learning Helps Supervised Neural Word Segmentation

  • Xiaobin Wang Alibaba Group
  • Deng Cai The Chinese University of Hong Kong
  • Linlin Li Alibaba Group
  • Guangwei Xu Alibaba Group
  • Hai Zhao Shanghai Jiao Tong University
  • Luo Si Alibaba Group


By exploiting unlabeled data for further performance improvement for Chinese word segmentation, this work makes the first attempt at exploring adding unsupervised segmentation information into neural supervised segmenter. We survey various effective strategies, including extending the character embedding, augmenting the word score and applying multi-task learning, for leveraging unsupervised information derived from abundant unlabeled data. Experiments on standard data sets show that the explored strategies indeed improve the recall rate of out-of-vocabulary words and thus boost the segmentation accuracy. Moreover, the model enhanced by the proposed methods outperforms state-of-theart models in closed test and shows promising improvement trend when adopting three different strategies with the help of a large unlabeled data set. Our thorough empirical study eventually verifies the proposed approach outperforms the widelyused pre-training approach in terms of effectively making use of freely abundant unlabeled data.

AAAI Technical Track: Natural Language Processing