Unsupervised Learning Helps Supervised Neural Word Segmentation

Xiaobin Wang; Deng Cai; Linlin Li; Guangwei Xu; Hai Zhao; Luo Si

doi:10.1609/aaai.v33i01.33017200

Authors

Xiaobin Wang Alibaba Group
Deng Cai The Chinese University of Hong Kong
Linlin Li Alibaba Group
Guangwei Xu Alibaba Group
Hai Zhao Shanghai Jiao Tong University
Luo Si Alibaba Group

DOI:

https://doi.org/10.1609/aaai.v33i01.33017200

Abstract

By exploiting unlabeled data for further performance improvement for Chinese word segmentation, this work makes the first attempt at exploring adding unsupervised segmentation information into neural supervised segmenter. We survey various effective strategies, including extending the character embedding, augmenting the word score and applying multi-task learning, for leveraging unsupervised information derived from abundant unlabeled data. Experiments on standard data sets show that the explored strategies indeed improve the recall rate of out-of-vocabulary words and thus boost the segmentation accuracy. Moreover, the model enhanced by the proposed methods outperforms state-of-theart models in closed test and shows promising improvement trend when adopting three different strategies with the help of a large unlabeled data set. Our thorough empirical study eventually verifies the proposed approach outperforms the widelyused pre-training approach in terms of effectively making use of freely abundant unlabeled data.

Unsupervised Learning Helps Supervised Neural Word Segmentation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription