Ido Dagan, Sean P. Engelson
Many corpus-based methods for natural language processing are based on supervised training, requiring expensive manual annotation of training corpora. This paper investigates reducing annotation cost by selective sampling. In this approach, the learner examines many unlabeled examples and selects for labeling only those that are most informative at each stage of training. In this way it is possible to avoid redundantly annotating examples that contribute little new information. The paper first analyzes the issues that need to be addressed when construct-ing a selective sampling algorithm, arguing for the attractiveness of committee-based sampling methods. We then focus on selective sampling for training probabilistic classifiers, which are commonly applied to problems in statistical natural language processing. We report experi-mental results of applying a specific type of committee-based sampling during training of a stochastic part-of-speech tagger, and demonstrate substantially improved learning rates over sequential training using all of the text. We are currently implementing and evaluating other variants of committee-based sampling, as discussed in the paper, in order to obtain further insight on optimal design of selective sampling methods.