Cynthia A. Thompson
Most corpus-based approaches to language learning have focused on tasks for which a sufficient amount of human-labeled training data is available. However, it is difficult to produce such data, and models trained from such data tend to be brittle when applied to domains that vary, even in seemingly minor ways, from the training data. We claim that these difficulties can by overcome by applying semi-supervised learning techniques. Semi-supervised techniques learn from both labeled and raw data. In our case, the latter is raw text. Several researchers have used semi-supervised techniques for language learning (Nigam et al. 2000; Blum and Mitchell 1998; Joachims 1999; Riloff and Jones 1999), but we believe that this area is not yet well explored and definitely not well understood. Therefore, we present a challenge problem for semi-supervised learning: semantic role labeling and semantic relationship annotation. Semantic role labeling was introduced by Gildea and Jurafsky (2002), and we added semantic relationship annotation in Thompson, Levy, and Manning (2003). This problem is a difficult one for semisupervised techniques, for three reasons. First, there are many possible classes (the role labels) for examples. Second, sequence learning is involved. Third, the learning scenario is plagued by sparse data problems. We describe the role labeling problem, our learning model and its extendibility to semi-supervised learning, and some preliminary experiments.