David Gabay, Ziv Ben-Eliahu, Michael Elhadad
Tagged corpora are essential for evaluating and training natural language processing tools. The cost of constructing large enough manually tagged corpora is high, even when the annotation level is shallow. This article describes a simple method to automatically create a partially tagged corpus, using Wikipedia hyperlinks. The resulting corpus contains information about the correct segmentation of 523,599 non-consecutive words in 363,090 sentences. We used our method to construct a corpus of Modern Hebrew (which we have made available at www.cs.bgu.ac.il/~nlpproj). The method can also be applied to other languages where word segmentation is difficult to determine, such as East and South-East Asian languages.
Subjects: 13. Natural Language Processing; 1.10 Information Retrieval
Submitted: May 5, 2008