Asif Ekbal, Sivaji Bandyopadhyay
This paper presents a modified joint-source channel model that is used to transliterate a Named Entity (NE) of the source language to the target language and vice-versa. As a case study, Bengali and English have been chosen as the possible source and target language pair. A number of alternatives to the modified joint-source channel model have been considered also. The Bengali NE is divided into Transliteration Units (TU) with patterns C+M, where C represents a consonant or a vowel or a conjunct and M represents the vowel modifier or matra. An English NE is divided into TUs with patterns C*V*, where C represents a consonant and V represents a vowel. The system learns mappings automatically from the bilingual training sets of person and location names. Aligned transliteration units along with their contexts are automatically derived from these bilingual training sets to generate the collocational statistics. The system also considers the linguistic features in the form of possible conjuncts and diphthongs in Bengali and their corresponding representations in English. Experimental results of the 10-fold open tests demonstrated that the modified joint source-channel model performs best during Bengali to English transliteration with a Word Agreement Ratio (WAR) of 74.4% for person names, 72.6% for location names and a Transliteration Unit Agreement Ratio (TUAR) of 91.7% for person names, 89.3% for location names. The same model has demonstrated a WAR of 72.3% for person names, 70.5% for location names and a TUAR of 90.8% for person names, 87.6% for location names during back transliteration.
Subjects: 13. Natural Language Processing
Submitted: Feb 9, 2007