Combining Categorization-based and Corpus-based Approaches for CLIR

Yiming Yang, Monica Rogati, and Bryan Kisiel, Carnegie Mellon University

Whether or not we can use existing concept taxonomies to help cross-lingual information retrieval (CLIR) is an open question. This paper investigates an interlingual approach that uses the MeSH categories in the medical domain to index bilingual documents and queries and to measure their relevance based on a category-level matching. We conducted bilingual retrieval experiments on a new corpus (Springer) of medical documents and queries, in the languages of English and German. We also evaluated several high-performing corpus-based learning methods and a machine translation (MT) based approach using SYSTRAN, a commercial system with strong results on CLIR benchmarks. Our results on Springer show that the categorization-based approach significantly outperformed the MT-based approach, but underperformed the corpus-based methods due to the loss of detailed information from the category-level indexing. Combining the output of categorization-based retrieval and corpus-based retrieval yielded a significant performance improvement over using either alone.


This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.