Probabilistic Modeling for Information Retrieval with Unsupervised Training Data

Ernest P. Chan, Santiago Garcia, Salim Roukos

We apply a well-known Bayesian probabilistic model to textual information retrieval: the classification of documents based on their relevance to a query. This model was previously used with supervised training data for a fixed query. When only noisy, unsupervised training data generated from a heuristic relevance-scoring formula are available, two crucial adaptations are needed: (1) severe smoothing of the models built on the training data; and (2) adding a prior probability to the models. We have shown that with these adaptations, the probabilistic model is able to improve the retrieval precision of the heuristic model. The experiment was performed using the TREC-5 corpus and queries, and the evaluation of the model was submitted as an official entry (ibms96b) to TREC-5.

This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.