Richard K. Belew and John Hatton
As the use of machine learning techniques in IR increases, the need for a sound empirical methodology for collecting andd assessing users' opinions - "relevance feedback" - becomes critical to the evaluation of system performance. In IR the typical assessment procedure relies upon the opinion of a single individual, an "expert" in the corpus’ domain of discourse. Apart from the logistical difficulties of gathering multiple opinions, whether any one, "omniscent" individual is capable of providing reliable data about the appropriate set of documents to be retrievod remains a foundational issue within IR. This paper responds to such critiques with a new methodology for collecting relevance assessments that combines evidence from muliiple human judges. RAVe is a suite of software routines that allow an IR experlmenter to effectively collect large numbers of relevance assessments for an arbitrary document corpus. This paper sketches our assumptions about the cognitive activity of the providing relvance assessments, and the design issues involved in identifying the documents to be evaluated; allocating subjects’ time to provide the most infommtive assessments; and aggregating multiple users’ opinions into a binary predicate of "relevant."