Bo Tang and Julia Hodges
This paper describe a novel approach to knowledge representation, learning, and reasoning in WebDoc, a system that classifies Web documents according to the Library of Congress classification system. We argue that an automatically constructed domain-independent knowledge base is indispensable. The WebDoc system builds a knowledge base (represented as a semantic network) that contains the Library of Congress subject headings and their relationships. Through training on human-indexed and NLP-parsed Web documents, WebDoc modifies the semantic network and generates rules for future index generation tasks.