Stephen Soderland, Bhushan Mandhani
There has been recent research in open-ended information extraction from text that finds relational triples of the form (arg1, relation phrase, arg2), where the relation phrase is a text string that expresses a relation between two arbitrary noun phrases. While such a relational triple is a good first step, much further work is required to turn such a textual relation into a logical form that supports inferencing. The strings from arg1 and arg2 must be normalized, disambiguated, and mapped to a formal taxonomy. The relation phrase must likewise be normalized and mapped to a clearly defined logical relation. Some relation phrases can be mapped to a set of pre-defined relations such as Part-0f and Causes. We focus instead on arbitrary relation phrases that are discovered from text. For this, we need to automatically merge synonymous relations and discover meta-properties such as entailment. Ultimately, we want the coverage of a bottom-up approach together with the rich set of axioms associated with a top-down approach. We have begun exploratory work in "ontologizing" the output of TextRunner, an open information extraction system that finds arbitrary relational triples from text. Our test domain is 2.5 million web pages on health and nutrition, which yields relational triples such as (orange, contains, vitamin C) and(fruits, are rich in, antioxidants). We automatically disambiguate the strings arg1 and arg2, mapping them to WordNet synsets. We also learn entailments between normalized relation strings (e.g. "be rich in" entails "contain"). This enhanced ontology enables reasoning about relationships that are not seen in the corpus, but can be inferred by inheritance and entailment. Further, we define ontology-based relationships between the extracted triples themselves, and experimentally show that these can be used in significantly improving probability estimation for the triples.
Subjects: 13. Natural Language Processing; 11.2 Ontologies
Submitted: Jan 26, 2007