T-REX: A Domain-Independent System for Automated Cultural Information Extraction

Massimiliano Albanese and V.S. Subrahmanian

RDF (Resource Description Framework) is a web standard defined by the World Wide Web Consortium. In RDF, we can define schemas of interest. For example, we can define a schema about tribes on the Pakistan-Afghanistan borderland, or a schema about violent events. An RDF instance is a set of facts that are compatible with the schema. The principal contribution of this paper is the development of a scalable system called T-REX (short for “The RDF EXtractor”) that allows us to extract instances associated with a user-specified schema, independently of the domain about which we wish to extract data. Using T-REX, we have successfully extracted information about various aspects of about 20 tribes living in the Pakistan-Afghanistan border. Moreover, we have used T-REX to successfully extract occurrences of violent events from a set of 80 news sites in approximately 50 countries. T-REX scales well — it has processed approximately 45,000 web pages per day for the last 6 months.

Subjects: 7.1 Multi-Agent Systems; 12.1 Reinforcement Learning

Submitted: Jun 20, 2008

This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.