Ke Wang and Huiqing Liu
To formulate a meaningful query on semistructured data, such as on the Web, that matches some of the source’s structure, we need first to discover something about how the information is represented in the source. This is referred to as schema discovery and was considered for a single object recently. In the case of multiple objects, the task of schema discovery is to identify typical structuring information of those objects as a whole. We motivate the schema discovery in this general setting and propose a framework and algorithm for it. We apply the framework to a real Web database, the Internet Movies Database, to discover typical schema of most voted movies.