Scott B. Huffman, Catherine Baudin, and Robert A. Nado
A semistructured information space consists of multiple collections of textual documents containing fielded or tagged sections. The space can be highly heterogeneous, because each collection has its own schema, and there are no enforced keys or formats for data items across collections. Thus, structured methods like SQL cannot be easily employed, and users often must make do with only full-text search. In this paper, we describe an intermediate approach that provides structured querying for particular types of entities, such as companies, people, and skills. Entity-based retrieval is enabled by normalizing entity references in a heuristic, type-dependent manner. To organize and filter search results, entities are categorized as playing particular roles (e.g., company as client, as vendor, etc.) in particular collection types (directories, client engagement records, etc.). The approach can be used to retrieve documents and can also be used to construct entity profiles summaries of commonly sought information about an entity based on the documents content. The approach requires only a modest amount of meta-information about the source collections, much of which is derived automatically. On a set of typical user queries in a large corporate information space, the approach produces a dramatic improvement in retrieval quality over knowledge-free methods like full-text search.