Buhwan Jeong, Boonserm Kulvatunyou, Nenad Ivezic, Hyunbo Cho, and Albert Jones
Ideally, e-Business application interfaces would be built from highly reusable specifications of business document standards. Since many of these specifications are poorly understood, users often create new ones or customize existing ones every time a new integration problem arises. Consequently, even though there is a potential for reuse, the lack of a component discovery tool means that the cost of reuse is still prohibitively high. In this paper, we explore the potential of using similarity metrics to discover standard XML Schema documents. Our goal is to enhance reuse of XML Schema document/component standards in new integration contexts through the discovery process. We are motivated by the increasing access to the application interface specifications expressed in the form of XML Schema. These specifications are created to facilitate business documents exchange among software applications. Reuse can reduce both the proliferation of standards and the interoperability costs. To demonstrate these potential benefits, we propose and position our research based on an experimental scenario and a novel evaluation approach to qualify alternative similarity metrics on schema discovery. The edge equality in the evaluation method provides a conservative quality measure. We review a number of fundamental approaches to developing similarity metrics, and we organize these metrics into lexical, structural, and logical categories. For each of the metrics, we discuss its relevance and potential issues in its application to the XML Schema discovery task. We conclude that each of the similarity measures has its own strengths and weaknesses and each is expected to yield different results in different search situations. It is important, in the context of an application of these measures to e-Business standards that a schema discovery engine capable of assigning appropriate weights to different similarity measures be used when the search conditions change. This is a subject of our future experimental work.