Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu, University of Munich, Germany
Both the number and the size of spatial databases are rapidly growing because of the large amount of data obtained from satellite images, X-ray crystallography or other scientific equipment. Therefore, automated knowledge discovery becomes more and more important in spatial databases. So far, most of the methods for knowledge discovery in databases (KDD) have been based on relational database systems. In this paper, we address the task of class identification in spatial databases using clustering techniques. We present an interface to the database management system (DBMS), which is crucial for the efficiency of KDD on large databases. This interface is based on a spatial access method, the R*-tree. It clusters the objects according to their spatial neighborhood and supports efficient processing of spatial queries. Furthermore, we propose a method for spatial data sampling as part of the focusing component, significantly reducing the number of objects to be clustered. Thus, we achieve a considerable speed-up for clustering in large databases. We have applied the proposed techniques to real data from a large protein database used for predicting protein-protein docking. A performance evaluation on this database indicates that clustering on large spatial databases can be performed both efficiently and effectively using our approach.