George H. John, IBM Almaden Research Center and Brian Lent, Stanford University
When mining large databases, the data extraction problem and the interface between the database and data mining algorithm become important issues. Rather than giving a mining algorithm full access to a database (by extracting to a flat file or other directly-accessible data structure), we propose the SQL Interface Protocol (SIP), which is a framework for interaction between a mining algorithm and a database. The data continues to reside entirely within the database management system (DBMS), but the query interface to the database gives the data mining algorithm sufficient information to discover the same patterns it would have found with direct access to the data. This model of interaction brings several advantages; for example, it allows a mining algorithm to be parallelized automatically just by using a parallelized DBMS to answer queries. We show how two families of mining algorithms may be implemented as ``SIPpers,'' and we discuss related work in databases that should further enhance performance in the future.