Tomasz Imielinski, Aashu Virmani, Amin Abdulghani
The main objective of the DataMine is to provide application development interface to develop knowledge discovery applications on the top of large databases. Current database systems have been designed mainly to support business applications. The success of SQL capitalized on a small number of primitives which are sufficient to support a vast majority of applications today. Unfortunately this is not enough to capture the emerging family of new applications dealing with the so called rule and knowledge discovery. The goal of the DataMine and our work is to make the next step in the development of DBMS and provide much needed support for the rule discovery applications. A typical knowledge discovery application starts with rule discovery, but rules are not necessarily the end products. For example: 1) Finding the "best" candidates for a marketing promotion package from a large population stored in the database; for example, the best candidates for certain type of insurance may be those who frequent health clubs and are under 40 etc. 2) Finding any strong rules between the age, disease, and residence area, such as "40% of heart disease case in NJ occur in patients older than 50" (rules are statements of the form 'if condition then consequent'.) 3) Finding the most distinctive features (as opposed to other states) of NJ heart patients. Finding rules is only the first step in a knowledge discovery application. Typically, a user wants to embed information obtained from the rules in a larger program. For instance, in target marketing applications a company may have a fixed promotion budget and can only offer some limited number of promotions. A promotion mailing application must rank the best candidates for mailing and go "down the list" of most likely candidates until all promotion offerings are taken. To accomplish such a task we need an integrated API for knowledge discovery applications, integrated with the programming language (like C) and with the database query language (such as SQL). There is no commercial system nor research prototype today which would offer such an integrated API for knowledge discovery applications. Today, most systems offer "stand alone" features using tree classifiers, neural nets, and meta-pattern generators. Such systems cannot be embedded into a large application and typically offer just one knowledge discovery feature. The situation today is thus very similar to the situation with DBMS in the early sixties when each application had to be built from scratch, without the benefit of dedicated database primitives provided later by SQL and relational database APIs. The objective of DataMine is to fill this gap and bring the database support for knowledge discovery applications to the same level that exists today for business applications. What we offer and plan to offer can be summarized as follows: Extension of SQL, called M-SQL to generate and selectively retrieve sets of rules from a large database. Embedding of M-SQL in the general host language (in a similar way as SQL is embedded in C) to provide API for Knowledge and Data Discovery applications. Thus, just as SQL does, we are supporting two basic modes: free form querying and embedded querying. Free form querying allows the user to perform interactive and exploratory data analysis, while embedded querying provides features to run applications which rely on rule discovery, but use rules in some further computations.