Foster John Provost and Daniel N. Hennessy
Machine learning methods are becoming accepted as additions to the biologist’s data-analysis tool kit. However, scaling these techniques up to large data sets, such as those in biological and medical domains, is problematic in terms of both the required computational search effort and required memory (and the detrimental effects of excessive swapping). Our approach to tackling the problem of scaling up to large datasets is to take advantage of the ubiquitous workstation networks that are generally available in scientific and engineering environments. This paper introduces the notion of the invariant-partitioning property--that for certain evaluation criteria it is possible to partition a data set across multiple processorsuch that rule that is satisfactory over the entire data set will also be satisfactory on at least one subset. In addition, by taking advantage of cooperation through interprocess communication, it is possible to build distributed learning algorithms such that rules that are satisfactory over the entire data set will be learned. We describe a distributed learning system, CorPRL, that takes advantage of the invariant-partitioning property to learn from very large data sets, and present results demonstrating CorPRL’s effectiveness in analyzing data from two databases.