Automated Discovery of Functional Components of Proteins from Amino-Acid Sequences Based on Rough Sets and Change of Representation

Shusaku Tsumoto, Hiroshi Tanaka

One of the most important problems in rule induction methods is how to estimate which method is the best to use in an applied domain. While some methods are useful in some domains, they are not useful in other domains. Therefore it is very difficult to choose one of these methods. For this purpose, we introduce multiple testing based on recursive iteration of resampling methods for rule-induction (MULT-RECITE-R). This method consists of four procedures, which includes the inner loop and the outer loop procedures. First, original training samples (S0) are randomly split into new training samples(S1) and test samples (T1) using a resampling scheme. Second, S1 are again split into training sample (S2) and training samples (T2) using the same resampling scheme. Rule induction methods are applied and predefined metrics are calculated. This second procedure, as the inner loop, is repeated for 10,000 times. Then, third, rule induction methods are applied to S1, and the metrics calculated by T1 are compared with those by T2. If the metrics derived by T2 predicts those by T1, then we count it as a success. The second and third procedures, as the outer loop, are iterated for 10,000 times. Finally, fourth, the overall results are interpreted, and the best method is selected if the resampling scheme performs well. In order to evaluate this system, we apply this MULT-RECITE-R method to three UCI databases. The results show that this method gives the best selection of estimation methods statistically.

This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.