Marc Sebban and Jean-Christophe Janodet
In front of modern databases, noise tolerance has become today one of the most studied topics in machine learning. Many algorithms have been suggested for dealing with noisy data in the case of numerical instances, either by filtering them during a preprocess, or by treating them during the induction. However, this research subject remains widely open when one learns from unbounded symbolic sequences, which is the aim in grammatical inference. In this paper, we propose a statistical approach for dealing with noisy data during the inference of automata, by the state merging algorithm RPNI. Our approach is based on a proportion comparison test, which relaxes the merging rule of RPNI without endangering the generalization error. Beyond this relevant framework, we provide some useful theoretical properties about the behavior of our new version of RPN[, called RPNI*. Finally, we describe a large comparative study on several datasets.