Xingquan Zhu, Xindong Wu, and Ying Yang
Given a noisy dataset, how to locate erroneous instances and attributes and rank suspicious instances based on their impacts on the system performance is an interesting and important research issue. We provide in this paper an Error Detection and Impact-sensitive instance Ranking (EDIR) mechanism to address this problem. Given a noisy dataset D, we first train a benchmark classifier T from D. The instances, that cannot be effectively classified by T are treated as suspicious and forwarded to a subset S. For each attribute Ai, we switch Ai and the class label C to train a classifier APi for Ai. Given an instance Ik in S, we use APi and the benchmark classifier T to locate the erroneous value of each attribute Ai. To quantitatively rank instances in S, we define an impact measure based on the Information-gain Ratio (IR). We calculate IRi between attribute Ai and C, and use IRi as the impact-sensitive weight of Ai. The sum of impact-sensitive weights from all located erroneous attributes of Ik indicates its total impact value. The experimental results demonstrate the effectiveness of our strategies.