A. Ketterlin, P. Gançarski, and J. J. Korczak, LSIIT, Université Louis Pasteur, France
KDD deals with the ready data, available in all scientific and applied domains. However, some domains with comprehensive and conclusive data have severe data security problems. To exclude the reidentification risk of individual cases, e.g. persons or companies, the access to these data is rigidly restricted, and often KDD applications are not allowed at all. In this paper, we discuss data privacy issues based on our experience with some applications of the discovery system Explora and other data analysis approaches. At first, some examples of applications are presented referring to a simple classification organized according to two dimensions important for the privacy discussion. Then we treat the reidentification risk and discuss anonymization methods to overcome these problems. Aggregation and synthetization methods are discussed in more detail. There is a tradeoff between the reduction of the reidentification risk and the preservation of the statistical content of data. We analyse for some main KDD patterns, how far the statistical content of anonymized data is still sufficient. In principle, KDD needs aggregate events. Since the event space of a dataset is very large, a static precomputation of all possible events is often not viable. We propose an architectural solution of a modular KDD system including a separate data server handling also data security requirements and ensuring that only dynamically aggregated data leave the server and can be analysed by the discovery modules of the KDD system. Finally, some other data privacy aspects are addressed.