AAAI Publications, Twenty-Third IAAI Conference

Font Size: 
A Machine Learning Based System for Semi-Automatically Redacting Documents
Chad Cumby, Rayid Ghani

Last modified: 2011-08-04


Redacting text documents has traditionally been a mostly manual activity, making it expensive and prone to disclosure risks. This paper describes a semi-automated system to ensure a specified level of privacy in text data sets. Recent work has attempted to quantify the likelihood of privacy breaches for text data. We build on these notions to provide a means of obstructing such breaches by framing it as a multi-class classification problem. Our system gives users fine-grained control over the level of privacy needed to obstruct sensitive concepts present in that data. Additionally, our system is designed to respect a user-defined utility metric on the data (such as disclosure of a particular concept), which our methods try to maximize while anonymizing. We describe our redaction framework, algorithms, as well as a prototype tool built in to Microsoft Word that allows enterprise users to redact documents before sharing them internally and obscure client specific information. In addition we show experimental evaluation using publicly available data sets that show the effectiveness of our approach against both automated attackers and human subjects.The results show that we are able to preserve the utility of a text corpus while reducing disclosure risk of the sensitive concept.

Full Text: PDF