Eric Bloedorn, Neal J. Rothleder, David DeBarr, and Lowell Rosen
In this paper, we describe research and application of relational graph mining in IRS investigations. One key scenario in this domain is the iterative construction of models for identifying tax fraud. For example, an investigator may be interested in understanding variations in schemes involving individuals sending money off-shore. This domain lends itself naturally to a graph representation with entities and their relationships represented as node and edges, respectively. There are two critical constraints in this application which make it unsuitable for existing work on relational graph mining. First, our data set is large (20 million nodes, 20 million edges, in 500GB) and includes multiple types of entities and relationships. Second, due to both the size, and the active nature of this data, it is necessary to do the mining directly against the database. Extracting and maintaining a separate data store would be impractical and costly to maintain. We focus on describing our approach to one of the core tasks in this process: allowing the investigator to mine potentially illegal activity by iteratively suggesting and refining loosely defined scenarios. Our current methodology combines three components: (1) a graph representation language which allows flexibility for inexact matches, (2) custom data structures, combined with dynamically generated sequences of SQL queries, to perform efficient mining directly against the database, and (3) exploiting cost-based optimization information to help improve our results search. A prototype solution has been deployed and used by the IRS and has resulted in both identification of criminal activity and accolades for ease of use and efficacy.