Dan Gusfield, University of California, Davis
The next phase of human genomics will involve large- scale screens of populations for significant DNA polymorphisms, notably single nucleotide polymorphisms (SNP’s). Dense human SNP maps are currently under construction. However, the utility of those maps and screens will be limited by the fact that humans are diploid, and that it is presently difficult to get separate data on the two copies. Hence genotype (blended) SNP data will be collected, and the desired haplotype (partitioned) data must then be (partially) inferred. A particular non-deterministic inference algorithm was proposed and studied before SNP data was available, and extensively applied more recently to study the first available SNP data. In this paper, we consider the question of whether we can obtain an efficient, deterministic variant of that method to optimize the obtained inferences. Although we have shown elsewhere that the optimization problem is NP-hard, we present here a practical approach based on (integer) linear programming. The method either returns the optimal answer, and a declaration that it is the optimal, or declares that it has failed to find the optimal. The approach works quickly and correctly, finding the optimal on all simulated data tested, data that is expected to be more demanding than realistic biological data.