A. Califano and I. Rigoutsos
A key issue in managing today’s large amounts of genetic data is the availability of efficient, accurate, and selective techniques for detecting homologies (similarities) between newly discovered and already stored sequences. A common characteristic of today’s most advanced algorithms, such as FASTA, BLAST, and BLAZE is the need to scan the contents of the entire database, in order to find one or more matches. This design decision results in either excessively long search times or, as is the case of BLAST, in a sharp trade-off between the achieved accuracy and the required amount of computation. The homology detection algorithm presented in this paper, on the other hand, is based on a probabilistic indexing framework. The algorithm requires minimal access to the database in order to determine matches. This minimal requirement is achieved by using the sequences of interest to generate a highly redundant number of very descriptive tuples; these tuples are subsequently used as indices in a table look-up paradigm. In addition to the description of the algorithm, theoretical and experimental results on the sensitivity and accuracy of the suggested approach are provided. The storage and computational requirements are described and the probability of correct matches and false alarms is derived. Sensitivity and accuracy are shown to be close to those of dynamic programming techniques. A prototype system has been implemented using the described ideas. It continues the full Swiss-Prot database rel 25 (10 MR) and the genome of E. Coli (2 MR). system is currently being expanded to include the complete Genbank database. Searches can be carried out in a few seconds on a workstation-class machine.