Sequence Comparisons via Algorithmic Mutual Information

Aleksandar Milosavljevic

One of the main problems in DNA and protein sequence comparisons is to decide whether observed similarity of two sequences should be explained by their relatedness or by mere presence of some shared internal structure, e.g., shared internal tandem repeats. The standard methods that are based on statistics or classical information theory can be used to discover either internal structure or mutual sequence similarity, but cannot take into account both. Consequently, currently used methods for sequence comparison employ "masking" techniques that simply eliminate sequences that exhibit internal repetitive structure prior to sequence comparisons. The "masking" approach precludes discovery of homologous sequences of moderate or low complexity, which abound at both DNA and protein levels. As a solution to this problem, we propose a general method that is based on algorithmic information theory and minimal length encoding. We show that algorithmic mutual information factors out the sequence similarity that is due to shared internal structure and thus enables discovery of truly related sequences. We extend the recently developed algorithmic significance method (Milosavljevic, A. and Jurka, J. 1993. Discovering simple DNA sequences by the algorithmic significance method. Computer Applications in Biosciences 9(4):407-411.) to show that significance depends exponentially on algorithmic mutual information.


This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.