Jason T. L. Wang, Bruce A. Shapiro, Dennis Shasha, Kaizhong Zhang, Chia-Yo Chang
In this paper we present a method for discovering approximately common motifs (also known as active motifs) in multiple RNA secondary structures. The secondary structures can be represented as ordered trees (i.e., the order among siblings matters). Motifs in these trees are connected subgraphs that can differ in both substitutions and deletions/insertions. The proposed method consists of two steps: (1) find candidate motifs in a small sample of the secondary structures; (2) search all of the secondary structures to determine how frequently these motifs occur (within the allowed approximation) in the secondary structures. To reduce the running time, we develop two optimization heuristics based on sampling and pattern matching techniques. Experimental results obtained by running these algorithms on both generated data and RNA secondary structures show the good performance of the algorithms. To demonstrate the utility of our algorithms, we discuss their applications to conducting the phylogenetic study of RNA sequences obtained from GenBank.