N. A. Kolchanov, O. V. Vishnevsky, V. N. Babenko, A. E. Kel, and I. N. Shindyalov
A computer tool has been developed for revealing sets of oligonucleotides invariant for isofunctional families of DNA (RNA) and for using these in functional identification of nucleotide sequences. The tool allows one to: build up vocabularies of invariant oligonucleotides for the families of isofunctional nucleotide sequences; assess significance of the vocabularies; identify nucleotide sequences with the vocabularies of invariant oligonucleotides; determine the most effective identification parameters to minimize first and second type errors; assess the efficiency of identification of individual isofunctional families with the oligonucleotide vocabularies; determine the evolutionary characteristics of the families of isofunctional sequences on which vocabulary volume depends. Based on the system mentioned, we have analyzed a total of 322 protein-encoding gene families and have built up sets of invariant oligonucleotides, or again, oligonucleotide vocabularies that are characteristic of gene families and subfamilies. Identification of nucleotide sequences belonging to these families with the sets of invariant oligonucleotides revealed has been shown. Under the most effective identification parameters, the first type error (false negative) on control (independent) data was 10-15%, the second type error (false positive) was just 1-2 redundant sequences per sequence being examined. As has been shown, the volume of a vocabulary of invariant oligonucleotides depends on the percentage of variable positions in the multiple alignment within a family.