Identification of Human Gene Functional Regions Based on Oligonucleotide Composition

V. V. Solovyev and C. B. Lawrence

Accurate recognition of coding and intron regions within large regions of uneharacterized genomic DNA is an unsolved problem. A data base of more than 4240791 bp coding and 7790682 bp noncoding human sequences was extracted from GenBank to develop a function for locating coding regions in anonymousequences. Several coding measures based on oligonucleotide preferences were tested on a control set that including 1/3 of all extracted sequences. An accuracy of separation of coding/noncoding regions is 87% for 9 bp oligonucleotides on 54 bp windows and 91% on 108 bp windows, respectively. For separation of coding/ intron regions the accuracy is 89-90% for 8 bp oligonucleotides on 54 bp windows and up to 95% on 108 bp windows. Using the information about preferences of octanucleotides in protein coding and intron regions and significant triplet frequencies as a function of position near splice junctions, a joint splice site prediction scheme was developed. The accuracy of the joint scheme for predicting splice site positions on the test set was about 96-97%, which exceeds the accuracy of the previously reported splice site selection method based on a more complex artificial neural network approach. A model of splicing using poly-G(C) rich exon flanking sequences is suggested. A remarkable difference of oligonucleotide composition 5- and 3- gene regions is displayed and applied in a gene structure predictive system.

This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.