Tetsushi Yada and Makoto Hirosawa
We have developed a hidden Markov model (HMM) to detect the protein coding regions within one megabase contiguous sequence data, registered in a database called GenBank in eight entries, of the genome of cyanobacterium, Sgnechocystis sp. strain PCC6803. Detection of the coding regions in the database entry was performed by using HMM whose parameters were determined by taking the statistics from the rests of the entries. This HMM has states modeling the di-codons asld their frequencies within coding regions and those modeling its base contents in the intergenic regions. Results of the cross--validation showed that the HMM recognized 92.1% of coding regions assigned in sequence annotation. In addition, it suggested 9.t potential new coding regions whose length are longer than 90 bases. The recognition accuracy calculated at the level of individual bases was 90.7% for the coding regions and 88.1% for the intergenic regions. This corresponds to a correlation coefficient for coding region recognition of 0.784. Comparison with its prediction accuracy with that by GeneMark showed that the HMM has the same level of prediction accuracy as GeneMark on average. Since we can extend the HMM to utilize information such as SD sequences, the prediction accuracy of the HMM will be enhanced. It was observed that correlation was positive between the prediction rate of the coding regions and the G+C content at the ttfird position of the eodon. This suggests the possibility that the prediction rate of coding regions in the cyanobacteria sequence can be enhanced by improving the present HMM into that reflects the classification of coding regions based on the G+C content.