Hiroshi Mamitsuka and Naoki Abe
We describe and demonstrate the effectiveness of a method of predicting protein secondary structures, flsheet regions in particular, using a class of stochastic tree grammars aa representational language for their amino acid sequence patterns. The family of stochastic tree grammars we use, the Stochastic Ranked Node Rewriting Grammars (SRNRG), is one of the rare families of stochastic grammars that are expressive enough to capture the kind of long-distance dependencies exhibited by the sequences of fl-sheet regions, and at the same time enjoy relatively efficient processing. We applied our method on real data obtained from the HSSP database and the results obtained are encouraging: Using an SRNRG trained by data of a particular protein, our method was actually able to predict the location and structure of fl-sheet regions in a number of different proteins, whose sequences are less than 25 per cent homologous to the training sequences. The learning algorithm we use is an extension of the Inside- Outside algorithm for stochastic context free grammars, but with a number of significant modifications. First, we restricted the grammars used to be members of the linear subclass of SRNRG, and devised simpler and faster algorithms for this subclass. Secondly, we reduced the alphabet size (i.e. the number of amino acids) by clustering them using their physicochemical properties, gradually through the iterations of the learning algorithm. Finally, we parallelized our parsing algorithm to run on a highly parallel computer, a 32-processor CM-5, and were able to obtain a nearly linear speed-up. We emphasize that our prediction method already goes beyond what is possible by the homology-based approaches. We also stress that our method can predict the structure as well as the location of fl-sheet regions, which was not possible by previous inverse protein folding methods.