Michael Scharf, Reinheard Schneider, Georg Casari, Peer Bork, Alfonso Valencia, Christos Ouzounis, and Chris Sander
We present the prototype of a software system, cMled GeneQuiz, for large-scale biological sequence analysis. The system was designed to meet the needs that arise in computational sequence analysis and our past experience with the analysis of 171 protein sequences of yeast chromosome III. We explain the cognitive chMlenges associated with this particular research activity and present, our model of the sequence analysis process. The prototype system consists of two parts: (i) the database update and search system (driven by perl programs and rdb, a simple relational database engine also written in perl) and (ii) the visualization and browsing system (developed under C++/ET++). The principal design requirement for the first paxt was the complete automation of all repetitive actions: database updates, efficient sequence similarity searches and sampling of results in a uniform fashion. Thc user is then presented with "hit-lists" that summarize the results from heterogeneous database searches. The expert’s primary task now simply becomes the further analysis of the candidate entries, where the problem is to extract adequate information about functional characteristics of the query protein rapidly. This second task is tremendously accelerated by a simple combination of the heterogeneous output into uniform relational tables and the provision of browsing mechanisms that give access to database records, sequence entries and alignment views. Indexing of molecular sequence databases provides fast retrieval of individuM entries with the use of unique identifiers as well as browsing through databases using pre-existing cross-references. The presentation here covers an overview of the architecture of the system prototype and our experiences on its applicability in sequence analysis. The utility of Genequiz has been already proven during the analysis of 331 protein sequences from yeast chromosome XI and a quarter of the Mycoplasma capricolum genome, containing 314 proteins. Further developments will allow active guidance of the user by a rule-based system. Also, dependencies will be minimized so that the system can be made publicly available.