Applications of Case-Based Reasoning in Molecular Biology
Total Page:16
File Type:pdf, Size:1020Kb
Articles Applications of Case-Based Reasoning in Molecular Biology Igor Jurisica and Janice Glasgow ■ Case-based reasoning (CBR) is a computational problems by recalling old problems and their reasoning paradigm that involves the storage and solutions and adapting these previous experi- retrieval of past experiences to solve novel prob- ences represented as cases. A case generally lems. It is an approach that is particularly relevant comprises an input problem, an output solu- in scientific domains, where there is a wealth of data but often a lack of theories or general princi- tion, and feedback in terms of an evaluation of ples. This article describes several CBR systems that the solution. CBR is founded on the premise have been developed to carry out planning, analy- that similar problems have similar solutions. sis, and prediction in the domain of molecular bi- Thus, one of the primary goals of a CBR system ology. is to find the most similar, or most relevant, cases for new input problems. The effective- ness of CBR depends on the quality and quan- tity of cases in a case base. In some domains, even a small number of cases provide good so- lutions, but in other domains, an increased number of unique cases improves problem- he domain of molecular biology can be solving capabilities of CBR systems because characterized by substantial amounts of there are more experiences to draw on. Howev- Tcomplex data, many unknowns, a lack of er, larger case bases can also decrease the effi- complete theories, and rapid evolution; rea- ciency of a system. The reader can find detailed soning is often based on experience rather descriptions of the CBR process and systems in than general knowledge. Experts remember Kolodner (1993). More recent research direc- positive experiences for possible reuse of solu- tions are presented in Leake (1996), and prac- tions; negative experiences are used to avoid tically oriented descriptions of CBR can be potentially unsuccessful outcomes. Similar to found in Bergman et al. (1999) and Watson other scientific domains, problem solving in (1997). molecular biology can benefit from systematic The remainder of this article describes sever- knowledge management using techniques al CBR systems that have been developed to from AI. Case-based reasoning (CBR) is partic- address problems in molecular biology. We be- ularly applicable to this problem domain be- gin with a description of a recent CBR system cause it (1) supports rich and evolvable repre- for planning protein-crystallization experi- sentation of experiences—problems, solutions, ments, followed by summaries of earlier CBR and feedback; (2) provides efficient and flexible systems for gene finding, knowledge discovery ways to retrieve these experiences; and (3) ap- in a sequence database, and protein- structure plies analogical reasoning to solve novel prob- determination. We conclude with a discussion lems. of issues related to the application of CBR in CBR is a paradigm that involves solving new the domain. Copyright © 2004, American Association for Artificial Intelligence. All rights reserved. 0738-4602-2004 / $2.00 SPRING 2004 85 Articles CBR and Protein Crystallization growth that are effective in many settings. For example, Jancarik and Kim (1991) proposed a One of the fundamental challenges in modern set of 48 agents that are often used during crys- molecular biology is the elucidation and un- tallization. Although advances have been made derstanding of the laws by which proteins through practical experience, a need remains adopt their three-dimensional structure. Pro- for systematic and principled studies to im- teins are involved in every biochemical process prove our deep understanding of the crystal- that maintains life in a living organism. lization process and provide a basis for the Through an increased understanding of pro- planning of successful new experiments. One tein structure, we gain insight into the func- of the difficulties in planning crystal growth ex- tions of these important molecules. Currently, periments is that “the history of experiments is the most powerful method for protein-struc- not well known, because crystal growers do not ture determination is single-crystal X-ray dif- monitor parameters” (Ducruix and Giege 1992, fraction, although new breakthroughs in nu- p. 14). One recent approach attempted to opti- clear magnetic resonance (NMR) (Kim and mize the 48-agent screen from crystallization Szyperski 2003) and in silico (Bysrtoff and Shao data on 755 different proteins (Kimber et al. 2002) approaches are growing in their impor- 2003). Not surprisingly, the study showed that tance. A crystallography experiment begins one can eliminate certain conditions (in this with a well-formed crystal that ideally diffracts case, 9) and still not lose any crystal or use only X-rays to high resolution. For proteins, this 6 conditions and still detect crystals for 338 process is often limited by the difficulty of proteins (60.6 percent). These results are en- growing crystals suitable for diffraction, which couraging. However, proteins have different is partially the result of the large number of pa- properties, and thus, the selection of more or rameters affecting the crystallization outcome less useful crystallizing conditions might never (such as purity of proteins, intrinsic physico- be universal across all proteins. In addition, chemical, biochemical, biophysical, and bio- when a particular condition produces differen- logical parameters) and the unknown correla- tial results across many proteins, it still might tions between the variation of a parameter and provide valuable information. the propensity for a given macromolecule to The BIOLOGICAL MACROMOLECULAR CRYSTALLIZA- crystallize. The CBR system described in this section ad- TION DATABASE (BMCD) (Gilliland, Tung, and Lad- dresses the problem of planning for a novel ner 2002) stores data from published crystal- protein-crystallization experiment. The poten- lization papers (positive examples of successful tial impact of such a system is far reaching: Ac- plans for growing crystals), including informa- celerating the process of crystallization implies tion about the macromolecule itself, the crys- an increased knowledge of protein structure, tallization methods used, and the crystal data. which is critical to medicine, drug design, and Attempts have been made to apply statistical enzyme studies and to a more complete under- and machine learning techniques to the BMCD. standing of fundamental molecular biology. These efforts include approaches that use clus- ter analysis (Farr, Peryman, and Samuzdi 1998), Protein Crystallization inductive learning (Hennessy et al. 1994), and Conceptually, protein crystal growth can be di- correlation analysis combined with a Bayesian vided into two phases: (1) search and (2) opti- technique (Hennessy et al. 2000) to extract mization. Approximate crystallization condi- knowledge from the database. Previous studies tions are identified during the search phase, were limited because negative results are not but the optimization phase varies these condi- reported in the database and because many tions to ultimately yield high-quality crystals. crystallization experiments are not repro- Improving these phases can lead to accelerated ducible because of an incomplete method de- protein-structure determination and function scription, missing details, or erroneous data resolution. Optimally, discovering the princi- (which is the result of often skimpy and vague ples of crystal growth will eliminate protein primary literature). Consequently, the BMCD is crystallization as a bottleneck in modern struc- not currently being used in a strongly predic- tural biology. tive fashion. The crystallization of macromolecules is cur- Significant advancement in this field has re- rently primarily empirical. Because of its unpre- sulted from the use of high-throughput robotic dictability and high irreproducibility, it has setups for the search phase of crystal growth. been considered by some to be an art rather This advancement vastly increases the number than a science (Ducruix and Giege 1992) or an of conditions that can initially be tested and al- “exact art and subtle science.” Experience alone so improves reliability and systematicity for ap- has produced experimental protocols for crystal proximating conditions for crystallization in 86 AI MAGAZINE Articles the search phase. However, it also results in proteins. If we consider only 15 possible condi- thousands of initial crystallization experiments tions, each having 15 possible values, the result carried out daily that require expert evaluation would be 4.3789e+017 possible experiments. based on visual criteria. Thus, our CBR system Assume that 2 proteins react similarly when includes an image analysis subsystem that is tested against a large set (over 1,500) of precip- used to automatically classify the initial out- itating agents in the search phase of crystalliza- comes in the search phase. This classification tion. Then, the planning strategies successfully forms the basis for the similarity measure for used for the one can profitably be applied to CBR. We incorporate knowledge discovery the other. Thus, it is important to identify a tools to assist in the optimization and the un- suitable set of precipitating agents to sort the derstanding of the protein-crystallization outcomes of reactions for a relatively large process. group of proteins. New crystallization chal- lenges are then approached by the execution