Articles Applications of Case-Based Reasoning in Molecular Biology

Igor Jurisica and Janice Glasgow

■ Case-based reasoning (CBR) is a computational problems by recalling old problems and their reasoning paradigm that involves the storage and solutions and adapting these previous experi- retrieval of past experiences to solve novel prob- ences represented as cases. A case generally lems. It is an approach that is particularly relevant comprises an input problem, an output solu- in scientific domains, where there is a wealth of data but often a lack of theories or general princi- tion, and feedback in terms of an evaluation of ples. This article describes several CBR systems that the solution. CBR is founded on the premise have been developed to carry out planning, analy- that similar problems have similar solutions. sis, and prediction in the domain of molecular bi- Thus, one of the primary goals of a CBR system ology. is to find the most similar, or most relevant, cases for new input problems. The effective- ness of CBR depends on the quality and quan- tity of cases in a case base. In some domains, even a small number of cases provide good so- lutions, but in other domains, an increased number of unique cases improves problem- he domain of molecular biology can be solving capabilities of CBR systems because characterized by substantial amounts of there are more experiences to draw on. Howev- Tcomplex data, many unknowns, a lack of er, larger case bases can also decrease the effi- complete theories, and rapid evolution; rea- ciency of a system. The reader can find detailed soning is often based on experience rather descriptions of the CBR process and systems in than general knowledge. Experts remember Kolodner (1993). More recent research direc- positive experiences for possible reuse of solu- tions are presented in Leake (1996), and prac- tions; negative experiences are used to avoid tically oriented descriptions of CBR can be potentially unsuccessful outcomes. Similar to found in Bergman et al. (1999) and Watson other scientific domains, problem solving in (1997). molecular biology can benefit from systematic The remainder of this article describes sever- knowledge management using techniques al CBR systems that have been developed to from AI. Case-based reasoning (CBR) is partic- address problems in molecular biology. We be- ularly applicable to this problem domain be- gin with a description of a recent CBR system cause it (1) supports rich and evolvable repre- for planning protein-crystallization experi- sentation of experiences—problems, solutions, ments, followed by summaries of earlier CBR and feedback; (2) provides efficient and flexible systems for gene finding, knowledge discovery ways to retrieve these experiences; and (3) ap- in a sequence database, and protein- structure plies analogical reasoning to solve novel prob- determination. We conclude with a discussion lems. of issues related to the application of CBR in CBR is a paradigm that involves solving new the domain.

Copyright © 2004, American Association for Artificial Intelligence. All rights reserved. 0738-4602-2004 / $2.00 SPRING 2004 85 Articles

CBR and Protein Crystallization growth that are effective in many settings. For example, Jancarik and Kim (1991) proposed a One of the fundamental challenges in modern set of 48 agents that are often used during crys- molecular biology is the elucidation and un- tallization. Although advances have been made derstanding of the laws by which proteins through practical experience, a need remains adopt their three-dimensional structure. Pro- for systematic and principled studies to im- teins are involved in every biochemical process prove our deep understanding of the crystal- that maintains life in a living organism. lization process and provide a basis for the Through an increased understanding of pro- planning of successful new experiments. One tein structure, we gain insight into the func- of the difficulties in planning crystal growth ex- tions of these important molecules. Currently, periments is that “the history of experiments is the most powerful method for protein-struc- not well known, because crystal growers do not ture determination is single-crystal X-ray dif- monitor parameters” (Ducruix and Giege 1992, fraction, although new breakthroughs in nu- p. 14). One recent approach attempted to opti- clear magnetic resonance (NMR) (Kim and mize the 48-agent screen from crystallization Szyperski 2003) and in silico (Bysrtoff and Shao data on 755 different proteins (Kimber et al. 2002) approaches are growing in their impor- 2003). Not surprisingly, the study showed that tance. A crystallography experiment begins one can eliminate certain conditions (in this with a well-formed crystal that ideally diffracts case, 9) and still not lose any crystal or use only X-rays to high resolution. For proteins, this 6 conditions and still detect crystals for 338 process is often limited by the difficulty of proteins (60.6 percent). These results are en- growing crystals suitable for diffraction, which couraging. However, proteins have different is partially the result of the large number of pa- properties, and thus, the selection of more or rameters affecting the crystallization outcome less useful crystallizing conditions might never (such as purity of proteins, intrinsic physico- be universal across all proteins. In addition, chemical, biochemical, biophysical, and bio- when a particular condition produces differen- logical parameters) and the unknown correla- tial results across many proteins, it still might tions between the variation of a parameter and provide valuable information. the propensity for a given macromolecule to The BIOLOGICAL MACROMOLECULAR CRYSTALLIZA- crystallize. The CBR system described in this section ad- TION DATABASE (BMCD) (Gilliland, Tung, and Lad- dresses the problem of planning for a novel ner 2002) stores data from published crystal- protein-crystallization experiment. The poten- lization papers (positive examples of successful tial impact of such a system is far reaching: Ac- plans for growing crystals), including informa- celerating the process of crystallization implies tion about the macromolecule itself, the crys- an increased knowledge of protein structure, tallization methods used, and the crystal data. which is critical to medicine, drug design, and Attempts have been made to apply statistical enzyme studies and to a more complete under- and machine learning techniques to the BMCD. standing of fundamental molecular biology. These efforts include approaches that use clus- ter analysis (Farr, Peryman, and Samuzdi 1998), Protein Crystallization inductive learning (Hennessy et al. 1994), and Conceptually, protein crystal growth can be di- correlation analysis combined with a Bayesian vided into two phases: (1) search and (2) opti- technique (Hennessy et al. 2000) to extract mization. Approximate crystallization condi- knowledge from the database. Previous studies tions are identified during the search phase, were limited because negative results are not but the optimization phase varies these condi- reported in the database and because many tions to ultimately yield high-quality crystals. crystallization experiments are not repro- Improving these phases can lead to accelerated ducible because of an incomplete method de- protein-structure determination and function scription, missing details, or erroneous data resolution. Optimally, discovering the princi- (which is the result of often skimpy and vague ples of crystal growth will eliminate protein primary literature). Consequently, the BMCD is crystallization as a bottleneck in modern struc- not currently being used in a strongly predic- tural biology. tive fashion. The crystallization of macromolecules is cur- Significant advancement in this field has re- rently primarily empirical. Because of its unpre- sulted from the use of high-throughput robotic dictability and high irreproducibility, it has setups for the search phase of crystal growth. been considered by some to be an art rather This advancement vastly increases the number than a science (Ducruix and Giege 1992) or an of conditions that can initially be tested and al- “exact art and subtle science.” Experience alone so improves reliability and systematicity for ap- has produced experimental protocols for crystal proximating conditions for crystallization in

86 AI MAGAZINE Articles the search phase. However, it also results in proteins. If we consider only 15 possible condi- thousands of initial crystallization experiments tions, each having 15 possible values, the result carried out daily that require expert evaluation would be 4.3789e+017 possible experiments. based on visual criteria. Thus, our CBR system Assume that 2 proteins react similarly when includes an image analysis subsystem that is tested against a large set (over 1,500) of precip- used to automatically classify the initial out- itating agents in the search phase of crystalliza- comes in the search phase. This classification tion. Then, the planning strategies successfully forms the basis for the similarity measure for used for the one can profitably be applied to CBR. We incorporate knowledge discovery the other. Thus, it is important to identify a tools to assist in the optimization and the un- suitable set of precipitating agents to sort the derstanding of the protein-crystallization outcomes of reactions for a relatively large process. group of proteins. New crystallization chal- lenges are then approached by the execution Case-Based Reasoning System and analysis of a set of precipitation reactions, One can view the process of crystal growth as followed by an automated identification of a planning task, where a single experiment cor- similar proteins and an analysis of the recipes responds to a simple plan, and a series of exper- used to crystallize them (that is, crystal growth iments for a given protein corresponds to a method, temperature and pH ranges, concen- more complex plan. Planning in AI generally tration of a protein, and crystallization agent). involves developing techniques for forming a Figure 1 illustrates this process. strategy of actions by choosing among alterna- Our wet-lab collaborators, George DeTitta tive actions and reasoning about their effects. and others at the Hauptman-Woodward Med- In CBR, actions are chosen based on the re- ical Research Institute in Buffalo, New York, trieval and adaptation of previously construct- have the capacity to prepare and evaluate the ed plans. results of more than 60,000 initial precipita- Our approach builds on a previously devel- tion experiments, involving 40 proteins, dur- oped computational framework for CBR called ing a single work week. These microbatch ex- TA3. This system uses a variable context, a sim- periments are being conducted under paraffin ilarity-based retrieval algorithm, and a flexible oil using robots outfitted with syringes and representation language (Jurisica and Glasgow cameras (figure 2). For each protein, crystal 2000; Jurisica, Glasgow, and Mylopoulos growth experiments are carried out using 1,536 2000). Cases corresponding to individual expe- different cocktails (solutions or reacting riences are stored in TA3 as a collection of at- agents) (Luft et al. 2001). The result of a precip- tribute-value pairs; attributes are grouped into itation experiment is an image that is automat- one or more categories to bring additional ically generated and analyzed to determine the structure to a case representation, reducing the outcome of the crystallization process (Jurisica impact of irrelevant attributes on system per- et al. 2001). Image processing involves four formance by selectively using individual cate- main steps: (1) drop recognition, (2) drop gories during case retrieval. analysis, (3) image-feature extraction, and (4) Although we would hope that most pure image classification. A standard vocabulary of proteins would crystallize readily under the ap- outcomes is utilized to describe the results of propriate conditions, finding a priori the opti- image classification: clear drop, amorphous mal mix of solution conditions is challenging. precipitate, phase separation, microcrystal, One possible approach to make this search crystal, and unknown. These outcomes, more systematic is the use of factorial design, recorded as a function of time, are the corner- which involves a set of experiments intended stone of the crystallization case base that also to identify important effects and interactions. contains chemical and physical information In crystallography, it can be used to search for about individual proteins. initial conditions and, once results are evaluat- A need for automated image analysis arises ed, to optimize the conditions. The process in- from the fact that there is no general approach volves the simultaneous combination and to quantitatively evaluate crystallization reac- modification of all conditions, which generally tion outcomes under the microscope. The ma- results in an intractable number of experi- jor weakness of existing scoring methods is the ments. tendency to confuse categories of precipitates We postulate that past experience can lead (Ducruix and Giege 1992). In our approach, us to the identification of initial conditions fa- crystallization outcomes are stored as image vorable to crystallization. Moreover, it is be- representations that are analyzed using com- lieved that solubility experiments can provide puter vision techniques (Jurisica et al. 2001) to a quantitative measure of similarity among objectively recognize different crystallization

SPRING 2004 87 Articles

PI for individual case 1 PI of a new as problem description 2 protein

Retrieval modified k–nearest neighbor retrieval

Protein- Crystallization Database Adaptation computational integration of results

Final case for a new protein Solution successful and failed crystallization recipes

Figure 1. Case-Based Planning of Protein-Crystallization Experiments. The precipitation index of a novel protein is compared to precipitation indexes of all proteins stored in the case base using a modified k–nearest-neighbor similarity matching. Crystallization plans of proteins with the most similar precipitation indexes are considered for planning a crystallization of a novel protein—successful crystallizations are used as positive planning experiences, but failed crystallization plans are utilized to poten- tially avoid negative results in the future.

outcomes and automatically extract important cases that are of minimum distance. The re- image features for further analysis. It is impor- trieval method implemented in the TA3 system tant to note that our approach produces objec- provides the user with a flexible interface for tive results and is scalable to accommodate the restricting or relaxing the similarity function to vast number of images that require analysis. retrieve fewer or more relevant cases as neces- Our current database contains over 1,500 pro- sary (Jurisica, Glasgow, and Mylopoulos 2000). teins, each being screened with 1,536 different Once similar cases have been retrieved, the conditions and photographed 6 times over a next step in CBR is adaptation, which is the period of 2 weeks. Data storage thus requires process of modifying previous solutions to ad- approximately one DVD disk for each protein dress the new problem. The most relevant re- experiment. trieved cases, along with domain knowledge, Case retrieval involves partial pattern match- are incorporated to determine appropriate pa- ing of an input case to cases in the case base. A rameters for a new set of experiments for pro- similarity function is used to determine which tein crystallization. At this stage, the system cases are most relevant to the given problem. acts, first, as an adviser to the crystallographer The precipitation index is utilized to define a to suggest possible parameters for further ex- quantitative similarity function for crystal perimentation and, second, as an evaluater of growth experiments. This index encodes the potential experiments that the user might pro- reactions from the set of initial precipitation pose. The system utilizes results from the pre- cocktails as a string containing the 1,536 initial cipitation index to suggest 80 percent of the so- crystallization outcomes (determined through lution, but 20 percent of the solution is image analysis). The distance is measured be- determined from the knowledge obtained by tween the string for a novel set of reactions and data mining the repository for general trends. those stored in the repository to extract those Although we still need to determine and vali-

88 AI MAGAZINE Articles date the most effective split, this combined ap- proach to plan adaptation attempts to resolve the problem of local similarity versus global trends. The adaptation module is being devel- oped over time as more general rules/principles are extracted from the growing case base using data-mining techniques. Once a plan (in the form of a set of experiments) has been derived and executed for a novel protein, the results are recorded as a new case that reflects this ex- perience. Cases with both positive and nega- tive outcomes are equally valuable for future decision-making processes and are also re- quired for the application of data-mining tech- niques to the case base. Currently, the system is being integrated, and preliminary validation is being conducted. The first step is image analysis, where we have processed almost 1,600 proteins to date (each screened on 1,536 different conditions). We have validated our automated image classifica- tion system, comparing it to human expert performance on 18 proteins, which resulted in an accuracy of 89 percent (receiver operating characteristic [ROC] score 0.875, precision 0.40, and recall 0.70) (Cumbaa et al. 2003). Some classification outcomes are illustrated in figure 3. In automatic image classification, we use a boosting approach by combining results from multiple techniques. Our image analysis sys- tem comprises several stages: (1) well registra- tion, (2) droplet segmentation, (3) feature ex- traction, and (4) image classification. During well registration, we locate and elim- inate the boundaries of the well. To determine the vertical-horizontal well boundaries, we find the pair of pixel columns/rows separated by the expected well width/height with the closest average pixel intensity. We incorporate a previously described registration method (Ju- risica et al. 2001). During droplet segmentation, we eliminate the edges of the drop. Currently, we use a prob- abilistic graphic model to segment the central region of the well for further analysis by distin- guishing the empty well, the inside of a droplet, and the edge of a droplet. We first di- vide the well into a grid of 1717 regions. Each region (i,j) in this grid is represented by a pair Figure 2. Crystallization Solution and Protein Pipetting Robots and XY Pho- of variables in the Bayesian net segmentation tography Stands. model, which forms a 60-component mixture The base set of robots in HWI laboratory comprises a liquid handling system of multivariate Gaussian distributions. Each pipette robot, which accommodates exchangeable banks of 1, 12, 96, or 384 sy- mixture models the local state of the well. Seg- ringes, that is utilized to deliver crystallization solution and protein into 1,536 mentation of a particular image is accom- wells (Robbins Scientific Tango, Sunnyvale, California) and a protein pipette ro- plished by inferring the most likely configura- bot with 96 syringes (Robbins Scientific Tango) and two XY stands for pho- tographing experiment results, with the capacity to digitally photograph 86,016 tion from the probability distribution, using crystallization experiments each. the local region of the image as well as the val- ues of its neighbors. We trained our segmenta-

SPRING 2004 89 Articles

XC PhP

1400 G 1200 1100

800

600

400

200 CGPPhX

0

Figure 3. Automated Image Analysis and Classification. After image segmentation, multiple classes of crystallization results can be detected. We use a boosting approach that combines multiple approaches. Here, we show images that have been classified as crystal (X), phase separation (Ph), precipitate (P), clear (C), and gel (G). The corresponding contour images have been utilized to compute the Euler number, which, in turn, has been utilized to cluster similar images. As shown in the bar graph, crystal and phase separation overlap but can be separated from clear drops and precipitates. Unfortunately, we did not have enough gel images to enable statistical analysis.

tion model on 45 hand-segmented images. Val- straight edges detected within a droplet, and 1 idation of the model was performed on a set of measuring the smoothness of the droplet con- 50 hand-segmented images containing 4,319 tent to detect precipitates and clear drops empty-well regions, 1,348 droplet-border re- (Cumbaa et al. 2003). gions, and 8,783 intradroplet regions. The During image classification, we classify each model correctly identified 96 percent of all well image, given its 23 numeric features, using lin- regions, 69 percent of all border regions, and ear discriminant analysis, into crystal, clear, and 95 percent of all intradroplet regions, for a precipitate classes (Cumbaa et al. 2003). weighted mean of 93-percent overall accuracy Results of image classification have been uti- (Cumbaa et al. 2003). lized to effectively visualize time dependency During feature extraction, we extract a min- of the crystallization process and begin the imal set of descriptive features that are essen- process of crystallization plan optimization. tial in differentiating protein-crystallization Considering that each protein crystallization outcomes. Because we approach the image- results in 9,216 images, the main goal of the vi- classification problem using an elimination sualization system is to provide easier naviga- strategy, we extract features that are indicative tion through experimental results. Different of specific subgroups of crystals (for example, size plates can easily be visualized by selecting microcrystals, microneedles, needle crystals, a protein from a list and displaying a color-cod- and larger faceted crystals), precipitates, and ed image classification across all time points clear drops. In total, we reduce each image to a (figure 4). Selecting a specific well displays vector of 23 features: 20 measuring micro-crys- cocktail information. Crystallization results are tal features, 2 measuring the presence of automatically (or interactively) selected for op-

90 AI MAGAZINE Articles

Figure 4. Visualization and Optimization of the Initial Crystallization Plan. White squares denote crystals, light grey denotes precipitates, black indicates clear drops, and dark gray indicates unknown. timization, using selected crystallizing condi- utilize the output of the image analysis to de- tions and optimization criteria (figure 5). At termine similarity between cases. this stage, the cocktail setup is optimized into 24-well plate-optimization screen (figure 6). We combine information from the successful Other Applications of and failed crystallizations in a given screen Case-Based Reasoning with historical patterns obtained from data mining to plan novel experiments. This effort Here, we summarize several other CBR systems can be described as a combination of tradition- that have been developed to address problems al crystallization optimization techniques with in molecular biology. In particular, we discuss machine learning approaches. approaches for analyzing genomic sequence Little research has been carried out in the data and predicting and determining the struc- area of AI and protein crystallization. As point- ture of proteins. ed out earlier, much of this research has fo- Sequence Analysis cused on applying statistical and machine learning to the BMCD. Recent work on data The linear subsequences of deoxyribonucleic mining applied to protein crystallization can acid (DNA) that encode proteins are called be found in the article by Buchanan and Liv- genes. A three-letter string from the alphabet A, ingston contained in this issue of AI Magazine. G, T, C (referred to as a codon) of bases from Also worth noting is work by Hennessy et al. DNA encodes 1 of the 20 amino acids that are (2000), which describes an experiment planner utilized to form a protein. One of the funda- for protein crystallization that uses a combina- mental problems in the area of sequence analy- tion of CBR and Bayesian reasoning. In partic- sis is gene finding, which involves locating the ular, they apply statistical methods to the BMCD correct groups of nucleotide triples to translate to determine the probability of success of an into the protein sequence. This problem is experimental plan. Our approach differs from complicated by noise in the data, which is this work in several fundamental ways: (1) we caused by sequencing errors. are developing a case base that includes nega- A CBR gene-finding algorithm was proposed tive, as well as positive, experiments; (2) we are by Jude Shavlik (1991). In this research, cases incorporating results from high-throughput correspond to complete genes and complete robotic experiments to determine possible proteins stored in existing biological sequence starting conditions for experimentation; (3) we databases. Similarity matching is utilized to lo- are applying image-processing techniques to cate and identify regions in a DNA sequence analyze experimental outcomes; and (4) we that encode proteins, which is done in the

SPRING 2004 91 Articles

Figure 5. Selected Experiments for the Optimization. One can identify both the time and the dependency of crystallization as well as view the details about crystallizing conditions.

Figure 6. Twenty-Four Well Plate Optimization Screen for the Selected Conditions Using Specified Optimization Criteria.

presence of frame shift errors, that is, errors re- in genes. The basis for cases in their system are sulting from breaking the proteins into seg- instances of genes found in GENBANK (Benson ments that result in different reading frames. et al. 2000), a primary repository for nucleic The proposed approach involves parsing a acid sequence data. An individual case corre- noisy sequence into discrete cases that can be sponds to an abstraction over multiple gene in- matched with entries in the case library. The stances. The prediction of regions is achieved gene-finding algorithm produces multiple, par- using a grammatical model of gene structure. tial matches and then combines a subset of Indexing of cases is based on a hierarchy of at- these into a consistent whole case. Aaronson et tributes corresponding to protein and species al. (1993) describe how CBR techniques can be similarity. Two approaches to prediction are applied to predict unknown regulatory regions proposed: (1) grammar-based CBR and (2) se-

92 AI MAGAZINE Articles quence-based CBR. In grammar-based CBR, in- quence. Zhang and Waltz (1989) describe a stances of genes are represented in the form of memory-based reasoning system to predict grammar rules, which are applied to induce these angles for a protein based on known grammars of gene classes. These are then used structures. Similar to CBR, memory-based rea- to predict the presence and location of features soning makes use of specific past experiences for a novel sequence. Sequence-based CBR re- for problem solving. Their work is based on the lies on the assumption that if there is similarity premise that if two amino acids have similar between two gene sequences, then these gene physical properties and occur in a similar envi- sequences might share similar features. Deter- ronment, then they should have similar struc- mining the location of a predicted feature in a ture (in terms of their φ and ψ angles). novel sequence is determined by aligning the Leng, Buchanan, and Nicholas (1993) devel- sequence with the similar sequence. The sys- oped a CBR architecture for predicting the sec- tem was evaluated on how well it discovered ondary structure of proteins. In this work, they features not in the feature table. It resulted in considered several different measures from the inferring 30 to 40 percent additional features biology literature for determining similar pro- that other existing extraction tools were not teins. Once these proteins are found, their ap- able to find. proach involves decomposing the novel se- Motivated by the observation of CBR in quence into smaller segments; retrieved cases “think-aloud” protocols carried out by a hu- are used to assign secondary structure to the man expert, Kettler and Darden (1993) utilized corresponding piece of the unknown structure. previous experimental plans, stored as cases, Each amino acid in the sequence is assigned a along with analogical reasoning, to plan for class (α-helix, β-strand, or coil) by applying a novel sequencing experiments. The planning weighted sum calculated from the evidence in component of their system created plans, the known structures. which are sequences of laboratory and data CBR has also been applied to determine the analysis procedures. The primary objective of a three-dimensional structure of proteins from plan is to find the amino acid sequence of a crystallographic data (Glasgow, Conklin, and protein. A case in the knowledge base captures Fortier 1993). This work in molecular scene two types of experience: (1) decision episodes, analysis concerns the automated reconstruc- which are the tasks that were chosen to be car- tion and interpretation of protein image data ried out, and (2) an experiment episode, which (in the form of a three-dimensional electron describes the experiment performed as a se- density map). Cases correspond to previously quence of decision episodes. Case retrieval in determined structures. Discovered spatial and the system consists of a simple probe and a visual concepts of a structure are used to index similarity measure that counts the number of cases. Cases are retrieved from the case base tasks that the current decision episode has in through a pattern-matching process that in- common with the stored cases. The approach is volves the comparison of unidentified features interactive, where the expert provides the sys- in a novel electron density map (derived from tem with potential hypotheses. an image-segmentation process) with motifs from known structures. This approach com- Protein-Structure Determination bines a bottom-up approach to image analysis, The structure of proteins can be analyzed at where image-processing techniques are applied varying levels of detail or complexity. The pri- to extract features from the maps, with a top- mary structure of a protein corresponds to an down approach, where CBR is used to anticipate ordered chain of amino acid residues, or a pro- what motifs are likely to occur in the image. tein sequence. Its secondary structure corre- sponds to the local conformation of the pro- Conclusions tein’s backbone. The most common secondary structures are the α-helix and the β-pleated CBR is based on the premise that problem solv- sheets. The ternary structure describes the ing involves recalling past experiences and uti- unique three-dimensional arrangement of the lizing these experiences (cases) to solve novel atoms in the polypeptide chain of the protein. problems. It is a particularly useful paradigm in Quaternary structure describes the assembly of domains that are not well understood and several individual polypeptide chains and, where it is difficult to come up with generaliza- thus, is defined only for multipolypeptide pro- tions that can be used to model the world. teins. The quantity and complexity of biological One way in which the tertiary structure of a data being generated is increasing at a rate protein can be represented is through the φ and much faster than our ability to analyze or un- ψ angles for each of the amino acids in the se- derstand it, implying the need for advanced

SPRING 2004 93 Articles

computational tools for problem solving in the ricella for the manual classification of a large domain. This article has introduced several ar- number of training and testing images. Imple- eas in which CBR has been applied to increase mentation of the presented software has been our understanding of biological data. However, done by Christian Anders Cumbaa (image a wealth of problems remains open that can analysis), Patrick Rogers (CBR), and Xin Zhang benefit from AI in general, and CBR in particu- (visualization and optimization). This research lar. Issues that need to be explored to fully rec- was supported in part by the Natural Science ognize the benefits of CBR for molecular biolo- and Engineering Research Council of Canada, gy problem solving include case representation contracts 224114 and 203833; the National In- and integrative CBR. stitutes of Health, P50 GM62413; an IBM With case representation, most of the current Shared University Research grant; and an IBM biological databases have been designed for a Faculty Partnership Award. specific single task and, thus, do not support References the full potential of machine learning and CBR Aaronson, J. S.; Juergen, H.; and Overton G. C. 1993. systems. For example, although the BMCD data- Knowledge Discovery in GENBANK. In Proceedings of base can be considered as a case base of previ- the First International Conference on Intelligent Systems ous experiences in planning crystallization ex- for Molecular Biology, eds. L. Hunter, D. Searls, and periments, no negative experiences are stored, U. Shavlik, 3–11. Menlo Park, Calif.: AAAI Press. limiting its usefulness for CBR and machine Benson, D. A.; Karsch-Mizrachi, I.; Lipman, J.; Ostell, learning. J.; Rapp, B. A.; and Wheeler, D. L. 2000. GENBANK. With integrative CBR, applying CBR tech- Nucleic Acids Research 28(1): 15–18. niques to heterogeneous and distributed data- Bergmann, R.; Breen, S.; Goker, M.; Manago, M.; and bases is increasingly required because relevant Wess, S. 1999. Developing Industrial Case-Based Rea- biological information can span multiple data- soning Applications: The INRECA Methodology. Berlin: bases, where more than one source is necessary Springer-Verlag. for problem solving. The vast amounts of ge- Bystroff, C., and Shao, Y. 2002. Fully Automated ab nomic and proteomic data being generated re- Initio Protein-Structure Prediction Using I-SITES, HMM- quire new analysis methods to combine, con- STR, and ROSETTA. Bioinformatics 18(Suppl 1): S54–S61. solidate, and interpret the information. No Cumbaa, C.; Lauricella, A.; Fehrman, N.; Veatch, C.; single database or algorithm will be successful Collins, R.; Luft, J.; DeTitta, G.; and Jurisica, I. 2003. at solving the complex analytic problems. Automatic Classification of Submicrolitre Protein- Thus, we need to integrate different tools and Crystallization Trials in 1536-Well Plates. Acta Crys- approaches, multiple sources of single types of tallographica Section D-Biological Crystallography data, and diverse data types. Similar trends D59(9): 1619–1627. have already been explored in the CBR litera- Ducruix, A., and Giege, R. 1992. Crystallization of Nu- ture, for example, McGinty and Smyth (2001) cleic Acids and Proteins. A Practical Approach. New and Leake and Sooriamurthi (2002). York: Oxford University Press. An advantage of CBR as a problem-solving Farr, R. G.; Perryman, A. L.; and Samuzdi, C. T. 1998. paradigm is that it is applicable to a wide range Reclustering the Database for Crystallization of of problems. It can be used to propose novel so- Macromolecules. Journal of Crystal Growth 183(4): 653–668. lutions or evaluate solutions to avoid potential problems. Aaronson, Juergen, and Overton Gilliland, G. L.; Tung, M.; and Ladner, J. E. 2002. The Biological Macromolecule Crystallization Database: (1993) suggest that analogical reasoning is par- Crystallization Procedures and Strategies. Acta Crys- ticularly applicable to the biological domain, tallographica Section D-Biological Crystallography partly because biological systems are often ho- D58(1): 916–920. mologous (rooted in evolution) and because bi- Glasgow, J. I.; Conklin, D.; and Fortier, S. 1993. Case- ologists often use a form of reasoning similar to Based Reasoning for Molecular Scene Analysis. In CBR, where experiments are designed and per- Case-Based Reasoning and Information Retrieval: Ex- formed based on the similarity between fea- ploring Opportunities for Technology—Papers from tures of a new system and those of known sys- the AAAI Spring Symposium, 53–62. Technical Re- tems. port SS-93-07. Menlo Park, Calif.: AAAI Press. Hennessy, D.; Buchanan, B.; Subramanian, D.; Wilkosz, P. A.; and Rosenberg, J. M. 2000. Statistical Acknowledgments Methods for the Objective Design of Screening Pro- We would like to thank our collaborators cedures for Macromolecular Crystallization. Acta Crystallographica Section D-Biological Crystallography George DeTitta and Joe Luft from the Haupt- D56 (Pt 7): 817–827. man-Woodward Medical Research Institute for Hennessy, D.; Gopalakrishnan, V.; Buchanan, B. G.; stimulating long-term collaboration. We are al- Rosenberg, J. M.; and Subramanian, D. 1994. Induc- so grateful to Nancy Fehrman and Angela Lau- tion of Rules for Biological Macromolecule Crystal-

94 AI MAGAZINE Articles lization. In Proceedings of the Second International Con- Shavlik, J. 1991. Finding Genes by Case-Based Rea- ference on Intelligent Systems for Molecular Biology, soning in the Presence of Noisy Case Boundaries. In 179–187. Menlo Park, Calif.: AAAI Press. Proceedings of the 1991 DARPA Workshop on Case- Jancarik, J., and Kim, S. H. 1991. Spare Matrix Sam- Based Reasoning, 327–338. San Francisco, Calif.: Mor- pling: A Screening Method for Crystallization of Pro- gan Kaufmann. teins. Journal of Applied Crystallography 24(409): Watson, I. D. 1997. Applying Case-Based Reasoning: 31–34. Techniques for Enterprise Systems. San Francisco, Calif.: Jurisica, I., and Glasgow, J. I. 2000. Improving Perfor- Morgan Kaufmann. mance of Case-Based Classification Using Context- Zhang, X., and Waltz, D. 1989. Protein-Structure Pre- Based Relevance. International Journal of Artificial In- diction Using Memory-Based Reasoning: A Case telligence Tools (Special Issue of IEEE ITCAI-96 Best Study of Data Exploration. In Proceedings of a Work- Papers) 6(4): 511–536. shop on Case-Based Reasoning, 351–355. San Francis- Jurisica, I.; Glasgow, J. I.; and Mylopoulos, J. 2000. co, Calif.: Morgan Kaufmann. Incremental Iterative Retrieval and Browsing for Effi- Igor Jurisica is an assistant profes- cient Conversational CBR Systems. International Jour- sor in the departments of comput- nal of Applied Intelligence 12(3): 251–268. er science and medical biophysics, Jurisica, I.; Rogers, P.; Glasgow, J. I.; Collins, R.; University of Toronto, and the De- Wolfley, J.; Luft, J.; and DeTitta, G. 2001. Improving partment of the School of Com- Objectivity and Scalability in Protein Crystallization: puting, Queen’s University. In ad- Integrating Image Analysis with Knowledge Discov- dition to his position as a scientist ery. IEEE Intelligent Systems Journal (Special Issue on at the Ontario Cancer Intelligent Systems in Biology) 16(6): 26–34. Institute/Princess Margaret Hospi- Kettler, B., and Darden, L. 1993. Protein Sequencing tal, Division of Cancer Informatics, Jirisica holds a Experiment Planning Using Analogy. In Proceedings visiting scientist position at the IBM Canada Toronto of the First International Conference on Intelligent Sys- Laboratory. He is recognized for work in computa- tems for Molecular Biology, 216–224. Menlo Park, tional biology, including representation, analysis, Calif.: AAAI Press. and visualization of high-dimensional data generat- ed by high-throughput biology experiments. His e- Kim, S., and Szyperski, T. 2003. GFT NMR, a New Ap- mail address is [email protected]. proach to Rapidly Obtain Precise High-Dimensional NMR Spectral Information. Journal of the American Janice Glasgow is a professor in Chemical Society 125(5): 1385–1393. the School of Computing at Kimber, M. S.; Vallee, F.; Houston, S.; Necakov, A.; Queen’s University, Canada, Skarina, T.; Evdokimova, E.; Beasley, S.; Christendat, where she holds a research chair in D.; Savchenko, A.; Arrowsmith, C. H.; Vedali, M.; biomedical computing. Currently, Gerstein, M.; and Edwards, A. M. 2003. Data Mining she is on sabbatical and is a senior Crystallization Databases: Knowledge-Based Ap- visiting research fellow at the In- proaches to Optimize Protein Crystal Screens. Pro- stitute of Advanced Studies, Uni- teins: Structure, Function, and Genetics 51(4): 562–568. versity of Bologna. She sits on the editorial board for several journals in AI, cognitive Kolodner, J. L. 1993. Case-Based Reasoning. San Fran- science, and bioinformatics; is a past-president of the cisco, Calif.: Morgan Kaufmann. Canadian Society for Computational Intelligence; Leake, D. B., ed. 1996. Case-Based Reasoning: Experi- and, until recently, was the vice-chair for the AI tech- ences, Lessons, and Future Directions. Menlo Park, nical committee for the International Federation for Calif.: AAAI Press. Information Processing. Her e-mail address is jan- Leake, D. B., and Sooriamurthi, R. 2002. Automati- [email protected]. cally Selecting Strategies for Multi–Case-Based Rea- soning. In Advances in Case-Based Reasoning: Proceed- ings of the Fifth European Conference on Case-Based Reasoning, 204–219. Berlin: Springer-Verlag. Leng, B.; Buchanan, B. G.; and Nicholas, H. B. 1993. Protein Secondary Structure Prediction Using Two- Level Case-Based Reasoning. In Proceedings of the First International Conference on Intelligent Systems for Mol- ecular Biology, 251–259. Menlo Park, Calif.: AAAI Press. Luft, J.; Wolfley, J.; Jurisica, I.; Glasgow, J. I.; Fortier, S.; and DeTitta, G. 2001. Macromolecular Crystalliza- tion in a High-Throughput Laboratory—The Search Phase. Journal of Crystal Growth 232(4): 591–595. McGinty, L., and Smyth, B. 2001. Collaborative Case- Based Reasoning: Applications in Personalized Route Planning. In Proceedings of the Fourth International Conference on Case-Based Reasoning, 362–376. Berlin: Springer-Verlag.

SPRING 2004 95 Articles

ISMB Proceedings

ISMB-2000—San Diego, California ISMB-96—St. Louis, Missouri Proceedings of the Eighth International Conference on In- Proceedings of the Fourth International Conference on In- telligent Systems for Molecular Biology telligent Systems for Molecular Biology Edited by , Timothy Bailey, , Michael Gribskov, Edited by David J. States, Pankaj Agarwal, , Thomas Lengauer, Ilya Shindyalov, Lynn Ten Eyck, & Helge Weissig , & Randall F. Smith ISBN 1-57735-115-0, 436 pp., index ISBN 1-57735-002-2, 274 pp., index

ISMB-99—Heidelberg, Germany ISMB-95—Cambridge, England Proceedings of the Seventh International Conference on In- Proceedings of the Third International Conference on Intel- telligent Systems for Molecular Biology ligent Systems for Molecular Biology Edited by Thomas Lengauer, Reinhard Schneider, Peer Bork, Edited by Christopher Rawlings, Dominic Clark, Russ Altman, Douglas Brutlag, Janice Glasgow, Hans-Werner Mewes, & Ralf Zimmer Lawrence Hunter, Thomas Lengauer, & ISBN 1-57735-083-9, 324 pp., index ISBN 0-929280-83-0, 427 pp., index

ISMB-98—Montréal, Quebec, Canada ISMB-94—Stanford, California Proceedings of the Sixth International Conference on Intel- Proceedings of the Second International Conference on In- ligent Systems for Molecular Biology telligent Systems for Molecular Biology Edited by Janice Glasgow, Tim Littlejohn, François Major, Richard Edited by Russ Altman, Douglas Brutlag, , Richard Lathrop, Lathrop, , & Christoph Sensen & David Searls ISBN 1-57735-053-7, 234 pp., index ISBN 0-929280-68-7, 401 pp., index

ISMB-97—Halkidiki, Greece ISMB-93—Bethesda, Maryland Proceedings of the Fifth International Conference on Intel- Proceedings of the First International Conference on Intelli- ligent Systems for Molecular Biology gent Systems for Molecular Biology Edited by Terry Gaasterland, Peter Karp, Kevin Karplus, Christos Edited by Lawrence Hunter, David Searls, & Jude Shavlik Ouzounis, , & ISBN 0-929280-47-4 468 pp., index ISBN 1-57735-022-7, 382 pp., index

Published by The AAAI Press, 445 Burgess Drive, Menlo Park, California 94025 (www.aaai.org/Press)

96 AI MAGAZINE