AI Magazine Volume 25 Number 1 (2004) (© AAAI) Articles

Editorial Introduction

AI and

Janice Glasgow, Igor Jurisica, and Burkhard Rost

■ This article is an editorial introduction to the re- modern-day biology is far more complex than search discipline of bioinformatics and to the articles suggested by the simplified sketch presented in this special issue. In particular, we address the issue here. In fact, researchers in life sciences live off of how techniques from AI can be applied to many of the introduction of new concepts; the discov- the open and complex problems of modern-day mol- ecular biology. ery of exceptions; and the addition of details that usually complicate, rather than simplify, his special issue of AI Magazine focuses the overall understanding of the field. on some areas of research in bioinfor- Possibly the most rapidly growing area of re- Tmatics that have benefited from applying cent activity in bioinformatics is the analysis AI techniques. Undoubtedly, bioinformatics is of microarray data. The article by Michael Mol- a truly interdisciplinary field: Although some la, Michael Waddell, David Page, and Jude researchers continuously affect wet labs in life Shavlik (“Using to Design through collaborations or provision of and Interpret Gene-Expression Microarrays”) tools, others are rooted in the theory depart- introduces some background information and ments of exact sciences (physics, chemistry, or provides a comprehensive description of how engineering) or computer sciences. This wide techniques from machine learning can be used variety creates many different perspectives and to help understand this high-dimensional and terminologies. One result of this Babel of lan- prolific gene-expression data. The authors guages is that there is no single definition for point out that it is natural to apply machine what the subject of this young field really is. learning to such data, but it is also challenging Even the name of the field varies: Bioinformat- because of its complexity. ics, theoretical biology, biocomputing, or computa- The term protein function is not well defined; tional biology are just a few of the terms used. In it encompasses a wide spectrum of biological fact, this lack of a precise definition is not of contexts in which proteins contribute to mak- the type, “I recognize it when I see it”; rather, ing an organism live. (Note that the term gene different representatives of the field have fairly function is somehow a misnomer in the sense different ideas about what it actually is. that it means “the function of the protein en- Here, we do not attempt to impose any spe- coded by a particular gene.”) This intrinsic cific definition of the field. The particular col- complexity of terminology makes it extremely lection of reviews presented constitutes a difficult to build databases with controlled vo- sparse sampling from the broad activities in cabularies for function. Furthermore, the vast the area. Larry Hunter (“Life and Its Molecules: majority of experimental data is buried in free- A Brief Introduction”) describes some of the text publications. Mining free text, such as concepts and terms prevalent in today’s mole- MEDLINE abstracts and machine learning in- cular biology. If you find the plethora of tech- terpretations of controlled vocabularies, consti- nical terms overwhelming, be assured that tutes another area of increasing activity. Rajesh

Copyright © 2004, American Association for Artificial Intelligence. All rights reserved. 0738-4602-2004 / $2.00 SPRING 2004 7 Articles

Nair and Burkhard Rost (“Annotating Protein tegrating information in biological databases. Function through Lexical Analysis”) review a A common theme among the articles is that few of the recent methods that have begun in- biological data are complex, and the quantity fluencing experimental research. They observe of such data is growing at an unprecedented that to date the technically simplest tools ap- rate (and arguably outgrowing the central pro- pear to be the most successful ones and that the cessing unit and storage capacity of comput- seemingly most simple problem—identifying ers). The problems that are faced in under- the gene-protein name from a publication— standing molecular sequence, structure, and constitutes one of the major bottlenecks in in- function rely on the ability to manage and un- corporating free-text mining systems into derstand these data. Thus, it is not surprising everyday MEDLINE searches. Ross King (“Ap- that AI techniques from knowledge representa- plying Inductive Logic Programming to Func- tion, machine learning, knowledge discovery, tional Genomics”) reviews applications of in- and reasoning are at the forefront in address- ductive logic programming that address the ing the important questions that are arising in problem of predicting some aspects of protein molecular biology. function. In particular, he reviews a method that combines the mining of controlled vocab- Janice Glasgow is a professor in … AI ulary with machine learning to render ge- the School of Computing at techniques nomewide annotations of function. Queen’s University, Canada, where High-throughput experiments targeting the she holds a research chair in bio- medical computing. Currently, she from genome have become almost a standard tool is on sabbatical and is a senior vis- knowledge for experimental biology over the last decade iting research fellow at the Insti- (for example, large-scale sequencing, micro-ar- tute of Advanced Studies, Universi- representation, rays, RNAi, two-hybrid methods, mass spec- ty of Bologna. She sits on the machine trometry). In contrast, the first comprehensive editorial board for several top journals in AI, cogni- attempt at realizing high-throughput experi- tive science, and bioinformatics; is a past president of learning, ments for proteins—structural genomics—is the Canadian Society for Computational Intelligence; knowledge still in the phase of pilot projects. One goal of and until recently was the vice-chair for the AI tech- nical committee for the International Federation for structural genomics is to experimentally deter- discovery, and Information Processing. Her e-mail address is jan- mine a structure for each representative pro- [email protected]. reasoning tein. This seemingly simple objective hides an Igor Jurisica is an assistant profes- avalanche of bottlenecks and problems. Many are at the sor in the departments of comput- forefront in of these will benefit from AI-driven solutions. er science and medical biophysics One such bottleneck—protein crystallization— at the University of and addressing the is addressed in the final two papers. Bruce the department of the school of Buchanan and Gary Livingston (“Toward Auto- computing at Queen’s University. important mated Discovery in the Biological Sciences”) In addition to his position as a sci- questions that focus on the use of a novel data-mining tech- entist at the Ontario Cancer Insti- nique to extract relationships from the data on tute/Princess Margaret Hospital, are arising in Division of Cancer Informatics, Jurisica holds a visit- crystal-growing experiments. Igor Jurisica and molecular ing scientist position at the IBM Canada Toronto lab- Janice Glasgow (“Applications of Case-Based oratory. He is recognized for his work in computa- biology. Reasoning in Molecular Biology”) demonstrate tional biology, including representation, analysis, how case-based reasoning can be applied to as- and visualization of high-dimensional data generat- sist in the planning of such experiments. They ed by high-throughput biology experiments. His e- also provide an overview of several other appli- mail address is [email protected]. cations in molecular biology that have benefit- Burkhard Rost has been an associ- ed from case-based reasoning. ate professor at Columbia Univer- An alternative to experimental methods for sity since 1999. After graduating determining protein structure is the applica- from the Institute of Theoretical tion of automated techniques for predicting Physics, Heidelberg, he was part of structure from sequence. The paper by Claus EMBL Heidelberg (1990–1994, 1996–1998), EBI Cambridge Andersen and Soren Brunak (“Amino Acid Sub- (1995), and LION Biosciences alphabets Can Improve Protein Structure Pre- (1998). His research focuses on diction”) presents some novel research that il- methods predicting protein structure and function lustrates an interesting application of AI geared from sequence. The major goals of his research are to toward learning about the relation between develop tools that can be applied in the context of amino acid alphabets and protein. In particu- entirely analyzing sequence organisms. lar, this work demonstrates the importance of knowledge representation in extracting and in-

8 AI MAGAZINE