Mathematical Challenges from Genomics and Molecular Richard M. Karp

fundamental goal of biology is to un- algorithms and the role of combinatorics, opti- derstand how living cells . This mization, probability, statistics, pattern recognition, understanding is the foundation for all and machine learning. higher levels of explanation, including We begin by presenting the minimal information Aphysiology, , behavior, , about , , and required to and the study of populations. The field of molec- understand some of the key problems in genomics. ular biology analyzes the functioning of cells and Next we describe some of the fundamental goals of the processes of inheritance principally in terms of the molecular life sciences and the role of genomics interactions among three crucially important classes in attaining these goals. We then give a series of of macromolecules: DNA, RNA, and proteins. brief vignettes illustrating algorithmic and mathe- Proteins are the molecules that enable and execute matical questions arising in a number of specific most of the processes within a . DNA is the car- areas: sequence comparison, , rier of hereditary information in the form of genes finding, phylogeny construction, re- and directs the production of proteins. RNA is a key arrangement, associations between polymorphisms intermediary between DNA and proteins. and disease, classification and clustering of gene and are undergoing expression data, and the logic of transcriptional revolutionary changes. These changes are guided by control. An annotated bibliography provides point- a view of a cell as a collection of interrelated sub- ers to more detailed information. systems, each involving the interaction among many genes and proteins. Emphasis has shifted from the Genes, Genomes, and Proteins study of individual genes and proteins to the explo- The Double Helix ration of the entire genome of an organism and the The field of genetics began with study of networks of genes and proteins. As the level (1865), who postulated the existence of discrete of aspiration rises and the amount of available data grows by orders of magnitude, the field becomes units of information (which later came to be called increasingly dependent on mathematical modeling, genes) that govern the inheritance of individual mathematical analysis, and computation. In the characteristics in an organism. In the first half of sections that follow we give an introduction to the the twentieth century it was determined that the mathematical and computational challenges that genes are physically embodied within complex arise in this field, with an emphasis on discrete DNA macromolecules that lie within structures called which occur in every living Richard M. Karp is a member of the International Computer cell. This set the stage for the epochal discovery Science Institute (ICSI), Berkeley, and a University Professor of the structure of DNA by Watson and Crick in at the University of California, Berkeley. His e-mail address 1953. They showed that a DNA molecule is a is [email protected]. double helix consisting of two strands. Each helix

544 NOTICES OF THE AMS VOLUME 49, NUMBER 5 is a chain of bases, chemical units of four types: A, according to the , which maps succes- C, T, and G. Each base on one strand is joined by sive triplets of RNA bases to amino acids. With a hydrogen bond to a complementary base on the minor exceptions, this many-to-one function from other strand, where A is complementary to T, and the sixty-four triplets of bases to the twenty amino C is complementary to G. Thus the two strands con- acids is the same in all organisms on Earth. tain the same information. Certain segments within Regulation of these chromosomal DNA molecules contain genes, All the cells within a living organism (with the ex- which are the carriers of the genetic information ception of the sperm and egg cells) contain nearly and, in a sense to be explained later, spell the identical copies of the entire genome of the organ- names of the proteins. Thus the genetic informa- ism. Thus every cell has the information needed to tion is encoded digitally, as strings over the four- produce any that the organism can produce. letter alphabet {A, C, T, G}, much as information Nevertheless, cells differ radically in the proteins is encoded digitally in computers as strings of that they actually produce. For example, there are zeros and ones. more than 200 different human cell types, and most In humans there are forty-six chromosomes. proteins are produced in only a subset of these cell All but two of these (the sex chromosomes) occur in types. Moreover, any given cell produces different pairs of “homologous” chromosomes. Two homol- proteins at different stages within its cycle of oper- ogous chromosomes contain the same genes, but ation, and its protein production is influenced by its a gene may have several alternate forms called internal environment and by the signals impinging , and the alleles of a gene on the two chromo- upon it from other cells. somes may be different. It is clear, then, that the expression of a gene within The total content of the DNA molecules within the a cell (as measured by the abundance and level of chromosomes is called the genome of an organism. activity of the proteins it produces) is regulated by Within an organism, each cell contains a complete the environment of the cell. of a gene copy of the genome. The contains is typically regulated by proteins called transcrip- about three billion base pairs and about 35,000 genes. tion factors that bind to the DNA near the gene and Proteins enhance or inhibit the copying of the gene into RNA. Proteins are the workhorses of cells. They act as Similarly, can be regulated by proteins structural elements, catalyze chemical reactions, that bind to the ribosome. Certain post-translational regulate cellular activities, and are responsible for processes, such as the chemical modification of the communication between cells. A protein is a linear protein or the transport of protein to a particular chain of chemical units called amino acids, of which compartment in the cell can also be regulated so there are twenty common types. The function of a as to affect the activity of the protein. Thus gene protein is determined by the three-dimensional expression can be viewed as a complex network of structure into which it folds. One of the premier interactions involving genes, proteins, and RNA, as problems in science is the protein folding problem well as other factors such as temperature and the of predicting the three-dimensional structure of a presence or absence of nutrients and drugs within protein from its linear sequence of amino acids. the cell. This problem is far from being solved, although progress has been made by a variety of methods. The Goals of Genomics These range from numerical simulation of the In this section we enumerate some of the goals of physical forces exerted by the amino acids on genomics. one another to pattern recognition techniques 1. Sequence and compare the genomes of differ- which correlate motifs within the linear ent species. To sequence a genome means to sequence with structural features of a protein. determine its sequence of bases. This sequence From Genes to Proteins will, of course, vary from individual to indi- The fundamental dogma of molecular biology is that vidual, and those individual differences are of DNA codes for RNA and RNA codes for protein. Thus paramount importance in determining each the production of a protein is a two-stage process, individual’s genetic makeup, but there is with RNA playing a key role in both stages. An RNA enough agreement to justify the creation of a molecule is a single-stranded chain of chemical composite reference genome for a species. For bases of four types: A, U, C, and G. In the first stage, example, any two humans will have the same called transcription, a gene within the chromosomal complement of genes (but different alleles) DNA is copied base-by-base into RNA according to and will agree in about 999 bases out of 1,000. the correspondence A → U, C → G, T → A, G → C. The of the human genome The resulting RNA transcript of the gene is then has been a central goal of the world genomics transported within the cell to a molecular machine since 1990. Draft sequences were called a ribosome which has the function of trans- completed in February 2001, and the quest lating the RNA into protein. Translation takes place continues for a much more accurate sequence.

MAY 2002 NOTICES OF THE AMS 545 This achievement was preceded by the se- bases or amino acids), we need a definition: An quencing of many bacterial genomes, yeast, alignment of a pair of sequences x and y is a new the nematode, and, in June 1999, the fruit fly pair of sequences x and y of equal length such . The sequencing of that x is obtained from x and y is obtained from a new organism is often of value for medical, y by inserting occurrences of the special space agricultural, or environmental studies. In symbol (−). Thus, if x = acbcdb and y = abbdcdc, addition, it may be useful for comparative then one alignment of x and y is as follows: studies with related organisms. x = a - c b c - d b,  2. Identify the genes and determine the func- y = a b - b d c d c. tions of the proteins they encode. This process Given an alignment of two sequences x and y, it is essential, since without it a sequenced is natural to assess its quality (i.e., the extent to which genome is merely a meaningless jumble of A’s, it displays the similarity between the two sequences) C’s, T’s, and G’s. Genes can be identified by as a score, which is the sum of scores associated methods confined to a single genome or by with the individual columns of the alignment. The comparative methods that use information score of a column is given by a symmetric scoring about one organism to understand another function σ that maps pairs of symbols from the al- related one. phabet Σ ∪ {−} to the real numbers, where Σ is the 3. Understand gene expression. How do genes set of residues. Normally we will choose σ (a, a) > 0 and proteins act in concert to control cellular for all symbols a ∈ Σ , so that matched symbols processes? Why do different cell types express increase the score of the alignment, and σ (a, −) < 0 different genes and do so at different times? for all a ∈ Σ, so that misalignments are penalized. 4. Trace the evolutionary relationships among ex- In the case of the alphabet of amino acids, σ (a, b) isting species and their evolutionary ancestors. reflects the frequency with which amino acid a 5. Solve the protein folding problem: From the replaces amino acid b in evolutionarily related linear sequence of amino acids in a protein, sequences. determine the three-dimensional structure into The global alignment problem is to find the which it folds. optimal alignment of two strings x and y with 6. Discover associations between gene respect to a given scoring function σ. A dynamic and disease. Some diseases, such as cystic fi- programming algorithm called the Needleman- brosis and Huntington’s disease, are caused Wunsch algorithm solves this problem in a num- by a single . Others, such as heart ber of steps proportional to the product of the disease, cancer, and diabetes, are influenced lengths of the two sequences. A straightforward by both genetic and environmental factors, implementation of this algorithm requires space and the genetic component involves a combi- proportional to the product of the lengths of the nation of influences from many genes. By two sequences, but there is a refinement which, at studying the relation between genetic endow- the cost of doubling the execution time, reduces ment and disease states in a population of the space requirement to m + log n, where m and individuals, it may be possible to sort out the n are the lengths of the shorter and the longer of genetic influences on such complex diseases. the two sequences. Having completed our brief survey of the gen- A related problem is that of local alignment, in eral goals of genomics, we now turn to a number which we seek the highest score of an alignment be- of examples of specific problems in genomics. tween consecutive subsequences of x and y, where These typically involve the creation of a mathe- these subsequences may be chosen as desired. matical model, the development of an algorithm, Such an alignment is intended to reveal the extent and a mathematical analysis of the algorithm’s of local similarity between sequences that may not performance. be globally similar. This problem can be solved within the same time and space bounds as the Sequence Comparison global alignment problem, using a dynamic pro- The similarity of a newly discovered gene or protein gramming algorithm due to Smith and Waterman. to known genes or proteins is often an indication of A gap is a sequence of consecutive columns in its importance and a clue to its function. Thus, an alignment in which each symbol of one of the whenever a sequences a gene or protein, sequences (x or y) is the space symbol (−). Gaps the next step is to search the sequence databases typically correspond to insertions or deletions of for similar sequences. The BLAST (Basic Local residue sequences over the course of . Alignment Search Tool) program and its successive Because mutations causing such insertions and refinements serve this purpose and are the most im- deletions may be considered a single evolutionary portant single software tool for . event (and may be nearly as likely as the insertion In preparation for giving a measure of the or deletion of a single residue), we may wish to similarity between two sequences of residues (i.e., assign a (negative) score to a gap which is greater

546 NOTICES OF THE AMS VOLUME 49, NUMBER 5 than the sum of the (negative) scores of its columns. describes the construction of an HMM represent- The above dynamic programming algorithms can ing the statistical features of human genes. be adapted for this purpose. An HMM for a family of sequences should have the One of the most commonly occurring tasks in property that sequences in the family tend to be gen- is to search a database for erated with higher probability than other sequences sequences similar to a given sequence. BLAST is a of the same length. In view of this property, one can set of programs designed for this purpose. Ideally, judge whether any given sequence lies in the family it would be desirable to scan through the entire data- by computing the probability that the HMM generates base for high-scoring local alignments, but this it, a task that can be performed efficiently by a sim- would require a prohibitive amount of computation. ple dynamic programming algorithm. Instead, a filtering program is used to find regions In order to construct a hidden Markov model of of the database likely to have a high-scoring local a family of sequences, one needs a training set alignment with the given sequences; the full consisting of representative sequences from the local alignment algorithm is then used within family. The first step in constructing the HMM is to these regions. One principle underlying the filter- choose the set of states and the initial state and ing program is that two sequences are likely to to specify which transition probabilities and have a high-scoring local alignment only if there which emission probabilities can be nonzero. These is a reasonably long exact match between them. choices are guided by the modeler’s knowledge of Multiple Alignment the family. Given these choices, one can use the The concept of an alignment can be extended to EM-algorithm to choose the numerical values of alignments of several sequences. A multiple align- the nonzero transition probabilities and emission probabilities in order to maximize the product of ment of the sequences x1,x2,... ,xn is an n-tuple    the emission probabilities of the sequences in the (x ,x ,... ,xn) of sequences of equal length where, 1 2  training set. for i =1, 2,... ,n, the sequence xi is obtained from xi by inserting occurrences of the space symbol. Sequence Assembly Just as in the case of a pairwise alignment, the The genomes of different organisms vary greatly scoring of a multiple alignment is based on a sym- in size. There are about 3 billion base pairs in metric score function σ from (Σ ∪ {−})2 into the real the human genome, 120 million in the genome of numbers; usually we take σ (−, −) to be zero. The the fruit fly Drosophila melanogaster, and 4.7 million score of a multiple alignment is then computed in the genome of the bacterium E. coli. There is no column by column. The sum-of-pairs scoring method, magic microscope than can simply scan across a in which the score of a multiple alignment is the genome and read off the bases. Instead, genomes sum of σ (a, b) over all aligned pairs of symbol are sequenced by extracting many fragments called occurrences, is a natural choice, with the convenient reads from the genome, sequencing each of these property that the score of an alignment is the sum reads, and then computationally assembling the of the scores of its induced pairwise alignments. genome from these reads. The typical length of a For the commonly used scoring methods, the read is about 500 bases, and the total length of all problem of finding a maximum-score multiple align- the reads is typically five to eight times the length of ment of a set of sequences is NP-hard, and various the genome. The reads come from initially unknown heuristics are used in practice. locations distributed more or less randomly across Hidden Markov Models the genome. The process of sequencing a read is Multiple alignments are an important tool for ex- subject to error, but the error rate is usually low. hibiting the similarities among a set of sequences. is conceptually the sim- Hidden Markov models (HMMs) provide a more plest way to assemble a genome from a set of flexible probabilistic method of exhibiting such reads. In this method the reads are compared in similarities. An HMM is a Markov chain that sto- pairs to identify those pairs that appear to have a chastically emits an output symbol in each state. significant overlap. Then the reads are aligned in It is specified by a finite set of states, a finite set a manner consistent with as many of these over- of output symbols, an initial state, transition prob- laps as possible. Finally, the most likely genomic  abilities p(q,q ), and emission probabilities e(q,b). sequence is derived from the alignment.  Here p(q,q ) is the probability that the next state During the 1990s The Institute for Genomic Re-  is q given that the present state is q, and e(q,b) search (TIGR) used the shotgun method to sequence is the probability of emitting output symbol b in the genomes of many , of size up state q. In typical biological applications the out- to about five megabases. However, the method was put symbols are residues ( or amino not believed to be applicable to organisms, such acids), and an HMM is used to represent the sta- as Homo sapiens, having much larger genomes tistical features of a family of sequences, such as containing many repeat families. Repeat families the family of globin proteins. A subsequent section are sequences that are repeated with very little

MAY 2002 NOTICES OF THE AMS 547 variation throughout a genome. For example, the Living organisms divide into two main classes: ALU repeat family consists of nearly exact repeti- prokaryotes, such as and blue-green algae, tions of a sequence of about 280 bases covering in which the cell does not have a distinct nucleus, about 10 percent of the human genome. Repeat fam- and , in which the cells contain visibly ilies in a genome complicate the sequence assembly evident nuclei and . Gene finding within process, since matching sequences within two prokaryotes is relatively easy because each gene reads may come from distinct occurrences of a consists of a single contiguous sequence of bases. repeat sequence and therefore need not indicate In higher eukaryotes, however, a gene typically that the reads overlap. consists of two or more segments called exons that The Human , an international code for parts of a protein, separated by noncod- effort coordinated by the U.S. Department of En- ing intervening segments called introns. In the ergy and the National Institutes of Health, favors process of transcription the entire sequence of a divide and conquer approach over the shotgun exons and introns is transcribed into a pre-mRNA sequencing approach. The basic idea is to reduce transcript. Then the introns are removed and the the sequencing of the entire genome to the se- exons are spliced together to form the mRNA tran- quencing of many fragments called clones of length script that goes to a ribosome to be translated about 130,000 bases whose approximate locations into protein. Thus the task of gene identification on the genome have been determined by a process involves parsing the genomic region of a gene called physical mapping. into exons and introns. Often this parsing is not In 1998 the biologist established Cel- unique, so that the same gene can code for several era Genomics as a rival to the Human Genome Pro- different proteins. This phenomenon is called ject and set out to sequence the human genome us- alternative splicing. ing the shotgun sequencing approach. Venter was The identification of a gene and its parsing into exons and introns is based on signals in the ge- joined by the computer scientist Gene Myers, who had nomic sequence that help to identify the beginning conducted mathematical analyses and simulation of the first exon of a gene, the end of the last exon, studies indicating that the shotgun sequencing ap- and the exon-intron boundaries in between. Some proach would work provided that most of the reads of these signals derive from the of the ge- were obtained in pairs extracted from the ends of netic code. Define a codon as a triplet of DNA bases. short clones. The advantage of using paired reads Sixty-one of the sixty-four codons code for specific is that the approximate distance between the two amino acids. One of these (ATG) is also a start reads is known. This added information reduces the codon determining the start of translation. The danger of falsely inferring overlaps between reads other three codons (TAA, TAG, and TGA) are stop that are incident with different members of the same codons which terminate translation. It follows that repeat family. Celera demonstrated the feasibility the concatenation of all the exons starts with ATG of its approach by completing the sequencing of (with occasional exceptions) and ends with one of Drosophila melanogaster (the fruit fly) in March 2000. the three stop codons. In addition, each intron In February 2001 Celera and the Human Genome must start with GT and end with AG. There are also Project independently reported on their efforts important statistical tendencies concerning the to sequence the human genome. Each group had distribution of codons within exons and introns and obtained a rough draft sequence covering upwards the distribution of bases in certain positions near of 90 percent of the genome but containing numer- the exon-intron boundaries. These deterministic sig- ous gaps and inaccuracies. Both groups are contin- nals and statistical tendencies can be incorporated uing to refine their sequences. The relative merits into a hidden Markov model for generating ge- of their contrasting approaches remain a topic of nomic sequence. Given a genomic sequence, one debate, but there is no doubt about the significance can use a dynamic programming algorithm called of their achievement. the Viterbi algorithm to calculate the most likely sequence of states that would occur during the Gene Finding emission of the given sequence by the model. Each Although the sequencing of the human genome is symbol is then identified as belonging to an exon, a landmark achievement, it is not an end in itself. intron, regulatory region, etc., according to the A string of three billion A’s, C’s, T’s, and G’s is of state that the HMM resided in when the symbol was little value until the meaning hidden within it has emitted. been extracted. This requires finding the genes, Another approach to gene finding is based on determining how their expression is regulated, the principle that functioning genes tend to be and determining the functions of the proteins they preserved in evolution. Two genes in different encode. These are among the goals of the field of species are said to be orthologous if they are de- . In this section we discuss the rived from the same gene in a common ancestral first of these tasks, gene finding. species. In a pair of species that diverged from

548 NOTICES OF THE AMS VOLUME 49, NUMBER 5 one another late in evolution, such as man and Distance-Based Methods mouse, one can expect to find many orthologous Distance-based methods are based on the concept pairs of genes which exhibit a high level of se- of an additive metric. Define a weighted phylogenetic quence similarity; hence the fact that a sequence tree T as a phylogenetic tree in which a nonnega- from the human genome has a highly similar coun- tive length λ(e) is associated with each edge e. terpart in the mouse increases the likelihood that Define the distance between two species as the both sequences are genes. Thus one can enhance sum of the lengths of the edges on the path between gene finding in both man and mouse by aligning the two species in the tree. The resulting distance the two genomes to exhibit possible orthologous function is called the additive metric realized by T. pairs of genes. Distance-based methods for phylogeny con- struction are based on the following assumptions: Phylogeny Construction 1. There is a well-defined evolutionary distance The evolutionary history of a genetically related between each pair of species, and this distance group of organisms can be represented by a phylo- function is an additive metric. genetic tree. The leaves of the tree represent extant 2. The “correct” phylogenetic tree, together with species. Each internal node represents a postulated appropriately chosen edge distances, realizes speciation event in which a species divides into two this additive metric. populations that follow separate evolutionary paths 3. The evolutionary distances between the extant and become distinct species. species can be estimated from the character The construction of a phylogenetic tree for a group state data for those species. of species is typically based on observed properties This suggests the following additive metric re- of the species. Before the era of genomics these prop- construction problem: Given a distance function D erties were usually morphological characteristics defined on pairs of species, construct a tree and a set such as the presence or absence of hair, fur, or of edge distances such that the resulting additive scales or the number and type of teeth. With the metric approximates D as closely as possible (for a advent of genomics the trees are often constructed suitable measure of closeness of approximation). by computer programs based on comparison of The following is an example of a simple stochas- related DNA sequences or protein sequences in tic model of which implies the different species. that the distances between species form an additive metric. The models used in practice are similar, but An instance of the phylogeny construction prob- more complex. lem typically involves n species and m characters. For The model is specified by a weighted phyloge- each species and each character a character state is netic tree T with edge lengths λ(e). The following given. If, for example, the data consists of protein assumptions are made: sequences of a common length aligned without gap All differences between the character states of symbols, then there will be a character for each 1. species are due to random mutations. column of the alignment, and the character state will 2. Each edge e represents the transition from an be the residue in that position. The output will be a ancestral species to a new species. Indepen- rooted binary tree whose leaves are in one-to-one dently for each character, the number of correspondence with the n species. mutations during this transition has a The informal principle underlying phylogenetic Poisson distribution with mean λ(e). tree construction is that species with similar char- 3. Whenever a mutation occurs, the new charac- acter states should be close together in the tree. ter state is drawn uniformly from the set of all Different interpretations of this principle yield dif- character states (not excluding the character ferent formulations of the tree construction prob- state that existed before the mutation). lem as an optimization problem, leading to several The neighbor-joining algorithm is a widely used classes of tree construction methods. In all cases, linear-time algorithm for the additive metric recon- the resulting optimization problem is NP-hard. We struction problem. Whenever the given distance shall discuss parsimony methods, distance-based function D is an additive metric, the neighbor- methods, and maximum-likelihood methods. joining algorithm produces a weighted tree whose Parsimony Methods metric is D. The neighbor-joining algorithm also The internal nodes of a phylogenetic tree are enjoys the property of asymptotic consistency. This intended to represent ancestral species whose means that, if the data is generated according to character states cannot be observed. In parsimony the stochastic model described above (or to certain methods of tree construction the task is to construct generalizations of that model) then, as the number a tree T and an assignment A of character states of characters tends to infinity, the tree and edge to the internal nodes to minimize the sum, over all weights produced by the neighbor-joining algorithm edges of the tree, of the number of changes in will converge to the correct tree and edge weights character state along the edge. with probability one.

MAY 2002 NOTICES OF THE AMS 549 Maximum Likelihood Methods each of the n elements, and the reversal operation Maximum likelihood methods for phylogeny con- reverses the order of a sequence of consecutive struction are based on a stochastic model such as elements and the sign of each of these elements. the one described above. Let Ax be the observed It turns out that the problem of computing assignment of states for character x to the extant the reversal distance between two (unsigned) species. For any model M =(T,λ) specifying the permutations is NP-hard, but there is an elegant tree structure and the edge lengths, let Lx(M) be the quadratic-time algorithm for computing the re- probability of observing the assignment Ax, given versal distance between two signed permutations. the model M. Define L(M), the likelihood of model M, as the product of Lx(M) over all characters x. The DNA Microarrays goal is to maximize L(M) over the set of all models. In this section we describe a key technology for This problem is NP-hard, but near-optimal solutions measuring the abundances of specific DNA or can be found using a combination of the following RNA molecules within a complex mixture, and three algorithms: we describe applications of this technology to the 1. an efficient algorithm based on dynamic study of associations between polymorphisms programming for computing the likelihood and disease, to the classification and clustering of of a model M =(T,λ); genes and biological samples, and to the analysis 2. an iterative numerical algorithm for optimizing of genetic regulatory networks. the edge lengths for a given tree; i.e., computing Two oriented DNA molecules x and y are called F(T ) = maxλ L(T,λ); complementary if y can be obtained by reversing 3. a heuristic algorithm for searching in the space x and replacing each base by its complementary of trees to determine maxT F(T ). base, where the pairs (A,T) and (C,G) are comple- Genome Rearrangement mentary. There is a similar notion of complemen- In comparing closely related species such as cab- tarity between RNA and DNA. Hybridization bage and turnip or man and mouse, one often finds is the tendency of complementary, or nearly that individual genes are almost perfectly conserved, complementary, molecules to bind together. but their locations within the genome are radically Specific molecules within a complex sample of different. These differences seem to arise from global DNA or RNA can be identified by detecting their rearrangements involving the duplication, reversal, hybridization to complementary DNA probes. A or translocation of large regions within a genome. DNA microarray is a regular array of DNA probes This suggests that the distance between genomes deposited at discrete addressable spots on a solid should be measured not only by counting mutations, surface; each probe is designed to measure the but also by determining the number of large-scale abundance of a specific DNA or RNA molecule rearrangements needed to transform one genome such as the mRNA transcript of a gene. It is pos- to another. sible to manufacture DNA microarrays with tens To study these problems mathematically we of thousands of spots on a surface the size of a view a genome as a sequence of occurrences of postage stamp. genes, define a set of primitive rearrangement Here we concentrate on the applications of DNA operations, and define the distance between two microarrays, omitting all technological details about genomes as the number of such operations needed the manufacture of the arrays, the application of to pass from one genome to the other. As an DNA or RNA samples to the arrays, and the mea- example, consider the case of two genomes that surement of hybridization. It is important to note, contain the same n genes but in different orders, however, that at the present state of the art the and in which the only primitive operation is the measurements are subject to large experimental reversal of a sequence of consecutive genes. Each error. Methods of experimental design and statis- genome can be modeled as a permutation of tical analysis are being developed to extract {1, 2,... ,n} (i.e., a sequence of length n contain- meaningful results from the noisy measurements, ing each element of {1, 2,... ,n} exactly once), and but currently one can obtain only a rough estimate we are interested in the reversal distance between of the abundance of particular molecules in the the two permutations, defined as the minimum sample. number of reversal operations required to pass from one permutation to the other. To make the Associations between Polymorphisms and problem more realistic we can take into account Disease that genes are oriented objects and that the reversal The genomes of any two humans differ consider- of a segment not only reverses the order of the ably. Each of us carries different alleles (commonly genes within it, but also reverses the orientation occurring variant forms) of genes and polymor- of each gene within the segment. In this case each phisms (local variations in the sequence, typically genome can be modeled as a signed permutation, due to mutations). Of particular interest are i.e., a permutation with a sign (+ or −) attached to Single- Polymorphisms (SNPs) caused

550 NOTICES OF THE AMS VOLUME 49, NUMBER 5 by mutations at a single position. Several million expression levels of a few dozens of critical genes, commonly occurring SNPs within the human with the other genes being irrelevant, redundant, genome have been identified. It is of great interest or of lesser significance. Thus there arises the fea- to find statistical associations between a genotype ture selection problem of identifying the handful of (the variations within an individual’s genome) and genes that best distinguish the classes inherent in phenotype (observable characteristics such as eye the data. Sometimes a gene expression matrix con- color or the presence of disease). Some genetic tains local patterns, in which a subset of the genes diseases result from a single polymorphism, but exhibit consistent expression patterns within a more commonly there are many genetic variations subset of experiments. These local patterns cannot that influence susceptibility to a disease; this is be discerned through a global partitioning of the the case for atherosclerosis, diabetes, and the many experiments or of the genes but require identifi- types of cancer. Microarrays are a fundamental cation of the relevant subsets of genes and of tool for association studies because they enable experiments. Research on the feature selection an experimenter to apply DNA probes for thou- problem and on the problem of identifying local sands of different polymorphisms in a single patterns is in its infancy. experiment. The statistical problems of finding Supervised Learning subtle associations between polymorphisms and Machine learning theory casts the problem of complex diseases are currently being investigated supervised learning in the following terms. Given intensively. a set of training examples {x1,x2,...,xn} drawn d Classification and Clustering Based on from a probability distribution over R , together Microarray Data with an assignment of a class label to each train- ing example, find a rule for partitioning all of Rd Microarrays can be used to identify the genetic into classes that is consistent with the class labels changes associated with diseases, drug treatments, for the training examples and is likely to general- or stages in cellular processes such as apoptosis ize correctly, i.e., to give correct class labels for (programmed cell death) or the cycle of cell growth other points in Rd . and division. In such applications a number of There is a general principle of machine learning array experiments are performed, each of which theory which, informally stated, says that, under produces noisy measurements of the abundances some smoothness conditions on the probability of many gene transcripts (mRNAs) under a given experimental condition. The process is repeated for distribution from which the training examples are many conditions, resulting in a gene expression drawn, a rule is likely to generalize correctly if matrix in which the rows represent experiments, 1. each training example lies within the region the columns represent genes, and the entries assigned to its class and is far from the represent the mRNA levels of the different gene boundary of that region; products in the different experiments. A funda- 2. the rule is drawn from a “simple” parame- mental computational problem is to find significant trized set of candidate rules. The notion of structure within this data. The simplest kind of simplicity involves a concept called the structure would be a partition of the experiments, Vapnik-Chervonenkis dimension, which we or of the genes, into subclasses having distinct omit from this discussion. patterns of expression. One reasonably effective decision rule is the In the case of supervised learning one is given nearest neighbor rule, which assigns to each point independent information assigning a class label the same class label as the training example at to each experiment. For example, each experiment minimum Euclidean distance from it. might measure the mRNA levels in a leukemia In the case of two classes (positive examples and specimen, and a physician might label each spec- negative examples) the support vector machine imen as either an acute lymphoblastic leukemia method is often used. It consists of the following (ALL) or an acute myeloid leukemia (AML). The two stages: computational task is to construct a decision rule Mapping into feature space: Map each training that correctly predicts the class labels and can be example x to a point φ(x)=(φ1(x),φ2(x),...), expected to generalize to unknown specimens. In where the φi are called features. The point φ(x) the case of unsupervised learning the class labels is called the image of training example x. are not available, and the computational task is Maximum-margin separation: Find a hyperplane to partition the experiments into homogeneous that separates the images of the positive train- clusters on the basis of their expression data. ing examples from the images of the negative Typically the number of genes measured in training examples and, among such separating microarray experiments is in the thousands, but hyperplanes, maximizes the smallest distance the classes into which the experiments should from the image of a training example to the be partitioned can be distinguished by the hyperplane.

MAY 2002 NOTICES OF THE AMS 551 The problem of computing a maximum-margin point is not definitely assigned to a cluster, but is separation is a quadratic programming problem. assigned a probability of lying in each cluster. It turns out that the only information needed about Merging methods start with each object in a the training examples is the set of inner products cluster by itself and repeatedly combine the two φ(x)T φ(y) between pairs (x, y) of training exam- clusters that are closest together as measured, for ples. It is possible for the images of the training example, by the distance between their centers of examples to be points in an infinite-dimensional gravity. Most merging and splitting methods that Hilbert space, as long as these inner products can have been proposed are heuristic in nature, since be computed. Mercer’s Theorem characterizes those they do not aim to optimize a clearly defined functions K(x, y) which can be expressed as inner objective function. products in finite- or infinite-dimensional feature spaces. These functions are called kernels, and the The Logic of Transcriptional Control freedom to use infinite-dimensional feature spaces Cellular processes such as cell division, programmed defined through their kernels is a major advantage cell death, and responses to drugs, nutrients, and of the support vector machine approach to super- hormones are regulated by complex interactions vised learning. among large numbers of genes, proteins, and other molecules. A fundamental problem of molecular Clustering life science is to understand the nature of this reg- Clustering is the process of partitioning a set of ulation. This is a very formidable problem whose objects into subsets based on some measure of complete solution seems to entail detailed mathe- similarity (or dissimilarity) between pairs of objects. matical modeling of the abundances and spatial Ideally, objects in the same cluster should be distributions within a cell of thousands of chemi- similar, and objects in different clusters should cal species and of the interactions among them. It be dissimilar. is unlikely that this problem will be solved within Given a matrix of gene expression data, it is of this century. interest to cluster the genes and to cluster the ex- One aspect of the problem that seems amenable periments. A cluster of genes could suggest either to mathematical methods is the logic of transcrip- that the genes have a similar function in the cell or tional control. As Eric Davidson has stated, “A large that they are regulated by the same transcription part of the answer lies in the gene control circuitry factors. A cluster of experiments might arise from encoded in the DNA, its structure and its functional tissues in the same disease state or experimental organization. The regulatory interactions mandated samples from the same stage of a cellular process. in the circuitry determine whether each gene is ex- Such hypotheses about the biological origin of a pressed in every cell, throughout developmental cluster would, of course, have to be verified by space and time, and if so, at what amplitude. In further biochemical experiments. physical terms the control circuitry encoded in Each gene or experiment can be viewed as the DNA is comprised of cis-regulatory elements, an n-dimensional vector, with each coordinate i.e., the regions in the vicinity of each gene which derived from a measured expression level. The contain the specific sequence motifs at which those similarity between points can be defined as the regulatory proteins which affect its expression inner product, after scaling each vector to Euclid- bind; plus the set of genes which encode these spe- ean length 1. When experiments are being clustered cific regulatory proteins (i.e., transcription factors).” the vectors are of very high dimension, and a It appears that the transcriptional control of a preliminary feature selection step is required to gene can be described by a discrete-valued function exclude all but the most salient genes. of several discrete-valued variables. The value of the In the K-means algorithm the number of clusters function represents the level of transcription of is specified in advance, and the goal is to mini- the gene, and each input variable represents the ex- mize the sum of the distances of points from the tent to which a has attached to centers of gravity of their clusters. Locally optimal binding sites in the vicinity of the gene. The genes solutions can be obtained by an iterative compu- that code for transcription factors are themselves tation which repeats the following step: Given a set subject to transcriptional control and also need to of K clusters, compute the center of gravity of each be characterized by discrete-valued functions. A cluster; then reassign each point to the cluster regulatory network, consisting of many interacting whose center of gravity is closest to the point. genes and transcription factors, can be described Maximum likelihood methods assume a given as a collection of interrelated discrete functions number of clusters and also assume that the points and depicted by a “wiring diagram” similar to the in each cluster have a multidimensional Gaussian diagram of a digital logic circuit. distribution. The object is to choose the parameters The analysis of this control circuitry involves of the Gaussian distributions so as to maximize the biochemical analysis and genomic sequence analy- likelihood of the observed data. In these methods a sis to identify the transcription factors and the

552N OTICES OF THE AMS VOLUME 49, NUMBER 5 sequence motifs characteristic of the sites at which they bind, together with microarray experiments which measure the transcriptional response of many genes to selected perturbations of the cell. These perturbations may involve changes in environmental factors such as temperature or the presence of a nutrient or drug, or interven- tions that either disable selected genes or enhance their transcription rates. The major mathematical challenges in this area are the design of informative perturbations and the inference of the transcriptional logic from infor- mation about transcription factors, their binding sites, and the results of microarray experiments under perturbed conditions.

Bibliography In this section we provide references for the principal topics introduced in this article. Sequence Comparison [1] D. GUSFIELD, Algorithms on Strings, Trees and Sequences, Cambridge University Press, 1997. Hidden Markov Models; Phylogeny Construction [2] R. DURBIN, S. EDDY, A. KROGH, and G. MITCHISON, Bio- logical Sequence Analysis, Cambridge University Press, 1998. Sequence Assembly [3] EUGENE W. MYERS et al., A whole-genome assembly of Drosophila, Science 287 (2000), 2196–2204. Gene Finding [4] S. SALZBERG, D. SEARLS, and S. KASIF (Eds.), Computa- tional Methods in Molecular Biology, Elsevier Science, 1998. Genome Rearrangement [5] P. PEVZNER, Computational Molecular Biology, MIT Press, 2000. DNA Microarrays [6] M. SCHENA (Ed.), DNA Microarrays: A Practical Approach, Oxford University Press, 1999. Supervised Learning [7] N. CRISTIANINI and J. SHAWE-TAYLOR, An Introduction to Support Vector Machines, Cambridge University Press, 1999. Clustering [8] B. EVERITT, S. LANDAU, and M. LEESE, , Edward Arnold, fourth edition, 2001. The Logic of Transcriptional Control [9] E. DAVIDSON, Genomic Regulatory Systems, Academic Press, 2001.

MAY 2002 NOTICES OF THE AMS 553