Mathematical Challenges from Genomics and Molecular Biology Richard M
Total Page:16
File Type:pdf, Size:1020Kb
Mathematical Challenges from Genomics and Molecular Biology Richard M. Karp fundamental goal of biology is to un- algorithms and the role of combinatorics, opti- derstand how living cells function. This mization, probability, statistics, pattern recognition, understanding is the foundation for all and machine learning. higher levels of explanation, including We begin by presenting the minimal information Aphysiology, anatomy, behavior, ecology, about genes, genomes, and proteins required to and the study of populations. The field of molec- understand some of the key problems in genomics. ular biology analyzes the functioning of cells and Next we describe some of the fundamental goals of the processes of inheritance principally in terms of the molecular life sciences and the role of genomics interactions among three crucially important classes in attaining these goals. We then give a series of of macromolecules: DNA, RNA, and proteins. brief vignettes illustrating algorithmic and mathe- Proteins are the molecules that enable and execute matical questions arising in a number of specific most of the processes within a cell. DNA is the car- areas: sequence comparison, sequence assembly, rier of hereditary information in the form of genes gene finding, phylogeny construction, genome re- and directs the production of proteins. RNA is a key arrangement, associations between polymorphisms intermediary between DNA and proteins. and disease, classification and clustering of gene Molecular biology and genetics are undergoing expression data, and the logic of transcriptional revolutionary changes. These changes are guided by control. An annotated bibliography provides point- a view of a cell as a collection of interrelated sub- ers to more detailed information. systems, each involving the interaction among many genes and proteins. Emphasis has shifted from the Genes, Genomes, and Proteins study of individual genes and proteins to the explo- The Double Helix ration of the entire genome of an organism and the The field of genetics began with Gregor Mendel study of networks of genes and proteins. As the level (1865), who postulated the existence of discrete of aspiration rises and the amount of available data grows by orders of magnitude, the field becomes units of information (which later came to be called increasingly dependent on mathematical modeling, genes) that govern the inheritance of individual mathematical analysis, and computation. In the characteristics in an organism. In the first half of sections that follow we give an introduction to the the twentieth century it was determined that the mathematical and computational challenges that genes are physically embodied within complex arise in this field, with an emphasis on discrete DNA macromolecules that lie within structures called chromosomes which occur in every living Richard M. Karp is a member of the International Computer cell. This set the stage for the epochal discovery Science Institute (ICSI), Berkeley, and a University Professor of the structure of DNA by Watson and Crick in at the University of California, Berkeley. His e-mail address 1953. They showed that a DNA molecule is a is [email protected]. double helix consisting of two strands. Each helix 544 NOTICES OF THE AMS VOLUME 49, NUMBER 5 is a chain of bases, chemical units of four types: A, according to the genetic code, which maps succes- C, T, and G. Each base on one strand is joined by sive triplets of RNA bases to amino acids. With a hydrogen bond to a complementary base on the minor exceptions, this many-to-one function from other strand, where A is complementary to T, and the sixty-four triplets of bases to the twenty amino C is complementary to G. Thus the two strands con- acids is the same in all organisms on Earth. tain the same information. Certain segments within Regulation of Gene Expression these chromosomal DNA molecules contain genes, All the cells within a living organism (with the ex- which are the carriers of the genetic information ception of the sperm and egg cells) contain nearly and, in a sense to be explained later, spell the identical copies of the entire genome of the organ- names of the proteins. Thus the genetic informa- ism. Thus every cell has the information needed to tion is encoded digitally, as strings over the four- produce any protein that the organism can produce. letter alphabet {A, C, T, G}, much as information Nevertheless, cells differ radically in the proteins is encoded digitally in computers as strings of that they actually produce. For example, there are zeros and ones. more than 200 different human cell types, and most In humans there are forty-six chromosomes. proteins are produced in only a subset of these cell All but two of these (the sex chromosomes) occur in types. Moreover, any given cell produces different pairs of “homologous” chromosomes. Two homol- proteins at different stages within its cycle of oper- ogous chromosomes contain the same genes, but ation, and its protein production is influenced by its a gene may have several alternate forms called internal environment and by the signals impinging alleles, and the alleles of a gene on the two chromo- upon it from other cells. somes may be different. It is clear, then, that the expression of a gene within The total content of the DNA molecules within the a cell (as measured by the abundance and level of chromosomes is called the genome of an organism. activity of the proteins it produces) is regulated by Within an organism, each cell contains a complete the environment of the cell. Transcription of a gene copy of the genome. The human genome contains is typically regulated by proteins called transcrip- about three billion base pairs and about 35,000 genes. tion factors that bind to the DNA near the gene and Proteins enhance or inhibit the copying of the gene into RNA. Proteins are the workhorses of cells. They act as Similarly, translation can be regulated by proteins structural elements, catalyze chemical reactions, that bind to the ribosome. Certain post-translational regulate cellular activities, and are responsible for processes, such as the chemical modification of the communication between cells. A protein is a linear protein or the transport of protein to a particular chain of chemical units called amino acids, of which compartment in the cell can also be regulated so there are twenty common types. The function of a as to affect the activity of the protein. Thus gene protein is determined by the three-dimensional expression can be viewed as a complex network of structure into which it folds. One of the premier interactions involving genes, proteins, and RNA, as problems in science is the protein folding problem well as other factors such as temperature and the of predicting the three-dimensional structure of a presence or absence of nutrients and drugs within protein from its linear sequence of amino acids. the cell. This problem is far from being solved, although progress has been made by a variety of methods. The Goals of Genomics These range from numerical simulation of the In this section we enumerate some of the goals of physical forces exerted by the amino acids on genomics. one another to pattern recognition techniques 1. Sequence and compare the genomes of differ- which correlate motifs within the linear amino acid ent species. To sequence a genome means to sequence with structural features of a protein. determine its sequence of bases. This sequence From Genes to Proteins will, of course, vary from individual to indi- The fundamental dogma of molecular biology is that vidual, and those individual differences are of DNA codes for RNA and RNA codes for protein. Thus paramount importance in determining each the production of a protein is a two-stage process, individual’s genetic makeup, but there is with RNA playing a key role in both stages. An RNA enough agreement to justify the creation of a molecule is a single-stranded chain of chemical composite reference genome for a species. For bases of four types: A, U, C, and G. In the first stage, example, any two humans will have the same called transcription, a gene within the chromosomal complement of genes (but different alleles) DNA is copied base-by-base into RNA according to and will agree in about 999 bases out of 1,000. the correspondence A → U, C → G, T → A, G → C. The sequencing of the human genome The resulting RNA transcript of the gene is then has been a central goal of the world genomics transported within the cell to a molecular machine community since 1990. Draft sequences were called a ribosome which has the function of trans- completed in February 2001, and the quest lating the RNA into protein. Translation takes place continues for a much more accurate sequence. MAY 2002 NOTICES OF THE AMS 545 This achievement was preceded by the se- bases or amino acids), we need a definition: An quencing of many bacterial genomes, yeast, alignment of a pair of sequences x and y is a new the nematode, and, in June 1999, the fruit fly pair of sequences x and y of equal length such Drosophila melanogaster. The sequencing of that x is obtained from x and y is obtained from a new organism is often of value for medical, y by inserting occurrences of the special space agricultural, or environmental studies. In symbol (−). Thus, if x = acbcdb and y = abbdcdc, addition, it may be useful for comparative then one alignment of x and y is as follows: studies with related organisms. x = a - c b c - d b, 2. Identify the genes and determine the func- y = a b - b d c d c. tions of the proteins they encode.