<<

Concepts of Genomic Biology

Author: Robert D. Locy Professor, Genomic Biology Auburn University Fall 2015

Table of Contents: Preface 3.4.1. DNA replication is semiconservative Websites of Interest 3.4.2. DNA polymerases 3.4.3. Initiation of replication 1. Introduction 3.4.4. DNA replication is semidiscontinuous 1.1. What is a ? 3.4.5. DNA replication in . 1.2. What is a ? 3.4.6. Replicating ends of 1.3. What is Genomic Biology? 3.5. 1.3.1. Structural Genomics 3.5.1. Cellular RNAs are transcribed from DNA 1.3.2. Comparative Genomics 3.5.2. RNA polymerases catalyze transcription 1.3.3. Functional Genomics 3.5.3. Transcription in 1.4. Genomic Databases 3.5.4. Transcription in Prokaryotes - Polycistronic mRNAs 2. The beginnings of Genomic Biology – classical are produced from operons 3.5.5. Beyond Operons – Modification of expression in Prokaryotes 2.1. Mendel & Darwin – traits are conditioned by 3.5.6. Transcriptions in Eukaryotes 2.2. Genes are carried on chromosomes 3.5.7. Processing primary transcripts into mature mRNA 2.3. The chromosomal theory of inheritance 3.6. 2.4. Additional Complexity of Mendelian Inheritance 2.4.1. Multiple alleles 3.6.1. The Nature of 2.4.2. Incomplete dominance and co-dominance 3.6.2. The Genetic Code 2.4.3. Sex linked inheritance 3.6.3. tRNA – The decoding molecule 2.4.4. Epistasis 3.6.4. Peptides are synthesized on Ribosomes 2.4.5. Epigenetics 3.6.5. Translation initiation, elongation, and termnation 2.5. The Law of Independent Assortment 3.6.6. Sorting in Eukaryotes 2.5.1. : chromosomes assort independently 2.5.2. Mapping genes on chromosomes 4. Genomic Biologists tool kit 2.6. Quantitative Genetics: Traits that are Continuously Variable 4.1. Restriction Endonucleases – making “sticky ends” 2.7. Population Genetics: Traits in groups of individuals 4.2. Cloning Vectors 4.2.1. Simple Cloning Vectors 3. The beginnings of Genomic Biology – molecular 4.2.2. Expression Vectors genetics 4.2.3. Shuttle Vectors 3.1. DNA is the Genetic Material 4.2.4. Phage Vectors 3.2. Watson & Crick – The structure of DNA 4.2.5. Artificial Vectors 3.3. Chromosome structure 4.3. Methods for Sequence Amplification 3.3.1. Prokaryotic chromosome structure 4.3.1. Polymerase Chain Reaction 3.3.2. Eukaryotic chromosome structure 4.3.2. Cloning Recombinant DNA 3.3.3. & 4.3.3. Cloning DNA in Expression Vectors 3.4. DNA Replication 4.3.4. Making complementary DNA (cDNA) CONCEPTS OF GENOMIC BIOLOGY Page 1 4.4. Methods for Sequence Amplification - Cont. 8.5. The Environment 4.4.5. Cloning a cDNA Library 8.6. Food & fiber production 4.5. Genomic Libraries 8.7. Evolutionary biology 4.6. DNA separation – electrophoresis Epilogue 4.7. DNA sequence identification – DNA hybridization 5. Structural Genomics 5.4. Sequencing DNA molecules • Sanger sequencing – dideoxy sequencing • Automated capillary DNA sequencing robots • Next generation sequencing – pyrosequencing 5.5. Genomic sequence libraries 5.6. Map-based strategies – molecular polymorphisms 5.7. Whole genome shotgun sequencing 5.8. Bioinformatics and gene identification 5.9. About sequenced 6. Comparative Genomics 6.4. Genomic variation – mutations 6.5. Genomic variation – polymorphisms 6.6. Phylogenetic trees 6.7. The tree of 7. Functional Genomics – Overview 7.4. Identification of protein structure and function 7.5. Non-protein-coding genes 7.6. Gene expression – Prokaryotes 7.7. Gene expression – Eukaryotes 7.8. Gene expression – Signal Transduction 7.9. The transcriptome – Measuring gene expression • Northern blot • RT-PCR • Quantitative PCR • Microarray 7.10. The Proteome 7.11. The Metabolome 8. Genomic Applications 8.4. Human biology CONCEPTS OF GENOMIC BIOLOGY Page 2 (Celera Corporation, which was formally launched in 1998. The dynamics and interaction of these 2 efforts is an interesting PREFACE (RETURN) study on how we identify and fund science today, and how Prior to about 1990 few people conceived of the idea of a public sector research is both in competition with and genome, much less undertaken the investigation of such. collaboration with privately funded corporate research. This is However in 1990 the Human Genome Project (HGP) was such an interesting plethora of information that several books initiated as an international scientific research collaboration have been written describing the HCP and Celera Genomics with the goals of: 1) determining the sequence of nucleotide efforts. A couple of these are given in the book list below: bases that make up a haploid copy of human chromosomes; 2) identifying all of the genes of the human genome both physically and functionally; and 3) mapping all of the genes identified to specific human chromosomes. The HGP remains the world's largest collaborative biological project. The project was proposed and funded by the US government through the National Instutues of Health (NIH). Planning started in 1984, and the project got underway in 1990. In 2003 President Bill Clinton declared the HGP a rousing success and essentially complete with the production of a first draft of the This eBook is intended to provide knowledge of the technology, achievements, and ongoing activities of what started as the HGP, but now human genome. In fact, work on gene identification, mapping, involves much broader considerations that are shaping the future of the and function is ongoing even today, and has yielded a treasure study of biology. In order to appreciate this information and to make it trove of knowledge about the human genome as well as the maximally useful, a brief synopsis of important concepts from classical and genes and genomes of many other microbes, fungi, and molecular genetics is presented. This followed by an analysis of the along the way that is revolutionizing not only health technology used by genomic biologists, and a summary of the significant science-related research but virtually every aspect of findings in DNA sequencing (structural genomics), sequence comparisons of a wide range of (comparative genomics), and information on contemporary biology. how genes in genomes produce their phenotype (functional genomics). The publicly funded project was led, by Dr. Francis Collins and involved more than in twenty universities and research Bob Locy, December, 2014 centers in the United States, the United Kingdom, Japan, France, Germany, and China. A parallel project was conducted in the private sector led by Dr. Craig Venter of the Celera Genomics CONCEPTS OF GENOMIC BIOLOGY Page 3

 Websites of interest (Return) General EMBL (European Bioinformatics Institute) Gennome News Network Human Genome Project (HGP) National Center for Bioltechnology Information (NCBI) Tree of Life Web Project Genome Databases CAMERA Resource for microbial genomics and metagenomics Corn the Maize Genetics and Genomics Database EcoCyc E. coli K-12 database PATRIC, the PathoSystems Resource Integration Center Flybase Drosophila melanogaster genome JGI Genomes of the DOE-Joint Genome Institute Mouse Genome Database (MGI) National Microbial Pathogen Data Resource. Repbase database for repetitive elements (transposons). Saccharomyces Genome Database (yeast) Xenbase genome of Xenopus tropicalis and Xenopus laevis Wormbase Caenorhabditis elegans database Zebrafish Information Network TAIR The Arabidopsis Information Resource Rat Genome Database (RGD) Banana Genome Hub Bacterrial Small Regulatory RNA Database

CONCEPTS OF GENOMIC BIOLOGY Page 4

1.1. WHAT IS A GENE? (RETURN)

What is a Gene? There are really two definitions of a gene that could be given. These are: Classical genetic definition – A gene is the unit of heredity that carries genetic information that produces the trait of an organism from one generation to the next. Subsequently, it has been demonstrated that genes reside on chromosomes, and chromosomes are passed  CHAPTER 1. INTRODUCTION (RETURN) from one generation to the next.

To begin our study of genomic biology we need to gain a common vocabulary with which we can move forward. At this point the definitions should come from your background knowledge equivalent to a beginning biology course. If these definitions are unfamiliar, you will need to refresh them using either the enclosed hyperlinks or a general biology or genetics textbook. As we move forward with our study of genomic biology all of these definitions will become more refined and more meaningful, so that by the end of the course you might wish to use a more thorough definition than our beginning point here.

CONCEPTS OF GENOMIC BIOLOGY Page 5 Molecular genetic definition - A gene is a locatable region Note that the molecular definition of a gene of genomic sequence, usually associated with a suggests that there is a region of a chromosome on chromosome, corresponding to a unit of inheritance, which a gene resides within something called a which is associated with regulatory regions, transcribed regions, and or other functional sequence regions genome. This definition draws on our background in ultimately generating a phenotypic trait. The diagram biology and genetics that we have learned about in below describes some of these terms, but for now you other courses. We will expand on this further in the only need understand that a gene can be defined in remainder of the book. terms of specific DNA sequences residing on the DNA

molecule carrying genetic information that is part of the chromosome. 1.2. WHAT IS A GENOME? (RETURN)

What is a Genome? The term genome originated in the early part of the 20th century apparently as a combination of the terms gene and chromosome. It was originally meant to indicate the sum of all of the genes on all of the chromosomes of an organism, or alternatively, the entire set of hereditary information for building, running, and maintaining an organism (or ). As such the definition of a genome applies to all living things including:

Figure 1.2. The parts of the Gene in the DNA sequence. This sequence is coded for during translation to produce a pre-mRNA which is processed into an mRNA that can be translated to produce a protein.

Viruses Prokaryotes

CONCEPTS OF GENOMIC BIOLOGY Page 6

1.3. WHAT IS GENOMIC BIOLOGY? (RETURN)

There is more to genomic biology than merely obtaining the genetic information carried in DNA molecules (sequence of base pairs in the DNA). There is other important information required for a gene to specific a trait, for example, other information is sustained in each cellular generation at the

chromosomal level, and finally the genome as a whole Eukyotes produces interactions that further determine gene function and the influence of the environment on the Note that of these various groups of expression of genes. some have a genome consisting of but a single Thus, the study of genomic biology must chromosome, while others (mostly Eukaryotes have incorporate not just the simplicity of DNA as the genomes made up of multiple chromosomes. Further information-carrying molecule, but also the myriad of note that most Eukaryotic organisms have organelles complex and sophisticated interactions between all such as and/or mitochondria that each contain separate genomes from the nucleus. things inside biological systems that mediate the complex regulation of that simple genetic information. Once it was determined that DNA carries the genetic information that makes up genes and that this To organize our study of the genomic universe, was physically the basic substance required to typically 3 sub disciplines are considered. These are: produce the traits that we all have, genomes took on structural genomics, the techinques, strategies, and a definition in terms of DNA itself. What we know is analysis of primary genomes of organisms; that an organisms genome is made up of comparative genomics, the comparison of genes and chromosomes, that chromosomes carry genes, and genomes from an array of related or unrelated that DNA is the substance carrying the information of organisms; and functional genomics, understanding the genes.

CONCEPTS OF GENOMIC BIOLOGY Page 7 the factors that mediate the function of genes. A brief overview of these subdisciplines is given below.

1.3.1. Structural Genomics (return) Structural genomics (Genome Structure) may be the original definition of genomics that involves the application of recombinant DNA technology, DNA sequencing methods, and bioinformatics to sequence, assemble, and analyze the complete structure of genomes. Advances in genomics have triggered a revolution in discovery-based research. The field includes efforts to determine the entire DNA sequence of organisms and fine-scale genetic mapping. Because structural genomics has a heavy reliance on the sequencing of complete genomes this area of genomics has evolved rapidly as DNA sequencing Figure 1.3. Structural genomics involves technology has evolved. A parallel emphasis on primary DNA sequencing. This is a computer algorithms for assembling shorter compomparison of the output from dideoxy Sanger sequencing showing sequences obtained from modern DNA sequencers either black and white or color outputs. has allowed the rapid development of genomic sequencing and a tremendous lowering of sequencing costs.

CONCEPTS OF GENOMIC BIOLOGY Page 8 The major principle of comparative genomics is 1.3.2. Comparative Genomics (return) that common features of two organisms will often be Comparative genomics is a subdiscipline of genomic encoded within the DNA that is conserved through biology in which the genomic features of different (evolutionary) time. Therefore, comparative genomic organisms are compared. Genomic features may approaches typically begin by making some form of include the DNA sequence, genes and gene order, sequence alignment of genome sequences and regulatory sequences, and other genomic structural looking for orthologous sequences (sequences that features. In this branch of genomics, whole or large share a common ancestry) and checking the extent parts of genomes resulting from genome sequencing that those sequences are conserved. Based on this projects are compared to study basic biological orthology, the genome and molecular evolution of similarities and differences as well as evolutionary the genomes are made and interpreted in the context relationships between organisms. of, for example the phenotype of the organism or in the context of the genetics of whole populations

1.3.3. Functional genomics (return) Functional genomics is the subdiscipline of genomic biology that focuses on how genes and genomes function. Once the genes in a genome are discovered by DNA sequencing (such as genome sequencing projects) functional studies seek to describe gene and protein functions. Functional genomics focuses on the dynamic aspects of the Figure 1.4. The interrelationships of all living genome, such as gene transcription, translation, and organisms to eachother can be investigated at protein–protein interactions, as opposed to the static the DNA sequence level. One of the results of this effort is the Tree of Life Project. This aspects of the genomic information such as DNA project is an outstanding example of the power of comparative genomics.

CONCEPTS OF GENOMIC BIOLOGY Page 9 nucleotide sequence or structures. Functional genomics attempts to answer questions about the function of DNA at the levels of genes, RNA transcripts (transcriptomics), and protein products (proteomics). A key characteristic of functional genomics studies is their genome-wide approach to these questions, generally involving high-throughput methods such as microarrays (see below) rather than a more traditional “gene-by-gene” approach. The goal of functional genomics is to understand the relationship between an organism's genome and its phenotype. Functional genomics involves studies of natural variation in genes, RNA, and proteins over time (such as an organism's development) or space (such as its body regions), as well as studies of natural Figure 1.5. A colorized microarray. Colors represent or experimental functional disruptions affecting different levels gene expression. Microarras are used to genes, chromosomes, RNA, or proteins such as examine genomewide gene expression data. environmental stresses. The promise of functional genomics is to expand biology today, and although we have learned a great and synthesize genomic and proteomic knowledge deal about gene function, much remains to be learned. into an understanding of the dynamic properties of an organism at cellular and/or organismal levels. This would provide a more complete picture of how biological function arises from the information encoded in an organism's genome. As such functional genomics is the major ongoing focus of genomic

CONCEPTS OF GENOMIC BIOLOGY Page 10

1.4. GENOMIC DATABAES (RETURN) The use of genomic databases is critically

important to the study of genomic biology. They are Much of the structural, comparative, and the main mechanism for communicating genomic functional genomic information is collected into information. Many such sites for your use can be various databases. There are many international found in the Websites of Interest section under specialty database-sites that are available, but the Genomic Databases above. largest single database is the National Center for Biotechnology Information (NCBI) database at the National Library of Medicine (NLM) sponsored be the National Institutes of Health (NIH).

Figure 1.6. A colorized microarray. Colors represent different levels gene expression. Microarras are used to examine genomewide gene expression data.

CONCEPTS OF GENOMIC BIOLOGY Page 11

2.1. MENDEL & DARWIN –

TRAITS ARE CONDITIONDBY GENES. (RETURN)

The idea of genomic biology begins with a  CHAPTER 2. THE BEGINNINGS OF GENOMIC consideration of what makes up genomes. Specifically what are genes. The timeline of genetics IOLOGY LASSICAL ENETICS (RETURN) B –C G and genomics begins with the early work of Charles Darwin and Gregor Mendel who didn’t really talk about genes per se, but who did describe the behavior of the characteristics of biological organisms, which they referred to as traits. It should be clear that the beginings of genomic biology are grounded in classical or Mendelian Genetics. Once the relationship between traits and genes was understood, the relationship between cells and genetics was investigated, leading to the discovery of chromosomes, and a quest for the substance that carried the genetic information began, culminating in the discovery of DNA. These studies constitute the contribution of classical genetics to the founding of the genomic era.

Charles Darwin CONCEPTS OF GENOMIC BIOLOGY Page 12 In 1859 Charles Darwin published his book On the Origin of Species. In this work Darwin described a mass of descriptive support for the concept that “traits” are stably transmitted through subsequent generations, and that organisms that have superior traits survive their natural environment to pass those traits on to the next generation. However, Darwin did not describe any mechanism for such transmission of traits to the next generation. Mendel's Experiments Video Experimental evidence for a mechanism explaining how traits pass to subsequent generations came in In 1865 Mendel delivered two long lectures that 1866 when an Austrian monk, Gregor Mendel, were published in 1866 as "Experiments in published his studies covering 10 years worth of work Hybridization." This established what eventually on the mechanism of inheritance of 7 characteristics became formalized as the Mendelian Laws of in garden peas in a paper called “Experiments in Plant inheritance: Hybridization”. • The law of dominance. For each trait, one factor (gene) is dominant and appears as the phenotype in the first filial generation (F1). In the F2 generation the dominant trait occurs more often, in a definite 3:1 ratio. The alternative form is recessive. In Mendel's peas, tallness was dominant, shortness recessive. Therefore, three times as many plants were tall as were short. This constant ratio represents the random combination of alleles during reproduction. Any combination of alleles that includes the dominant allele will express that form of the trait. Gregor Mendel CONCEPTS OF GENOMIC BIOLOGY Page 13

• The law of independent segregation. Inherited it was early in the 20th century that the name “gene” characteristics (such as stem length in Mendel's pea was given to the hereditary unity described by plants) exist in alternative forms (tallness, Mendel decades earlier, and the study of genetics and shortness)—today known as alleles. For each genomics began in earnest. characteristic, an individual possesses two paired alleles—one inherited from each parent. 2.2. GENES ARE CARRIED ON CHROMOSOMES. Correspondingly, these pairs segregate (i.e. separate (RETURN) or assort) in germ cells and recombine during reproduction so that each parent transmits one allele to each offspring. At about the same time that genes were coming • The law of independent assortment. Specific traits into focus as having a role in inheritance, a series of operate independently of one another. A pea plant observations at the cellular level established: might have a stem that is tall or short, but in either • The existence of structures called chromosomes. case may produce white or gray seed coats. However, the significance of Mendel’s work and his insight into the mechanism of inheritance went unrecognized until 1900 when three European scientists, Hugo de Vries, Carl Correns, and Erich von Tschermak reached similar conclusions in their own research though they claimed to be unaware of Mendel’s earlier theory of the 'discrete units' on which genetic material resides. The biological entity (factor) responsible for • Chromosomes carry genes. gene defining traits was later termed a by Wilhelm The notion that Mendel’s particulate hereditary Johansen in 1910, but the biological basis for factors reside on visible structures called chromo-somes inheritance remained unknown until DNA was was originally independently proposed by Theodor identified as the genetic material in the 1940s. Thus, Boveri, a German scientist, and Walter Sutton, an CONCEPTS OF GENOMIC BIOLOGY Page 14 American graduate student, in 1902 at about the same time that Mendel’s Laws of inheritance were being 2.3. THE CHROMOSOMAL THEORY OF INHERITANCE. rediscovered. (RETURN) The developing theory stated: • More than one gene is located on each chromosome. In the early years of the 20th century Thomas Hunt Thus, chromosomes are like a string of beads with Morgan, who was skeptical about the theories of the each gene represented as a bead. Along the length of day concerning Mendel’s observations and the role of the chromosome (string of beads) there are genes for chromosomes in inheritance, began conducting a many traits on each chromosome, and each gene series of experiments using the fruit fly, Drosophilla occupies a specific position on each chromosome melanogaster, that ultimately convinced him of the called a locus. details of inheritance leading to what is called today the chromosomal theory of inheritance. The general • The chromosomes are passed from one generation to the next and carry genes to the next generation tenets of this theory are given below: as they are passed. • Multiple genes conditioning the cellular and These points were incorportated into what we now organismal traits an organism possesses are passed from one cellular or organismal generation to the next know as the Chromosomal Theory of Inheritance. on chromosomes. • Genes for specific traits reside at specific positions on chromosomes called loci (singular locus). • Most cells of an organism have homologous pairs of chromosomes for each chromosome found in the cell. • The complete set of chromosomes an organism possesses is called it’s .

CONCEPTS OF GENOMIC BIOLOGY Page 15 Gametes, eukaryotic cells that pass chromosomes to the next organismal generation, contain only a haploid number of chromosomes (23 in the case of homans). Thus, gametes have only 1 chromosome from each pair found in a non-gametic cell. Chromosome numbers are constant for a species, but vary from one species to another. • One of the chromosomes in each homologous pair comes from the maternal parent while the other chromosome in the pair comes from the paternal parent. Figure 2.1. The complete set of 23 pairs of human chromosomes is shown in the karyotype above. Note • Although traits are conditioned by genes at specific that there are 22 pairs of autosomal Chromosomes, and loci on the chromosomes, the gene at a given locus the X and Y sex chromosome “pair”. Thus, we say that coming from each parent may not be the same. They there are 22 pairs of homologous autosomal can be either the dominant (according to Mendel’s law chromosomes plus a pair of sex chromosomes (X or Y) of dominance) factor, ort he recessive factor. We now in humans, and humans have 46 (diploid number) call the nature of the factor (gene) at each locus, an chromosomes in total. allele. • When both the maternal and paternal homologous The complete set of human chromosomes is shown chromosome contain the same allele, the organism is in Figure 2.1. Humans have 22 pairs of autosomal said to be homozygous, but if the alleles contained at the locus on the homologous chromosomes are chromosomes, and the X and Y sex chromosomes that different the organism is said to be heterozygous. are present in males (XY) of females (XX). Thus, we say that there are 22 pairs of homologous autosomal • When an organism is homozygous, if the allele it bears is the dominant allele, the organism demonstrates a chromosomes plus a pair of sex chromosomes (X or Y) homozygous dominant genotype. While a in humans. Humans have 46 chromosomes in total, homozygous organism bearing 2 identical recessive and the diploid number of chromosomes is 26. alleles is considered homozygous recessive genotype.

CONCEPTS OF GENOMIC BIOLOGY Page 16

• The genotype that an organism possesses in combination with environmental factors is responsible 2.4. ADDI TIONAL COMPLEXITY OF MENDELIAN for production of the trait that we see. This is also a INHERITANCE. (RETURN) definition of the phenotype of an individual, i.e. the appearance of the individual resulting from the interaction of genotype and environmental factors. Once the simple laws of Mendel that governed Thus, an organism can demonstrate a dominant inheritance had been established and related to the phenotype or a recessive phenotype. behavior of chromosomes, there were many examples of situations that were not fully accounted for with the simple laws. In the early 20th century What Mendel observed was the phenotype of his there was great controversy, not just about the pea plants. From observations of phenotype he chromosomal theory and its relationship to proposed a model for genotypic behavior of his inheritance of traits, but about other known examples “factors” that we no know as genes. We also know that appeared not to be explained by Mendel and the that these genes reside on chromosomes, and the chromosomal theory. manner in which the chromosomes are passed to the Resolution of these issues took decades and next generation provides the basis for Mendel’s law required careful, thorough and well-designed of segregation that directly relates the behavior of the experiments too provide us with an understanding of chromosomes bearing the genes to the phenotypic many of these situation. In fact a few of these controversies were not fully resolved until the behavior that Mendel observed. However, there are genomic era and some are still being investigated a number of instances where, although Mendel’s law today. of segregation applies additional background is required to appreciate how such Mendel’s work applies.

CONCEPTS OF GENOMIC BIOLOGY Page 17

2.4.1. Multiple alleles (retrun)

Note that it is possible that for some traits more

than 2 alleles exist. In this case there is a hierarchy of

dominance among the multiple alleles. In any given

individual the more dominant allele of the 2 alleles it posses is dominant, while the more recessive one will be the recessive allele. Examples of this phenomenon could be the ABO blood type system and the rabbit coat color example discussed shown in Figure 2.2. There are 4 unique alleles that have been found at the C-locus, which is one of 5 separate genetic loci that generate coat color Figure 2.2. Phenotypic description of the alleles of the C-locus for coat patterns in rabbits. The hierarchy of dominance that color in rabbits. Note that this patterning is also found in many other animals although the names of the phenotypes may differ. has been observed at the C-locus suggests that the wild type “large C” allele is the “most” dominant of the alleles in the dominance hierarchy, and the “most recessive” of the alleles is the “small c” locus. A rabbit whose genotype is cc has an albino phenotype while a rabbit with a CC genotype will be fully colored (e.g. agouti, or black that is really dark grey as described in the Figure 2.2). The second most dominant allele is the chinchilla allele, cch- allele, and the ch-allele is intermediate in dominance between the cch- allele and the c- allele.

CONCEPTS OF GENOMIC BIOLOGY Page 18 A. Parents

Rr rr

2.4.2. Incomplete dominance and co-dominance RR rr F1 (retrun) It is also possible to have 2 alleles demonstrate an intermediate phenotype in the heterozygous Rr Rr rr rr condition. This phenomenon is referred to as Rr Rr Rr Rr incomplete dominance (similar to co-dominance), B. Parents and can be observed in Figure 2.2. where the Rr rr phenotype of a cchch or cchc heterozygous rabbit is Rr Rr distinct and intermediate between the homozygous F1 (more) dominant cchcch and the homozygous (more) h h recessive c c or cc phenotypes. Rr Rr rr rr

Another example is given in Figure 2.3., where RRRr RrRr Rrrr rrrr pure breeding (homozygous) red and white flowered C. Parents plants are crossed to give rise to intermediate Rr rr heterozygous pink plants. In some plants the Rr rr intermediate heterozygotes appear as separate F1 distinct patches of color. This is typical of the description of co-dominant traits where the distinct Rr Rr rr rr alleles in a heterozygote are both visible. Thus, co- dominance and incomplete dominance may be a RrRr RrRr rr rrrr distinction without a difference. Figure 2.3. Example of incomplete dominance in flower color of four o’clocks. A) Red flowering x White flowering yields all pink flowers; B) pink flowering x pink flowering yields 1 red : 2 pink : 1 while flowers; and C) pink flowering x white flowering yields half pink and half white flowers.

CONCEPTS OF GENOMIC BIOLOGY Page 19

2.4.3. Sex linked interitance (retrun)

Another example that differs from typical Mendelian inheritance is sex-linked inheritance. In organisms that have X and Y choromosomes, such as Drosophila and humans, the female typically has a pair of X chromosomes (XX) while the male has an X and a (XY). So when a red-eyed female fruit fly is crossed with a white-eyed male, the result is all red-eyed progeny. This might seem like a normal autosomal inheritance pattern where red eyes are a dominant trait. However, in the reciprocal cross (a white-eyed female crossed to a red- eye male. All females will have red eyes, and all males will have white eyes. This demonstrates that the eye-color trait in Drosophilla is a sex-linked trait, and it is conditioned by a gene located on the . Males contribute an X-chromsomes only to their daughters, as their sons must get the Y-chromosome. Females contribute their X-chromsomes to both males and females. This phenomenon is pictorially demonstrated Figure 2.4. Demonstration of sex linked inheritance. The outcome as using Punnet’s squares in Figure 2.4. below. demonstrated in the Punnet’s squares above is different based on whether the male bears the dominant or recessive trait.

CONCEPTS OF GENOMIC BIOLOGY Page 20

2.4.4. Epistasis (retrun) 2.3.5. Epigenetics (retrun) Sometimes the phenotype of an organism does not More recently discovered phenomena involving reflect the actual genotype. This can be the case heritable changes in gene expression that are not when one or more genes are epistatic to others. related to actual changes in DNA sequence, but rather Epistatic genes modify or eliminate the phenotype of are related to chromosome structure and function others so that the phenotype is not apparent. An have also emerged. These phenomena are referred example of an epistatic gene might be a gene for to as epigenetic inheritance, and have emerging baldness. This gene would be epistatic to genes for importance in virtually all areas of biology and hair color, e.g. red hair or blond hair genes. medicine. We will discuss them in greater molecular Another example of an epistatic gene is the c-allele detail later, but they clearly had their beginning in in rabbits given above. This allele produces a classical genetics. phenotypically albino, white rabbit with pink eyes in the homozygous recessive state. However, there are at least 5 additional genetic loci that condition various coat colors and patterns. Many of these other loci have multiple alleles (as does the C-locus, see above), but the rabbit will be albino if it is genotypically cc (homozygous recessive) at the C-locus. Demonstrating that the C-locus is epistatic to the other coat color loci.

CONCEPTS OF GENOMIC BIOLOGY Page 21 chosen 7 genes on 7 different chromosomes to work with, and as a consequence Mendel’s law of 2.5. THE LAW OF INDEPENDENT ASSORTMENT. independent assortment did not necessarily apply to all (RETURN) genes since it was the chromosomes that assorted not the genes per se. In his studies with garden peas Mendel observed that The question has been raised as to whether Mendel each of the 7 traits that he studied behaved chose only data to work with that supported this independently of each other. The mechanism that this theory and disregarded other data or traits that did not observation generated involved genes (hereditary fit his theory to present. Whether this is true or not we factors) assorting independently of each other. Thus, will never really know, but it surely doesn’t detract when 2 factors (genes) were involved in a cross, each of from the important contribution Mendel’s work has them behaved independently. made to the science of genetics and genomic biology However, the chromosomal theory of inheritance by establishing an important set of rules that govern contradicts this observation by suggesting that genes the inheritance of traits. are linked together on chromosomes, and further suggests that it is the chromosomes that are passed to 2.5.1. Meiosis: chromosomes assort the next generation. If this is the case, how can genes independently (retrun) on the same chromosome assort indepen-dently? The theory that allows us to explain the mapping of Answering this question plagued the early develop- genes begins with an understanding of the behavior of ment of genetics until the chromosomal theory of chromosomes during meiosis. During the assorting of inheritance emerged and the idea of gene linkage for diploid chromosomes sets like those found in somatic genes on the same chromosome were clearly shown by cells into haploid chromosomes sets like those found in Thomas Hunt Morgan and his colleagues about a gametes, it is possible to exchange parts of century ago. chromosomes between different homologous sister Once established that it is chromosomes that assort . independently, it was clear that Mendel had fortuitously CONCEPTS OF GENOMIC BIOLOGY Page 22 The process of meiosis begins with a diploid cell containing 2 copies of the complete diploid genome (diploid set of chromosomes) and ends with 4 cells containing 1 copy of the haploid genome (haploid set of chromosomes). In the first meiotic division (meiosis I) homologous chromosomes each consisting of 2 sister chromatids are separated from each other to produce 2 haploid cells with each chromosome consisting of 2 sister chromatids. As these chromosomes (chromatids) align at the mid-plane of the cell in late prophase I of meiosis, the chromatids of homologous chromosomes may overlap with each other and pieces of each chromosome are sometimes exchanged. This process is called crossing over or genetic recombination. Once this exchange has taken place and meiosis I is completed, the exchanged chromosomes become part of new separate haploid chromosome sets in each of 2 haploid cells. Each of these cells undergoes a second meiotic division where the sister chromatids are separated, leading to 4 cells which have a unique combination of traits that mixed the traits derived from each parent of the original individual. Since this process is taking place Figure 2.5. The stages of meiosis I and meiosis II are shown. This on each chromosome of the organism, the end result is involves two separate cell divisions that lead to the formation of 2 a likelihood that every gamete consists of a genome haploid cells from one diploid cell. that is unique compared to the parental genomes that CONCEPTS OF GENOMIC BIOLOGY Page 23 produced the individual. This mixing of genes at loci chromosomes can be exchanged during meiosis, the along the length of chromosomes contributes much to frequency of this exchange provides a measure of the the genetic diversity required to make the process of relative distance between linked genes on the same evolution work. chromosome. Distantly located genes recombine more frequently while nearby genes rarely recombine and are 2.5.2. Mapping genes on chromosomes (retrun) closely linked. By measuring the frequency of crossing- Using Drosophila, Thomas Hunt Morgan and his over between linked genes on the same chromosomes students accumulated a large collection of mutants the distance between genes can be estimated, and (allele pairs) for specific traits. As the collection of genetic maps can be calculated and constructed. mutants grew, it became clear that particular sets of From Morgan and Sturtevant’s work, the percentage traits assorted together rather than independently as crossing-over became a chromosomal distance Mendel had found with his peas. Morgan concluded measurement, and the definition of a unit of crossing that genes for specific traits are linked together into 4 over, became know as the Centimorgan (=1% crossing groups in Drosophila. This happened to equal the over between linked genes on the same chromosome). number of chromosomes observed in Drosophila cells in the microscope. By studying the process of meiosis as described above, it was further established that pieces of homologous chromosomes are exchanged when chromosome numbers are reduced from 2 homologous chromosomes per cell, to just a single chromosomal homolog in the gametes that are fused to produce the next generation. Figure 2.6. Alfred Sturte- From this initial idea of linkage of genes into groups vant’s first genetic map of on chromosomes, Alfred Sturtevant, Morgan's student, the Drosophila chrom- osomes. was the first scientist to make genetic or linkage maps of fruit fly chromosomes. To do this Sturtevant reasoned that since pieces of homologous CONCEPTS OF GENOMIC BIOLOGY Page 24 In the human population there are not discrete height classes. Height varies between over 7 feet tall 2.6. QUANTITATIVE GENETICS: TRAITS THAT ARE to under 4 feet tall in the human population; there CONTINUOUSLY VARIABLE. (RETURN) are not such things as pure breeding lines of tall people and short people similar to what Mendel Mendel, perhaps fortuitously, chose to work with a developed in pea plants, and when two extremely tall series of traits where he could find a pair of discrete individuals are mated, the progeny, though perhaps phenotypes. However, not all phenotypes are so taller than average, are not all extremely tall like their clean producing discrete classes. Above we have parents. Traits such as tallness are often referred to already looked at examples of incomplete dominance, as quantitative traits, and a separate branch of multiple alleles, and epistasis, but for other traits genetics called quantitative genetics has emerged to phenotypes are continuously variable between 2 study and understand quantitative phenomena. extremes rather than producing discrete phenotypic classes. Examples of such traits are often related to height, weight, or amounts of things. There are several books written on the topic of quantitative inheritance, and one can link-out to online more brief treatments of the topic can be found. A number of additional references on quantitative genetics can be found at the link-out, but be aware that these may not be adequately edited, and they are certainly incomplete, although they do provide an overview of the area suitable for understanding the relationship of Figure 2.7. Description of a quantitative locus. A gene quantitative genetics to genomic biology. Also note contributes “d” average effect, but the value obtained lies between +a and –a away from d. that there are many more complex issues involved in understanding quantitative inheritance that require statistical background beyond that expected here.

CONCEPTS OF GENOMIC BIOLOGY Page 25 In classical genetics, statistical approaches to In addition to statistical treatments of quantitative quantitative inheritance have emerged that provide inheritance it is also widely considered that statistical tools for detailed analyses of quantitative quantitative inheritance results from the interaction inheritance. These statistical approaches focus on of a number of different loci where each of these has phenotypically defining 2 alleles at a putative an effect on the final integrated outcome. This is “quantitative locus”. The midpoint between termed polygenic inheritance. homozygotes of the 2 alleles is defined as +d, and the each opposing homozygotes would phenotypically deviate from the midpoint by +a or –a (see Figure 2.7.). In a heterozygote a phenotype closer to the homozygous dominant (+a) than the midpoint (+d), indicates a dominant character to that allele, and a heterozygous phenotype closer to the homozygous recessive (–a) results from a less dominant character to the dominant allele. A measure of the heterozygote distance from the midpoint then becomes a statistical definition of incomplete dominance for such a quantitative gene. Note that in Mendel’s tall versus short pea plants the phenotype of the heterozygote is almost precisely +d, indicating 100% dominance of the cM tall allele over the short allele. In actual fact it is even Figure 2.8. Mapping quantitative trait loci using LOD scores. possible to have a super dominant allele that gives a This quantitative analysis identifies quantitative trait loci (QTLs) located on various chromosomes and shows which regions of heterozygous phenotype more distant from the the chromosome contribute significant genes to the quantitative midpoint than +a, a phenomenon that is sometimes phenotype being investigated. The figure compares the severity referred to as hybrid vigor. of an arthritic phenotype in hip and spine by location on the chromosome.

CONCEPTS OF GENOMIC BIOLOGY Page 26

Polygenes are nonallelic genes at a set of loci 2.7. POPULATION GENETICS. (RETURN) distributed in the genome that contribute to the overall quantitative phenotype observed in the Statistical genetic theories have also become a organism. Figure 2.8 above shows how phenotypic data and their proximity to known marker genes major consideration in the discipline of population allows the mapping of chromosome regions genetics. In the context of a population, the influencing quantitative phenotypes referred to as frequency of individuals having a given genotype is quantitative trait loci (QTLs). The distance measure related to the frequency of each allele in the breeding used in this map is the so-called LOD score that population. If you assume that mating in a population relates phenotype to position on the chromosome. is random and very large to assure that it is The LOD score method of locating regions of homogeneous, then the frequency of genotypes in chromosomes that influence quantitative inheritance the subsequent generation will be directly related to relies on having numerous closely related genetic the frequency of alleles in the gamete pool that markers on chromosomes. produces that generation. Thus, where there are only

Although the method has been available for some time, the advent of genomic techniques for identifying and mapping DNA sequence markers on chromosomes has markedly improved the accuracy and facility of identifying QTLs in genomes. Additionally, the availability of complete genome sequences makes it possible to not only identify regions of the chromosome related to phenotypes, but to actually identify the specific causally-related genes. The QTL approach has found wide application 2 alleles for a given locus found in the population, and ranging from the mapping of human disease QTLs p = frequency of dominant allele gametes while q = (example in Figure 2.8), to applications in plant and frequency of the recessive allele gametes, p + q = 1. breeding, and to application in evolutionary As is shown in table 2.1., the and population genetics among others.

CONCEPTS OF GENOMIC BIOLOGY Page 27 frequency of homozygous dominant individuals in the occurring in the population and that there be no population should be p2 and the frequency of natural selection taking place for the alleles or linked homozygous recessive individuals will be q2. genes in question. Additionally, there should be no Heterozygotes should then be found at a frequency of gene flow (migration into or from the population) 2pq, and in total p2 + 2pq + q2 = 1. This is the binomial taking place, and the population should not have expansion of (p + q)2. gone through a dramatic change in size recently that If this looks familiar, recall the Punnett’s squares may have related to genetic drift in the population. It that we did showing gene segregation in the F-2 should also be noted that the equations given above generation. In that case since heterozygotes produce relate only to diploid species. Some species found in gametes, half of which carry the dominant allele and nature are natural polyploids (having more than 2 half of which carry the recessive allele, i.e. p = q = 0.5. sets of chromosomes), and the equations for Substituting these gamete allele frequencies into the describing the behavior of polyploids are different binomial equation above, we get the 1:2:1 from the bionomial expansions described above. Also segregation ratios we expect. other changes in the equations are required for situations where there are more than 2 alleles found However, in a population, where there are both in a population. homozygotes and heterozygotes all producing gametes, p will not usually equal q, and a different As was with quantitative genetics, the introduction equilibrium of gametes and genotypes will be of tools from genomic studies into population established and maintained through time. This genetics have greatly facilitated the investigation of description is called a Hardy-Weinberg equilibrium. genes in populations, and this is particularly relevant in the investigation of the human population. In order for a population to sustain a Hardy- Population genetic studies using molecular markers Weinberg equilibrium additional factors for important health-related genes are now common (assumptions) must be in place or the equilibrium will place in Public Health studies. not be maintained. In addition to a large population and random mating within the population as discussed above, it is also necessary that there be no mutation

CONCEPTS OF GENOMIC BIOLOGY Page 28

 CHAPTER 3. THE BEGINNINGS OF GENOMIC

BIOLOGY – MOLECULAR GENETICS (RETURN)

3.1. DNA IS THE GENETIC MATERIAL. (RETURN) As the development of classical genetics proceeded from Mendel in 1866 through the In 1928, a British scientist, early part of the 20th century the understanding Frederick Griffith, published his that Mendel’s factors that produced traits were work showing that live, rough, avirulent could be carried on chromosomes, and that there were transformed by a “principle” infinite ways that the genetic information from 2 found in dead, smooth, virulent parents could assort in each generation to bacteria into smooth, virulent produce the genetic variety demanded by bacteria. This meant that the Darwin’s theories on “origin of species” on which bacterial traits of rough versus natural selection acted. This gave rise to the Frederick Griffith smooth and avirulence versus study of gene behavior of more complex traits (1879-1941) viru- and an understanding of genes in populations. virulence were controlled by a substance that could carry the phenotype from dead to live cells. At the same time a quest for the material Griffith’s observations on Pneu-mococcus were inside a cell, perhaps a subcomponent of a controversial to say the least, and inspired a spirited chromosome, that carried the genetic instructions debate and much experimentation directed at proving to make organisms what they are was ongoing. whether the “transforming principle” was protein or , the two main components of

CONCEPTS OF GENOMIC BIOLOGY Page 29 chromosomes identified early in the 20th century, well disrupted with a kitchen blender, and the location of before Griffith’s experiments. This debate continued the label determined. The 35S-labeled protein was found until Oswald Avery and his colleagues, Colin MacLeod, outside the infected cells, while the 32P-labeled DNA and Maclyn McCarty published their work in 1944 was inside the E. coli, indicating that DNA carried the unequivocally showing that DNA was, in fact, Griffith’s information needed for viral infection. transforming principle. This completely revolutionized genetics and is considered the founding observation of molecular genetics.

Figure 3.1. An electron micrograph of bacteriophage T2 (left), and a sketch showing the structures present in the virus (right). The head consists of a DNA molecule surrounded by proteins, while the core, sheath, and tail fibers are all made of Oswald T. Avery Colin MacLeod Maclyn McCarty protein. Only the DNA molecule enters the cell. In 1953, more evidence supporting DNA being the genetic material resulted from the work of Alfred Once it was established that DNA was the genetic Hershey and Martha Chase on E. coli infected with material carrying the instructions for life so to speak, bacteriophage T2. In their experiment, T2 proteins attention turned to the question of “How could a were labeled with the 35S radioisotope, and T2 DNA was molecule carry genetic information?” The key to that labeled with was labeled with the 32P radioisotope. Then became obvious with a detailed understanding of the the labeled were mixed separately with the E. structure of the DNA molecule, which was developed by coli host, and after a short time, phage attachment was two scientists a Cambridge University, James Watson and Francis Crick. CONCEPTS OF GENOMIC BIOLOGY Page 30 Rosalind Franklin a young x-ray crystallographer working in the laboratory of Maurice Wilkins at 3.2. WASON & CRICK – THE STRUCTURE OF DNA. Cambridge University used a technique known as x-ray (RETURN) diffraction to generate images of DNA molecules that showed that DNA had a helical structure with repeating The basic laboratory observations that lead to the structural elements every 0.34 nm and every 3.4 nm formulation of a structure for DNA did not involve along the axis of the molecule. biologists. Rather Irwin Chargaff, an analytical, organic chemist, and physicists, Rosalind Franklin and Maurice Wilkins made the laboratory observations that led to the solution of the structure of DNA. Chargaff determined that there were 4 different nitrogen bases found in DNA molecules; the purines, (A) and guanine G), and the pyrimidines, cytosine (C) and thymine (T), and he purified DNA from a number of different sources so he could examine the quantitative relationships of A, T, G, and C. He con- Rosalind Franklin Maurice Wilkins cluded that in all DNA molecules, the mole-percentage of A was nearly equal to the mole-percentage of T, while the mole-percentage of G was nearly equal to the mole-percentage of C. Alternatively, you could state this as the mole-percentage of pyri-midine bases equaled the mole-percentage of purine bases. These observations became known as Chargaff’s rules. Figure 3.2. X-ray diffraction image of DNA molecule showing helical structure with repeat structural elements.

CONCEPTS OF GENOMIC BIOLOGY Page 31 These astute observations allowed Watson and Crick The key elements of this structure are: to synthesize together a 3-dimentional structure of a • Double helical structure – each helix is made from DNA molecule with all of these essential features. This the alternating deoxyribose sugar and phosphate structure was published in 1953, and immediately groups derived from deoxynuclotides, which are the generated much excitement, culminating in a Nobel monomeric units that are used to make up Prize in Physiology and Medicine, in 1962 awarded to polymeric nucleic acid molecules. Each nucleotide Franklin, Wilkins, Watson, and Crick. in each chain consists of a nitrogen base of either the purine type (adenine or guanine) or the pyrimidine type (cytosine or ) attached to the 1’-position of 2’-deoxyribose sugar, and a phosphate group, esterified by a phospho-ester bond to the 5’-position of the sugar.

Figure 3.3. Watson & Crick’s DNA structure. Their model consisted of a double helicical structure with the sugars and phosphates making the two hlices on the outside of the structure. The sugars were held together by 3’-5’-phosphodiester bonds. The bases pair on the inside of the molecule with A always pairing with T, and G always pairing with C. This pairing leads to Chargaff’s observations about bases in DNA. Figure 3.4. Structures of purine and pyrimidine bases in DNA, and structure of 2’-deoxyribose sugar. CONCEPTS OF GENOMIC BIOLOGY Page 32 • The nucleotides are held together in sequence order Figure 3.5. The building bocks of nucleic along the length of the polynucleotide chain by 3’-5’- acids are nucleotides and . Any phosphodiester bonds, and the strands demonstrate base together with a deoxyribose sugar forms a deoxyribonucleoside, while if the a polarity as the 5’-OH at one end of a polynuc- sugar is ribose a ribonucleoside is formed leotide strand is distinct from the 3’-OH at the other (not shown). Addition of a phosphate on end of the strand. Often, but not always, the 5’- the 5’ position of the sugar froms nucleotides from nucleosides. strand end will have a phosphate group attached.

• Each of the 2 polynucleotide chains of the double helix are held together by hydrogen bonds beween the adenosines in one strand and the in the other strand, and between the guanosines in one strand hydrogen bonded to the cytosines in the other strand.

Figure 3.6. Base pairing between A and T involves two hydrogen bonds, and pairing between G and C involves 3 Figure 3.7. Strand of DNA hydrogen bonds. This means that the forces holding strands showing the 3’,5’-phosphodi- together in G=C base pair-rich regions are stronger than in A=T ester bonds holding base pair-rich regions. nucleotides together. CONCEPTS OF GENOMIC BIOLOGY Page 33 In order to get a uniform diameter for the molecule and have proper alignment of the nucleotide pairs in the middle of the strands, the strands must be orient- ed in antiparallel fashion, i.e. with the strand polarity of each strand of the double helix going in the opposite direction (one strand is 3’-> 5’ whie the other is 5’ -> 3’). The truly elegant aspect of this solution to DNA structure produces a spacing of exactly 0.34 nm between nucleotide base pairs in the molecule, and there are 10 base-pairs per complete turn of the helix. This corresponds precisely with Rosalind Franklin’s x-ray diffraction measurements of repeating units of 0.34 nm and 3.4 nm, and with her measurements of 2 nm for the diameter of the double helix. It is also noteworthy that Watson and Crick suggested that the structure they proposed produced a clear method for the two strands of the DNA molecule to duplicate and maintain the fidelity of the sequence of bases along each chain as DNA was synthesized inside a cell. Thus, providing a mechanism for the fidelity of information transfer from cell generation to cell generation.

CONCEPTS OF GENOMIC BIOLOGY Page 34 base sequences of the DNA molecules themselves. This 3.3. CHROMOSOME STRUCTURE. (RETURN) was obvious well before we btained the first genomic DNA sequences, but has become even more apparent The DNA inside a cell seldom exists as a simple, and significant now that we have the DNA sequences of “naked” DNA molecule. Because DNA molecules are many genomes. Thus, genomic biology is not merely long linear molecules with an overall negative charge the study of DNA nucleotide sequences, but involves deriving from the phosphate groups making up the the study of the structure of the genetic material such helices, positively charged ionic species within cells are as chromosomes and . attracted to these molecules. These positively charged + ++ molecules can be small ions such as K and Mg , or they 3.3.1. Prokaryotic chromosome structure (return) can be larger positively charged proteins, and/or other Most Prokaryotes (e.g. bacteria) have a single, larger molecular species. These ionic interactions play circular chromosome although some have more than an important role producing the folding and packaging one chromosome, and some have linear chromosomes that is required to keep the large linear molecule rather than circular chromosomes. Certainly, the most packaged inside the microscopic cell. well studied bacteria, e.g. , has a single In the case of proteins it is clear that the positively circular chromosome that can exist in either a relaxed or charged proteins can interact both by general ionic supercoiled state. interactions, but they can also ingeract in sequence Supercoiling involves breaking one of the 2 circular specific ways; i.e. specific proteins only bind to specific helical strands and then rotating the broken ends either sequences of bases in the DNA strand. Thus, the types in the direction of the helix (+ supercoil) or in the of molecular interactions that ionic substances, opposite direction of the helix (- supercoil). As particularly proteins, have with DNA molecules play supercoiling is added to the DNA molecule it becomes important roles in determining the expression of “tightly” coiled (see Figure 3.8.), and therefore can be information that is carried in the DNA molecule. It will compacted more easily. This permits the packaging of be obvious as we proceed through our study of genomic the large DNA molecule into the relatively small cells in biology, that such DNA-protein interactions are as which it must exist and function. critical to describing “genetic information” as are the CONCEPTS OF GENOMIC BIOLOGY Page 35

Figure 3.9. Diagram of DNA organizational structure in prokaryotes. Supercoiled DNA is looped and attached to scaffold proteins.

Figure 3.8. An E. coli cell lysed open showing the expanse of its DNA molecule (left). Note that this entire molecule must be folded and 3.3.2. Eukaryotic chromosome structure (return) packaged inside the cell in the picture. On the right are two electron micrographs showing circular DNA molecules either in a relaxed (top) or In general Eukaryotes have much larger genomes supercoiled (bottom) state. than do Archea and other Prokaryotes. This difference in relative genome size compared to the complexity of Additional packaging results from the supercoiled the organism does not appear to be as true for species DNA being carefully looped onto a scaffold of proteins within the Eukaryota. This lack of correlation between leading to an organized intracellular structure that can organismal complexity and genome size (called the C- be easily accessible but also keep DNA from twisting vlaue) is referred to as the C-value paradox (Table 3.1) and being damaged du-ring normal cellular processes. The C-value paradox results from great variation in the nature of DNA in different Eukaryotes. Some eukaryotes contain substantial amounts of DNA that appears to have limited or at have a gene density in their genomes resembling the Prokaryotes (e.g. the

yeasts and malarial parasite in the table above). The CONCEPTS OF GENOMIC BIOLOGY Page 36 majority of Eukaryotes fall somewhere in between these extremes, but are highly variable in their DNA contents. For now we need to appreciate that this variation in DNA content and type appears to have a relationship to chromosome structure. But the nature of this relationship will be considered further once we learn more about DNA sequencing and examine fully Figure 3.10. Electron micrograph showing the structure sequenced genomes. of Eukaryotic DNA. The DNA molecule is barely visible, but connects the beads of proteins that the DNA wraps around creating the appearance of beads on a string.

In eukaryotes, there are multiple levels of chromosomal organization that we will need to consider. Observations using powerful electron

CONCEPTS OF GENOMIC BIOLOGY Page 37

microscopes demonstrated that in Eukaryotes, the DNA molecules in chromosomes are organized like beads on a string. These structures have subsequently been

named . Investigation of the nature of nucleosomes has shown that they are made from several types basic proteins (positively charged) found

in cells called proteins.

The basic nucleosome consists of a combination of H2A, H2B, H3, and H4. DNA is subsequently

wrapped around these structures producing the bead- like appearance observed in the electron microscope. Once the nucleosomes are formed, they can condense

or decondense based on interaction with another histone, .

During prophase of mitosis or meiosis, the

nucleosome structure of chromatin further condenses into a so-called solenoid structure, which is approxim- ately 30 nm in diameter. This solenoid from is not

visible in a light microscope but can be viewed in an electron microscope. This appears to be the form DNA Figure 3.11. Nucleosomes are formed when DNA wraps around a histone complex. Nucleosomes can exist in either a assumes when chromosomes condense during during more condensed or a decondenses state depending ot the mitosis, but the DNA is not as accessible for use in the state of the genetic material in a cell. cell as it is during interphase, when the chromatin is

decondensed.

CONCEPTS OF GENOMIC BIOLOGY Page 38

Figure 3.13. Loop-folding of the 30 nm solenoid structure yields a packaged DNA that is visible in a ligh microscope in each Eukaryotic chromosome.

structure appears to be required to allow for the appropriate assembly and assortment of the genetic material during the cell cycle in mitosis. Without this structural organization, it is likely that cellular DNA would become a hopeless Figure 3.12. Condensation of chromatin leads to tangle, and cellular reproduction would be severely the careful packaging of DNA into so called solenoid sturctures. These structures ultimately hampered, and would likely require too much time and form chromosomes. effort to ultimately be successful.

The solenoid structures are subsequently looped and 3.3.3. Heterochromatin & Euchromatin. (return) fastened to chromosome scaffold proteins generating a The cell cycle affects DNA packing into chromatin structure that is visible in a light microscope that we with chromatin condensing for mitosis and meiosis and know as a chromosome. then decondensing during interphase while being most While this may seem like an elaborate structure dispersed at S-phase. However, cytogeneticists have involving several sets of structural proteins, such a observed that there can be two differently staining forms of chromatin, called Euchromatin and CONCEPTS OF GENOMIC BIOLOGY Page 39 heterochromatin. Euchromatin condenses and unique-sequence DNA, and Prokaryotes had little or no decondenses with the cell cycle. Euchromatin accounts repetitive sequences. However, Eukaryotes have a mix for most of the active genome in dividing cells and bears of unique and repetitive sequence types of DNA. most of the protein-coding DNA sequences. • Unique-sequence DNA includes most of the genes Heterochromatin remains condensed throughout the that encode proteins, and Euchromatin is rich in cell cycle and is believed to be relatively inactive. There unique-sequence DNA. are two types of heterochromatin based on activity, ie. • Repetitive-sequence DNA includes the moderately constitutive heterochromatin that is tightly condensed and highly repeated sequences. They may be in virtually all cell types and facultative dispersed throughout the genome or clustered in heterochromatin which varies between cell types tandem repeats. Heterochromatin is rich in moderate and/or developmental stages. and highly repetitive DNA. Other methods of characterizing types of DNA • Human DNA contains about 65% unique sequences suggest that there are sequences of DNA that can occur while unque sequence DNA makes up a much lower in may copies in the genome. These types of sequences percentage of the genome of organisms that have can be repeated only once in the genome or they can unexpectedly large genomes (C-values) that were occur 10’s of thousans of times or more in genomes. discussed earlier in this section. Sequences can be categorized into: • Unique-sequence DNA, present in one or a few copies per genome. • Moderately repetitive DNA, present in a few to 105 copies per genome • Highly repetitive DNA, present in about 105–107 copies per genome Observations about repetitive DNA sequences as described above have been known for decades, and initially it was shown that Prokaryotic DNA was mostly CONCEPTS OF GENOMIC BIOLOGY Page 40 but it is possible to find atoms with 7 protons, and 8 neutrons, having an atomic mass of 15 (written as 15N). 3.4. DNA REPLICATION. (RETURN) It turns out that if you grew bacterial cells on a nitrogen 15 source enriched in a N enriched nitrogen source, the DNA molecules purified from such cells have a greater As Watson and Crick were solving the structure of density (they are heavier). By synchronizing cells and DNA, they realized the general mechanism by which the purifying DNA after each round of DNA replication and molecule could be copied and maintain fidelity in then determining the density of the newly made DNA copying the DNA molecule. From that beginning, molecules using density gradient centrifugation, interest in understanding the duplication of the DNA Meselson and Stahl were able to show that the first molecules of a cell became a subject of investigation, round of DNA synthesis produced molecules having a and led to a number of Nobel Prize awards. However, hybrid density between light and heavy DNA. While understanding DNA replication was critical to the after a subsequent round of DNA replication they development of the technologies needed for molecular produced light and hybrid molecules. Such a pattern of genetics and ultimately genomic biology research.

3.4.1. DNA Replication is semiconservative. (return) Among the earliest experiments concerning the nature of how DNA replicates were the studies of Mathew Meselson and Frank Stahl. Meselson, while a Ph.D. student designed an experiment that utilized so called “heavy” isotopes nitrogen. Elemental isotopes consist of atoms having the same number of proton, but Figure 3.14. Diagram showing the predicted outcome with more than the average number of neutrons. For of conservative, semiconservative, and dispersive DNA example, nitrogen normally has 7 protons, and 7 replication. Original strands are shown in red while neutrons, giving it an atomic mass of 14 (written 14N) newly made DNA is shown in blue. CONCEPTS OF GENOMIC BIOLOGY Page 41 15N labeling was consistent only with the strand according to the sequence of the corresponding semiconservative replication of DNA. strand being copied. This copied strand is referred to as the template strand. 3.4.2. DNA Polymerases. (return) All DNA polymerases studied to date make DNA The that replicates the DNA double helix is using the general principles established for Kornberg’s called DNA polymerase. The enzyme is difficult to work with because there are but a few copies of it needed per cell, and then they are required only in S-phase of the cell cycle. In spite of these limitations, Arthur Kornberg, won the Nobel Prize in 1959 for the first purification and characterization of an enzyme that makes DNA. Kornberg’s enzyme was purified from the bacterium E. coli, and beside the enzyme 4 additional components were required to make DNA in a test tube. These factors included a template DNA (Kornberg used E. coli DNA), the four deoxy nucleotide triponosphates (dNTP), i.e. dATP, dGTP, dCTP, and dTTP. Note that these are the deoxy NTP, and not the ribose containing NTP’s. The remaining requirements for DNA polymerase are magnesium ion (Mg++) and a primer single strand of DNA. This primer requirement involves a single strand of DNA that will form a short double- stranded region of DNA. DNA polymerase then adds nucleotides to the free 3’-end of this primer, but Figure 3.15. Note that the template strand is read from it’s 3’-end without the primer DNA polymerase is unable to make a to its 5’-end while the antiparallel, new DNA strand is made from DNA strand. As the nucleotides are added they are the 5’-end to the 3’-end. added from the 5’-end to the growing 3’-end of the CONCEPTS OF GENOMIC BIOLOGY Page 42 enzyme, but there are significant differences between • A minimal sequence of about 245 bp required for them in other respects. For example, in E. coli there are initiation. five different DNA polymerases. Kornberg’s enzyme is • Three copies of a 13-bp AT-rich sequence. now known as DNA polymerase I, but there are also • Four copies of a 9-bp sequence. DNA polymerases II, III, IV, and V. DNA polymerases II, IV, and V are not involved in the DNA replication process, and they have specialized functions in repairing damaged DNA under specific circumstances. DNA polymerases I and III are the DNA polymerases involved in the replication of cellular DNA. Both of these DNA polymerases contain a 3’ -> 5’ activity that is involved in proof-reading the recently made DNA strand and removing any mistakes that are made. Only DNA polymerase I has a 5’ -> 3’ exonuclease activity and we will visit this function again below when the role of DNA polymerase I in DNA replication is considered. 3.4.3. Initiation of replication. (return) Replication initiates at a specific sequence in the genome that is often called an . E. coli has one origin, called oriC, where replication starts when the strands of the helix are forced apart to expose the bases, creating a replication bubble with two replication forks. Replication is usually bidirectional Figure 3.16. Initiation of DNA replication in E. from the origin using the two forks to enlarge the coli. at oriC. Noote the 9 and 13 bp repeats where DNA binds and activates bubble in both directions. E. coli has one origin, oriC, replicatlion throught the action of DNA . with the following properties: CONCEPTS OF GENOMIC BIOLOGY Page 43 From a series of in vitro studies it has been shown in E. 3) DNA polymerase III adds nucleotides to the 3’-end coli that the following steps are involved in initiating of the primer, synthesizing a new strand replication: complementary to the template and displacing the 1) Initiator proteins attach to oriC (E. coli’s initiator SSBs. DNA is made in opposite directions (at each protein is the DnaA protein derived from the dnaA fork) on the two template strands since DNA gene. polymerase only adds nuclotides to the free 3’- end. 2) DNA helicase (from dnaB gene) binds initiator proteins on the DNA and denatures the AT-rich 13- 4) The new strand made 5’-to-3’ in the same bp region using ATP as an energy source. direction as movement of the replication fork, i.e. 3) DNA primase (from the dnaG gene) binds helicase DNA polymerase III is continuously moving toward to form a primosome, which synthesizes a short the fork on one strand of the bubble at each fork. (5–10 nt) RNA primer. This defines the “leading strand”. On the other strand the new strand must be made in the

opposite direction as it must be made 5’ -> 3’. 3.4.4. DNA Replication is Semidiscontinuous (return) 5) This means that on this “lagging strand” primase When DNA denatures (strands separate) at the ori, must add the RNA primer very close to the replication forks are formed. DNA replication is usually replication fork, and the DNA polymerase III bidirectional, but we will consider events at just one moves away from the fork rather than toward the replication fork, but don’t forget that a similar set of fork like it was on the leading strand. events are occurring at the other replication fork in the bubble. The events occurring at each fork are: 6) The Leading strand needs only one primer and continuously makes the new DNA strand, while on 1) Single-strand DNA-binding proteins (SSBs) bind the the lagging strand a series of RNA primers are ssDNA formed by helicase, preventing required and only a limited number of DNA reannealing. nucleotides are added by DNA polymerase III 2) Primase synthesizes a primer on each template before the previously made fragment is strand. encountered. CONCEPTS OF GENOMIC BIOLOGY Page 44 ments. DNA replication is therefore semidiscon- tinuous. 8) As the bubble enlarges and DNA helicase denatures (untwists) the strands, this causes tighter winding in other parts of the circular chromosome. A protein called DNA Gyrase relieves the tension created in the molecule. 9) As accumulate on the lagging strand, DNA polymerase I binds and the 5’ -> 3’ exonuclease activity removes the RNA primers, and replaces them with DNA nucleotides.

Figure 3.17. DNA replication at a replication fork showing continuous DNA synthesis on the lower strand and discontinuous DNA synthesis on the upper strand where Okazaki fragments are dd 7) Thus, the leading strand is synthesized continuous- ly, while the lagging strand is synthesized discontin- uously in the form of shorter pieces of DNA with Figure 3.18. Removal of the RNA primers by the 5’-> 3’ exonuclease of DNA polymerase I, and replacement with DNA interspersed RNA primers called Okazaki frag- nucleotides on the lagging strand.

CONCEPTS OF GENOMIC BIOLOGY Page 45

Primer removal differs from that in prokaryotes. Pol 10) The DNA fragments lacking RNA primers are now continues extension of the newer Okazaki fragment, fastened together using an enzyme called DNA displacing the RNA and producing a flap that is removed ligase that closes the remaining gaps on the lagging by nucleases, thus allowing the Okazaki fragments to be strand. joined by DNA ligase. Other DNA polymerases replicate mitochondrial or DNA, or they are used in DNA repair. These are all similar to the prokaryotic system described in detail above.

3.4.6. Replicating ends of chromosomes. (return) Figure 3.19. DNA ligase joins an opening in a DNA strand remaking acomplete phosphodiester-linked polynucleotide chain. Replicating the ends of chromosomes in organisms without circular chromosomes presents unique problems. Removal of primers at the 5’-end of the 3.4.5. DNA replication in Eukaryotes. (return) newly made strand will produce shorter strands that cannot be extended with existing DNA polymerases, and of eukaryotic DNA replication are not as if the gap is not addressed chromosomes would become well characterized as their prokaryotic counterparts. shorter each time DNA replicates. Thus a new Fifteen DNA polymerases are known in mammalian mechanism for the completion of the ends of the cells, for example. Three DNA polymerases are used to chromosome is required. This is accomplished using the replicate nuclear DNA. Pol extends the 10-nt RNA telomerase system. primer by about 30 nt. Pol and Pol extend the RNA/DNA primers, one the leading strand and the other Most eukaryotic chromosomes have short, species- on the lagging stand, but it is not clear which specific sequences tandemly repeated at their synthesizes which. . It has been shown that chromosome lengths are maintained by telomerase, which adds repeats without using the cell’s regular replication CONCEPTS OF GENOMIC BIOLOGY Page 46 machinery. In humans, the telomere repeat sequence is Telomerase, an enzyme containing both protein and 5’-TTAGGG-3’. RNA, includes an 11-bp RNA sequence used to synthesize the new telomere repeat DNA. Using an RNA template to make DNA, telomerase functions as a reverse transcriptase called TERT (telomerase reverse transcriptase). The 3’-end of the telomerase RNA

contains the sequence 3’-CAUC, which binds the 5-- GTTAG-3’ overhang on the chromosome, positioning telomerase to complete its synthesis of the GGGTTAG telomere repeat. Additional rounds of telomerase activity lengthen the chromosome by adding telomere repeats. Ends of telomere DNA usually loop back to form a D-loop. After telomerase adds telomere sequences, chromosomal replication proceeds in the usual way. Any shortening of the chromosome ends is compensated for by the addition of the telomere repeats. Telomere length may vary, but organisms and cell types have characteristic telomere lengths, resulting from many levels of regulation of telomerase. Mutants affecting telomere length have been identified, and data shortening of telomeres eventually leads to cell death. Loss of telomerase activity results in limited rounds of before the cell death. Figure 3.20. The dilemma of how the 3’ overhangs are replicated at each end of the chromosome to duplicate a chromosome and make sister chromatids. CONCEPTS OF GENOMIC BIOLOGY Page 47

Figure 3.21. Replication of chromosome ends using telomerase. CONCEPTS OF GENOMIC BIOLOGY Page 48

3.5. TRANSCRIPTION. (RETURN)

In cells the genetic information carried in the DNA nucleotide sequence becomes functional information that gives characteristics to cells ultimately specifying traits. This conversion of DNA sequence information into functional information begins with the creation of cellular RNAs from one of the two strands of DNA sequence. This process is called transcription. The mechanism by which these cellular RNAs are transcribed from will be presented in this section while the regulation of these processes will be covered later.

3.5.1. Cellular RNAs are transcribed from DNA (return) Ribosomal RNAs (return 3.6.4.) The most abundant type of RNA in most cells is a structural component of the cellular particle that is involved in the synthesis of proteins called a ribosome. Since ribosomes have 2 subunits, a large subunit and a small subunit, they also have two major types of ribosomal RNA. These are described in detail in Table 3.2. In addition to the largest ribosomal RNAs there are additional smaller ribosomal RNAs as well. Note that the size and nature of all of these ribosomal RNAs is different in Prokaryotes and Eukaryotes.

CONCEPTS OF GENOMIC BIOLOGY Page 49 In prokaryotes a small 30S ribosomal subunit Mammalian mitochondria have only two contains the 16S ribosomal RNA. The large 50S mitochondrial rRNA molecules (12S and 16S) but do not ribosomal subunit contains two rRNA species (the 5S contain 5S rRNA. The ribosomal RNAs are transcribed and 23S ribosomal RNAs). Bacterial 16S ribosomal RNA, from the mitochondrial genome. This is also the case for 23S ribosomal RNA, and 5S rRNA genes are typically plant mitochondrial rRNAs although plants contain a organized as a co-transcribed unit (operon). There may more prokaryotic like ribosomal RNAs, i.e. a 16S, a 26S, be one or more copies of the operon dispersed in the and a 5S rRNA. Plants also contain chloroplast genome (for example, Escherichia coli has seven). ribosomal RNAs (16S, 23S, and 5S) produced by Archaea contains either a single rDNA operon or transcription from the chloroplast genome. multiple copies of the operon. Messenger RNAs – mRNAs In Eukaryotes, the cytoplasmic small ribosomal All organisms (and mitochondria and chloroplasts) subunit (40S) contains an 18S rRNA while the large produce a type of RNA that codes for the amino acid ribosomal subunit (60S contains a 28S, 5S, and 5.8S sequence of proteins. This RNA is a copy of the DNA rRNA. As in Prokaryotes these rRNAs are structural sequence of the gene and is transcribed from one of the components of ribosomes where they perform essential two DNA strands of each gene. By reproducing the DNA function. In mammals, the 28S, 5.8S, and 18S rRNAs are sequence as an mRNA copy the sequence information encoded by a single nuclear transcription unit (45S). for the gene is faithfully maintained allowing the Two internally transcribed spacers separate the 3 rRNA generation of many gene “copies” that can be used to species in the 45S transcript. Generally, there are many produce even more protein copies from each gene. copies of the 45S rDNAs organized clusters throughout the nuclear genome. In humans, for example, each Transfer RNAs - tRNA (return 3.6.3.) cluster has 300-400 repeats. 5S rDNA is not made as Transfer RNAs (tRNAs) are smaller (~90 nt) RNA part of the 45S transcript, but occurs in tandem arrays molecules that are transcribed from genes scattered (~200-300 5S genes) interspersed in the mammalian throughout both Prokaryotic and Eukaryotic genomes, genome independently of the 45S rDNA genes. including mitochondrial and chloroplast genomes. These molecules are the “decoding” molecules that determine which amino acids are put in proteins in the CONCEPTS OF GENOMIC BIOLOGY Page 50 order specified by the nucleotide sequence in the other types of RNA, mostly rRNA, tRNA, and snRNA. mRNA. They are highly structured RNA molecules, and One of the main functions of snoRNAs involves there is at least one, often several, tRNA for each of the modification of the 45S ribosomal precursor so that it twenty protein-contained amino acids. Each tRNA is can be futher processes to generate the 18S, 5.8S, and processed from a transcribed precursor-tRNA molecule 28S rRNAs. coded for by specific tRNA genes, and typically there is Small regulatory RNAs are found in prokaryotes but one tRNA produced per tRNA gene. where they are involved in the regulation of gene In Eukaryotes tRNA are scattered across all expression, but mostly they are known for the role they chromosomes, and there are separate sets of tRNA play in transcriptional, posttranscriptional and genes in each of the organelle genomes present in translational control of gene expression in Eukaryotes. eukaryotes. These molecules are an array of 20-30 nt RNAs Other Non-protein-coding Transcribed RNAs transcribed in various ways from genes in the genomes of organisms. Note that although there are primarily 2 More recently additional types of RNAs that perform types of srRNAs, microRNAs (miRNA) and short vital functions in cells have been described. Most of interfering RNA (siRNA) these types are specific to these have been described in Eukaryotes once we certain organisms and there are likely thousands of described and characterized genomes of Eukaryotes. genes transcribed for such srRNAs. Small nuclear RNAs (snRNA) are smaller RNAs (typically ~ 150 nt) transcribed from nuclear DNA in 3.5.2. RNA polymerases catalyze transcription (return) eukaryotic cells. snRNAs are structurally part of small RNA polymerase is the enzyme responsible for nuclear ribonucleoprotein particles (snRNPs) that are copying a DNA sequence into an RNA sequence, during involved in processing mRNAs in the nucleus of cells. the process of transcription. As complex molecule Typically there are but a handful of different snRNAs composed of protein subunits, RNA polymerase controls made in each species and these are highly conserved the process of transcription, during which the among eukaryotes. information stored in a molecule of DNA is copied into a Small nucleolar RNAs (snoRNAs) are a class of small molecule of cellular RNA. RNA molecules that function to guide modification of CONCEPTS OF GENOMIC BIOLOGY Page 51 The detailed mechanism of how RNA polymerase of RNA polymerase that transcribes mRNA, tRNA, and works is shown in Figure 3.22. all rRNAs. Eukaryotes contain three (animals and fungi) to five (plants) distinct types of RNA polymerases. Each of these RNA polymerases transcribes different species of RNA as shown in Table 3.3.

Figure 3.22. The chemical reaction catalyzed by RNA polymerases showing both the reactants and products and the specificity of base pair addition. Note the antiparallel nature of the RNA strand to the DNA strand being transcribed. RNA polymerase makes a phosphodiester bond between the 5’-phosphate group closest to the ribose sugar and the 3’-OH on the 3’-end of the growing strand of RNA.

Multisubunit RNA polymerases exist in all species, but the number and composition of these proteins vary across taxa. For instance, bacteria contain a single type CONCEPTS OF GENOMIC BIOLOGY Page 52 A prokaryotic gene is a DNA sequence in the In spite of these differences, there are striking chromosome. The gene has three regions, each with a similarities among transcriptional mechanisms for all function in transcription (see Figure 3.23.). These are: RNA polymerases. For example, transcription is divided into three steps for both bacteria and eukaryotes. They are initiation, elongation, and termination. The process of elongation is highly conserved between bacteria and eukaryotes, but initiation and termination are somewhat different. All species require a mechanism by which transcription can be regulated in order to achieve Figure 3.23. Prokaryotic genes all have promoter regions upstream (toward the 5’-end of the mRNA) of the protein coding gene and spatial and temporal changes in gene expression. terminator regions downstream (toward the 3’-end of the mRNA). Proteins that interact with the core RNA polymerase, These regions are located at the 3’-end (promoter) and the 5’-end (terminator) of the template strand of DNA. Typically the nucleotide and that recognize specific sequences in the DNA where RNA polymerase begins transcribing is designaed the +1 mediate these initial regulatory steps during nucleotide position, and sequences in the promoter are designated as transcription initiation. However the types and nature (-) nt positions. of these interacting proteins are quite distinct in Prokaryotes compared to Eukaryotes. This leads to a 1) A promoter sequence that attracts RNA discussion of how transcription initiation at each gene polymerase to begin transcription at a site locus takes place in both Prokaryotes and Eukaryotes. specified by the promoter. Some genes use one strand of DNA as the template; other genes use 3.5.3. Transcription in Prokaryotes (return) the other strand. For a model of Prokaryotic gene regulation, the 2) The transcribed sequence, called the RNA-coding bacterium, Escherichia coli, will be used as a model. sequence. The sequence of this DNA corresponds This model is similar to nearly all Prokaryotes. with the RNA sequence of the transcript. 3) A terminator region that specifies where trans- cription will stop. CONCEPTS OF GENOMIC BIOLOGY Page 53

The process of transcription initiation in E. coli is a shown in Figure 3.24. The process involves two DNA sequences centered at -35 bp and -10 bp upstream from the +1 start site of transcription in the promoter region of the gene. These two consensus sequences (in E. coli) are 5’-TTGACA-3’ at the -35 nt region and 5’-TATAAT-3’ b at the -10 region (previously known as a Pribnow box, but they can vary according to the organism and gene within the organism. Transcription initiation requires the RNA polymerase holoenzyme (only one type is found in bacteria) to bind to the promoter DNA sequence. Holoenzyme consists c of: 1) Core enzyme of RNA polymerase, containing five polypeptides (two alpha, one beta, one beta’ and an omega; written as α2ββ’ω).

2) One of several sigma factors (σ-factor) that binds the core enzyme and confers ability to recognize d

specific gene promoters. RNA polymerase holoenzyme binds promoter in two steps (Figure 3.24) that involve the sigma factor. First, it loosely binds to the -35 sequence of dsDNA closed Figure 3.24. Prokaryotic (E. coli) transcription initiation. a) RNA Polymerase holoenzyme is “recruited to the promoter by a specific promoter complex (Figure 3.24a). Second, it binds σ-factor (sigma factor); b) strands of the DNA are separated tightly to the -10 sequence (Figure 3.24b), untwisting exposing the sense strand for copying; d) nucleotides are polymerized as RNA polymerase moves down the strand, and σ- about 17 bp of DNA at the site. At this point RNA factor leaves the complex as; d) elongation continues, the newly made mRNA exits the enzyme, and the transcription “bubble” moves

CONCEPTS OF GENOMIC BIOLOGY Page 54 polymerase is in position to begin transcription (open TABLE 3.4. promoter complex). E. coli σ-factors and their function Promoters often deviate from consensus the s-factors Function consensus sequences at -35 and -10, and the associated σ70 (rpoD) = σA the "housekeeping" sigma factor or also genes will show different levels of transcription, called as primary sigma factor, transcribes corresponding with σ-factor’s ability to recognize their most genes in growing cells. Every cell has a “housekeeping” sigma sequences. E. coli has several sigma factors with important roles in gene regulation. Each sigma can bind σ19 (fecI) the ferric citrate sigma factor, regulates a molecule of core RNA polymerase and guide its choice the fec gene for iron transport of genes to transcribe, but has different affinity for σ24 (rpoE) the extracytoplasmic/extreme heat stress sigma factor specific promoters. σ28 (rpoF) the flagellar sigma factor 70 Most E. coli genes have a σ promoter, and σ70 is σ32 (rpoH) the heat shock sigma factor; it is turned usually the most abundant σ-factor in the cell. σ70 on when the bacteria are exposed to heat. Due to the higher expression, the factor recognizes the sequence TTGACA at -35, and TATAAT at will bind with a high probability to the -10. Other sigma factors may be produced in response polymerase-core-enzyme. Doing so, other to changing conditions, and each can bind the core RNA heatshock proteins are expressed, which enable the cell to survive higher polymerase, enabling holoenzyme to recognize different temperatures. Some of the enzymes that promoters. An example is σ32, which arises in response are expressed upon activation of σ32 are to heat shock and other forms of stress and recognizes a chaperones, proteases and DNA-repair enzymes. sequence at -39 bp and -15 bp. E. coli has additional σ38 (rpoS) the starvation/stationary phase sigma sigma factors with various roles (Table 3.4), and other factor bacterial species also have multiple similar and σ54 (rpoN) the nitrogen-limitation sigma factor additional sigma factors. Many bacterial genes are controlled by regulatory proteins that interact with regulatory sequences near proteins, i.e. activators that stimulate transcription by the promoter. There are two classes of regulatory facilitating RNA polymerase activity, and repressors that CONCEPTS OF GENOMIC BIOLOGY Page 55 inhibit transcription by decreasing RNA polymerase move along the transcript and destabilize the binding or elongation of RNA. RNA–DNA hybrid at the termination region, Once initiation is completed, RNA synthesis begins, terminating transcription. and the sigma factor is released and reused for other initiations (Figure 3.24c). Core enzyme completes the Figure 3.25. Simplified transcript. Core enzyme untwists DNA helix locally, schematics of the mechanisms of prokaryotic transcriptional allowing a small region to denature. Newly synthesized termination. In Rho- RNA forms an RNA–DNA hybrid, but most of the independent termination, a transcript is displaced as the DNA helix reforms (Figure terminating hairpin forms on the nascent mRNA interacting with 3.24d). the NusA protein to stimulate release of the transcript from

Terminator sequences are used to end transcription. the RNA polymerase complex In E. coli there are two types of transcript termination: (top). In Rho-dependent termination, the Rho protein 1) Rho-independent (ρ-independent) or type I binds at the upstream rut site, terminators (Figure 3.25, upper) have twofold translocates down the mRNA, symmetry that would allow a hairpin loop to form and interacts with the RNA polymerase complex to (Figure 3.25). The palindrome is followed by 4–8 U stimulate release of the residues in the transcript, and when these transcript.

sequences are transcribed, they form a stem-loop structure and cause chain termination. 2) Rho-dependent (ρ-dependent) or type II 3.5.4. Transcription in Prokaryotes – polycistronic terminators (Figure 3.25, lower) require the mRNAs from operons (return) protein ρ for termination. Rho binds to the C-rich While we have considered the structure of a sequence in the RNA upsteam of the termination prokaryotic gene as having a promoter, a coding region, site and moves with the transcript until and a termination region (see Figure 3.23), in most encountering a stalled polymerase. It then acts as cases multiple protein-coding regions are under the a helicase, using ATP hydrolysis for energy to control of a single promoter. This genetic structure is CONCEPTS OF GENOMIC BIOLOGY Page 56 referred to as an operon, and the mRNA transcribed from each operon is in fact an RNA capable of producing multiple peptides. This type of mRNA, typical of prokaryotes, and Eukaryotic mitochondria and chloroplasts, is referred to as a polycistronic mRNA. Thus, the proteins binding to promoter and regulatory regions of genomes that regulate gene expression in prokaryotes regulate the production of multiple peptides simultaneously. Typically, these peptides are functionally related, e.g. the proteins required to catabolize lactose as a carbon source [lac operon] (see Figure 3.26.), or the proteins required to make the Figure 3.26. The lac operon in E. coli. Three lactose metabolism genes (lacZ, lacY, and lacA) are organized together in a cluster called amino acid tryptophan [trp operon] (see Figure 3.27.). the lac operon. The coordinated transcription and translation of The lac operon is an example of an inducible the lac operon structural genes is controlled by a shared promoter, operator, and terminator. A lac regulator gene (lacI) with its separate (positively regulated) operon. The repressor protein promoter is found just outside the lac operon. The lacI gene produces does not bind to the operator and stop transcription in a regulatory protein, the lac repressor protein that binds to the the presence of the effector (lactose), while the “inducer”, which is lactose (or a derivative, allolactose) when it is present in a cell. The lacI protein also can bind to a region of the tryptophan operon is an example of a repressible operon between the lac promoter and the structural genes referred to (netatively regulated) operon. The repressor protein as the lac operator (lacO). In the absence of lactose (allolactose) the only binds to the operator in the presence of the lacI protein tightly binds to the operator and prevents RNA polymerase from transcribing the polycistronic mRNA. When lactose binds to the effector molecule (tryptophan). Thus, using the similar lacI protein, the lacI protein cannot bind to the lacO gene, and RNA types of regulatory proteins and genes, and similar polymerase proceeds to produce the polycistronic mRNA operon structure almost any type of gene regulation can corresponding to the lacZ, lacY, and lacA genes. © 2013 Nature Education Adapted from Pierce, Benjamin. Genetics: A Conceptual Approach, be obtained. 2nd ed. All rights reserved. Additionally, it should be noted that the proteins for regulated as a consequence of the production of related critical cellular functions can be coordinately polycistronic mRNAs. CONCEPTS OF GENOMIC BIOLOGY Page 57 A 3.5.5. Beyond Operons - Modification of expression of prokaryotic genes (return) Additional regulation of operons is often used to produce further fine-tuning of transcription. This can vary with each operon in Prokaryotic genomes. A common type of additional regulation has been shown for the lac operon and many other catabolic operons. B Glucose is the preferred carbon source in E. coli. In the presence of glucose, lactose will not be utilized. This means that if an abundant supply of glucose and lactose are both available, the lac operon will not be induced until the glucose is used up. This phenomenon is often referred to as catabolite repression, and the critical components of catabolite repression in the lac operon Figure 3.27. The tryptophan operon of E. Coli consists of five structural genes (trpE, trpD, trpC, trpB, and trpA) with a common are shown in Figure 3.28. promoter, operator, and terminator. A separate promoter regulates the trpR regulatory protein (trp repressor). Transcription of the trp When the concentration of intracellular glucose is operon produces a polycistronic mRNA that contains a leader peptide low (Figure 3.28, upper panel) the levels of the signal and coding sequences for the 5 structural genes that produce the 5 enzymes required to make tryptophan. Since tryptophan is an amino molecule cAMP are high, and cAMP binds to CAP acid required for cell growth, the trp operon is “repressed” when protein. The association between RNA polymerase and cells have access to an abundant supply of tryptophan (panel A), and becomes “derepressed” when cells are starving for tryptophan (panel promoter DNA is enhanced when the CAP-cAMP B). A) Tryptophan present, repressor bound to operator, operon complex is present. Enhanced RNA polymerase binding repressed. When complexed with tryptophan, the repressor protein binds tightly to the trp operator, thereby preventing RNA polymerase leads to a high rate of transcription (provided that the from transcribing the operon structural genes. B) Tryptophan absent, operator is free) and translation of the lac operon repressor not bond to operator, operon derepressed. In the absence of tryptophan, the free trp repressor cannot bind to the operator polycistronic mRNA. The resulting mRNA transcripts are site. RNA polymerase can therefore move past the operator and translated into the enzymes beta-galactosidase, transcribe the trp operon structural genes, giving the cell the capability to synthesize tryptophan. permease, and transacetylase, and these enzymes are

CONCEPTS OF GENOMIC BIOLOGY Page 58 used to break down lactose into glucose and Galactose. The latter can subsequently be converted into glucose. When the glucose concentration in the cell is high (Figure 3.28, lower panel), low concentrations of cAMP result in decreased binding of cAMP to CAP. Therefore, the cAMP-CAP complex is not bound to the bacterial DNA, and as a result, neither is RNA polymerase. This lowers the rate of transcription and polycistronic mRNA production is decreased for the lacZ, lacY, and lacA genes. The absence of these proteins reduces glucose production from lactose, leading to the use of the available glucose prior to the use of any lactose. The interaction of CAP with DNA and with cAMP directly regulates the production of mRNA. Some type of interaction of proteins with regulatory regions in the DNA mediates the phenomenon of catabolite repression in operons associated with carbon source utilization in prokaryotes. In anabolic operons (typical of amino acid synthesis), a phenomenon of additional regultation referred to as attenuation has been documented. The example most commonly considered involves the trp operon discussed above.

Figure 3.28. Diagram showing the major effects of low glucose The leader sequence in the polycistronic mRNA of the (upper panel) and high glucose (lower panel) on the expression trp operon contains several trp codons, and can form 3 of lac operon genes. © 2013 Nature Education Adapted from Pierce, Benjamin. Genetics: A Conceptual Approach, different stem-loop structures. Depending on the 2nd ed. (New York: W. H. Freeman and Company), 446. All rights reserved. CONCEPTS OF GENOMIC BIOLOGY Page 59 when trp is abundant. While the other structure does not terminate transcription, and the polycistronic mRNA is produced. Several amino acid synthetic operons (e.g. phenyl- alanine, histidine, leucine, threonine, and isoleucine- valine) demonstrate this same type of attenuation. Consequently, this mechanism is relatively widespread as a means of modulating and fine-tuning pathways for amino acid biosynthesis.

3.5.6. Transcription in Eukaryotes (return) Although transcription in Eukaryotes follows the general principles outlined above for Prokaryotes, there Figure 3.29. Attenuation of the trp operon. The diagram at the center are many specific details that are different. Recall that shows the general folding of the leader sequence of the trp polycistronic mRNA and labeling of strands. The mRNA is folded in four there are as many as five Eukaryotic RNA polymerases. parallel strands connected at the bottom by two small hairpin loops While each of these transcribes different types of RNA, between strands 1 and 2 and strands 3 and 4 and by one large hairpin they are all Multisubunit RNA polymerases that function loop at the top between strands 2 and 3. In the structure on the left, strands 1 and 2 and strands 3 and 4 are stabilized by base pairing. This in related ways. The mechanism of the important RNA structure terminates transcription of the trp operon in the presence of polymerase II that produces mRNAs will be described high tryptophan. In contrast, strands 2 and 3 are stabilized by base pairing in the structure on the right, which allows transcription of the here, but each of the 5 has similar mechanisms for trp operon to continue in the presence of low tryptophan. © 1981 Nature Publishing Group Yanofsky, C. Attenuation in the control of expression of bacterial initiation, elongation, and termination of transcription. operons. Nature 289, 753 (1981). All rights reserved. Eukaryotic mRNAs are nearly always monocistronic mRNAs with a general structure as shown in Figure amount of available tryptophan, one of two structures 3.30. The key transcribed features are a 5’-UTR can be produced (Figure 3.29) One structure leads to (untranslated region), a coding region, and 3’-UTR. termination of transcription in the leader sequence CONCEPTS OF GENOMIC BIOLOGY Page 60 Other nontranscribed features that are typical of Eukaryotic promoters, core promoter elements and mRNAs in promoter proximal elements. Core promoter elements are located near the transcription start site and specify where transcription 5’ UTR Coding Region 3’ UTR upstream TATA box begins. Examples include: Enhancers Promoter Exon 1 Exon 2 Exon 3 DNA 5’ 3’ Intron 1 Intron 2 1) The initiator element (Inr), a pyrimidine-rich A Gene Transcription that spans the transcription start site; by RNA Polymerase II 2) The TATA box (also known as a TATA element Primary Transcript or Goldberg–Hogness box) at -30 nt (full Nuclear Processing – 5’ Capping &

poly-A tail addition sequence is TATAAAA). This element aids in

G

Me - 5’ Cap 7 3’ Poly-A tail Pre-mRNA AAAAAA local DNA denaturation and sets the start point Nuclear Processing – Intron removal for transcription. & transport to the cytoplasm

5’ UTR Protein Coding Region 3’ UTR Promoter-proximal elements are required for high

5’ Cap G 3’ Poly-A tail Me - Final mRNA 7 AAAAAA levels of transcription. They are further upstream from the start site, at positions between -50 and -200. These Figure 3.30. Diagram showing the elements and structure of a typical eukaryotic mRNA-producing gene. Note that a primary transcript is elements generally function in either orientation. produced which is subsequently modified by the addition of a 7-methyl Examples include: guanosine (Cap), and the poly-A tail. Subsequently, introns are spiced from the transcript to make a finished mRNA ready to exit the nucleus. 1) The CAAT box, located at about -75. 2) The GC box, consensus sequence GGGCGG, Eukaryotic cells include a 5’-Cap structure and a poly-A located at about -90. tail that will be described in more detail below. Various combinations of core and proximal elements Promoters in many Eukaryotes have been analyzed are found near different genes. Promoter-proximal either by the use of directed mutations within promoter elements are key to understanding the rate at which sequences or by comparative analysis of multiple genes transcription initiation occurs and thus the level of gene from different organisms. These studies have revealed expression. that there are two types of elements found in CONCEPTS OF GENOMIC BIOLOGY Page 61 Eukaryotic Transcription initiation requires assembly complex (PIC). Note that the PIC is sometimes referred of RNA polymerase II and binding of general to simply as the transcription initiation complex. GTFs transcription factors (GTFs) on the core promoter at the are needed for initiation by all RNA polymerases and are TATA box (see Figure 3.31) forming a preinitiation numbered to match their corresponding RNA polymerase and lettered in the order of discovery (e.g., TFIID was the fourth GTF discovered that works with RNA polymerase II). The general transcription factors along with other proteins forming specific PICs at a particular promoter poise RNA polymerase to begin transcription of the gene behind the promoter. Once the PIC forms, RNA polymerase will initiate transcription. However, the rate at which transcription initiation occurs at a particular gene depends on 2 factors. The first factor is the number and types of enhancer/silencer sequence elements found in the promoter. These sequence elements can be from 50 nt to over 1,000 nt in length. Enhancer/silencer elements must be located in cis (meaning close to) to promoter/coding sequence in order to effect the expression of a gene. Some enhancer/silencer sequences have been found that are as much as 1 Figure 3.31. Eukaryotic transcription begins with the formation of a transcription preinitiation complex (PIC) on the TATA box in the megabase (1,000,000 nt) away from the transcription promoter of the gene. The PIC is a large complex of proteins that is start site (TATA box), but most are within a few necessary for the transcription of protein-coding genes in eukaryotes. The preinitiation complex helps position RNA polymerase II over gene thousand bases or less of the TATA box. transcription start sites, denatures the DNA, and positions the DNA in the RNA polymerase II active site for transcription. The minimal PIC The second factor regulating the rate of transcript includes RNA polymerase II and six general transcription factors: TFIIA, initiation is proteins that can bind to specific enhancer TFIIB, TFIID, TFIIE, TFIIF, and TFIIH. Additional regulatory complexes (co- activators and chromatin-remodeling complexes) could also be or silencer sequence elements. Activators are proteins components of the PIC. CONCEPTS OF GENOMIC BIOLOGY Page 62 that bind to enhancer sequences. Activator proteins either interfering with the critical protein-protein also contain protein-protein interaction domains that interactions of activators or by binding tightly to allow them to bind to and affect the behavior of other enhancer sequences keeping activators from binding. proteins. These other proteins could be RNA polymer- Thus, activator and repressor proteins are important ase itself; other general transcription factors in the PIC; in transcription regulation. They are recognized by or other adapter proteins that interact with the PIC (see promoter-proximal elements and other enhancer/ Figure 3.32). silencer sequence elements found upstream of the promoter, and they are specific for groups of similarly regulated genes. These proteins mediate the rate of transcription initiation for genes that contain recognized sequence elements. The presence or absence of specific activator and repressor sequences in a specific cell either because of cell type or because of environmental factors can mediate the initiation of transcription. For example, housekeeping genes (used in all cell types for basic cellular functions) have common promoter- proximal elements and are recognized by activator proteins found in all cells. Examples of genes with housekeeping functions include: actin, hexokinase, and Figure 3.32. An activator protein binding to a promoter-proximal Glucose-6-phosphate dehydrogenase. enhancer seuqence, interacting with an adapter protein, and the PIC to enhance transcription initiation. Genes expressed only in some cell types or at particular times have promoter-proximal elements recognized by activator proteins found only in specific Repressor proteins can either bind to silencer or cell types or times. Enhancers are another cis-acting enhancer sequence elements in the promoter. In so element. They are required for maximal transcription of doing they reverse the effect of activator proteins by a gene. CONCEPTS OF GENOMIC BIOLOGY Page 63 Enhancers/silencers are usually upstream of the Additionally, each tissue produces a set of tissue- transcription initiation site but may also be specific and general activator and repressor proteins, downstream. They may modulate from a distance of and the spectrum of these proteins can be influenced by thousands of base pairs away from the initiation site. environmental factors such as cellular surroundings, Because there are similar enhancer and silencer temperature, chemical environment, etc. This affords sequences in front of several genes that are the ability of each cell to “customize” the expression of coordinately regulated, and each gene promoter has its genes depending on the protein functions that are own unique spectrum of such sequences, Eukaryotic required in each cell based on cell type and cellular cells can avoid the necessity of contiguous organization environment. This phenomenon is referred to as of genes into operons as is common in prokaryotes. combinatorial gene regulation and is illustrated in Figure 3.33. Once transcription initiation has occurred, the RNA polymerase moves away from the TATA box as the transcript is elongated. This is fundamentally the work of RNA polymerase, and the other proteins of the PIC now leave the complex to be recycled to form new PICs while RNA polymerase elongates the primary transcript nucleotide chain to complete the formation of the primary transcript. 3.5.7. Processing the primary transcript into a mature mRNA (return) As shown in Figure 3.30, the primary transcript must Figure 3.33. Combinatorial gene regulation leads to the coordinate be processed in 3 significant ways to become a mature regulation of batteries of genes in Eukaryotes. The types of enhancer and silencer sequences in front of each gene determine the level of mRNA. This processing all takes place in the nucleus of transcription of each gene based on the activator and repressor the cell and prepares the mRNA for transport to the proteins present in each cell/tissue type and the environment surrounding each cell. CONCEPTS OF GENOMIC BIOLOGY Page 64 cytosol of the cell where it will subsequently be to the 5’-end of the transcript. Note that the cap is translated to produce a protein. reversed compared to the RNA strand, i.e. it is attached First, the primary transcript must acquire a cap at its 5’ to 5’ not 5’ to 3’ as are the other nucleotides in the 5’- end. The cap prepares the transcript for transport transcript. The cap can be attached to the transcript from the nucleus, provides stability to attack by during transcription before completion of the primary in the cytoplasm, and aids in the initiation transcript, but it is critical to efficient transport of the of the translation process. Structurally, a cap consists of mRNA from the nucleus so it must be attached in the a 7-methyl guanosine attached by 3 phosphate groups nucleus. The second processing step occurs at the 3’-end of the transcript (Figure 3.35), and is involved in transcript termination of elongation by RNA polymerase II. Note that other eukaryotic RNA polymerases may have other mechanisms of transcript termination since they do not produce poly adenylated transcripts. The process for addition of the poly-A tail involves a complex of proteins that assembles at a poly-A addition consensus sequence (AAUAAA). The proteins involved in the cleavage step of the termination process include: 1) CPSF (cleavage and polyadenylation specificity factor). 2) CstF (cleavage stimulation factor). 3) Two cleavage factor proteins (CFI and CFII). Following cleavage, the enzyme poly(A) polymerase Figure 3.34. Structure of the 5’-Cap added to Eukaryotic primary RNA transcripts. The cap consists of a 7-methyl guanosine (PAP) adds A nucleotides to the 3’ end of the cleaved residueattached 5’ to 5’ at the 5’ end of the transcript by 3 transcript RNA, using ATP as a substrate. PAP is bound phosphate groups (a phosphotetraester). to CPSF during this process. Typically, about 200-250 A’s CONCEPTS OF GENOMIC BIOLOGY Page 65 are added. PABII (poly-A binding protein II) binds the A. Cleavage poly-A tail as it is produced. Upon completion of the poly-A tail, further transcription is terminated with the release of the pre-mRNA transcript from the protein complex. The third step in the process of producing a mature mRNA from a pre-mRNA involves removal of sequences that are found in the DNA coding sequence and pre- mRNA that are absent from the mature mRNA that is found in the cytoplasm of the cell. These removed sequences are called introns. The parts of the pre- mRNA that remain in the mature mRNA are called exons B. Poly-A tail addition (see Figure 3.30). The removal of introns from the primary transcript to is a process referred to as splicing, and it typically involves a protein RNP particle referred to as a spliceosome. Spliceosomes are small nuclear ribonucleoprotein particles (snRNPs) associated with pre-mRNAs. snRNAs that were previously discussed are structural parts of spliceosome RNPs. The principal snRNAs involved are U1, U2, U4, U5, and U6. Each of these snRNAs is Figure 3.35. The addition of a poly-A tail to the associated with several proteins; e.g. U4 and U6 are transcript terminates transcription of the pre-mRNA. The process involves 2 steps: A) cleavage of the part of the same snRNP. Others are in their own growing primary transcript by a complex of proteins snRNPs. Each snRNP type is abundant (~105 copies per that recognize a poly-A addition signal in the transcript; B) Addition of 200-300 A’s to the 3’ end of the transcript by PolyA polymerase (PAP).

CONCEPTS OF GENOMIC BIOLOGY Page 66 nucleus) consistent with the critical role that these The steps of RNA splicing are outlined in Figure 3.36: snRNPs play in nuclear processes. 1) U1 snRNP binds the 5’ splice junction of the intron, as a result of base-pairing of the U1 snRNA to the intron RNA. 2) U2 snRNP binds by base pairing to the branch-

point sequence upstream of the 3’ splice junction. 3) U4/U6 and U5 snRNPs interact and then bind the

U1 and U2 snRNPs, creating a loop in the intron. 4) U4 snRNP dissociates from the complex, forming

Figure 3.36. The process of intron spicing the active spliceosome. conducted by U2-dependent spiceosomes. Note 5) The spliceosome cleaves the intron at the 5’ splice that there are other types of spiceosomes, and that there are a few introns that are spliced junction, freeing it from exon 1. The free 5’ end of independent of spliceosomes. The binding of at the intron bonds to a specific nucleotide (usually least 5 RNP complexes containing snRNAs and proteins ultimately produce a structure that holds A) in the branch-point sequence to form an RNA the transcript cleaved ends together while the lariat. intron is spliced out producing a “lariat” structure. The exon ends of the transcript are then ligated 6) The spliceosome cleaves the intron at the 3’ together producing a mature mRNA with the junction, liberating the intron lariat. Exons 1 and 2 intron removed from the sequence. are ligated, and the snRNPs are released. One of the most interesting aspects of intron splicing is that there can be different transcripts created based on how introns are spliced. This is referred to as alternative splicing can be used to produce different polypeptides from the same gene as shown in Figures 3.37 and 3.38.

CONCEPTS OF GENOMIC BIOLOGY Page 67

Figure 3.38 Alternative splicing of 1 primary transcript to produce 3 different proteins.

From the above discussion it is clear that processing Figure 3.37. A schematic representation of alternative splicing. The of a mature mRNA from the primary RNA transcript, and figure illustrates different types of alternative splicing: exon inclusion or skipping, alternative splice-site selection, mutually exclusive exons, the transport of the mature mRNA from the nucleus to and intron retention. For an individual pre-mRNA, different alternative the cytoplasm are steps that can influence the amount exons often show different types of alternative-splicing patterns. © 2002 Nature Publishing Group Cartegni, L., Chew, S. L., & Krainer, A. R. Listening to of translatable mRNA for a particular protein that exists silence and understanding nonsense: exonic mutations that affect splicing. Nature Reviews Genetics 3, 285–298 (2002). All rights reserved. in a cell. The details of the steps we have discussed have emerged from a series of original molecular genetic studies, and have been greatly embellished more recently by functional genomic studies that we

will investigate further in subsequent chapters.

CONCEPTS OF GENOMIC BIOLOGY Page 68 protein. This section concerns the process by which this translation of genetic information takes place.

3.6. TRANSLATIION (RETURN) 3.6.1. The Nature of Proteins (return)

Amino acids are the monomeric building blocks used Following transcription, the genetic information to make proteins. The generic structure of an amino carried in nucleotide sequence of mRNA is translated acid is shown in Figure 3.39. into an amino acid sequence of a polypeptide or The feature that distinguishes the 20 protein amino acids from one another is the nature of their R-groups while the amino and carboxyl groups are utilized to make the peptide bonds that form the “backbone” of proteins (Figure 3.40). The R-groups make the 20 amino acids unique.

Figure 3.40. The peptide bond is used to fasten amino acids together in proteins. Basically, peptide bonds form between the amino group of one amino acid and the carboxyl group of another amino acids. This would produce a dipeptide.

Figure 3.39. Generic structure of an amino acid showing the features However, there are similarities in the chemical common to all amino acids. These features include an alpha carbon properties imparted to each amino acid by their R- that has chemical bonded to it: an amino group, a carboxyl group, a group. As we learn more about proteins and their hydrogn atom, and an R-group. The 20 protein amino acids differ by the structure of their R-group. important features will need to understand this CONCEPTS OF GENOMIC BIOLOGY Page 69 similarity in amino acid functional groups. Figure 3.41 shows the four classes of amino acids: acidic, basic, polar, and nonpolar. These classes are based on R- groups having similar properties in each group.

3.6.2. The Genetic Code (return) How many nucleotides are needed to specify one amino acid? A one-letter code could specify four amino acids; two letters specify 16 (4 * 4). To accommodate 20, at least three letters are needed (4 * 4 * 4 = 64). Thus we can conclude that the Genetic Code is most likely a triplet code. This has been verified experimentally and the code words for all 20 amino acids have been determined (see Figure 3.42). Characteristics of the genetic code: 1) It is a triplet code; meaning each three-nucleotide codon in the mRNA specifies one amino in the polypeptide. 2) It is “punctuation” free, i.e. the mRNA is read con- tinuously, three bases at a time, without skipping any bases. 3) It is nonoverlapping. Each nucleotide is part of only one codon and is read only once during

translation. Figure 3.41. There are 20 amino acids used in biological proteins. 4) It is almost universal. In nearly all organisms TheyFigure are 3.42. divided Codon into subgroups table, showing according each to of the the properties 64 possible of theirtriplet R groupscode words.(2 acidic, Both 3 singlebasic, lett 9 erneutral, and 3 letterand polar,abbreviations or 6 neutral are shown and studied, most codons have the same amino acid nonpolar).the amino acid corresponding to each code word. Note that there meaning. Examples of minor code differences are 3 stop code words that do not code for any amino acid, and an initation code word that codes for methionine (or N-formyl methionine in Prokaryotes). CONCEPTS OF GENOMIC BIOLOGY Page 70 include the protozoan Tetrahymena and mitochondria of some organisms. 5) It is degenerate. Of 20 amino acids, 18 are encoded by more than one codon (see Figure 3.42). M (AUG) and W (UGG) are the exceptions; all other amino acids correspond to a set of two or more codons, e.g. F, Y, C, H, Q, N, K, & D all have 2 code words, while I has 3, V, P, T, A, & G all have 4 code words, and L, S, & R have 6 code words). Figure 3.43. Schematic showing normal and Wobble base Codon sets often show a pattern in their pairing at the 5’ end of the anticodon in the tRNA with the 3’ sequences; variation at the third position is most end of the codon in the mRNA. common (result of wobble in this position, see below). recognize up to three different codons (Figure 6) Wobble occurs in the anticodon. The third base in 3.43 and Table 3.5). the codon is able to base-pair less specifically, 7) The code has start and stop signals. AUG is the usual start signal for protein synthesis and defines the open reading frame. Stop signals are codons with no corresponding tRNA, the nonsense or chain-terminating codons. There are generally three stop codons: UAG (amber), UAA (ochre), and UGA (opal).

3.6.3. tRNA the decoding molecule (return) As described earlier in Section 3.5.1. transfer RNA because it is less constrained three-dimensionally. (tRNA) are a type of transcribed RNA produced from a It wobbles, allowing a tRNA with base modification set of tRNA genes scattered throughout the genomes of of its anticodon (e.g., the purine inosine) to eukaryotes, mitochondria, and chloroplasts. A genome would have to have at least 1 transfer RNA for each of CONCEPTS OF GENOMIC BIOLOGY Page 71 the 20 amino acids, but in fact because of multiple code words and the inability of wobble to make it possible for 1 tRNA to code for even 4 code words, there must be substantially more than 20 tRNAs in each genome. Typical Eukaryotic genomes contain around 100 tRNA genes, and multiple tRNAs for each amino acid. Structurally, tRNAs have a highly organized, folded structure, and because they must all bind to the same sites on ribosomes, the overall general structure of each tRNA are quite similar. In 2-dimensions tRNAs are typically written showing a cloverleaf structure (Figure 3.44, upper left). Because of base pairing the 3’ end and the 5’ end form a stem, and there are 3 separate loops making up each of the leaves of the “clover”. The loop opposite from the stem contains the anticodon sequenc,, while the other two loops help the tRNA fold into a proper 3-dimensional structure (Figure 3.44, upper right). This 3-dimensional structure is an L- shaped molecule that fits into sites on the ribosome during the translation process. The amino acid is attached to the free 3’-OH group at the 3’ end of the

tRNA sequence via an ester bond (acid function of the Figure 3.44. Characteristics and structure of tRNAs. Upper left shows amino acid to the hydroxyl function on the ribose sugar. a 2-D “cloverleaf” structure typical of most tRNAs. However, this structure has additional secondary structure producing a 3-D This 3’-terminal OH is always on an adenosine structure similar to that shown in the upper right. An atomic space- , which is preceded by two cytosines. These filling model of this 3-D structure is shown in the lower left, and the detailed 3’-end of all tRNAs showing the CCA where the amino acid is 3 nucleotide residues are not transcribed, but are attached that is posttranscriptionally added to all tRNAs is shown in the lower right. CONCEPTS OF GENOMIC BIOLOGY Page 72 posttranscriptionally added during processing of the The attachment of the amino acid to the tRNA primary transcript. Note all tRNAs share this 3’ feature. The enzyme that adds the amino acids to the tRNA, called an amino acyl tRNA synthetase (note for a particular amino acid attachment, amino acyl would be replaced by the name of the amino acid, e.g. for the amino acid, glycine, a glycyl tRNA synthetase attaches glycine to the 3’-OH on the adenosine at the 3’- terminus). There must be at least 1 amino acyl tRNA synthetase for the tRNAs coding for each amino acid, but there may be multiple tRNA synthetases if there are multiple tRNAs. tRNA synthetases recognize the proper amino acid to be fastened to each tRNA based on the shape of the amino acid, and they also recognize the tRNA to which the amino acid should be fastened based on the detailed shape of the tRNA. Note that tRNAs do not recognize anticodon sequence, but rather overall shape Figure 3.45. The catalytic cycle of an amino acyl tRNA synthetase. The enzyme binds an amino acid and ATP (top); ATP is hydrolyzed characteristics of the tRNAs. Although all tRNAs have a producing pyrophosphate and aminoacyl-AMP bound to the enxyme. relatively similar shape so that they fit into ribosomes, This if followed by the appropriate tRNA binding to the aa-tRNA synthetase, and subsequently the formation of the amino acylated they have enough uniqueness to their shape that only tRNA (referred to as a charged tRNA), with the subsequent liberation the properly shaped tRNAs are recognized by the of the tRNA and AMP. corresponding tRNA synthetases and thus shape recognition assures that the proper amino acid gets molecule carried out by the an amino acyl tRNA attached to a tRNA with the appropriate anticodon for synthetase is described in Figure 3.45. This reaction that amino acid. requires that the amino acid be energized by ATP forming an amino acyl AMP enzyme bound complex and CONCEPTS OF GENOMIC BIOLOGY Page 73 releasing pyrophosphate (PPi). The amino acid is then transferred to a tRNA forming an ester bond, and the aminoacylated tRNA and AMP are then released from the enzyme. Functionally, amino acids are inserted into the polypeptide in the proper sequence due to the decoding function of the tRNA. This involves specific binding of each amino acid to its cognate (appropriate) tRNA, and specific base pairing between the mRNA codon and tRNA anticodon. Thus, the fidelity of the transfer of genetic information from the DNA of the gene to the amino acid sequence of the protein relies as much on the fidelity of the “coding” done by the tRNA moleucles as on the fidelity of the transcription process creating the mRNA.

3.6.4. Peptides Are Synthesized On Ribosomes In Both Prokaryotes and Eukaryotes (return) Ribosomes are RNA-protein complexes that are so small that they cannot be seen with a light microscope. Figure 3.46. The structure and composition of (a) Prokaryotic and (b) Eukaryotic ribosomes. Both are composed of a large ribosomal Although both Prokaryotes and Eukaryotes have subunit and a small ribosomal subunit although these are different ribosomes, there are key differences between both the sizes. The rRNAs and proteins that are a part of each of the subunits are also shown. RNA species and proteins found in ribosomes in the two groups of organisms. It should also be noted that The details of ribosomal structure in Prokaryotes and mitochondria and chloroplasts in Eukaryotic cells Eukaryotes are shown in Figure 3.46; you may need to produce -like ribosomes that produce refer back to section 3.5.1 and Table 3.2 for more proteins in each or these organelles. details. Bacterial 70S ribosomes are composed of a 50S CONCEPTS OF GENOMIC BIOLOGY Page 74 large subunit and a 30S small subunit. Each of these Note that Eukaryotic mitochondria and chloroplasts contains the rRNAs and proteins shown in Figure 3.46. contain a genome that codes for bacteria-like rRNAs and and ribosomal proteins, and thus, the prokaryotic versions of ribosomes are found in these intracellular compartments in Eukaryotes. Eukaryotic ribosomes are significantly larger than their prokaryotic counterparts. The 80S ribosome is composed of a 60S large ribosomal subunit and a 40S small ribosomal subunit. Both of these are made up of larger rRNAs and more proteins than their prokaryotic counterparts. The important structural features of ribosomes are created by the assembly of the large and small subunits into a functional ribosome (Figure 3.47). There are 3 tRNA binding sites created on the surface of the fully assembled ribosome. One of these sites is for the entering aminoacyl-tRNA (A-site), a second is for the tRNA containing the growing peptide chain (P-site), and the last site is used for the tRNA to exit the ribosome Figure 3.47. An assembled functioning ribosome has pockets on the after it has lost its amino acid. surface near the location of the mRNA that bind the decoding tRNAs. A charged tRNA (with amino acid) binds to the aminoacyl (A) site identifying a codon sequence in the mRNA corresponding to it’s anticodon. Adjacent to the A-site is a peptidyl (P) site occupied by the tRNA that just received the growing peptide chain. As protein chain elongation occurs, the peptide is transferred to the amino acid bound to the tRNA in the A-site, forming a new peptide bond. The third site 3.6.5. Translation involves initiation, elongation, and is the exit (E) site where the tRNA having just lost it’s peptide exits the ribosome. Once the peptide has transferred, and the E site is emptied termination (return) the ribosome translocates (moves) to put the peptidyl tRNA into the P- site, and create an opportunity of a new tRNA corresponding to the codon in the A-site to enter beginning this cycle over again. CONCEPTS OF GENOMIC BIOLOGY Page 75 The translation process is very similar in Prokaryotes subject to a series of modifications, and must exit the and Eukaryotes. Although different elongation, nucleus to be translated. These multiple steps offer initiation, and termination factors are used and the ribosomes differ as discussed above, the genetic code is generally identical. In bacteria, transcription and translation take place simultaneously, and mRNAs are

Figure 3.48. Eukaryotes and bacteria produce mRNAs somewhat differently. Bacteria use the RNA transcript as an often polycistronic mRNA without modification. Eukaryotes modify pre-mRNA into mRNA by processing. The processed mRNA then leaves the nucleus where translation occurs in the cytoplasm. relatively short-lived. This facilitates the regulation of gene expression using processes such as attenuation. Figure 3.49. The process of initiation in both Porkaryotes and Eukaryotes is fundamentally similar. The small ribosomal By comparison, in eukaryotes mRNAs can have subunit interacts with the mRNA and an initiator tRNA is located at the AUG start code word. The large subunit then more variable half- from short to long-lived, are completes initiation by aligning to the complex and froming an active ribosome. CONCEPTS OF GENOMIC BIOLOGY Page 76 additional opportunities to regulate levels of protein production, and thereby fine-tune gene expression. Initiation of translation The process of initiation of translation involves the formation of a functional ribosome at the AUG start code word on the mRNA. In both Prokaryotes and Eukaryotes, this process involves the small ribosomal subunit interacting with the mRNA at a translation start site (Figure 3.49). The initiator tRNA with methionine (N-formyl-methionine in Prokaryotes) then locates the AUG code word, and provides the opportunity for assembly of a completed ribosome with the addition of the large ribosomal subunit. This simple process of initiation is substantially more complex in Eukaryotes, and relatively simple in Prokaryotes. The details of Prokaryotic translation initiation are shown in Figure 3.50, and the details of Eukaryotic translation initiation are shown in Figure 3.51. CONCEPTS OF GENOMIC BIOLOGY Page 77

Figure 3.51. Eukaryotic translation initiation and ribosomal subunit recycling are depicted as a nine-stage process. In stage 1, ribosome recycling occurs to yield separated 40S and 60S ribosomal subunits. In stage 2, eukaryotic initiation factor 2 (eIF2), GTP, and an initiator methionine tRNA (Met-tRNAMeti) form a ternary complex called eIF2–GTP–Met-tRNAMeti. In stage 3, the 43S preinitiation complex forms. This complex includes a 40S subunit, eIF1, eIF1A, eIF3, eIF2– GTP–Met-tRNAMeti and probably eIF5. In stage 4, mRNA activation occurs, during which the mRNA cap-proximal region is unwound in an ATP- dependent manner by eIF4F with eIF4B. The mRNA loops into a circular configuration. In stage 5, the 43S preinitiation complex attaches to the unwound mRNA region. In stage 6, the 43S complex scans the 5′ UTR in a 5′ to 3′ direction. In stage 7, the initiation codon is recognized, and the 48S initiation complex forms, which switches Figure 3.50. Initiation of translation in prokaryotes the scanning complex to a 'closed' conformation. involves the assembly of the components of the This leads to the displacement of eIF1 to allow translation system, which are: the two ribosomal eIF5-mediated hydrolysis of eIF2-bound GTP and subunits (50S and 30S subunits); the mature mRNA to inorganic phosphate (Pi) release. In stage 8, the be translated; the tRNA charged with N- 60S subunit joins the 48S complex, and there is concomitant displacement of GDP-bound eIF2 formylmethionine (the first amino acid in the nascent and other factors (eIF1, eIF3, eIF4B, eIF4F, and peptide); guanosine triphosphate (GTP) as a source of eIF5) mediated by eIF5B. In stage 9, hydrolysis of energy; the prokaryotic elongation factor EF-P and the eIF5B-bound GTP occurs, and eIF1A and GDP- three prokaryotic initiation factors IF1, IF2, and IF3, bound eIF5B are released from the assembled which help the assembly of the initiation complex. elongation-competent 80S ribosomes. Translation Variations in the mechanism can be anticipated. The is a cyclical process. Following elongation, termination occurs, followed by recycling (stage selection of an initiation site (usually an AUG codon) 1), which generates separated ribosomal depends on the interaction between the 30S subunit subunits, and the process begins again. Note the and the mRNA template. The 30S subunit binds to the critical role that the 5’-CAP and the poly-A tail mRNA template at a purine-rich region (the Shine- play in translation initiation in Eukaryotes, and Dalgarno sequence) upstream of the AUG initiation note the number of GTP’s and different proteins codon. The Shine-Dalgarno sequence is complementary required in this process compared to the Prokaryotic process. to a pyrimidine rich region on the 16S rRNA component © 2010 Nature Publishing Group Jackson, R. J., Hellen, C. U., of the 30S subunit. During the formation of the & Pestova, T. V. The mechanism of eukaryotic translation initiation complex, these complementary nucleotide initiation and principles of its regulation. Nature Reviews Molecular Cell Biology 11, 113–127 (2010). All rights sequences pair to form a double stranded RNA reserved. structure that binds the mRNA to the ribosome in such a way that the initiation codon is placed at the P site. CONCEPTS OF GENOMIC BIOLOGY Page 78 Translation Elongation Once the initiation process is completed, the peptide chain is elongated by the repeated cycling of the ribosome using a process like that shown in Figure 3. 52. Note that this is the Prokaryotic version of the process, but the overall process is essentially the same though the details for the eukaryotic version differ mostly in name and number of elongation factors. Termination Termination is signaled by a stop codon (UAA,UAG,UGA) that has no corresponding tRNA (Figure 3.42). Release factors (RF) assist the ribosome in recognizing the stop codon and terminating translation. In E. coli there are 3 RFs: 1) RF1 recognizes UAA and UAG. 2) RF2 recognizes UAA and UGA. 3) RF3 stimulates termination via GTP hydrolysis. RRF (ribosome recycling factor) binds the A site, EF-G translocates the ribosome, RRF then releases the last uncharged tRNA and EF-G releases RRF, causing the ribosomal subunits to dissociate from the mRNA. In eukaryotes, eRF1 recognizes all three Figure 3.52. The Prokaryotic translation elongation cycle. Though the names, and stop codons, while eRF3 stimulates termination. numbers of elongation factors differ in Eukaryotes, the process is essentially the same as that shown here. CONCEPTS OF GENOMIC BIOLOGY Page 79 Ribosome recycling occurs without an equivalent of and move the proteins to the intracellular compartment RRF. where they are intended to function, or outside the cell. This is accomplished by the rough endoplasmic 3.6.6. Protein Sorting in Eukaryotes (return) reticulum and the Golgi apparatus of the cell. Both bacteria and eukaryotes secrete proteins As proteins destined to a specific cellular location although different mechanisms are involved. Eukar- are being made, a hydrophobic signal (leader) yotes utilize the endomembrane system to synthesize sequences (15–30 N-terminal amino acids) is produced.

Figure 3.53. Proteins made by ribosomes on the rough endoplasmic reticulum (ER) are placed inside the lumen of the ER, at the time of their synthesis. As protein synthesis begins, the first ~20 amino acids of the protein are hydrophobic (signal peptide) and bind to a signal recognition particle (SRP). This attracts the ribosome with, mRNA, and peptide to a SRP receptor on the surface of the rough endoplasmic reticulum (RER). As protein synthesis continues the protein is pushed through a pore to the inside (lumen) of the RER. A signal peptidase cleaves the signal peptide from the newly made peptide chain, and when translation is terminated the newly made protein is released inside the ER lumen. CONCEPTS OF GENOMIC BIOLOGY Page 80 As the signal sequence is produced by translation, it is bound by a signal recognition particle (SRP) composed of RNA and protein (Figure 3.53). The SRP suspends translation until the complex (containing nascent protein, ribosome, mRNA, and SRP) binds a docking protein (SRP receptor) on the ER membrane. When the complex binds the docking protein, the signal sequence is inserted into the membrane through a specifically designed pore complex, SRP is released, and translation resumes. The growing polypeptide is inserted through the membrane into the ER, in an example of cotranslational transport. In the ER cisternal space, the signal sequence is removed by signal peptidase. The protein is usually glycosylated and then transferred to the Golgi for sorting. In eukaryotes, proteins synthesized on the rough ER (endoplasmic reticulum) are glycosylated and then transported in vesicles to the Golgi apparatus. In the Golgi proteins are sorted based on other sequence signals, and then they are packaged in membrane vesicles for transport via the sends them to their destinations.

CONCEPTS OF GENOMIC BIOLOGY Page 81

4.1. RESTRICTION ENDONUCLEASES (RETURN)

 CHAPTER 4. THE GENOMIC BIOLOGIST’S Restriction endonucleases (restriction enzymes) TOOLKIT (RETURN) each recognize a specific DNA sequence (restriction site), and break a phosphodiester linkage between a 3’ carbon and phosphate within that sequence. Restriction enzymes are used to create DNA fragments for cloning Genomic Biology has 3 important branches, i.e. and to analyze positions of restriction sites in cloned or Structural Genomics, Comparative genomics, and genomic DNA. A specific restriction enzyme digests cut Functional genomics. The ultimate goal of these DNA at the same sites in every molecule if allowed to branches is, respectively; the sequencing of genes and cut to completion. Thus, this is a method whereby all genomes; the comparison of these sequenced genes copies of genomes or any other longer sequence can be and genomes, and an understanding of how genes and reproducibly cut into identical fragments. genomes work to produce the complex phenotypes of The first three letters of the name of a restriction all organisms. enzyme are derived from the genus and species of the A set of molecular genetic technologies was/is organism from which it was isolated. Additional letters critical to our ability to pursue the goals described often denote the bacterial strain from which the above. The Genomic Biologists Tool Kit is provides a restriction enzyme was isolated, and if multiple brief understanding of these critical tools, and how they enzymes are isolated from the same strain, they are are used in the investigation of genomes. While the given Roman numerals. For example, the restriction techniques are intrinsically laboratory tools, the nature enzyme EcoRI, is the first enzyme isolated from the of what they can do and how they work can be readily RY13-strain of Escherichia coli. studied using bioinformatic resources. Bacteria produce restriction endonucleases to defend against bacteriophages (viruses), and each restriction

CONCEPTS OF GENOMIC BIOLOGY Page 82

Table 4.1. Characteristics of Some Restriction Enzymes

CONCEPTS OF GENOMIC BIOLOGY Page 83 enzyme recognizes a completely unique DNA sequence from degrading host cell DNA, while invading bacter- where it cuts the DNA strands (see Table 4.1 & Figure iophage DNA is unmethylated and readily degraded. 4.1). The specific restriction enzyme recognition sites in Many restriction sites are sequences of 4, 6, or 8 the bacterial DNA are often limited in the genome of base pairs in length and have identical sequences from the organism from which it comes, but they are 5’ to 3’ on each strand. These sequences are referred to abundant in the genome of the bacteriophage. Also the as palindromic DNA sequences. Other restriction sites DNA of the host cell can be modified by methylation, are not completely symmetrical and/or differ in length which prevents the restriction enzymes of the host cell from 4, 6, or 8 nucleotide pairs (Table 4.1 & Figure 4.1). As shown in the figure on the left, the nature of the

fragment ends produced when a restriction enzyme produces DNA fragments can vary. Some enzymes produce fragments where the two strands are equal in length. This is referred to as blunt ends. Other enzymes produce fragments where the two strands are unequal in length. These are referred to as either 5’ sticky ends, or 3’ sticky ends. Overhanging sticky ends provide a basis for combining DNA fragments produced by the same restriction enzyme from different DNA sources. This process was the original method used to produce recombinant DNA molecules. The application of restriction endonucleases to the cloning of DNA is further discussed in DNA Cloning video that can be viewed by clicking on the link. Note that part of this video will be discussed in detail in the next Figure 4.1. Restriction site sequences and section of the Genomic Biologist’s Toolkit, but the first cut locations of: a) SmaI; b) BamHI, and c) part of the video is a good demonstration of how PstI. CONCEPTS OF GENOMIC BIOLOGY Page 84 restriction enzymes work and how they can be used to Note that we have previously discussed SNPs as a create recombinant DNA molecules for cloning DNA. type of Sequence Tagged Site (STS). As single nucleotide changes in the genome sequence, consider the effect of an SNP that happens to occur in a restriction endonuclease recognition site. The result would be the loss of a restriction site at that SNP. This site would no longer be cut by the enzyme, and thus new fragments having different sizes would be produced. This is called Restriction Fragment Length Polymorphism (RFLP). Thus, and RFLP is an SNP that happens to occur in a restriction site in the DNA. A famous RFLP is associated with Sickle Cell Disease, and is further described in the accompanying video.

Figure 4.2. Using restriction enzyme, EcoRI to make recombin-ant 4.2. CLONING VECTORS (RETURN) DNA. The procedure relies on the 3’-overhanging “sticky ends”. The process of “DNA cloning” involves a set of An additional application of restriction enzymes experimental methods in molecular biology that are involves the production of a res-triction map. A used to assemble recombinant DNA molecules and to restriction map is shows the relative position of direct their replication within host organisms. The use restriction sites for multiple restriction enzymes in a of the word cloning refers to the fact that the method piece of linear or circular DNA. Prior to the availability involves the replication of one molecule to produce a of genomic sequences, restriction mapping was an population of cells with identical DNA molecules. important tool used to characterize cloned DNA Molecular cloning generally uses DNA sequences from fragments. The production of a restriction map for a two different organisms: 1) the organism that is the circular DNA is shown in the Restriction Mapping video. source of the DNA to be cloned, and 2) the organism that will serve as the living host for replication of the CONCEPTS OF GENOMIC BIOLOGY Page 85 recombinant DNA. Molecular cloning methods are to a specific antibiotic or that permits cells to make central to many areas of biology, biotechnology, and an amino acid required for growth. medicine, including DNA sequencing. These are the basic requirements that all modern The DNA from host organism in a cloning cloning vectors contain, but beyond these basic experiment, often called a vector, typically has 3 things: requirements, there can be a number of additional features that make specific vectors useful for various 1) Sequences necessary to produce recombinant DNA purposes. Thus, several types of cloning vectors have and facilitate entry into the host organism. Typically, been constructed, each with different molecular this can be one or more “unique” restriction sites. properties and cloning capacities. “Unique” in this context means that these are restriction sites will permit cutting the vector at only 4.2.1. Simple Cloning Vectors (RETURN) one location. Most vectors contain unique restriction The most common vectors are used to clone sites for a number of different restriction enzymes. recombinant DNA in bacterial cells, typically E. coli. This is called a polylinker or multiple cloning site, and Simple cloning vectors are constructed from can make the use of the vector much easier. common in many bacterial cells. In fact plasmids are 2) An origin of replication for the host organism to circles of dsDNA (double stranded) much smaller than facilitate replication of the recombinant DNA in the the bacterial chromosome that include replication host cell. Typically this sequence controls the origins (ori sequence) needed for replication in bacterial number of copies of the vector that can be made in cells that naturally carry DNA between different one cell. bacteria. An example of a typical E. coli cloning vector is 3) In order to facilitate identification of cells that pUC19 (2,686bp). The more modern version of pUC19 is contain the vector containing recombinant DNA, a pBluescript II. The features of this are shown in gene that can be expressed in the host and that Figrue 4.2. provides a “selectable” marker for the presence of More information about cloning DNA in plasmid recombinant DNA is provided. Often the selectable vectors can be found in Molecular Cell Biology, 4th marker gene will be a gene that makes cells resistant edition, Section 7.1. This can be downloaded from NCBI by clicking on the link. The use of simple cloning vectors CONCEPTS OF GENOMIC BIOLOGY Page 86

Figure 4.3. The features of pUC19 and pBluescrip II include: 1) High copy number in E. coli, with nearly 100 copies per cell, provides a good yield of cloned DNA. 2) Its selectable marker is ampR. 3) It has a cluster of unique restriction sites, called the polylinker (multiple cloning site). 4) The polylinker is part of the lacZ (b-galacto-sidase) gene. The plasmid will complement a lacZ- mutation, allowing it to become lacZ+. When DNA is cloned into the polylinker, lacZ is disrupted, preventing complementation of the lacZ- from occurring. 5) X-gal, a chromogenic analog of lactose that turns blue when -galactosidase is present, and remains white in the absence of -galactosidase, so blue- white screening can indicate which colonies contain recombinant plasmids.

to clone recombinant DNA made via the use of DNA One of these might be to create a expression library restriction and overhanging sticky ends can be seen in that makes specific proteins from each clone. This the attached Steps in DNA Cloning video. The use of requires an expression vector. simple cloning to obtain a collection of clones 4.2.2. Expression Vectors (RETURN) representing all sequences that can be cut from a longer piece of DNA is called creating a clone library (see video) Expression vectors contain all of the same elements of sequences. Libraries can be useful in several ways. that simple cloning vectors contain, i.e. an ori, a selectable marker, and a multiple cloning site; but the CONCEPTS OF GENOMIC BIOLOGY Page 87 MCS is flanked by a promoter sequence, and a sequence can insert randomly in two orientations. terminator sequence that works in the host organism. However, only one of the orientations will produce a This permits the cloned sequence to be transcribed, and translatable mRNA. The other orientation will produce if the vector contains a Shine-Delgarno sequence (not an apparent RNA that will be the complementary strand shown in Figure 4.4.), to be translated into a protein if of the mRNA (called an antisense RNA). In section 4.5. there is an start and stop code word in the sequence. dealing with this issue will be considered. t Note that Figure 4.4. illustrates how the cloned 4.2.3. Shuttle Vectors (RETURN) A cloning vector capable of replicating in two or more types of organism (e.g., E. coli and yeast) is called

Figure 4.5. Shuttle vectors like pRS426 can be used to move cloned DNA into 2 different organisms. In this case, the plasmid moves into E. coli and Yeast. Note that the vector contains an origin of replication for yeast (yeast 2 u ARS) and E. coli (ori), a selectable marker gene for E. coli (ampr) and yeast (Ura3, does not require Uracil for growth as does the Figure 4.4. An example of a simple expression vector. yeast strain used), and a multiple cloning site with a yeast promoter and terminator on either side. Thus, this shuttle vector can work in both E. coli and yeast. CONCEPTS OF GENOMIC BIOLOGY Page 88 a shuttle vector. Shuttle vectors may replicate plasmids. These often have specific uses that take autonomously in both hosts, or integrate into the host advantage of their unique properties. Among the types genome. of non-plasmid vectors, bacteriophage λ vectors (shown

4.2.4. Phage Vectors (RETURN) in Figure 4.6) are among the most frequently used. Phage λ vectors can be used to make expression Beside plasmid-based simple cloning vectors, there libraries and to convenient for selection of clones as the are a number of other vectors that are not based on bacteriophage lyses cells releasing the contents to the cell to the medium. Thus RNAs and proteins derived from the inserted fragment can be investigated using these vectors.

4.2.5. Artificial Chromosome Vectors (RETURN) The typical simple cloning vector will accommodate DNA fragments up to about 3,000 bp in length. However, there are needs to clone significantly longer fragments of DNA for study. Typically DNA genomic sequencing is easiest with the longest fragments possible. Two vector systems, i.e. BAC vectors (bacterial artificial chromosome) and YAC vectors (yeast artificial chromosome), are useful choices for cloning DNA fragments. In BACs fragments up about 350 kbp (350,000 bp) can be cloned while in YACs fragments up 1,000,000 bp have been reported. Both of these methods have been used in the original human genome sequencing project. However, it was found that YACs are relatively unstable, meaning that they frequently Figure 4.6. Phage λ Vector. self-modified loosing DNA in the process, and thus, they CONCEPTS OF GENOMIC BIOLOGY Page 89 did not have the stability shown by BACs. Conse- quently, BACs have emerged as the large cloning vector of choice.

4.3. METHODS OF SEQUENCE AMPLIFICAION (RETURN)

With our discussion of restriction endonucleases and cloning vectors completed. We are now ready to put these concepts together and show how specific DNA sequences can be amplified to provide specific DNA sequences for genetic and genomic studies.

4.3.1. Polymerase Chain Reaction (PCR) (RETURN) Polymerase Chain Reaction or PCR is a method by which DNA polymerase can be used to make many copies of a DNA sequence in a test tube. The technique is a valuable supplement to DNA cloning to generate specific DNA sequences for use as reagents. A description of the PCR process is given in the Polymerase Chain Reaction video. Click the link to view this video. Some additional things to note are that the Figure 4.7. Artificial Chromosome vectors. a) Shows a reaction temperature is changed using a device called a bacterial artificial chromosome (BAC) that has a thermal cycler that can rapidly change temperatures selectable marker (chloramphenicol resistance), and a MCS. However, the ori sequence is replaced by a during each cycle. The reaction mixture must have all single copy F factor origin of replication. b) Shows a necessary components for a PCR reaction including a yeast artificial chromosome, including selectable thermostable DNA polymerase like the TAQ DNA markers (TRP1 and URA3), a yeast origin of replication (ARS), and and telomere chromosome polymerase mentioned in the video. Such DNA parts. This vector will replicate in yeast cells.

CONCEPTS OF GENOMIC BIOLOGY Page 90 polymerases are obtained from organisms called 4.3.2. Cloning in a Simple Cloning Vector (RETURN) extremophiles that grow in very hot water like that DNA cloning is the for a number of genomic biology found in geysers (e.g. Old Faithful in Yellowstone experiments. Large amounts of DNA are needed for National park) or thermal vents on the floor of the analysis, sequencing, and numerous experimental ocean. The reaction also contains the deoxyNTP (deoxy approaches. As we saw above multiple copies of a nucleotide triphosphates, e.g. dATP, dGTP, dCTP, & known DNA sequence can be made and cloned using dTTP), and the primers which define each end of the PCR and a PCR vector. However, an alternative is sequence to be amplified. necessary when the sequence to be cloned is unknown DNA sequences amplified via PCR typically contain an (i.e. PCR primers cannot be determined). To introduce extra A on the 3’-end the molecule, i.e. a single this principle we will outline the steps to clone a DNA overhanging 3’-A that makes ligation of the PCR fragment of unknown sequence in a simple cloning amplified fragment into a PCR cloning vector much vector. easier (see Figure 4.6). To get multiple copies of a gene or other piece of DNA you must isolate, or ‘cut’, the DNA from its source using restriction enzymes, and then ‘paste’ it into a simple cloning vector that can be amplified in a host cell, typically E. coli. pGEM-T Easy PCR Vector (3015 bp) The four main steps in PCR DNA cloning are: pGEM-T Easy PCR VectorDNA Ligase + (3015+ bp) Step 1. DNA is purified from the donor cells using a standard DNA purification technique. pGEM-Teasy+ PCR Amplified Step 2. A chosen fragment of DNA is ‘cut’ from the PCR Amplified DNA DNA (4206 bp) (1191 bp) purified genomic DNA of the source organism using Figure 4.8. PCR Cloning vectors. Note that the vector comes linearized a restriction enzyme. with overhanging 3’-T’s. PCR products typically have single over- hanging A’s at their 3’-ends. This provides a convenient way of making a circular plasmid with the inserted PCR product.

Recont CONCEPTS OF GENOMIC BIOLOGY Page 91 Step 3. The piece of DNA is ‘pasted’ into a vector and the ends of the DNA are joined with the vector DNA by DNA ligase (joins Okazaki fragments) in the DNA

Figure 4.9. Insertion of restricted DNA into a simple cloning vector.

replication section. Step 4. The vector is introduced into a host cell, often a bacterium, by a process called bacterial transformation. The transformed host cells copy the vector DNA + recombinant DNA along with their own DNA, creating multiple copies of the inserted DNA. DNA that has been ‘cut’ and ‘pasted’ from an organism into a Figure 4.10. Using PCR to obtain only the forward orientation of a sequence in an expression vector. Primers are designed with a vector is called recombinant DNA. Because of this, DNA restriction site added such that they anneal at each end of the cloning is also called recombinant DNA technology. fragment of interest. Following PCR an amplified fragment will be produced with a KpnI site at the 5’ end of the intended coding Step 5. The vector DNA is isolated (or separated) sequence and a SalI site at the 3’ end. The expression vector is then opened by cutting with both KpnI and SalI. Since the KpnI from the host cells’ DNA and purified. site is closer to the promoter in the expression vector’s MCS, while the SalI site is closer to the terminator. This construct will 4.3.3. Cloning DNA in Expression Vectors (RETURN) go into the vector in the sense orientation so that a message is In section 4.2., we discussed expression vectors, and produced that makes the protein of interest rather than its showed that when a restricted DNA sequence is cloned antisense equivalent. CONCEPTS OF GENOMIC BIOLOGY Page 92 in an expression vector, it can be ligated into the vector reverse transcriptase (makes a DNA strand from an RNA in both a “forward” or a “reverse” configuration (Figure strand) is used to make a first-strand DNA copy of the 4.4). In the forward configuration the fragment is mRNA strand. positioned so that it makes an mRNA that codes for a protein, while in the reverse configuration, the DNA fragment does not make an mRNA, but makes an RNA from the opposite strand called an antisense RNA. It is possible using a PCR strategy to insert a DNA fragment into an expression vector such that it can only insert in the “forward” orientation. This strategy is shown in Figure 4.10.

4.3.4 Making complementary DNA (cDNA) (RETURN) A double stranded DNA copy of an mRNA is called a cDNA. Making cDNA is a way to convert a relatively labile single-stranded RNA into a relatively stable double-stranded DNA. It is possible to make a DNA copy of an RNA by employing an enzyme involved in replic- ation of certain viruses called reverse transcriptase. The other aspect of Eukaryotic mRNAs that makes producing cDNAs relatively facile is the polyA tail as we will see below. cDNAs can be made in several ways, but the method described here is a traditional method. Step 1. Total RNA is extracted from cells using a standard technique for the organism in question. Step 2. An oligo-dT primer is hybridized with the polyA tail of a Eukaryotic mRNA. Then an enzyme called Figure 4.11. The process for making cDNA in a simple cloning vector. CONCEPTS OF GENOMIC BIOLOGY Page 93 Step 3. The RNA is then partially degraded with cDNA library. A similar cDNA library from different cells RNase H, and RNA fragments are randomly annealed to (e.g. different tissues, or cells treated with a drug, or the newly made DNA strand. These RNA fragments act grown in a different environment, etc.) will show is primers for DNA polymerase I. different levels of each cDNA present based on the Step 4. DNA polymerase I is then used to make a mRNAs found in a tissue. The frequency of mRNAs complementary DNA strand, and replace the RNA found in a tissue is considered information about the primers with DNA nucletoides. expression of a gene. Gene expression information relates directly to the function of transcription Step 5. All pieces are then ligated together using machinery in cells, and is critical functional genomic DNA ligase. Completing the synthesis of a double information, as we will see in a subsequent section of stranded DNA copy of the mRNA. the book. At completion of the procedure above you will have In order to store and subsequently utilize a cDNA prepared a cDNA copy of each mRNA that was present library it is useful to produce a clone of each sequence in the cells from which you extracted the RNA. If there in the library. Typically this involved putting the cDNAs were 10,000,000 polyA tails on 10,000,000 mRNAs you into vectors, and putting the vectors into host cells, should make 10,000,000 cDNAs. In other words if there typically E. coli such that each cell gets a single cDNA were 10,000 mRNAs in the preparation that coded for a which is amplified in that cell and all it’s clones. given protein like myosin, but only 500 mRNAs coding for hexokinase and 10 mRNAs for tyrosyl-tRNA 4.3.5. Cloning a cDNA Library (RETURN) synthetase, you might expect that your cDNA library of A cDNA clone library is a useful tool to identify sequences obtained from the cells you used would have specific mRNAs found in a tissue and to obtain the 10,000, 500, and 10 cDNAs for the 3 proteins sequences of identified genes. To do this a cDNA clone respectively. The frequency of occurrence of each library (i.e. to clone all cDNAs into a vector, and put one mRNA is represented by the frequency of cDNAs in the vector containing an individual cDNA in each cell) can be cDNA library obtained from a given set of cells. Thus, created. These cells can be screened to determine information about the frequency of occurrence of which clones express genes of interest. mRNAs in cells can be obtained from analysis of such a CONCEPTS OF GENOMIC BIOLOGY Page 94 Various types of vectors can be used to create a Step 4. Digest the cDNAs with internal sites cDNA clone library. These include phage expression protected and linkers attached with the restriction vectors, plasmid expression vectors, or shuttle vectors enzyme to generate the appropriate overhanging sticky depending on the intended use of the clone library. We ends). will look at a protocol for incorporation of cDNA into a plasmid expression vector, using a simple strategy. Note that kits are now available that provide everything you require and outline specific strategies for most types of vectors should you ever need to accomplish Step 3 this task. Step 1. Prepare a cDNA library as outlined in section 4.3.4. Step 2. Manipulating the cDNAs so that each one Step 4 has a unique (not contained in any cDNA) restriction site at both ends. To do this, the cDNAs are frequently methylated with a specific methyl transferase that incorporates a methyl group into particular restriction sites to protect them from the restriction enzyme that Step 5 will be used later. Step 3. A synthetic double stranded oligonucleotide linker is then ligated to the ends of this cDNA. The linker should correspond to a restriction site in the MCS of the vector to be used. Blunt end ligation is generally a low efficiency process; but, by using a high concentration of these synthetic oligonucleotides, it is possible to drive the reaction to near completion. Figure 4.12. Procedure of inserting a cDNA into a cloning vector involving ligation of linkers on the ends of the cDNA. CONCEPTS OF GENOMIC BIOLOGY Page 95 Step 5. Mix the digested cDNAs with the predigested vector, and add DNA ligase to ligate to make cDNA recombinant vectors Step 6. Transform the recombinant vectors into host cells, and grow up clones. Once the cDNA clone library has been constructed, a number of strategies can be used to select a specific clone that contains a gene of interest. Figure 4.11 demonstrates how this could be done if antibodies against the protein of interest are available. Figure 4.12. shos a strategy for identifying a specific clone by Figure 4.13. Finding a specific cDNA clone using an expres- complementation of a yeast mutant. Note that for this sion library. Following technique the cDNA library was constructed in a yeast transformation of cells with the shuttle vector. cDNA expression library, transformants with inserts Because cDNAs are the exons of the gene (parts that (white colonies) are selected, code for proteins) a cDNA clone library can be replated, and screened with antibodies against the protein expressed in either Prokaryotic or Eukaryotic cells. of interest. Colonies producing However, there are sometimes (but relatively antigenic proteins are then infrequently) complex issues that keep Eukaryotic tested for the presence of the protein of interest and the cDNAs from expressing functional proteins in cDNA insert in that clone is Prokaryotic cells. When this occurs the shuttle vector characterized. approach is necessary to get a functional protein produced in the library. introns and exons in the genomic sequence (see Figure cDNA libraries have many uses, but comparisons of 4.15. By sequencing clones from a cDNA library, so cDNA sequences with sequences of corresponding called expressed sequence tags (ESTs) are determined. genes is one way of demonstrating the positions of The sequences of ESTs were critical to understanding CONCEPTS OF GENOMIC BIOLOGY Page 96

DNA (Gene)

Primary RNA Transcript Figure 4.14. Strategy for identifying cDNA clones for a gene of interest (ARG1) mRNA (cDNA) using cDNAs (high MW DNA from (ARG1)yeast Figure 4.15. Primary RNA Transcript strain. Note the cDNAs need to be inserted into a yeast shuttle vector such (RETURN) that the ARG1 gene will be 4.4. GENOMIC LIBRARIES propperly expressed and complement the arg1 A genomic clone library or Genomic Library is a set mutant in the yeast strain used. of cloned sequences made by cloning the entire genome of an organism or organelle. One of several ways this can be done by cutting the genomic DNA with one or more restriction enzymes, and ligating the pieces into a simple cloning vector as shown in Figure 4.9. A limitation of simple cloning vectors is the size of DNA that can be introduced into the cell by transformation. This presents problems when you are trying to create a functional components of genomes as they were being Genomic Library of a large genome such as that of most sequenced. Eukaryotes. Remember that a genomic library contains all of the DNA found in the cells of the organism. If you digest CONCEPTS OF GENOMIC BIOLOGY Page 97 organismal DNA to completion with a restriction previously obtained. If this new clone overlaps a portion enzyme, ligate those fragments into a plasmid vector of the original clone, then the length of the DNA of and transform bacterial cells, only a portion of those interest is extended by the length of DNA in the second fragments will be represented in the final clone that is not found in the original clone. By transformation products. If a gene of interest is larger performing these steps successive times, a long distance that the clonalbe fragment length, then you will not be map can be obtained. To claify this concept, please view able to isolate that gene in tact from a plasmid library. the Chromosome Walking short video. But what can be done to increase the probability of This technique though has difficulties. First, each obtaining a clone that contains the entire gene. First step is technically slow. Second, if you use phage λ or you need to use a vector that can accept large clones, you might only extend the region of fragments of DNA. Examples of these are bacteriophage interest by 5-10 kb in each step of the walk. Finally, if and cosmid vectors, and the relatively popular yeast any of the clones that are obtained contain repeated artificial chromosome (YAC) vectors (see Figure 4.7b) sequences, the subclone could lead you to another and the bacterial artificial chromosome (BAC) vetors region of the genome that is not contiguous with the (see Figure 4.7a). While longer fragments of genomic region of interest. This is because Eukaryotic genomes DNA can be cloned in YAC vectors, these are less stable have so called DNA interspersed than the BAC vectors, making BACs the vectors most throughout their genomes. frequently used for genomic cloning. Yeast artificial chromosomes can alleviate some of 4.4.1. Cloning in YAC Vectors (RETURN) these problems because of the large (100-1000kb) A goal of genomic sequencing is to obtain physical amount of DNA that can be cloned. Howver, YACs data about the genomic organization of DNA in a cannot speed up each step of the walk because the genome. Traditionally, this data has been obtained by a subcloning and screening steps cannot be accellerated. technique called chromosome walking. Walking can But YACs can easily extend the region of interest by 50- performed by subcloning the ends of DNA inserted in a 100 kb and up to as much as 500 kb per walking cycle. phage λ vector or cosmid vector and screening a library Thus a long distance map of the region can be obtained for new clones that contain the end-sequences in several steps. Secondly, although repetitive regions CONCEPTS OF GENOMIC BIOLOGY Page 98 may be 10-20 kb in length they are rarely, longer than library. Individual BAC clone colonies can be stored 50 kb. Thus a YAC with 100kb will contain some region until needed. that is single copy which can be used for further steps in Making a BAC library the walk. To make a genomic Bacterial Artificial Chromosome While YACs allow the cloning of the largest (BAC) library: fragments possible, their relative stability has allowed the more stable BACs, which bear shorter recombinant Step 1. Isolate the cells containing the DNA you want fragments, to become the vector of choice for to store. For animals BAC libraries come from white chromosome walking and subsequent sequencing. blood cells. Step 2. These isolated cells are then mixed with 4.4.2. Cloning in BAC Vectors (RETURN) warm agarose, a jelly-like substance. The whole mixture During the Human Genome Project, researchers had is then poured into a mold and allowed to cool to to find a way to reduce the entire human genome into produce a set of small blocks, each containing chunks, as it was too large to be sequenced in one go. thousands of the isolated cells. To do this they created a store of DNA fragments called a BAC library, specifically a human genome BAC library. Step 3. The cells are then treated with enzymes to dissolve their cell membranes and release the DNA into BAC stands for Bacterial Artificial Chromosome. the agarose gel. A restriction endonuclease is used to These are small pieces of bacterial DNA that can be chop the DNA into pieces around 200,000 base pairs in identified and copied within a bacterial cell and act as a length (partial digestion versus complete digestion vector, to artificially carry recombinant DNA into the cell producing smaller fragments). of a bacterium, such as Escherichia coli. Step 4. These blocks of gel containing chopped up In general BAC clones carry inserts of DNA up to DNA are then inserted into holes in a slab of agarose 300,000 bp in length. The bacteria are then grown to gel. The DNA fragments are then separated according to produce colonies that contain the same fragment of size by electrophoresis. DNA in each cell of the colony. This is a BAC clone CONCEPTS OF GENOMIC BIOLOGY Page 99 and inserted into a BAC vector using DNA ligase to join the two bits of DNA together. This produces a set of BAC clones. Step 6. The BAC clones are added to bacterial cells, usually E. coli, and the bacteria are then spread on nutrient rich plates that allow only the bacteria that carry BAC clones to grow. The bacteria grow rapidly, resulting in lots of bacterial cells, each containing a copy of a separate BAC clone. Step 7. After they have grown, the bacteria are then ‘picked’ into plates of 96 or 384 so that each tube contains a single BAC clone. The bacteria can also be copied or frozen and kept until researchers are ready to use the DNA for sequencing. A BAC library has been created.

4.5. DNA SEQUENCING (RETURN)

Figure 4.16. BAC Vector. Contains blue/white screening capability. Genomic DNA fragments up to 300,000 bp can be ligated into the MCS of the vector which also contains a selectable marker and an F’ single copy origin of replication.

Step 5. Fragments of a particular size class (200,000 to 300,000 bp) selected, removed from the agarose gel