CASE STUDY 179

been developed. An example is GenBank­ or others with similar sequences have sequences. Scroll below the table to (http://www.ncbi.nlm.nih.gov/genbank/), been discovered. Enter this sequence see the aligned sequences from this which is the National Institutes of Health into the “Enter Query Sequence” text search, and then answer the following database of all publicly available sequence box at the top of the page. Near the questions: data. This global resource, with access to bottom of the page, under the “Pro- a. What were the top three matches databases in Europe and Japan, currently gram Selection” category, choose to your query sequence? contains more than 220 billion base pairs “blastn”; then click on the “BLAST” of sequence data! button at the bottom of the page to b. For each alignment, BLAST also Earlier (in the Exploring Genomics run the search. It may take several min- indicates the percent identity and exercises for Chapter 7), you were intro- utes for results to be available because the number of gaps in the match duced to the National Center for Biotech- BLAST is using powerful algorithms between the query and subject nology Information (NCBI) Genes and to scroll through billions of bases of sequences. What was the percent Disease site. Now we will use an NCBI sequence data! A new page will appear identity for the top three matches? application called BLAST, Basic Local with the results of your search. What percentage of each aligned Alignment Search Tool. BLAST is an 4. On the search results page, below the sequence showed gaps indicating invaluable program for searching through Graphic Summary you will see a cat- sequence differences? GenBank and other databases to find egory called Descriptions and a table DNA- and protein-sequence similarities c. Click on the links for the first showing significant matches to the between cloned substances. It has many matched sequence (far-right col- sequence you searched with (called additional functions that we will explore umn). These will take you to a the query sequence). BLAST deter- in other exercises. wealth of information, including the mines significant matches based on size of the sequence; the species it Exercise I – Introduction to BLAST statistical measures that consider the was derived from; a PubMed-linked length of the query sequence, the num- 1. Access BLAST from the NCBI Web chronology of research publica- ber of matches with sequences in the site at http://blast.ncbi.nlm.nih.gov/ tions pertaining to this sequence; database, and other factors. Significant Blast.cgi. the complete sequence; and if the alignments, regions of significant similar- sequence encodes a polypeptide, 2. Click on “ blast.” This ity in the query and subject sequences, the predicted sequence feature allows you to search DNA typically have E values less than 1.0. coded by the gene. Skim through databases to look for a similarity 5. The top part of the table lists matches the information presented for this between a sequence you enter and to transcripts (mRNA sequences), gene. What is the gene’s function? other sequences in the database. Do and the lower part lists matches to a nucleotide search with the following 7. A BLAST search can also be done genomic DNA sequences, in order of sequence: by entering the accession number for a highest to lowest number of matches. sequence, which is a unique identify- CCAGAGTCCAGCTGCTGCTCATA 6. Alignments are indicated by hori- ing number assigned to a sequence CTACTGATACTGCTGGG zontal lines. BLAST adjusts for gaps before it can be put into a database. 3. Imagine that this sequence is a short in the sequences, that is, for areas For example, search with the accession part of a gene you cloned in your labo- that may not align precisely because number NM_007305. What did you ratory. You want to know if this gene of missing bases in otherwise similar find?

CASE STUDY Credit where credit is due

n the early 1950s, it became clear to many researchers that DNA 1. What vital clues were provided by Franklin’s work to Watson was the cellular molecule that carries genetic information. How- and Crick about the molecular structure of DNA? Iever, an understanding of the genetic properties of DNA could 2. Was it ethical for Wilkins to show Franklin’s unpublished photo only be achieved through a detailed knowledge of its structure. to Watson and Crick without her consent? Would it have been To this end, several laboratories began a highly competitive race more ethical for Watson and Crick to have offered ­Franklin to discover the three-dimensional structure of DNA, which ended ­co-authorship on this paper? when Watson and Crick published their now classic paper in 1953. 3. Given that these studies were conducted in the 1950s, Their model was based, in part, on an X-ray diffraction photograph how might gender have played a role in the fact that of DNA taken by Rosalind Franklin (Figure 9.10). Two ethical issues ­Rosalind ­Franklin did not receive due credit for her X-ray surround this photo. First, the photo was given to Watson and Crick diffraction work? by ­Franklin’s co-worker, Maurice Wilkins, without her knowledge or consent. Second, in their paper, Watson and Crick did not credit See the Understanding : How Science Really Works Web Franklin’s contribution. The fallout from these lapses lasted for site: “Credit and debt”(http://undsci.berkeley.edu/article/0_0_0/ decades and raises some basic questions about ethics in science. dna_13).

M09_KLUG8414_10_SE_C09.indd 179 16/11/18 5:13 pm 180 9 DNA Structure and Analysis

INSIGHTS AND SOLUTIONS

This chapter recounts some of the initial experimental analyses Solution: First, transformed cells pass the trait on to their that launched the era of molecular genetics. Quite fittingly, then, progeny cells, thus supporting the conclusion that DNA is our “Insights and Solutions” section shifts its emphasis from prob- responsible for heredity, not for the direct production of lem solving to experimental rationale and analytical thinking. polysaccharide coats. Second, subsequent transformation studies over the next five years showed that other traits, such 1. Based strictly on the transformation analysis of Avery, as antibiotic resistance, could be transformed. Therefore, MacLeod, and McCarty, what objection might be made to the the transforming factor has a broad general effect, not one conclusion that DNA is the genetic material? What other con- specific to polysaccharide synthesis. clusion might be considered? 3. If RNA were the universal genetic material, how would this Solution: Based solely on their results, we could conclude have affected the Avery experiment and the Hershey–Chase that DNA is essential for transformation. However, DNA experiment? might have been a substance that caused capsular formation by converting nonencapsulated cells directly to cells with a Solution: In the Avery experiment, ribonuclease (RNase), capsule. That is, DNA may simply have played a catalytic role rather than deoxyribonuclease (DNase), would have elimi- in capsular synthesis, leading to cells that display smooth, nated transformation. Had this occurred, Avery and his col- type III colonies. leagues would have concluded that RNA was the transforming factor. Hershey and Chase would have obtained identical 2. What observations argue against this objection? results, since 32P would also label RNA but not protein.

Mastering Genetics Visit for Problems and Discussion Questions instructor-assigned tutorials and problems.

1. HOW DO WE KNOW? In this chapter, we have focused on DNA, 7. Does the design of the Hershey–Chase experiment distinguish the molecule that stores genetic information in all living things. between DNA and RNA as the molecule serving as the genetic In particular, we discussed its structure and delved into how we material? Why or why not? analyze this molecule. Based on your knowledge of these topics, 8. What observations are consistent with the conclusion that answer several fundamental questions: DNA serves as the genetic material in eukaryotes? List and (a) How were we able to determine that DNA, and not some discuss them. other molecule, serves as the genetic material in bacteria, 9. What are the exceptions to the general rule that DNA is the bacteriophages, and eukaryotes? genetic material in all organisms? What evidence supports these (b) How do we know that the structure of DNA is in the form of exceptions? a right-handed double-helical molecule? 10. Draw the chemical structure of the three components of a nucleo- (c) How do we know that in DNA G pairs with C and that A tide, and then link them together. What atoms are removed from pairs with T as complementary strands are formed? the structures when the linkages are formed? 11. How are the carbon and nitrogen atoms of the sugars, purines, and pyrimidines numbered? 2. CONCEPTS QUESTION Review the Chapter Concepts list on 12. Adenine may also be named 6-amino purine. How would you p. 161. Most center on DNA and RNA and their role of serving name the other four nitrogenous bases, using this alternative

as the genetic material. Write a short essay that contrasts these system? (O is oxy, and CH3 is methyl.) molecules, including a comparison of advantages conferred by 13. Draw the chemical structure of a dinucleotide composed of A and their structure that each of them has over the other in serving G. Opposite this structure, draw the dinucleotide composed of in this role. T and C in an antiparallel (or upside-down) fashion. Form the 3. Discuss the reasons why proteins were generally favored over possible hydrogen bonds. DNA as the genetic material before 1940. What was the role of 14. Describe the various characteristics of the Watson–Crick double the tetranucleotide hypothesis in this controversy? helix model for DNA. 15. What evidence did Watson and Crick have at their disposal 4. Contrast the contributions made to an understanding of trans- in 1953? What was their approach in arriving at the structure formation by Griffith and by Avery and his colleagues. of DNA? 5. When Avery and his colleagues had obtained what was con- 16. What might Watson and Crick have concluded, had ­Chargaff’s data cluded to be the transforming factor from the IIIS virulent cells, from a single source indicated the following base composition? they treated the fraction with proteases, ribonuclease, and deoxyribonuclease, followed by the assay for retention or loss of A T G C transforming ability. What were the purpose and results of these % 29 19 21 31 experiments? What conclusions were drawn? 6. Why were 32P and 35S chosen in the Hershey–Chase experiment? Why would this conclusion be contradictory to Wilkins and Discuss the rationale and conclusions of this experiment. Franklin’s data?

M09_KLUG8414_10_SE_C09.indd 180 16/11/18 5:13 pm PROBLEMS AND DISCUSSION QUESTIONS 181

17. How do covalent bonds differ from hydrogen bonds? Define base (b) A major hyperchromic shift is evident upon heating and complementarity. monitoring UV absorption at 260 nm. 18. List three main differences between DNA and RNA. (c) Base-composition analysis reveals four bases in the follow- 19. What are the three major types of RNA molecules? How is each ing proportions: related to the concept of information flow? Adenine 8% Hypoxanthine 18% 20. How is the absorption of ultraviolet light by DNA and RNA impor- = = tant in the analysis of nucleic acids? Guanine = 37% Xanthine = 37% 21. What is the physical state of DNA after being denatured (d) About 75 percent of the sugars are deoxyribose, whereas by heat? 25 percent are ribose. 22. What is the hyperchromic effect? How is it measured? What does Attempt to solve the structure of this molecule by postulating a T imply? m model that is consistent with the foregoing observations. 23. Why is T related to base composition? m 28. One of the most common spontaneous lesions that occurs in DNA 24. What is the chemical basis of molecular hybridization? under physiological conditions is the hydrolysis of the amino 25. What did the Watson–Crick model suggest about the replication group of cytosine, converting it to uracil. What would be the of DNA? effect on DNA structure if a uracil group replaced cytosine? 26. A genetics student was asked to draw the chemical structure 29. In some organisms, cytosine is methylated at carbon 5 of the of an adenine- and thymine-containing dinucleotide derived pyrimidine ring after it is incorporated into DNA. If a 5-methyl from DNA. His answer is shown below. The student made more cytosine is then hydrolyzed, as described in Problem 28, what than six major errors. One of them is circled, numbered 1, and base will be generated? explained. Find five others. Circle them, number them 2 to 6, and 30. Newsdate: March 1, 2030. A unique creature has been discovered briefly explain each by following the example given. during exploration of outer space. Recently, its genetic material has been isolated and analyzed, and has been found to be similar NH2 in some ways to DNA in chemical makeup. It contains in abun- dance the 4-carbon sugar erythrose and a molar equivalent of C N phosphate groups. In addition, it contains six nitrogenous bases: N C adenine (A), guanine (G), thymine (T), cytosine (C), hypoxanthine 1 N (H), and xanthine (X). These bases exist in the following relative O- C C proportion: O- H N C - CH A = T = H and C = G = X O P O P O 2 O O O H O X-ray diffraction studies have established a regularity in the mol- C C ecule and a constant diameter of about 30 Å. Together, these data H H H C H have suggested a model for the structure of this molecule. H C N C C (a) Propose a general model of this molecule, and briefly C C describe it. O OH (b) What base-pairing properties must exist for H and for X in H N O the model? O- P O CH OH 2 O (c) Given the constant diameter of 30 Å, do you think either (i) both H and X are purines or both pyrimidines, or (ii) one is O C C a purine and one is a pyrimidine? H H Explanations 31. You are provided with DNA samples from two newly discovered H H C C bacterial viruses. Based on the various analytical techniques dis- 1 Extra phosphate cussed in this chapter, construct a research protocol that would should not be present OH H be useful in characterizing and contrasting the DNA of both viruses. Indicate the type of information you hope to obtain for 27. A primitive eukaryote was discovered that displayed a unique each technique included in the protocol. as its genetic material. Analysis revealed the follow- 32. During electrophoresis, DNA molecules can easily be separated ing observations: according to size because all DNA molecules have the same (a) X-ray diffraction studies display a general pattern similar charge–mass ratio and the same shape (long rod). Would you to DNA, but with somewhat different dimensions and more expect RNA molecules to behave in the same manner as DNA irregularity. during electrophoresis? Why or why not?

M09_KLUG8414_10_SE_C09.indd 181 16/11/18 5:13 pm 10 DNA Replication

CHAPTER CONCEPTS

■■ Genetic continuity between parental and progeny cells is maintained by semiconservative replication of DNA, as predicted by the Watson–Crick model. Transmission electron micrograph of human DNA from a HeLa cell, ■■ Semiconservative replication uses each illustrating a replication fork characteristic of active DNA synthesis. strand of the parent double helix as a template, and each newly replicated double helix includes one “old” and one “new” strand of DNA. ollowing Watson and Crick’s proposal for the structure of DNA, scien- ■■ DNA synthesis is a complex but orderly tists focused their attention on how this molecule is replicated. Rep- process, occurring under the ­direction lication is an essential function of the genetic material and must be of a myriad of enzymes and other F executed precisely if genetic continuity between cells is to be maintained fol- proteins. lowing cell division. It is an enormous, complex task. Consider for a moment ■■ DNA synthesis involves the that more than 3 * 109 (3 billion) base pairs exist within the human genome. ­polymerization of into To duplicate faithfully the DNA of just one of these chromosomes requires a ­polynucleotide chains. mechanism of extreme precision. Even an error rate of only 10-6 (one in a ■■ DNA synthesis is similar in bacteria million) will still create 3000 errors (obviously an excessive number) during and eukaryotes, but more complex in each replication cycle of the genome. Although it is not error free, and much eukaryotes. of evolution would not have occurred if it were, an extremely accurate system ■■ In eukaryotes, DNA synthesis at the of DNA replication has evolved in all organisms. ends of chromosomes (telomeres) poses As Watson and Crick wrote at the end of their classic 1953 paper that a special problem, overcome by a unique announced the double-helical model of DNA, “It has not escaped our notice RNA-containing enzyme, telomerase. that the specific pairing (A-T and C-G) we have postulated immediately sug- gests a copying mechanism for the genetic material.” Called semiconserva- tive replication, this mode of DNA duplication was soon to receive strong support from numerous studies of viruses, bacteria, and eukaryotes. Once the general mode of replication was clarified, research to determine the pre- cise details of the synthesis of DNA intensified. What has since been discov- ered is that numerous enzymes and other proteins are needed to copy a DNA helix. Because of the complexity of the chemical events during synthesis, this subject remains an extremely active area of research.

182

M10_KLUG8414_10_SE_C10.indd 182 16/11/18 5:13 pm 10.1 DNA Is Reproduced by Semiconservative Replication 183

In this chapter, we will discuss the general mode of rep- Conservative Semiconservative Dispersive lication, as well as the specific details of DNA synthesis. The research leading to such knowledge is another link in our understanding of life processes at the molecular level.

10.1 DNA Is Reproduced by Semiconservative Replication

Watson and Crick recognized that, because of the arrange- ment and nature of the nitrogenous bases, each strand of a DNA double helix could serve as a template for the synthe- One round of replication — new synthesis is shown in blue sis of its complement (Figure 10.1). They proposed that, if the helix were unwound, each nucleotide along the two parent strands would have an affinity for its complementary nucleotide. As we learned in Chapter 9, the complementarity is due to the potential hydrogen bonds that can be formed. If

GC

FIGURE 10.2 Results of one round of replication of DNA for AT each of the three possible modes by which replication could AT be accomplished. GC TA thymidylic acid (T) were present, it would “attract” adenylic CG acid (A); if guanylic acid (G) were present, it would attract AT cytidylic acid (C); likewise, A would attract T, and C would GC TA attract G. If these nucleotides were then covalently linked into polynucleotide chains along both templates, the result CG would be the production of two identical double strands of A T DNA. Each replicated DNA molecule would consist of one T A “old” and one “new” strand, hence the reason for the name G C semiconservative replication. T A Two other theoretical modes of replication are pos- A T sible that also rely on the parental strands as a template C G (Figure 10.2). In conservative replication, complemen- GC GC tary polynucleotide chains are synthesized as described ear- TA TA AT AT lier. Following synthesis, however, the two newly created CG CG strands then come together and the parental strands reas- sociate. The original helix is thus “conserved.” AT AT In the second alternative mode, called dispersive GC GC ­replication, the parental strands are dispersed into two TA TA new double helices following replication. Hence, each strand GC GC consists of both old and new DNA. This mode would involve cleavage of the parental strands during replication. It is the TA TA most complex of the three possibilities and is therefore con- sidered to be least likely to occur. It could not, however, be New Old Old New ruled out as an experimental model. Figure 10.2 shows the

FIGURE 10.1 Generalized model of semiconservative repli- theoretical results of a single round of replication by each of cation of DNA. New synthesis is shown in blue. the three different modes.

M10_KLUG8414_10_SE_C10.indd 183 16/11/18 5:13 pm 184 10 DNA Replication

E. coli grown in 15N-labeled medium

E. coli DNA becomes uniformly labeled with 15N in nitrogenous bases

Generation 0 Generation I Generation II Generation III

15N-labeled Cells Cells Cells E. coli added replicate replicate a replicate to 14N medium once in second time a third time 14N in 14N in 14N

Gravitational force DNA extracted and centrifuged in gradient

15 15 14 14 15 14 N/ N 15N/14N 14N/14N 15N/14N N/ N N/ N

FIGURE 10.3 The Meselson–Stahl experiment.

The Meselson–Stahl Experiment of the gradient medium. In this case, 15N@DNA will reach this In 1958, Matthew Meselson and Franklin Stahl published point at a position closer to the bottom of the tube than will the results of an experiment providing strong evidence that 14N@DNA. semiconservative replication is the mode used by bacterial In this experiment (Figure 10.3), uniformly labeled cells to produce new DNA molecules. They grew Escherichia 15N cells were transferred to a medium containing only 15 14 coli cells for many generations in a medium that had NH4Cl NH4Cl. Thus, all “new” synthesis of DNA during replica- (ammonium chloride) as the only nitrogen source. A “heavy” tion contained only the “lighter” isotope of nitrogen. The isotope of nitrogen, 15N contains one more neutron than the time of transfer to the new medium was taken as time zero naturally occurring 14N isotope; thus, molecules containing t = 0. The E. coli cells were allowed to replicate over several 15N are more dense than those containing 14N. Unlike radio- generations, with cell samples removed after each replica- active isotopes, 15N is stable. After many generations in this tion cycle. DNA was isolated from each sample and subjected medium, almost all nitrogen-containing molecules in the E. to sedimentation equilibrium centrifugation. coli cells, including the nitrogenous bases of DNA, contained After one generation, the isolated DNA was present in the heavier isotope. only a single band of intermediate density—the expected Critical to the success of this experiment, DNA contain- result for semiconservative replication in which each rep- ing 15N can be distinguished from DNA containing 14N. The licated molecule was composed of one new 14N@strand and experimental procedure involves the use of a technique one old 15N@strand (Figure 10.4). This result was not con- referred to as sedimentation equilibrium centrifugation (also sistent with the prediction of conservative replication, in called buoyant density gradient centrifugation). Samples which two distinct bands would occur; thus this mode may are forced by centrifugation through a density gradient of a be ruled out. heavy metal salt, such as cesium chloride. Molecules of DNA After two cell divisions, DNA samples showed two will reach equilibrium when their density equals the density density bands—one intermediate band and one lighter

M10_KLUG8414_10_SE_C10.indd 184 16/11/18 5:13 pm 10.1 DNA Is Reproduced by Semiconservative Replication 185

Generation l Generation ll

15N-DNA 14N-DNA

15N-DNA 14N 14N

Gravitational force

15 14 14 14 15 14 15N/15N N/ N N/ N N/ N

FIGURE 10.4 The expected results of two generations of semiconservative replication in the Meselson–Stahl experiment.

band corresponding to the 14N position in the gradient. NOW SOLVE THIS Similar results occurred after a third generation, except 10.1 In the Meselson–Stahl experiment, which of the three that the proportion of the lighter band increased. This was modes of replication could be ruled out after one round of again consistent with the interpretation that replication is replication? After two rounds? semiconservative. HINT: This problem involves an understanding of the nature of You may have realized that a molecule exhibiting inter- the experiment as well as the difference between the three pos- mediate density is also consistent with dispersive replica- sible modes of replication. The key to its solution is to determine tion. However, Meselson and Stahl also ruled out this mode which mode will not create “hybrid” helices after one round of of replication on the basis of two observations. First, after the replication. first generation of replication in an 14N@containing medium, they isolated the hybrid molecule and heat denatured it. Recall from Chapter 9 that heating will separate a duplex into single strands. When the densities of the single Semiconservative Replication in Eukaryotes strands of the hybrid were determined, they exhibited either In 1957, the year before the work of Meselson and Stahl 15 14 an N profile or an N profile, but not an intermediate den- was published, J. Herbert Taylor, Philip Woods, and Walter sity. This observation is consistent with the semiconserva- Hughes presented evidence that semiconservative replica- tive mode but inconsistent with the dispersive mode. tion also occurs in eukaryotic organisms. They experimented Furthermore, if replication were dispersive, all genera- with root tips of the broad bean Vicia faba, which are an tions after t = 0 would demonstrate DNA of an intermedi- excellent source of dividing cells. These researchers were ate density. In each generation after the first, the ratio of able to monitor the process of replication by labeling DNA 15 14 N N would decrease, and the hybrid band would become with 3H@thymidine, a radioactive precursor of DNA, and per- lighter and lighter, eventually approaching the 14N band. > forming autoradiography. This result was not observed. The Meselson–Stahl experi- Autoradiography is a common technique that, when ment provided conclusive support for semiconservative applied cytologically, pinpoints the location of a radioiso- replication in bacteria and tended to rule out both the con- tope in a cell. In this procedure, a photographic emulsion servative and dispersive modes. is placed over a histological preparation containing cellular material (root tips, in this experiment), and the preparation is stored in the dark. The slide is then developed, much as photographic film is processed. Because the radioisotope ESSENTIAL POINT emits energy, upon development the emulsion turns black In 1958, Meselson and Stahl resolved the question of which of three potential modes of replication is utilized by E. coli during the dupli- at the approximate point of emission. The end result is the cation of DNA in favor of semiconservative replication, showing presence of dark spots or “grains” on the surface of the sec- that newly synthesized DNA consists of one old strand and one new tion, identifying the location of newly synthesized DNA strand. ■ within the cell.

M10_KLUG8414_10_SE_C10.indd 185 16/11/18 5:13 pm 186 10 DNA Replication

(a) Replication I Both sister 3H-thymidine chromatids labeled

Unlabeled chromosome

Metaphase

Anaphase

Chromatids migrate into separate cells

No sister Replication II Sister chromatid chromatid exchange Unlabeled thymidine exchange (b) (c)

Unlabeled chromatid Reciprocal Only one regions of both chromatid labeled chromatids labeled

Metaphase II Metaphase II

FIGURE 10.5 The Taylor–Woods–Hughes experiment, demonstrating the semiconservative mode of replication of DNA in the root tips of Vicia faba. (a) An unlabeled chromosome pro- ceeds through the cell cycle in the presence of 3H@thymidine. As it enters mitosis, both sister chromatids of the chromosome are labeled, as shown by autoradiography. After a second round of replication (b), this time in the absence of 3H@thymidine, only one chromatid of each chromosome is expected to be surrounded by grains. Except where a reciprocal exchange has occurred between sister chromatids (c), the expectation was upheld.

Taylor and his colleagues grew root tips for approxi- These results are compatible with the semiconserva- mately one generation in the presence of the radioisotope and tive mode of replication. After the first replication cycle in then placed them in unlabeled medium in which cell division the presence of the isotope, both sister chromatids show continued. At the conclusion of each generation, they arrested radioactivity, indicating that each chromatid contains one the cultures at metaphase by adding colchicine (a chemi- new radioactive DNA strand and one old unlabeled strand. cal derived from the crocus plant that poisons the spindle After the second replication cycle, which takes place in unla- fibers) and then examined the chromosomes by autoradiog- beled medium, only one of the two sister chromatids of each raphy. They found radioactive thymidine only in association chromosome should be radioactive because half of the par- with chromatids that contained newly synthesized DNA. ent strands are unlabeled. With only the minor exceptions Figure 10.5 illustrates the replication of a single chromosome of sister chromatid exchanges (discussed in Chapter 7), this over two division cycles, including the distribution of grains. result was observed.

M10_KLUG8414_10_SE_C10.indd 186 16/11/18 5:13 pm 10.2 DNA Synthesis in Bacteria Involves Five Polymerases, as Well as Other Enzymes 187

Together, the Meselson–Stahl experiment and the oriC experiment by Taylor, Woods, and Hughes soon led to the general acceptance of the semiconservative mode of replica- tion. Later studies with other organisms reached the same ter conclusion and also strongly supported Watson and Crick’s proposal for the double helix model of DNA.

ESSENTIAL POINT Taylor, Woods, and Hughes demonstrated semiconservative replica- tion in eukaryotes using the root tips of the broad bean as the source of dividing cells. ■

Origins, Forks, and Units of Replication To enhance our understanding of semiconservative replica- tion, let’s briefly consider a number of relevant issues. The first concerns the origin of replication. Where along the chro- mosome is DNA replication initiated? Is there only a single origin, or does DNA synthesis begin at more than one point? Is any given point of origin random, or is it located at a spe- cific region along the chromosome? Second, once replication begins, does it proceed in a single direction or in both direc- tions away from the origin? In other words, is replication unidirectional or bidirectional? To address these issues, we need to introduce two terms. First, at each point along the chromosome where replication is occurring, the strands of the helix are unwound, creating FIGURE 10.6 bidirectional replication of the E. coli chromo- some. The thin black arrows identify the advancing replication what is called a replication fork (see the chapter opening forks. photograph on p. 182). Such a fork will initially appear at the point of origin of synthesis and then move along the DNA duplex as replication proceeds. If replication is bidirectional, from oriC in both directions. Figure 10.6 therefore inter- two such forks will be present, migrating in opposite direc- prets Cairns’s work with that understanding. Bidirectional tions away from the origin. The second term refers to the replication creates two replication forks that migrate far- length of DNA that is replicated following one initiation event ther and farther apart as replication proceeds. These forks at a single origin. This is a unit referred to as the replicon. eventually merge, as semiconservative replication of the The evidence is clear regarding the origin and direc- entire chromosome is completed, at a termination region, tion of replication. John Cairns tracked replication in E. coli, called ter. using radioactive precursors of DNA synthesis and autora- Later in this chapter, we will see that in eukaryotes, diography. He was able to demonstrate that in E. coli there each chromosome contains multiple points of origin. is only a single region, called oriC, where replication is initi- ated. The presence of only a single origin is characteristic of bacteria, which have only one circular chromosome. Since 10.2 DNA Synthesis in Bacteria DNA synthesis in bacteriophages and bacteria originates at a single point, the entire chromosome constitutes one repli- Involves Five Polymerases, con. In E. coli, the replicon consists of the entire genome of as Well as Other Enzymes 4.6 Mb (4.6 million base pairs). Figure 10.6 illustrates Cairns’s interpretation of DNA To say that replication is semiconservative and bidirectional replication in E. coli. This interpretation and the accom- describes the overall pattern of DNA duplication and the panying micrograph do not answer the question of unidi- association of finished strands with one another once syn- rectional versus bidirectional synthesis. However, other thesis is completed. However, it says little about the more results, derived from studies of bacteriophage lambda, have complex issue of how the actual synthesis of long comple- demonstrated that replication is bidirectional, moving away mentary polynucleotide chains occurs on a DNA template.

M10_KLUG8414_10_SE_C10.indd 187 16/11/18 5:13 pm 188 10 DNA Replication

Like most questions in molecular biology, this one was first to the growing 3′@end. Each step provides a newly exposed studied using microorganisms. Research on DNA synthesis 3′@OH group that can participate in the next addition of a began about the same time as the Meselson–Stahl work, and nucleotide as DNA synthesis proceeds. the topic is still an active area of investigation. What is most Having isolated DNA polymerase I and demonstrated apparent in this research is the tremendous complexity of its catalytic activity, Kornberg next sought to demonstrate the biological synthesis of DNA. the accuracy, or fidelity, with which the enzyme replicated the DNA template. Because technology for ascertaining the nucleotide sequences of the template and newly synthesized DNA Polymerase I strand was not yet available in 1957, he initially had to rely Studies of the enzymology of DNA replication were first on several indirect methods. reported by and colleagues in 1957. They One of Kornberg’s approaches was to compare the isolated an enzyme from E. coli that was able to direct DNA nitrogenous base compositions of the DNA template with synthesis in a cell-free (in vitro) system. The enzyme is called those of the recovered DNA product. Using several sources DNA polymerase I, because it was the first of several simi- of DNA (phage T2, E. coli, and calf thymus), he discovered lar enzymes to be isolated. that, within experimental error, the base composition of Kornberg determined that there were two major each product agreed with the template DNA used. This sug- requirements for in vitro DNA synthesis under the direction gested that the templates were replicated faithfully. of DNA polymerase I: (1) all four deoxyribonucleoside tri- phosphates (dNTPs) and (2) template DNA. If any one of the ESSENTIAL POINT four deoxyribonucleoside triphosphates was omitted from Arthur Kornberg isolated the enzyme DNA polymerase I from E. coli the reaction, no measurable synthesis occurred. If deriva- and showed that it is capable of directing in vitro DNA synthesis, pro- tives of these precursor molecules other than the vided that a template and precursor nucleoside triphosphates were triphosphate were used (nucleotides or nucleoside diphos- supplied. ■ phates), synthesis also did not occur. If no template DNA was added, synthesis of DNA occurred but was reduced greatly. Most of the synthesis directed by Kornberg’s enzyme DNA Polymerase II, III, IV, and V appeared to be exactly the type required for semiconserva- tive replication. The reaction is summarized in Figure 10.7, While DNA polymerase I clearly directs the synthesis of which depicts the addition of a single nucleotide. The DNA, a serious reservation about the enzyme’s true bio- enzyme has since been shown to consist of a single polypep- logical role was raised in 1969. Paula DeLucia and John tide containing 928 amino acids. Cairns discovered a mutant strain of E. coli that was defi- The way in which each nucleotide is added to the grow- cient in polymerase I activity. The mutation was desig- ing chain is a function of the specificity of DNA polymerase nated polA1. In the absence of the functional enzyme, I. As shown in Figure 10.8, the precursor dNTP contains the this mutant strain of E. coli still duplicated its DNA and three phosphate groups attached to the 5′@carbon of deoxyri- ­successfully reproduced. However, the cells were defi- bose. As the two terminal phosphates are cleaved during syn- cient in their ability to repair DNA. For example, the thesis, the remaining phosphate attached to the 5′@carbon is mutant strain is highly sensitive to ultraviolet light (UV) covalently linked to the 3′@OH group of the deoxyribose to and radiation, both of which damage DNA and are muta- which it is added. Thus, chain elongation occurs in the 5 genic. Nonmutant bacteria are able to repair a great deal to 3 direction by the addition of one nucleotide at a time of ­UV-induced damage.

(dNMP)x DNA (dNMP)x polymerase I + + P–P dNTP Mg2+

(dNMP)n (dNMP)n + 1

Deoxyribonucleoside DNA template (dNMP)x Complement to template Inorganic triphosphates and a portion of its strand is extended by one pyrophosphate nucleotide (n 1) (dATP, dTTP, dCTP, dGTP) complement (dNMP)n +

FIGURE 10.7 The chemical reaction catalyzed by DNA polymerase I. During each step, a single nucleotide is added to the growing complement of the DNA template using a nucleoside triphosphate as the substrate. The release of inorganic pyro- phosphate drives the reaction energetically.

M10_KLUG8414_10_SE_C10.indd 188 16/11/18 5:13 pm 10.2 DNA Synthesis in Bacteria Involves Five Polymerases, as Well as Other Enzymes 189

P A O 5'- end

P P A T O O 5'- end

P P P P P T C C O O O + 5'- end + P P

OH OH 5'- end attaches to OH 3'- end 3'- end 3'- end 3'- end

Growing Precursor chain (nucleoside triphosphate)

FIGURE 10.8 Demonstration of 5′ to 3′ synthesis of DNA.

These observations led to two conclusions: nucleotides, starting at the end at which synthesis begins and proceeding in the same direction of synthesis. Two 1. At least one other enzyme that is responsible for replicat- final observations probably explain why Kornberg isolated ing DNA in vivo is present in E. coli cells. ­polymerase I and not polymerase III: polymerase I is present 2. DNA polymerase I serves a secondary function in vivo, in greater amounts than is polymerase III, and it is also much now believed to be critical to the maintenance of fidelity more stable. of DNA synthesis. What then are the roles of the polymerases in vivo? ­Polymerase III is the enzyme responsible for the 5′ to 3′ To date, four other unique DNA polymerases have polymerization essential to in vivo replication. Its 3′ to 5′ been isolated from cells lacking polymerase I activity and exonuclease activity also provides a proofreading function from normal cells that contain polymerase I. Table 10.1 that is activated when it inserts an incorrect ­nucleotide. ­contrasts several characteristics of DNA polymerase I with When this occurs, synthesis stalls and the polymerase DNA ­polymerase II and III. Although none of the three can “reverses course,” excising the incorrect nucleotide. initiate DNA synthesis on a template, all three can elongate Then, it proceeds back in the 5′ to 3′ direction, synthesiz- an existing DNA strand, called a primer, and all three ing the complement of the template strand. Polymerase I possess 3′ to 5′ ­exonuclease activity, which means that is believed to be responsible for removing the primer, as they have the ­potential to polymerize in one direction and well as for the synthesis that fills gaps produced after this then pause, reverse their direction, and excise nucleotides removal. Its exonuclease­ activities also allow for its partici- just added. As we will later discuss, this activity provides pation in DNA repair. Polymerase­ II, as well as polymerase­ a capacity to proofread newly synthesized DNA and to IV and V, are involved in various aspects of repair of DNA remove and replace incorrect nucleotides. that has been damaged by external forces, such as ultra- DNA polymerase I also demonstrates 5′ to 3′ exonu- violet light. Polymerase II is encoded by a gene whose tran- clease activity. This activity allows the enzyme to excise scription is activated by disruption of DNA synthesis at the replication fork.

TABLE 10.1 Properties of Bacterial DNA Polymerases I, II, and III

Properties I II III Initiation of chain synthesis - - - ESSENTIAL POINT 5′93′ polymerization + + + The discovery of the polA1 mutant strain of E. coli, capable of DNA 3′95′ exonuclease activity + + + replication despite its lack of polymerase I activity, cast doubt on the 5′93′ exonuclease activity + - - enzyme’s hypothesized in vivo replicative function. Polymerase III has Molecules of polymerase/cell 400 ? 15 been identified as the enzyme responsible for DNA replication in vivo. ■

M10_KLUG8414_10_SE_C10.indd 189 16/11/18 5:13 pm 190 10 DNA Replication

The DNA Polymerase III Holoenzyme Pol III core We conclude this section by emphasizing the complexity of the DNA polymerase III enzyme, henceforth referred to as DNA Pol III. The active form of DNA Pol III, referred to as the t holoenzyme, is made up of unique polypeptide subunits, ten of which have been identified (Table 10.2). The largest sub- unit, a, along with subunits e and u, form a complex called Sliding DNA clamp loader the core enzyme, which imparts the catalytic function to the holoenzyme. In E. coli, each holoenzyme contains two, Sliding DNA clamp and possibly three, core enzyme complexes. As part of each core, the a subunit is responsible for DNA synthesis along the DNA Pol III holoenzyme template strands, whereas the e subunit possesses 3′ to 5′ exonuclease capability, essential to proofreading. The need FIGURE 10.9 The components making up the DNA Pol III holoenzyme, as described in the text. While there may be for more than one core enzyme will soon become apparent. three core enzyme complexes present in the holoenzyme, A second group of five subunits (g, d, d′, x, and y) are com- for simplicity, we illustrate only two. plexed to form what is called the sliding clamp loader, which pairs with the core enzyme and facilitates the function of a critical component of the holoenzyme, called the sliding present. The components of the DNA Pol III holoenzyme will DNA clamp. The enzymatic function of the sliding clamp be referred to in the discussion that follows. loader is dependent on energy generated by the hydrolysis of ATP. The sliding DNA clamp links to the core enzyme and is made up of multiple copies of the b subunit, taking on the shape of a donut, whereby it can open and shut, to encircle 10.3 Many Complex Issues Must Be the unreplicated DNA helix. By doing so, and being linked to Resolved during DNA Replication the core enzyme, the clamp leads the way during synthesis, maintaining the binding of the core enzyme to the template We have thus far established that in bacteria and viruses during polymerization of nucleotides. Thus, the length of replication is semiconservative and bidirectional along a DNA that is replicated by the core enzyme before it detaches single replicon. We also know that synthesis is catalyzed by from the template, a property referred to as processivity, is DNA polymerase III and occurs in the 5′ to 3′ direction. Bidi- vastly increased. There is one sliding clamp per core enzyme. rectional synthesis creates two replication forks that move Finally, one t subunit interacts with each core enzyme, link- in opposite directions away from the origin of synthesis. As ing it to the sliding clamp loader. we can see from the following list, many issues remain to be The DNA Pol III holoenzyme is diagrammatically illus- resolved in order to provide a comprehensive understanding trated in Figure 10.9. You should compare the diagram of DNA replication: to the description of each component above. Note that we have shown the holoenzyme to contain two core enzyme 1. The helix must undergo localized unwinding, and the complexes, although as stated above, a third one may be resulting “open” configuration must be stabilized so that synthesis may proceed along both strands.

TABLE 10.2 Subunits of the DNA Polymerase III Holoenzyme 2. As unwinding and subsequent DNA synthesis proceed,

Subunit Function Groupings increased coiling creates tension further down the helix, which must be reduced. a 5′93′ polymerization Core enzyme: Elongates e 3′95′ exonuclease polynucleotide chain 3. A primer of some sort must be synthesized so that polym- Core assembly and proofreads u erization can commence under the direction of DNA poly- merase III. Surprisingly, RNA, not DNA, serves as the g Loads enzyme on primer. d ­template (serves as S g complex ′ d clamp loader) 4. Once the RNA primers have been synthesized, DNA poly- x merase III begins to synthesize the DNA complement of y both strands of the parent molecule. Because the two b Sliding clamp structureS strands are antiparallel to one another, continuous syn- u (processivity factor) thesis in the direction that the replication fork moves is t Dimerizes core complex possible along only one of the two strands. On the other

M10_KLUG8414_10_SE_C10.indd 190 16/11/18 5:13 pm 10.3 Many Complex Issues Must Be Resolved during DNA Replication 191

strand, synthesis must be discontinuous and thus involves that bind specifically to single strands of DNA, appropriately a somewhat different process. called single-stranded binding proteins (SSBs). As unwinding proceeds, a coiling tension is created 5. The RNA primers must be removed prior to completion of ahead of the replication fork, often producing supercoiling.­ replication. The gaps that are temporarily created must In circular molecules, supercoiling may take the form of be filled with DNA complementary to the template at each added twists and turns of the DNA, much like the coiling location. you can create in a rubber band by stretching it out and then 6. The newly synthesized DNA strand that fills each tem- twisting one end. Such supercoiling can be relaxed by DNA porary gap must be joined to the adjacent strand of DNA. gyrase, a member of a larger group of enzymes referred to as DNA topoisomerases. The gyrase makes either single- or 7. While DNA polymerases accurately insert comple- double-stranded “cuts” and also catalyzes localized move- mentary bases during replication, they are not perfect, ments that have the effect of “undoing” the twists and knots and, occasionally, incorrect nucleotides are added to created during supercoiling. The strands are then resealed. the growing strand. A proofreading mechanism that These various reactions are driven by the energy released also corrects errors is an integral process during DNA during ATP hydrolysis. synthesis. Together, the DNA, the polymerase complex, and As we consider these points, examine Figures 10.10, ­associated enzymes make up an array of molecules that 10.11, and 10.12 to see how each issue is resolved. ­initiate DNA synthesis and are part of what we have Figure 10.13 summarizes the model of DNA synthesis. ­previously called the replisome.

Unwinding the DNA Helix ESSENTIAL POINT As discussed earlier, there is a single point of origin along During the initiation of DNA synthesis, the double helix unwinds, the circular chromosome of most bacteria and viruses at forming a replication fork at which synthesis begins. Proteins stabi- which DNA replication is initiated. This region in E. coli lize the unwound helix and assist in relaxing the coiling tension cre- has been particularly well studied and is called oriC. It ated ahead of the replication fork. ■ consists of 245 DNA base pairs and is characterized by five repeating sequences of 9 base pairs, and three repeat- ing sequences of 13 base pairs, called 9mers and 13mers, respectively. Both 9mers and 13mers are AT-rich, which Initiation of DNA Synthesis Using renders them ­relatively less stable than an average double- an RNA Primer helical sequence of DNA, which no doubt enhances heli- Once a small portion of the helix is unwound, what else cal unwinding. A ­specific initiator protein, called DnaA is needed to initiate synthesis? As we have seen, DNA (because it is encoded by the dnaA gene), is responsible for ­polymerase III requires a primer with a free 3′@hydroxyl initiating replication by binding to a region of 9mers. This group in order to elongate a polynucleotide chain. Since newly formed complex then undergoes a slight conforma- none is available in a circular chromosome, this absence tional change and associates with the region of 13mers, prompted researchers to investigate how the first nucleotide which causes the helix to destabilize and open up, exposing could be added. It is now clear that RNA serves as the primer single-stranded regions of DNA (ssDNA). This step facilitates that initiates DNA synthesis. the subsequent ­binding of another key player in the process A short segment of RNA (about 10 to 12 ribonucleo- —a protein called DNA helicase (made up of multiple cop- tides long), complementary to DNA, is first synthesized ies of the DnaB ­polypeptide). DNA helicase is assembled as on the DNA template. Synthesis of the RNA is directed by a hexamer of subunits around one of the exposed single- a form of RNA polymerase called primase, which does stranded DNA molecules. The helicase subsequently recruits not require a free 3′@end to initiate synthesis. It is to this the holoenzyme to bind to the newly formed replication short segment of RNA that DNA polymerase III begins to fork to formally initiate replication, and it then proceeds to add deoxyribonucleotides,­ initiating DNA synthesis. A move along the ssDNA, opening up the helix as it progresses. conceptual diagram of ­initiation on a DNA template is Helicases require energy supplied by the hydrolysis of ATP, shown in Figure 10.10. Later, the RNA primer is clipped which aids in denaturing the hydrogen bonds that stabilize out and replaced with DNA. This is thought to occur under double helix. the direction of DNA polymerase I. Recognized in viruses, Once the helicase has opened up the helix and ssDNA is bacteria, and several eukaryotic organisms, RNA priming available, base pairing must be inhibited until it can serve is a ­universal phenomenon during the initiation of DNA as a template for synthesis. This is accomplished by proteins synthesis.

M10_KLUG8414_10_SE_C10.indd 191 16/11/18 5:13 pm 192 10 DNA Replication

DNA template 5' 3' 5' 3' 5' 3' 3' Initiation New DNA added to of RNA RNA primer 5' primer 5' 3' FIGURE 10.10 The initiation of DNA synthesis. A comple- mentary RNA primer is first synthesized, to which DNA is added. All synthesis is in the 5′ to 3′ direction. Eventually, the RNA primer is replaced with DNA under the direction of DNA polymerase I. Lagging strand 3'

5' ESSENTIAL POINT Leading strand DNA synthesis is initiated at specific sites along each template strand by the enzyme primase, resulting in short segments of RNA that provide suitable 3′@ends upon which DNA polymerase III can begin polymerization. ■ Discontinuous synthesis

Continuous and Discontinuous DNA Synthesis 3' We must now revisit the fact that the two strands of a dou- 5' ble helix are antiparallel to each other—that is, one runs in the 5′ to 3′ direction, while the other has the opposite Continuous synthesis 3′ to 5′ polarity. Because DNA polymerase III synthesizes Key DNA in only the 5′ to 3′ direction, synthesis along an advanc- Initiation ing replication fork occurs in one direction on one strand and RNA primer in the opposite direction on the other. DNA synthesis As a result, as the strands unwind and the replica- tion fork progresses down the helix (Figure 10.11), only FIGURE 10.11 opposite polarity of synthesis along the two strands of DNA is necessary because they run antiparallel to one strand can serve as a template for continuous DNA one another, and because DNA polymerase III synthesizes ­synthesis. This newly synthesized DNA is called the ­leading in only one direction (5′ to 3′). On the lagging strand, syn- strand. As the fork progresses, many points of initiation thesis must be discontinuous, resulting in the production of Okazaki fragments. On the leading strand, synthesis is are necessary on the opposite DNA template, resulting in continuous. RNA primers are used to initiate synthesis on discontinuous­ DNA synthesis* of the lagging strand. both strands. Evidence supporting the occurrence of discontinuous DNA synthesis was first provided by Tsuneko and Reiji Oka- zaki award. They discovered that when bacteriophage DNA which is capable of catalyzing the formation of the phosphodi- is replicated in E. coli, some of the newly formed DNA that is ester bond that seals the nick between the discontinuously syn- hydrogen bonded to the template strand is present as small thesized strands. The evidence that DNA ligase performs this fragments containing 1000 to 2000 nucleotides. RNA primers function during DNA synthesis is strengthened by the observa- are part of each such fragment. These pieces, now called Oka- tion of a ligase-deficient mutant strain (lig) of E. coli, in which zaki fragments, are converted into longer and longer DNA a large number of unjoined Okazaki fragments accumulate. strands of higher molecular weight as synthesis proceeds. Discontinuous synthesis of DNA requires enzymes that Concurrent Synthesis Occurs on the Leading both remove the RNA primers and unite the Okazaki frag- and Lagging Strands ments into the lagging strand. As we have noted, DNA poly- Given the model just discussed, we might ask how the holo- merase I removes the primers and replaces the missing enzyme of DNA Pol III synthesizes DNA on both the leading nucleotides. Joining the fragments is the work of DNA­ ligase, and lagging strands. Can both strands be replicated simul- taneously at the same replication fork, or are the events dis-

* Because DNA synthesis is continuous on one strand and discon- tinct, involving two separate copies of the enzyme? Evidence tinuous on the other, the term semidiscontinuous synthesis is suggests that both strands are replicated simultaneously, sometimes used to describe the overall process. with each strand acted upon by one of the two core enzymes

M10_KLUG8414_10_SE_C10.indd 192 16/11/18 5:13 pm 10.4 A Coherent Model Summarizes DNA Replication 193

5' 3' 5'

5' 3'

Lagging strand template 3' RF 5' Leading strand template

5' 3' 3' Sliding clamp Pol III core enzyme

FIGURE 10.12 Illustration of how concurrent DNA synthesis may be achieved on both the leading and lagging strands at a single replication fork (RF). The lagging template strand is “looped” in order to invert the physical direction of synthesis, but not the biochemical direction. The enzyme functions as a dimer, with each core enzyme achieving synthesis on one or the other strand.

that are part of the DNA Pol III holoenzyme. As Figure 10.12 Proofreading and Error Correction Occur illustrates, if the lagging strand template is spooled out, during DNA Replication forming a loop, nucleotide polymerization can occur simul- The immediate purpose of DNA replication is the synthesis taneously on both template strands under the direction of the of a new strand that is precisely complementary to the tem- holoenzyme. After the synthesis of 1000 to 2000 ­nucleotides, plate strand at each nucleotide position. Although the action the monomer of the enzyme on the lagging strand will of DNA polymerases is very accurate, synthesis is not perfect encounter a completed Okazaki fragment, at which point it and a noncomplementary nucleotide is occasionally inserted releases the lagging strand. A new loop of the lagging strand erroneously. To compensate for such inaccuracies, the DNA is spooled out, and the process is repeated. Looping inverts polymerases all possess 3′ to 5′ exonuclease activity. This the orientation of the template but not the direction of actual property imparts the potential for them to detect and excise synthesis on the lagging strand, which is always in the 5′ to a mismatched nucleotide (in the 3′ to 5′ direction). Once the 3′ direction. As mentioned above, it is believed that there is mismatched nucleotide is removed, 5′ to 3′ synthesis can a third core enzyme associated with the DNA Pol III holoen- again proceed. This process, called proofreading, increases zyme, and that it functions in the synthesis of Okazaki frag- the fidelity of synthesis by a factor of about 100. In the case ments. For simplicity, we will include only two core enzymes of the holoenzyme form of DNA polymerase III, the epsilon in this and subsequent figures. (e) subunit is directly involved in the proofreading step. In Another important feature of the holoenzyme that facili- strains of E. coli with a mutation that has rendered the e sub- tates synthesis at the replication fork is the donut-shaped slid- unit nonfunctional, the error rate (the mutation rate) during ing DNA clamp that surrounds the unreplicated double helix DNA synthesis is increased substantially. and is linked to the advancing core enzyme. This clamp pre- vents the core enzyme from dissociating from the template as polymerization proceeds. By doing so, the clamp is responsible for vastly increasing the processivity of the core enzyme—that 10.4 A Coherent Model Summarizes is, the number of nucleotides that may be continually added prior to dissociation from the template. This function is criti- DNA Replication cal to the rapid in vivo rate of DNA synthesis during replication. We can now combine the various aspects of DNA replication occurring at a single replication fork into a coherent model, as ESSENTIAL POINT shown in Figure 10.13. At the advancing fork, a helicase is Concurrent DNA synthesis occurs continuously on the leading strand unwinding the double helix. Once unwound, single-stranded and discontinuously on the opposite lagging strand, resulting in binding proteins associate with the strands, preventing the short Okazaki fragments that are later joined by DNA ligase. ■ re-formation of the helix. In advance of the replication fork,

M10_KLUG8414_10_SE_C10.indd 193 16/11/18 5:13 pm 194 10 DNA Replication

Lagging strand template RNA Single-stranded primer binding proteins Okazaki fragments DNA gyrase Helicase DNA Pol III core enzymes

Leading strand template

Sliding clamp loader

Sliding DNA clamp

FIGURE 10.13 summary of DNA synthesis at a single replication fork. Various enzymes and proteins essential to the process are shown.

DNA gyrase functions to diminish the tension created as the major enzyme responsible for replication. Many other muta- helix supercoils. Each half of the dimeric polymerase is a core tions interrupt or seriously impair some aspect of replication, enzyme bound to one of the template strands by a b@subunit such as the ligase-deficient and the proofreading-deficient sliding clamp. Continuous synthesis occurs on the leading mutations mentioned previously. Because such mutations are strand, while the lagging strand must loop out and around the lethal, genetic analysis frequently uses conditional­ muta- polymerase in order for simultaneous (concurrent) synthesis tions, which are expressed under one condition but not under to occur on both strands. Not shown in the figure, but essential a different condition. For example, a temperature-sensitive to replication on the lagging strand, is the action of DNA poly- mutation may not be expressed at a particular permissive merase I and DNA ligase, which together replace the RNA prim- temperature. When mutant cells are grown at a restrictive ers with DNA and join the Okazaki fragments, respectively. temperature, the mutant phenotype is expressed and can be The preceding model provides a summary of DNA syn- studied. By examining the effect of the loss of function asso- thesis against which genetic phenomena can be interpreted. ciated with the mutation, the investigation of such tempera- ture-sensitive mutants can provide insight into the product NOW SOLVE THIS and the associated function of the normal, nonmutated gene. 10.2 An alien organism was investigated. When DNA rep- As shown in Table 10.3, a variety of genes in E. coli spec- lication was studied, a unique feature was apparent: No ify the subunits of the DNA polymerases and encode prod- Okazaki fragments were observed. Create a model of DNA ucts involved in specification of the origin of synthesis, helix that is consistent with this observation. HINT: This problem involves an understanding of the process of TABLE 10.3 Some of the Various E. coli Genes and Their DNA synthesis in bacteria, as depicted in Figure 10.12. The key Products or Role in Replication to its solution is to consider why Okazaki fragments are observed Gene Product or Role during DNA synthesis and how their formation relates to DNA polA DNA polymerase I structure, as described in the Watson–Crick model. polB DNA polymerase II dnaE, N, Q, X, Z DNA polymerase III subunits dnaG Primase 10.5 Replication Is Controlled dnaA, I, P Initiation dnaB, C Helicase at oriC by a Variety of Genes gyrA, B Gyrase subunits lig DNA ligase Much of what we know about DNA replication in viruses and rep DNA helicase bacteria is based on genetic analysis of the process. For exam- ple, we have already discussed studies involving the polA1 ssb Single-stranded binding proteins mutation, which revealed that DNA polymerase I is not the rpoB RNA polymerase subunit

M10_KLUG8414_10_SE_C10.indd 194 16/11/18 5:13 pm 10.6 Eukaryotic DNA Replication Is Similar to Replication in Bacteria, but Is More Complex 195

unwinding and stabilization, initiation and priming, relaxation of supercoiling, repair, and ligation. The discovery of such a large group of genes attests to the complexity of the process of replication, even in this relatively simple organism. Given the enormous quantity of DNA that must be unerringly replicated in a very brief time, this level of complexity is not unexpected. As we will see in Section 10.6, the process is even more involved and therefore more difficult to investigate in eukaryotes.

FIGURE 10.14 An electron micrograph of a eukaryotic rep- 10.6 Eukaryotic DNA Replication licating fork demonstrating the presence of histone-protein- containing nucleosomes on both branches. Is Similar to Replication in Bacteria,

but Is More Complex replicating sequences (ARSs), consist of approximately 120 base pairs containing a consensus sequence (mean- Eukaryotic DNA replication shares many features with rep- ing a sequence that is the same, or nearly the same, in all lication in bacteria. In both systems, double-stranded DNA yeast ARSs) of 11 base pairs. Origins in mammalian cells is unwound at replication origins, replication forks are appear to be unrelated to specific sequence motifs and may formed, and bidirectional DNA synthesis creates leading and be defined more by chromatin structure over a 6- to 55-kb lagging strands from single-stranded DNA templates under region. the direction of DNA polymerase. Eukaryotic polymerases have the same fundamental requirements for DNA synthesis as do bacterial polymerases: four deoxyribonucleoside tri- Multiple Eukaryotic DNA Polymerases phosphates, a template, and a primer. However, eukaryotic To accommodate the large number of replicons, eukaryotic DNA replication is more complex, due to several features cells contain many more DNA polymerase molecules than do of eukaryotic DNA. For example, eukaryotic cells contain bacterial cells. For example, a single E. coli cell contains about much more DNA, this DNA is complexed with nucleosomes, 15 molecules of DNA polymerase III, but a mammalian cell and eukaryotic chromosomes are linear rather than circular. contains tens of thousands of DNA polymerase molecules. In this section, we will describe some of the ways in which Eukaryotes also utilize a larger number of different eukaryotes deal with this added complexity. DNA polymerase types than do bacteria. The human genome contains genes that encode at least 14 different DNA poly- merases, only three of which are involved in the majority of Initiation at Multiple Replication Origins nuclear genome DNA replication. The most obvious difference between eukaryotic and bac- Pol a, d, and e are the major forms of the enzyme terial DNA replication is that eukaryotic replication must involved in initiation and elongation during eukaryotic deal with greater amounts of DNA. For example, yeast cells nuclear DNA synthesis, so we will concentrate our discus- contain three times as much DNA, and Drosophila cells con- sion on these. Two of the four subunits of the Pol A enzyme tain 40 times as much as E. coli cells. In addition, eukaryotic synthesize RNA primers on both the leading and lagging DNA polymerases synthesize DNA at a rate 25 times slower strands. After the RNA primer reaches a length of about (about 2000 nucleotides per minute) than that in bacteria. 10 ribonucleotides, another subunit adds 10 to 20 comple- Under these conditions, replication from a single origin on mentary deoxyribonucleotides. Pol a is said to possess low a typical eukaryotic chromosome would take days to com- processivity, a term that refers to the strength of the asso- plete. However, replication of entire eukaryotic genomes is ciation between the enzyme and its substrate, and thus the usually accomplished in a matter of minutes to hours. length of DNA that is synthesized before the enzyme dis- To facilitate the rapid synthesis of large quantities of sociates from the template. Once the primer is in place, an DNA, eukaryotic chromosomes contain multiple replica- event known as polymerase switching occurs, whereby Pol tion origins. Yeast genomes contain between 250 and 400 a dissociates from the template and is replaced by Pol d or origins, and mammalian genomes have as many as 25,000. e. These enzymes extend the primers on opposite strands of Multiple origins are visible under the electron microscope DNA, possess much greater processivity, and exhibit 3′ to as “replication bubbles” that form as the DNA helix opens 5′ exonuclease activity, thus having the potential to proof- up, each bubble providing two potential replication forks read. Pol e synthesizes DNA on the leading strand, and Pol d (Figure 10.14). Origins in yeast, called autonomously synthesizes the lagging strand. Both Pol d and e participate

M10_KLUG8414_10_SE_C10.indd 195 16/11/18 5:13 pm 196 10 DNA Replication

in other DNA synthesizing events in the cell, including sev- eral types of DNA repair and recombination. All three DNA 10.7 Telomeres Solve Stability and polymerases are essential for viability. Replication Problems at Eukaryotic As in bacterial DNA replication, the final stages in eukaryotic DNA replication involve replacing the RNA Chromosome Ends primers with DNA and ligating the Okazaki fragments on A final difference between bacterial and eukaryotic DNA the lagging strand. In eukaryotes, the Okazaki fragments synthesis stems from the structural differences in their chro- are about ten times smaller (100 to 150 nucleotides) than mosomes. Unlike the closed, circular DNA of bacteria and in bacteria. most bacteriophages, eukaryotic chromosomes are linear. Included in the remainder of DNA-replicating enzymes The presence of linear DNA “ends” on eukaryotic chromo- is Pol g, which is found exclusively in mitochondria, syn- somes creates two potential problems. thesizing the DNA present in that organelle. Other DNA The first problem is that the double-stranded “ends” of polymerases are involved in DNA repair and replication DNA molecules at the termini of linear chromosomes poten- through regions of the DNA template that contain damage tially resemble the double-stranded breaks (DSBs) that or distortions. can occur when a chromosome becomes fragmented inter- nally as a result of DNA damage. Such double-stranded DNA Replication through Chromatin ends are recognized by the cell’s DNA repair mechanisms that join the “loose ends” together, leading to chromosome One of the major differences between bacterial and fusions and translocations. If the ends do not fuse, they are eukaryotic DNA is that eukaryotic DNA is complexed with vulnerable to degradation by nucleases. The second problem DNA-binding proteins, existing in the cell as chromatin. occurs during DNA replication, because DNA polymerases As we will discuss later in the text (see Chapter 11), chro- cannot synthesize new DNA at the tips of single-stranded matin consists of regularly repeating units called nucleo- 5′@ends. somes, each of which consists of about 200 base pairs of To deal with these two problems, linear eukaryotic DNA wrapped around eight histone protein molecules. chromosomes end in distinctive sequences called telomeres, Before DNA polymerases can begin synthesis, nucleo- as we will describe next. somes and other DNA-binding proteins must be stripped away or otherwise modified to allow the passage of repli- cation proteins. As DNA synthesis proceeds, the histones Telomere Structure and Chromosome Stability and non-histone proteins must rapidly reassociate with the newly formed duplexes, reestablishing the character- In 1978, Elizabeth Blackburn and Joe Gall reported the pres- istic nucleosome pattern. Electron microscopy studies, ence of unexpected structures at the ends of chromosomes such as the one shown in Figure 10.14, show that nucleo- of the ciliated protozoan Tetrahymena. They showed that somes form immediately after new DNA is synthesized at the protozoan’s chromosome ends consisted of the short replication forks. sequence 5′@TTGGGG@3′, tandemly repeated from 30 to In order to re-create nucleosomal chromatin on repli- 60 times. This strand is referred to as the G-rich strand, in cated DNA, the synthesis of new histone proteins is tightly contrast to its complementary strand, the so-called C-rich coupled to DNA synthesis during the S phase of the cell cycle. strand, which displays the repeated sequence 5′@AACCCC@3′. Research data suggest that nucleosomes are disrupted just Since then, researchers have discovered similar tandemly ahead of the replication fork and that the preexisting his- repeated DNA sequences at the ends of linear chromosomes tone proteins can assemble with newly synthesized histone in most eukaryotes. These repeat regions make up the chro- proteins into new nucleosomes. The new nucleosomes are mosome’s telomeres. In humans, the telomeric sequence assembled behind the replication fork, onto the two daugh- 5′@TTAGGG@3′ is repeated several thousand times and telo- ter strands of DNA. The assembly of new nucleosomes is car- meres can vary in length from 5 to 15 kb. In contrast, yeast ried out by chromatin assembly factors (CAFs) that move telomeres are several 100 base pairs long and mouse telo- along with the replication fork. meres are between 20 and 50 kb long. Since each linear chro- mosome ends with two DNA strands running antiparallel to one another, one strand has a 3′@ending and the other has a ESSENTIAL POINT 5′@ending. It is the 3′@strand that is the G-rich one. This has DNA replication in eukaryotes is more complex than replication in special significance during telomere replication. bacteria, using multiple replication origins, multiple forms of DNA Two features of telomeric DNA help to explain how telo- polymerases, and factors that disrupt and assemble nucleosomal meres protect the ends of linear chromosomes. First, at the chromatin. ■ end of telomeres, a stretch of single-stranded DNA extends

M10_KLUG8414_10_SE_C10.indd 196 16/11/18 5:13 pm 10.7 Telomeres Solve Stability and Replication Problems at Eukaryotic Chromosome Ends 197

out from the 3′ G-rich strand. This single-stranded tail var- (Lagging strand template) ies in length between organisms. In Tetrahymena, the tail is 5' 3' between 12 and 16 nucleotides long, whereas in mammals, 3' 5' it varies between 30 and 400 nucleotides long. The 3′@ends (Leading strand template) of G-rich single-stranded tails are capable of interacting with upstream sequences within the tail, creating loop structures. RNA Discontinuous and primer continuous synthesis The loops, called t-loops, resemble those created when you tie your shoelaces into a bow. Second, a complex of six 5' 3' proteins binds and stabilizes telomere t-loops, forming the 3' 5' shelterin complex. It is believed that t-loop structures, in combination with the shelterin proteins, close off the ends 5' 3' of chromosomes and make them resistant to nuclease diges- 3' 5' tion and DNA fusions. Shelterin proteins also help to recruit telomerase enzymes to telomeres during telomere replica- RNA primers removed Gaps ( a ), ( b ), tion, which we discuss next. and ( c ) created 5' 3' Telomeres and Chromosome End Replication 3' ( a ) ( b )

Now let’s consider the problem that semiconservative rep- 5' ( c ) 3' lication poses the ends of double-stranded DNA molecules. 3' 5' As we have learned previously in this chapter, DNA replica- tion initiates from short RNA primers, synthesized on both Gap ( a ) filled Gaps ( b ) and leading and lagging strands (Figure 10.15). Primers are ( c ) unfilled necessary because DNA polymerase requires a free 3′@OH on 5' 3' which to initiate synthesis. After replication is completed, ( b ) these RNA primers are removed. The resulting gaps within 3' the new daughter strands are filled by DNA polymerase and 5' ( c ) 3' sealed by ligase. These internal gaps have free 3′@OH groups 3' 5' available at the ends of the Okazaki fragments for DNA poly-

merase to initiate synthesis. The problem arises at the gaps FIGURE 10.15 Diagram illustrating the difficulty encountered left at the 5′@ends of the newly synthesized DNA [gaps (b) during the replication of the ends of linear chromosomes. Of and (c) in Figure 10.15]. These gaps cannot be filled by DNA the three gaps, (b) and (c) are left following synthesis on both the leading and lagging strands. polymerase because no free 3′@OH groups are available for the initiation of synthesis. Thus, in the situation depicted in Figure 10.15, gaps remain on newly synthesized DNA strands at each successive how the Tetrahymena telomerase enzyme accomplishes this round of synthesis, shortening the double-stranded ends of synthesis yielded an extraordinary finding. The enzyme is a the chromosome by the length of the RNA primer. With each ribonucleoprotein, containing within its molecular structure round of replication, the shortening becomes more severe a short piece of RNA that is essential to its catalytic activity. in each daughter cell, eventually extending beyond the telo- The telomerase RNA component (TERC) serves as both mere and potentially deleting gene-coding regions. a “guide” to proper attachment of the enzyme to the telo- The solution to this so-called end-replication mere and a “template” for synthesis of its DNA complement. problem is provided by a unique eukaryotic enzyme called Synthesis of DNA using RNA as a template is called reverse telomerase­ . Telomerase was first discovered by Elizabeth transcription. The telomerase reverse transcriptase Blackburn and her graduate student, Carol Greider, in (TERT) is the catalytic subunit of the telomerase enzyme. In studies of Tetrahymena. As noted earlier, telomeric DNA addition to TERC and TERT, telomerase contains a number in eukaryotes consists of many short, repeated nucleotide of accessory proteins. In Tetrahymena, TERC contains the sequences, with the G-rich strand overhanging in the form sequence CAACCCCAA, within which is found the comple- of a single-stranded tail. In Tetrahymena the tail contains ment of the repeating telomeric DNA sequence that must be several repeats of the sequence 5′@TTGGGG@3′. As we will synthesized (TTGGGG). see, telomerase is capable of adding several more repeats Figure 10.16 shows a model of how researchers envi- of this six-nucleotide sequence to the 3′@end of the G-rich sion the enzyme working. Part of the TERC RNA sequence strand. Detailed investigation by Blackburn and Greider of of the enzyme (shown in green) base pairs with the ending

M10_KLUG8414_10_SE_C10.indd 197 16/11/18 5:13 pm 198 10 DNA Replication

sequence of the single-stranded overhanging (a) Telomerase binds to 3' G-rich tail DNA, while the remainder of the TERC RNA 5' GGTTGGGGTTGGGGTTG 3' extends beyond the overhang. Next, the telomer- 3' 5' CAAC C CCAA ase’s reverse transcription activity synthesizes a stretch of single-stranded DNA using the TERC RNA as a template, thus increasing the length of 3' 5' the 3′@tail. It is believed that the enzyme is then Telomerase with translocated toward the (newly formed) end of RNA component the tail, and the same events are repeated, con- tinuing the extension process. (b) Telomeric DNA is synthesized on G-rich tail Once the telomere 3′@tail has been length- ened by telomerase, conventional DNA synthe- 5' GGTTGGGGTTGGGGTTGGGGTT CAAC C CCAA sis ensues. Primase lays down a primer near the 3' 5' end of the telomere tail; then DNA polymerase and ligase fill most of the gap [Figure 10.16(d) 3' 5' and (e)]. When the primer is removed, a small gap remains at the end of the telomere. How- ever, this gap is located well beyond the original end of the chromosome, thus preventing any chromosome shortening. (c) Telomerase is translocated and Steps (a) and (b) are repeated Telomerase function has now been found 5' GGTTGGGGTTGGGGTTGGGGTTGGGGTT 3' in all eukaryotes studied, and telomeric DNA 3' 5' CAAC C CCAA sequences have been highly conserved through- out evolution, reflecting the critical function of 3' 5' telomeres.

Telomeres in Disease, Aging, and Cancer (d) Telomerase released; primer added by primase Despite the importance of maintaining telomere 5' GGTTGGGGTTGGGGTTGGGGTTGGGGTT 3' length for chromosome integrity, only some 3' –– –– –– Gap –– –– –– C CAA 5' cell types express telomerase. In humans, these Primer include embryonic stem cells, some types of adult stem cells, and other cell types that need to divide repeatedly such as epidermal cells and cells of the immune system. In contrast, most normal somatic cells do not express telomerase. As a result, (e) Gap filled by DNA polymerase and sealed by ligase after many cell divisions, somatic cell telomeres 5' GGTTGGGGTTGGGGTTGGGGTTGGGGTT 3' become seriously eroded, leading to chromosome 3' CCAAC C CCAACCCC AACCCC AACCC CAA 5' damage that can either kill the cell or cause it to cease dividing and enter a state called senescence. Ligase Primer Several rare human diseases have been asso- ciated with loss of telomerase activity and abnor- mally short telomeres. For example, patients with the inherited form of dyskeratosis con- (f) Primer removed genita have mutations in genes encoding telom- 5' GGTTGGGGTTGGGGTTGGGGTTGGGGTT 3' erase or shelterin subunits. These mutations 3' CCAAC C CCAACCCC AACCCC AACC 5' bring about many different symptoms that are also seen in premature aging, and patients suffer FIGURE 10.16 The predicted solution to the problem posed in Figure 10.15. The enzyme telomerase (with its TERC RNA component early deaths due to stem cell failure. Many stud- shown in green) directs synthesis of repeated TTGGGG sequences, ies show a correlation between telomere length resulting in the formation of an extended 3′@overhang. This facilitates or telomerase activity and common diseases such DNA synthesis on the opposite strand, filling in the gap that would oth- erwise have been increased in length with each replication cycle. as diabetes or heart disease.

M10_KLUG8414_10_SE_C10.indd 198 16/11/18 5:13 pm 10.7 Telomeres Solve Stability and Replication Problems at Eukaryotic Chromosome Ends 199

The connection between telomere length and aging has by introducing telomerase activity, as long as at least two also been the subject of much research and speculation. When other types of genes (proto-oncogenes and tumor-suppressor telomeres become critically short, a cell may suffer chromo- genes) are mutated or abnormally expressed. These findings some damage and enter senescence, a state in which cell do not suggest that telomerase activity is in itself sufficient to division ceases and the cell undergoes metabolic changes that create a cancer cell. We return to the multistep nature of the cause it to function less efficiently. Some scientists propose genesis of cancer later (see Chapter 19). that the presence of shortened telomeres in such senescent Nevertheless, the requirement for telomerase activity in cells is directly related to changes associated with aging. This cancer cells suggests that researchers may be able to develop topic is discussed in the “Genetics, Ethics, and Society” essay cancer drugs that repress tumor growth by inhibiting telom- below. erase activity. Because most human somatic cells do not While most somatic cells contain little if any telomer- express telomerase, such a therapy might be relatively tumor ase, shorten their telomeres, and undergo senescence after specific and less toxic than many current anticancer drugs. multiple cell divisions, most all human cancer cells retain A number of such anti-telomerase drugs are currently being telomerase activity and thus maintain telomere length developed, and some have entered phase III clinical trials. through multiple cell divisions, suggesting that this may be the key to their “immortality.” Those cancer cells that do ESSENTIAL POINT not contain telomerase use a different telomere-lengthening Replication at the ends of linear chromosomes in eukaryotes poses a method called alternative lengthening of telomeres (ALT). And, special problem that can be solved by the presence of telomeres and by a unique RNA-containing enzyme called telomerase. ■ in tissue cultures, cells can be transformed into cancer cells

GENETICS, ETHICS, AND SOCIETY

Telomeres: The Key to a Long Life?

e humans, like all multi- One of the many characteristics of ag- the classic symptoms of aging, ­including cellular organisms, grow ing cells involves telomeres. As described tissue atrophy, neurodegeneration, and a old and die. As we age, in this chapter, most mammalian somat- ­shortened life span. When the researchers­ Wour immune systems become less effi- ic cells do not contain telomerase activity ­reactivated telomerase function in these cient, wound healing is impaired, and and telomeres shorten with each DNA ­prematurely aging adult mice, tissue tissues lose resilience. Why do we go replication. Some epidemiological stud- ­atrophies and neurodegeneration were through these age-related declines, and ies show a correlation between telomere reversed and their life spans increased. can we reverse the march to mortality? length in humans and their life spans. In Similar studies led to the conclusion Some researchers suggest that the an- addition, some common diseases such that overexpression of telomerase in nor- swers to these questions may lie at the as cardiovascular disease and some life- mal mice also increased their life spans, ends of our chromosomes. style factors such as ­smoking, poor diets, ­although it was not clear that ­telomere Human cells, both those in our bodies and stress, correlate with shorter than lengths were altered. These studies sug- and those growing in culture dishes, have average telomere lengths. Despite these gest that some of the symptoms that a finite life span. When placed into tissue correlations, the data linking telomere accompany old age in ­humans might be culture dishes, normal human fibroblasts length and longevity in humans are not reversed by activating ­telomerase genes. become senescent, losing their ability to consistent and remain the subject of sci- However, some scientists still debate grow and divide after about 50 cell divi- entific debate. whether telomerase activation or telo- sions. Eventually, they die. Although we Telomerase activity has also been cor- mere lengthening directly cause these don’t know whether cellular senescence related with aspects of aging in multicel- effects or may simply accompany other, directly causes aging in a multicellular lular organisms. In one study, investiga- ­unknown, ­causative mechanisms. organism, the evidence is suggestive. tors introduced cloned telomerase­ genes For example, cultured cells derived from into normal human cells in culture. The YOUR TURN young people undergo more divisions increase in ­telomerase activity caused ake time, individually or in groups, than those from older people; cells from the ­telomeres’ lengths to increase, and to answer the following questions. short-lived species stop growing after the cells continued to grow past their T Investigate the references and links fewer divisions than those from longer- typical senescence point. In another to help you understand some of the re- lived species; and cells from patients with study, researchers created a strain of search and ethical questions that surround premature aging syndromes undergo mice that was defective in the TERT sub- the links between telomeres and aging. fewer divisions than those from normal unit of telomerase. These mice developed 1. The connection between telomeres and patients. ­extremely short telomeres and showed aging has been of great interest to both (continued)

M10_KLUG8414_10_SE_C10.indd 199 16/11/18 5:13 pm 200 10 DNA Replication

Genetics, Technology, and Society, continued scientists and the public. Studies have ing conclusions from the same data: Bär, C. the ethics of offering such tests and used model organisms, cultured cells, and Blasco, M. A. (2016). Telomeres supplements when some scientists and data from epidemiological surveys and telomerase as therapeutic tar- argue that it is too early to sufficiently to try to determine a ­correlation. gets to prevent and treat age-related understand the mechanisms that may Although these studies suggest links diseases. F1000Res 5 2016 Jan 20; (or may not) link telomere length or between telomeres and aging, the 5:89, and Simons, M. J. P. (2015). telomerase ­activity with aging. What conclusions from these studies have Questioning causal involvement of are the potential benefits and harms of also been the subject of debate. How ­telomeres in aging. Ageing Res. Rev. these tests and treatments? Would you would you assess the current status of 24:191–196. ­purchase these tests or supplements? research on the link between telomeres Why or why not? 2. A large number of private ­companies and human aging? now offer telomere length diagnos- This topic is discussed in Leslie, M. Begin your exploration of current telomere tic tests and telomere lengthening (2011). Are telomere tests ready for research with two reviews that draw differ- ­supplements to the public. Discuss prime time? Science 332:414–415.

CASE STUDY At loose ends

yskeratosis congenita (DKC) is a rare human genetic disor- 1. Why might mutations in genes encoding telomerase subunits der affecting telomere replication. Mutations in the genes lead to bone marrow failure? Dencoding the protein or RNA subunits of telomerase result 2. Although the brother is an immunologically matched donor for in very short telomeres. DKC symptoms include bone marrow fail- his sister, it would be unethical for the clinicians to transplant ure (reduced production of blood cells) and anemia. If symptoms are bone marrow from the brother to the sister. Why? severe, a bone marrow transplant may be the only form of effec- 3. The clinicians are faced with another ethical dilemma. How can tive treatment. In one case, clinicians recommended that a 27-year- they respect the brother’s desire to not know if he has DKC old woman with a dominant form of DKC undergo a bone marrow while also revealing that he is not a suitable donor for his sis- transplant to treat the disorder. Her four siblings were tested, and ter? In addition, what should the clinicians tell the sister about her 13-year-old brother was identified as the best immunologically her brother? matched donor. However, before being tested, he was emphatic that he did not want to know if he had DKC. During testing, it was See Denny, C., et al. (2008). All in the family: Disclosure of discovered that he had unusually short telomeres and would most “unwanted” information to an adolescent to benefit a relative. Am. likely develop symptoms of DKC. J. Med. Genet. 146A(21):2719–2724.

INSIGHTS AND SOLUTIONS

1. Predict the theoretical results of conservative and dispersive Density replication of DNA under the conditions of the Meselson–Stahl experiment. Follow the results through two generations of replication after cells have been shifted to an 14N@containing medium, using the following sedimentation pattern.

2. Mutations in the dnaA gene of E. coli are lethal and can 14N/14N 15N/14N 15N/15N only be studied following the isolation of conditional, Solution: ­temperature-sensitive mutations. Such mutant strains grow Conservative replication nicely and replicate their DNA at the permissive temperature Generation I Generation II of 18°C, but they do not grow or replicate their DNA at the restrictive temperature of 37°C. Two observations were use- ful in ­determining the function of the DnaA protein product. Dispersive replication First, in vitro studies using DNA templates that have Generation I Generation II unwound do not require the DnaA protein. Second, if intact cells are grown at 18°C and are then shifted to 37°C, DNA synthesis continues at this temperature until one round of replication is completed and then stops. What do these ­observations suggest about the role of the dnaA gene product? thesis can begin. Because the DnaA protein is not required for synthesis of unwound DNA, these observations suggest Solution: At 18°C (the permissive temperature), the that, in vivo, the DnaA protein plays an essential role in DNA ­mutation is not expressed and DNA synthesis begins. synthesis by interacting with the intact helix and somehow ­Following the shift to the restrictive temperature, the facilitating the localized denaturation necessary for synthe- already initiated DNA synthesis continues, but no new syn- sis to proceed.

M10_KLUG8414_10_SE_C10.indd 200 16/11/18 5:13 pm PROBLEMS AND DISCUSSION QUESTIONS 201

Mastering Genetics Visit for Problems and Discussion Questions instructor-assigned tutorials and problems.

1. HOW DO WE KNOW? In this chapter, we focused on how DNA 15. List the proteins that unwind DNA during in vivo DNA synthesis. is replicated and synthesized. In particular, we elucidated the How do they function? general mechanism of replication and described how DNA is syn- 16. Define and indicate the significance of (a) Okazaki fragments, thesized when it is copied. Based on your study of these topics, (b) DNA ligase, and (c) primer RNA during DNA replication. answer the following fundamental questions: 17. Outline the current model for DNA synthesis. (a) What is the experimental basis for concluding that DNA rep- 18. Why is DNA synthesis expected to be more complex in eukaryotes licates semiconservatively in both bacteria and eukaryotes? than in bacteria? How is DNA synthesis similar in the two types (b) How was it demonstrated that DNA synthesis occurs under of organisms? the direction of DNA polymerase III and not ­polymerase I? 19. If the analysis of DNA from two different microorganisms (c) How do we know that in vivo DNA synthesis occurs in the 5′ ­demonstrated very similar base compositions, are the DNA to 3′ direction? sequences of the two organisms also nearly identical? (d) How do we know that DNA synthesis is discontinuous on 20. Several temperature-sensitive mutant strains of E. coli display one of the two template strands? the following characteristics. Predict what enzyme or function (e) What observations reveal that a “telomere problem” exists is being affected by each mutation. during eukaryotic DNA replication, and how did we learn of (a) Newly synthesized DNA contains many mismatched the solution to this problem? base pairs. 2. CONCEPT QUESTION Review the Chapter Concepts list on (b) Okazaki fragments accumulate, and DNA synthesis is never p. 182. These are concerned with the replication and synthesis of completed. DNA. Write a short essay that distinguishes between the terms (c) No initiation occurs. replication and synthesis, as applied to DNA. Which of the two is (d) Synthesis is very slow. most closely allied with the field of ? (e) Supercoiled strands remain after replication, which is 3. Compare conservative, semiconservative, and dispersive modes never completed. of DNA replication. 21. Many of the gene products involved in DNA synthesis were 15 4. Describe the role of N in the Meselson–Stahl experiment. ­initially defined by studying mutant E. coli strains that could not 5. Predict the results of the experiment by Taylor, Woods, and synthesize DNA. Hughes if replication were (a) conservative and (b) dispersive. (a) The dnaE gene encodes the a subunit of DNA polymerase 6. Reconsider Problem 30 in Chapter 9. In the model you proposed, III. What effect is expected from a mutation in this gene? could the molecule be replicated semiconservatively? Why? How could the mutant strain be maintained? Would other modes of replication work? (b) The dnaQ gene encodes the e subunit of DNA polymerase. 7. What are the requirements for in vitro synthesis of DNA under the What effect is expected from a mutation in this gene? direction of DNA polymerase I? 22. Assume a hypothetical organism in which DNA replication is 8. How did Kornberg assess the fidelity of DNA polymerase I in ­conservative. Design an experiment similar to that of ­Taylor, copying a DNA template? Woods, and Hughes that will unequivocally establish this 9. Which characteristics of DNA polymerase I raised doubts that fact. Using the format established in Figure 10.5, draw sister its in vivo function is the synthesis of DNA leading to complete ­chromatids and illustrate the expected results depicting this replication? mode of replication. 10. Kornberg showed that nucleotides are added to the 3′@end of each 23. Describe the “end-replication problem” in eukaryotes. How is it growing DNA strand. In what way does an exposed 3′@OH group resolved? participate in strand elongation? 24. In 1994, telomerase activity was discovered in human cancer cell 11. What was the significance of the polA1 mutation? lines. Although telomerase is not active in human somatic tissue,­ 12. Summarize and compare the properties of DNA polymerase I, II, these cells do contain the genes for telomerase proteins and and III. ­telomerase RNA. Since inappropriate activation of telomerase­ 13. List and describe the function of the ten subunits constituting may contribute to cancer, why do you think the genes ­coding DNA polymerase III. Distinguish between the holoenzyme and for this enzyme have been maintained in the human genome the core enzyme. ­throughout evolution? Are there any types of human body cells 14. Distinguish between (a) unidirectional and bidirectional ­synthesis, where ­telomerase activation would be advantageous or even and (b) continuous and discontinuous synthesis of DNA. ­necessary? Explain.

M10_KLUG8414_10_SE_C10.indd 201 16/11/18 5:13 pm 11 Chromosome Structure and DNA Sequence Organization

CHAPTER CONCEPTS A chromatin fiber viewed using a scanning transmission ■■ Genetic information in viruses, bacteria, ­electron microscope (STEM) mitochondria, and chloroplasts is most often contained in a short, circular DNA molecule, relatively free of associated proteins. ■■ Eukaryotic cells, in contrast to viruses and bacteria, contain relatively large nce geneticists understood that DNA houses genetic information, amounts of DNA organized into it became very important to determine how DNA is organized into ­nucleosomes and present during most Ogenes and how these basic units of genetic function are organized of the cell cycle as chromatin fibers. into chromosomes. In short, the major question had to do with how the ■■ Uncoiled chromatin fibers characteristic genetic material was organized as it makes up the genome of organisms. of interphase coil up and condense into There has been much interest in this question because knowledge of the chromosomes during eukaryotic cell organization of the genetic material and associated molecules is important to division. understanding many other areas of genetics. For example, the way in which ■■ Eukaryotic genomes are characterized the genetic information is stored, expressed, and regulated must be related by both unique and repetitive DNA to the molecular organization of the genetic molecule DNA. sequences. In this chapter, we focus on the various ways DNA is organized into chro- ■■ Eukaryotic genomes consist mostly of mosomes. These structures have been studied using numerous techniques, noncoding DNA sequences. instruments, and approaches, including analysis by light microscopy and electron microscopy. More recently, molecular analysis has provided signifi- cant insights into chromosome organization. In the first half of the chapter, after surveying what we know about chromosomes in viruses and bacteria, we examine the large specialized eukaryotic structures called polytene and lampbrush chromosomes. Then, in the second half, we discuss how eukary- otic chromosomes are organized at the molecular level—for example, how DNA is complexed with proteins to form chromatin and how the chromatin fibers characteristic of interphase are condensed into chromosome struc- tures visible during mitosis and meiosis. We conclude the chapter by exam- ining certain aspects of DNA sequence organization in eukaryotic genomes.

202

M11_KLUG8414_10_SE_C11.indd 202 16/11/18 5:13 pm 11.1 Viral and Bacterial Chromosomes Are Relatively Simple DNA Molecules 203

long DNA molecule into a relatively small volume. In l, the 11.1 Viral and Bacterial DNA is 17 mm long and must fit into the phage head, which Chromosomes Are Relatively is less than 0.1 mm on any side. Table 11.1 compares the length of the chromosomes of several viruses to the size of their Simple DNA Molecules head structure. In each case, a similar packaging feat must be accomplished. Compare the dimensions given for phage T2 The chromosomes of viruses and bacteria are much less with the micrograph of both the DNA and the viral particle complicated than those of eukaryotes. They usually consist shown in Figure 11.2. Seldom does the space available in of a single nucleic acid molecule, unlike the multiple chro- the head of a virus exceed the chromosome volume by more mosomes comprising the genome of higher forms. Compared than a factor of two. In many cases, almost all of the space is to eukaryotes, the chromosomes contain much less genetic filled, indicating nearly perfect packing. Once packed within information and the DNA is not as extensively bound to pro- the head, the genetic material is functionally inert until it is teins. These characteristics have greatly simplified analysis, released into a host cell. and we now have a fairly comprehensive view of the struc- Bacterial chromosomes are also relatively simple in ture of viral and bacterial chromosomes. form. They generally consist of a double-stranded DNA The chromosomes of viruses consist of a nucleic acid molecule, compacted into a structure sometimes referred molecule—either DNA or RNA—that can be either sin- to as the nucleoid. Escherichia coli, the most extensively gle- or double-stranded. They can exist as circular struc- studied bacterium, has a large circular chromosome tures (closed loops), or they can take the form of linear measuring approximately 1200 mm (1.2 mm) in length molecules. For example, the single-stranded DNA of the that may occupy up to one-third of the volume of the FX174 ­bacteriophage and the double-stranded DNA of the cell. When the cell is gently lysed and the chromosome is polyoma virus are closed loops housed within the protein released, it can be visualized under the electron micro- coat of the mature viruses. The bacteriophage lambda (l), scope (Figure 11.3). on the other hand, possesses a linear double-stranded DNA DNA in bacterial chromosomes is found to be associated molecule prior to infection, which closes to form a ring upon with several types of DNA-binding proteins. Two, called HU its infection of the host cell. Still other viruses, such as the and H-NS (Histone-like Nucleoid Structuring) proteins, T-even series of bacteriophages, have linear double-stranded are small but abundant in the cell and contain a high per- chromosomes of DNA that do not form circles inside the bacte- centage of positively charged amino acids that can bond rial host. Thus, circularity is not an absolute requirement for ionically to the negative charges of the phosphate groups replication in viruses. in DNA. These proteins function to fold and bend DNA. As Viral nucleic acid molecules have been seen with the elec- such, coils are created that have the effect of compacting the tron microscope. Figure 11.1 shows a mature bacteriophage DNA constituting the nucleoid. Additionally, H-NS proteins, l and its double-stranded DNA molecule in the circular con- like histones in eukaryotes, have been implicated in regulat- figuration. One constant feature shared by viruses, bacteria, ing gene activity in a nonspecific way. and eukaryotic cells is the ability to package an exceedingly

(a) (b)

FIGURE 11.1 Electron micrographs of phage l (a) and the DNA that was isolated from it (b). The chromosome is 17 mm long. Note that the phages are magnified about five times more than the DNA.

M11_KLUG8414_10_SE_C11.indd 203 16/11/18 5:14 pm 204 11 chromosome Structure and DNA Sequence Organization

TABLE 11.1 The Genetic Material of Representative Viruses and Bacteria

Nucleic Acid Overall Size of Viral Head or Bacteria (Mm) Organism Type SS or DS* Length (Mm) Viruses fX174 DNA SS 2.0 0.025 * 0.025 Tobacco mosaic virus RNA SS 3.3 0.30 * 0.02 Phage l DNA DS 17.0 0.07 * 0.07 T2 phage DNA DS 52.0 0.07 * 0.10

Bacteria Haemophilus influenzae DNA DS 832.0 1.00 * 0.30 Escherichia coli DNA DS 1200.0 2.00 * 0.50 *SS = sing le@stranded, DS = double@stranded.

ESSENTIAL POINT 11.2 Mitochondria and Chloroplasts In contrast to eukaryotes, bacteriophage and bacterial chromo- somes are largely devoid of associated proteins, are of much smaller Contain DNA Similar to Bacteria size, and most often consist of circular DNA. ■ and Viruses

That both and contain their NOW SOLVE THIS mitochondria chloroplasts own DNA and a system for expressing genetic information 11.1 In bacteriophages and bacteria, the DNA is almost was first suggested by the discovery of mutations and the always organized into circular (closed loops) chromo- resultant inheritance patterns in plants, yeast, and other somes. Phage l is an exception, maintaining its DNA in a fungi. Because both mitochondria and chloroplasts are linear chromosome within the viral particle. However, as inherited through the maternal cytoplasm in most organ- soon as this DNA is injected into a host cell, it circularizes isms, and because each of the above-mentioned examples before replication begins. What advantage exists in rep- of mutations could be linked hypothetically to the altered licating circular DNA molecules compared to linear mol- ecules, characteristic of eukaryotic chromosomes? function of either chloroplasts or mitochondria, geneticists set out to look for more direct evidence of DNA in these HINT: This problem involves an understanding of eukaryotic organelles. Not only was unique DNA found to be a normal DNA replication, as discussed in Chapter 10. The key to its component of both mitochondria and chloroplasts, but care- solution is to consider why the enzyme telomerase is essential in eukaryotic DNA replication, and why bacterial and viral chro- ful examination of the nature of this genetic information mosomes can be replicated without encountering the “telomere revealed a remarkable similarity to that found in viruses problem.” and bacteria.

FIGURE 11.2 Electron micrograph of bacteriophage T2, FIGURE 11.3 Electron micrograph of the bacterium E. coli, which has had its DNA released by osmotic shock. The which has had its DNA released by osmotic shock. The chro- ­chromosome is 52 mm long. mosome is 1200 mm long.

M11_KLUG8414_10_SE_C11.indd 204 16/11/18 5:14 pm 11.2 Mitochondria and Chloroplasts Contain DNA Similar to Bacteria and Viruses 205

Molecular Organization and Gene Products other light (L). While most of the mitochondrial genes are of Mitochondrial DNA encoded by the H strand, several are encoded by the comple- Extensive information is also available concerning the struc- mentary L strand. ture and gene products of mitochondrial DNA (mtDNA). In most eukaryotes, mtDNA exists as a double-stranded, Molecular Organization and Gene Products closed circle (Figure 11.4) that is free of the chromosomal of Chloroplast DNA proteins characteristic of eukaryotic chromosomal DNA. An exception is found in some ciliated protozoans, in which the Chloroplasts provide the photosynthetic function specific to DNA is linear. plants. Like mitochondria, they contain an autonomous genetic In size, mtDNA varies greatly among organisms. In a system distinct from that found in the nucleus and cytoplasm, variety of animals, including humans, mtDNA consists of which has as its foundation a unique DNA molecule (cpDNA). about 16,000 to 18,000 bp (16 to 18 kb). However, yeast Chloroplast DNA, shown in Figure 11.5, is fairly uniform (Saccharomyces) mtDNA consists of 75 kb. Plants typically in size among different organisms, ranging between 100 and exceed this amount—367 kb is present in mitochondria in 225 kb in length. It shares many similarities to DNA found in the mustard plant, Arabidopsis. Vertebrates have 5 to 10 bacterial cells. It is circular and double stranded, and it is free such DNA molecules per organelle, whereas plants have 20 of the associated proteins characteristic of eukaryotic DNA. to 40 copies per organelle. The size of cpDNA is much larger than that of mtDNA. There are several other noteworthy aspects of mtDNA. To some extent, this can be accounted for by a larger number With only rare exceptions, introns (noncoding regions within of genes. However, most of the difference appears to be due genes) are absent from mitochondrial genes, and gene rep- to the presence in cpDNA of many long noncoding nucleotide etitions are seldom present. Nor is there usually much in the sequences both between and within genes, the latter being way of intergenic spacer DNA. This is particularly true in called introns. Duplications of many DNA sequences are also species whose mtDNA is fairly small in size, such as humans. present. In Saccharomyces, with a much larger mtDNA molecule, In the green alga Chlamydomonas, there are about introns and intergenic spacer DNA account for much of the 75 copies of the chloroplast DNA molecule per organelle. excess DNA. As will be discussed in Chapter 12, the expres- In higher plants, such as the sweet pea, multiple copies of sion of mitochondrial genes uses several modifications of the the DNA molecule are also present in each organelle, but otherwise standard genetic code. Also of interest is the fact the molecule is considerably smaller (134 kb) than that in that replication in mitochondria is dependent on enzymes Chlamydomonas (195 kb). encoded by nuclear DNA. Another interesting observation is that in vertebrate ESSENTIAL POINT mtDNA, the two strands vary in density, as revealed by cen- Mitochondria and chloroplasts contain DNA that is remarkably trifugation. This provides researchers with a way to isolate similar in form and appearance to some bacterial and bacterio- phage DNA. ■ the strands for study, designating one heavy (H) and the

FIGURE 11.4 Electron micrograph of mitochondrial FIGURE 11.5 Electron micrograph of chloroplast DNA (mtDNA) derived from Xenopus laevis. DNA (cpDNA) derived from lettuce.

M11_KLUG8414_10_SE_C11.indd 205 16/11/18 5:14 pm 206 11 chromosome Structure and DNA Sequence Organization

11.3 Specialized Chromosomes Reveal Variations in the Organization of DNA

We now consider two cases of genetic organization that dem- onstrate the specialized forms that eukaryotic chromosomes can take. Both types—polytene chromosomes and lampbrush chromosomes—are so large that their organization was dis- cerned using light microscopy long before we understood how mitotic chromosomes form from interphase chromatin. The study of these chromosomes provided many of our ini- tial insights into the arrangement and function of the genetic

information. It is important to note that polytene and lamp- FIGURE 11.6 polytene chromosomes derived from larval brush chromosomes are unusual and not typically found in salivary gland cells of Drosophila. most eukaryotic cells, but the study of their structure has revealed many common themes of chromosome organization. chromosomal material is normally dispersed as chromatin and homologs are not paired. Second, their large size and distinctiveness result from the many DNA strands that com- Polytene Chromosomes pose them. The DNA of these paired homologs undergoes Giant polytene chromosomes are found in various tis- many rounds of replication, but without strand separation or sues (salivary, midgut, rectal, and malpighian excretory cytoplasmic division. As replication proceeds, chromosomes tubules) in the larvae of some flies and in several species of contain 1000 to 5000 DNA strands that remain in precise protozoans and plants. Such structures, first observed by parallel alignment with one another, giving rise to the dis- E. G. Balbiani in 1881, provided a model system for subse- tinctive band pattern along the axis of the chromosome. quent investigations of chromosomes. What is particularly The presence of bands on polytene chromosomes was intriguing about polytene chromosomes is that they can be initially interpreted as the visible manifestation of indi- seen in the nuclei of interphase cells. vidual genes. The discovery that the strands present in Each polytene chromosome is 200 to 600 mm long, bands undergo localized uncoiling during genetic activity and when they are observed under the light microscope, further strengthened this view. Each such uncoiling event they reveal a linear series of alternating bands and inter- results in what is called a puff because of its appearance bands (Figure 11.6). The banding pattern is distinctive for (Figure 11.7). That puffs are visible manifestations of gene each chromosome in any given species. Individual bands activity (transcription that produces RNA) is evidenced by are sometimes called chromomeres, a generalized term their high rate of incorporation of radioactively labeled RNA describing lateral condensations of material along the axis precursors, as assayed by autoradiography. Bands that are of a chromosome. not extended into puffs incorporate fewer radioactive pre- Extensive study using electron microscopy and radioac- cursors or none at all. tive tracers led to an explanation for the unusual appear- The study of bands during development in insects ance of these chromosomes. First, polytene chromosomes such as Drosophila and the midge fly Chironomus reveals represent paired homologs. This is highly unusual because differential gene activity. A characteristic pattern of band they are present in somatic cells, where in most organisms, formation that is equated with gene activation is observed

P

B IB IB B IB Puff

FIGURE 11.7 photograph of a puff within a polytene chromosome. The diagram depicts the uncoiling of strands within a band (B) region to produce a puff (P) in polytene chromosomes. Interband regions (IB) are P also labeled.

M11_KLUG8414_10_SE_C11.indd 206 16/11/18 5:14 pm 11.3 Specialized Chromosomes Reveal Variations in the Organization of DNA 207

as development proceeds. Despite attempts to resolve the from each chromomere is a pair of lateral loops, which issue, it is not yet clear how many genes are contained give the chromosome its distinctive appearance. In part in each band. However, we do know that in Drosophila, (b), the scanning electron micrograph (SEM) reveals adja- which contains about 15,000 genes, there are approxi- cent loops present along one of the two axes of the chro- mately 5000 bands. Interestingly, a band may contain mosome. As with bands in polytene chromosomes, much up to 107 base pairs of DNA, enough to encode 50 to 100 more DNA is present in each loop than is needed to encode average-size genes. a single gene. Such an SEM provides a clear view of the chromomeres and the chromosomal fibers emanating from NOW SOLVE THIS them. Each chromosomal loop is thought to be composed of one DNA double helix, while the central axis is made 11.2 After salivary gland cells from Drosophila are isolated up of two DNA helices. This hypothesis is consistent with and cultured in the presence of radioactive thymidylic acid, the belief that each meiotic chromosome is composed of a autoradiography is performed, revealing the presence of pair of sister chromatids. Studies using radioactive RNA thymidine within polytene chromosomes. Predict the dis- precursors reveal that the loops are active in the synthe- tribution of the grains along the chromosomes. sis of RNA. The lampbrush loops, in a manner similar to HINT: This problem involves an understanding of the organiza- puffs in polytene chromosomes, represent DNA that has tion of DNA in polytene chromosomes. The key to its solution is to been reeled out from the central chromomere axis during be aware that 3H@thymidine, as a molecular tracer, will only transcription. be incorporated into DNA during its replication.

ESSENTIAL POINT Lampbrush Chromosomes Polytene and lampbrush chromosomes are examples of specialized structures that extended our knowledge of genetic organization and Another specialized chromosome that has given us function well in advance of the technology available to the modern- insight into chromosomal structure is the lampbrush day molecular biologist. ■ ­chromosome, so named because it resembles the brushes used to clean kerosene-lamp chimneys in (a) the nineteenth century. Lampbrush chromosomes were first studied in detail in 1892 in the oocytes of sharks and are now known to be characteris- tic of most vertebrate oocytes as well as the sper- matocytes of some insects. Therefore, they are meiotic chromosomes. Most experimental work Chiasma has been done with material taken from amphib- ian oocytes. These unique chromosomes are easily isolated from oocytes in the first prophase stage of meiosis, where they are active in directing the metabolic activities of the developing cell. The homologs are seen as synapsed pairs held together by chiasmata. (b) However, instead of condensing, as most meiotic chromosomes do, lampbrush chromosomes often extend to lengths of 500 to 800 mm. Later in mei- osis, they revert to their normal length of 15 to Loops 20 mm. Based on these observations, lampbrush chromosomes are interpreted as extended, uncoiled versions of the normal meiotic chromosomes. The two views of lampbrush chromosomes in Central axis with Figure 11.8 provide significant insights into their chromo- morphology. Part (a) shows the meiotic configura- meres tion under the light microscope. The linear axis of each structure contains a large number of con- FIGURE 11.8 lampbrush chromosomes derived from amphibian densed areas, and as with polytene chromosomes, oocytes. Part (a) is a photomicrograph; part (b) is a scanning electron these are referred to as chromomeres. Emanating micrograph.

M11_KLUG8414_10_SE_C11.indd 207 16/11/18 5:14 pm 208 11 chromosome Structure and DNA Sequence Organization

rings, suggesting that repeating structural units occur along 11.4 DNA Is Organized into the chromatin axis. If the histone molecules are chemically Chromatin in Eukaryotes removed from chromatin, the regularity of this diffraction pattern is disrupted. We now turn our attention to the way DNA is organized in A basic model for chromatin structure was worked out eukaryotic chromosomes. Our focus will be on eukaryotic in the mid-1970s. Several observations were particularly cells, in which chromosomes are visible only during mitosis. relevant to the development of this model: After chromosome separation and cell division, cells enter the 1. Digestion of chromatin by certain endonucleases, such interphase stage of the cell cycle, during which time the com- as micrococcal nuclease, yields DNA fragments that are ponents of the chromosome uncoil and are present in the form approximately 200 bp in length or multiples thereof. This referred to as chromatin. While in interphase, the chromatin demonstrates that enzymatic digestion is not random, for is dispersed in the nucleus, and the DNA of each chromosome if it were, we would expect a wide range of fragment sizes. is replicated. As the cell cycle progresses, most cells reenter Thus, chromatin consists of some type of repeating unit, mitosis, whereupon chromatin coils into visible chromosomes each of which is protected from enzymatic cleavage, except once again. This condensation represents a length contraction where any two units are joined. It is the area between units of some 10,000 times for each chromatin fiber. that is attacked and cleaved by the endonuclease. The organization of DNA during the transitions just described is much more intricate and complex than in 2. Electron microscopic observations of chromatin reveal viruses or bacteria, which never exhibit a process similar to that chromatin fibers are composed of linear arrays of mitosis. This is due to the greater amount of DNA per chro- spherical particles (Figure 11.9). Discovered by Ada and mosome, as well as the presence of a large number of pro- Donald Olins, the particles occur regularly along the axis teins associated with eukaryotic DNA. For example, while of a chromatin strand and resemble beads on a string. DNA in the E. coli chromosome is 1200 mm long, the DNA in This conforms nicely to the earlier observation, which each human chromosome ranges from 19,000 to 73,000 mm suggests the existence of repeating units. These particles, in length. In a single human nucleus, all 46 chromosomes initially referred to as n-bodies (n is the Greek letter nu), contain sufficient DNA to extend to more than 2 meters. are now called nucleosomes. This genetic material, along with its associated proteins, is contained within a nucleus that usually measures about 5 to 10 mm in diameter.

Chromatin Structure and Nucleosomes

As we have seen, the genetic material of viruses and bacteria consists of strands of DNA or RNA that are nearly devoid of proteins. In eukaryotic chromatin, a substantial amount of protein is associated with the chromosomal DNA in all phases of the eukaryotic cell cycle. The associated proteins are divided into basic, positively charged histones and less positively charged nonhistones. The histones clearly play the most essential structural role of all the proteins associ- ated with DNA. There are five types, and they all contain large amounts of the positively charged amino acids lysine and arginine. This makes it possible for them to bond elec- trostatically to the negatively charged phosphate groups of nucleotides in DNA. Recall that a similar interaction has been proposed for several bacterial proteins. The general model for chromatin structure is based on the assumption that chromatin fibers, composed of DNA and protein, undergo extensive coiling and folding as they are condensed within the cell nucleus. X-ray diffraction studies FIGURE 11.9 An electron micrograph revealing nucleo- confirm that histones play an important role in chromatin somes appearing as “beads on a string” along chromatin structure. Chromatin produces regularly spaced diffraction strands derived from Drosophila melanogaster.

M11_KLUG8414_10_SE_C11.indd 208 16/11/18 5:14 pm 11.4 DNA Is Organized into Chromatin in Eukaryotes 209

3. Studies of precise interactions of histone molecules On the basis of this information, as well as on X-ray and DNA in the nucleosomes constituting chroma- and neutron-scattering analyses of crystallized core par- tin show that histones H2A, H2B, H3, and H4 occur ticles by John T. Finch, Aaron Klug, and others, a detailed

as two types of tetramers, H2A 2 # H2B 2 and model of the nucleosome was put forward in 1984, pro- H3 # H4 . Roger Kornberg predicted that each viding a basis for predicting chromatin structure and its 2 2 1 2 1 2 repeating nucleosome unit consists of one of each tet- condensation into chromosomes. In this model, illustrated 1 2 1 2 ramer (creating an octamer) in association with about in Figure 11.10, a 147-bp length of the 2-nm-diameter 200 bp of DNA. Such a structure is consistent with pre- DNA molecule coils around an octamer of histones in a vious observations and provides the basis for a model left-handed superhelix that completes about 1.7 turns per that explains the interaction of histones and DNA in nucleosome. Each nucleosome, ellipsoidal in shape, mea- chromatin. sures about 11 nm at its longest point [Figure 11.10(a)]. Significantly, the formation of the nucleosome represents 4. When nuclease digestion time is extended, some of the the first level of packing, whereby the DNA helix is reduced 200 bp of DNA are removed from the nucleosome, cre- to about one-third of its original length by winding around ating a nucleosome core particle consisting of 147 bp. the histones. The DNA lost in this prolonged digestion is responsible for In the nucleus, the chromatin fiber seldom, if ever, exists linking nucleosomes together. This linker DNA is associ- in the extended form described in the previous paragraph ated with the fifth histone, H1. (that is, as an extended chain of nucleosomes). Instead,

(d) Metaphase chromosome

Nucleosome core 1400 nm Chromatid (700-nm diameter)

(c) Chromatin fiber (300-nm diameter)

Looped domains H1 Histone (b) 30-nm fiber

Spacer DNA plus H1 histone Histones

H1

Histone octamer plus 147 base pairs of DNA DNA (2-nm diameter) (a) Nucleosomes (6-nm * 11-nm flat disc)

FIGURE 11.10 general model of the association of histones and DNA to form nucleosomes, illustrating the way in which each thickness of fiber may be coiled into a more condensed structure, ultimately producing a metaphase chromosome.

M11_KLUG8414_10_SE_C11.indd 209 16/11/18 5:14 pm 210 11 chromosome Structure and DNA Sequence Organization

the 11-nm-diameter fiber is further packed into a thicker, NOW SOLVE THIS 30-nm-diameter structure that was initially called a solenoid 11.3 If a human nucleus is 10 mm in diameter, and it [Figure 11.10(b)]. This thicker structure, which is depen- must hold as much as 2 m of DNA, which is complexed dent on the presence of histone H1, consists of numerous into nucleosomes that during full extension are 11 nm in nucleosomes coiled around and stacked upon one another, diameter, what percentage of the volume of the nucleus creating a second level of packing. This provides a six-fold does the genetic material occupy? increase in compaction of the DNA. It is this structure that HINT: This problem asks you to make some numerical calcula- is characteristic of an uncoiled chromatin fiber in interphase tions in order to see just how “filled” the eukaryotic nucleus is of the cell cycle. In the transition to the mitotic chromosome, with a diploid amount of DNA. The key to its solution is the use still further compaction must occur. The 30-nm structures 3 of the formula V = (4 3)pr , which calculates the volume of are folded into a series of looped domains, which further con- a sphere. dense the chromatin fiber into a structure that is 300 nm in > diameter [Figure 11.10(c)]. These coiled chromatin fibers are then compacted into the chromosome arms that constitute a chromatid, one of the longitudinal subunits of the meta- the structural problem of how to organize a huge amount phase chromosome [Figure 11.10(d)]. While Figure 11.10 of DNA within the eukaryotic nucleus, a new problem was shows the chromatid arms to be 700 nm in diameter, this apparent: the chromatin fiber, when complexed with histones value undoubtedly varies among different organisms. At a and folded into various levels of compaction, makes the DNA value of 700 nm, a pair of sister chromatids comprising a inaccessible to interaction with important nonhistone proteins. chromosome measures about 1400 nm. For example, the proteins that function in enzymatic and The importance of the organization of DNA into chro- regulatory roles during the processes of replication and gene matin and chromatin into mitotic chromosomes can be illus- expression must interact directly with DNA. To accommo- trated by considering a human cell that stores its genetic date these protein–DNA interactions, chromatin must be material in a nucleus that is about 5 to 10 mm in diame- induced to change its structure, a process called chromatin ter. The haploid genome contains 3.2 * 109 base pairs of remodeling. In the case of replication and gene expression, DNA distributed among 23 chromosomes. The diploid cell chromatin must relax its compact structure but be able to contains twice that amount. At 0.34 nm per base pair, this reverse the process during periods of inactivity. amounts to an enormous length of DNA (as stated earlier, Insights into how different states of chromatin struc- to more than 2 m). One estimate is that the DNA inside a ture may be achieved were forthcoming in 1997, when typical human nucleus is complexed with roughly 2.5 * 107 Timothy Richmond and members of his research team nucleosomes. were able to significantly improve the level of resolution In the overall transition from a fully extended DNA in X-ray diffraction studies of nucleosome crystals (from 7 helix to the extremely condensed status of the mitotic chro- Å in the 1984 studies to 2.8 Å in the 1997 studies). At this mosome, a packing ratio (the ratio of DNA length to the resolution, most atoms are visible, thus revealing the subtle length of the structure containing it) of about 500 to 1 must twists and turns of the superhelix of DNA that encircles the be achieved. In fact, our model accounts for a ratio of only histones. Recall that the double-helical ribbon represents about 50 to 1. Obviously, the larger fiber can be further bent, 147 bp of DNA surrounding four pairs of histone proteins. coiled, and packed to achieve even greater condensation This configuration is repeated over and over in the chroma- during the formation of a mitotic chromosome. tin fiber and is the principal packaging unit of DNA in the eukaryotic nucleus. The work of Richmond and colleagues, extended to a ESSENTIAL POINT resolution of 1.9 Å in 2003, has revealed the details of the Eukaryotic chromatin is a nucleoprotein organized into ­repeating units called nucleosomes, which are composed of 200 base pairs of location of each histone entity within the nucleosome. Of DNA, an octamer of four types of histones, plus one linker histone. ■ particular interest to chromatin remodeling is that unstruc- tured histone tails are not packed into the folded histone domains within the core of the nucleosome. For example, Chromatin Remodeling tails devoid of any secondary structure extending from his- As with many significant findings in genetics, the study of tones H3 and H2B protrude through the minor groove chan- nucleosomes has answered some important questions, but nels of the DNA helix. The tails of histone H4 appear to make at the same time it has also led us to new ones. For example, a connection with adjacent nucleosomes. Histone tails also in the preceding discussion, we established that histone pro- provide potential targets for a variety of chemical modifica- teins play an important structural role in packaging DNA tions that may be linked to genetic functions along the chro- into the nucleosomes that make up chromatin. While solving matin fiber, including the regulation of gene expression.

M11_KLUG8414_10_SE_C11.indd 210 16/11/18 5:14 pm 11.4 DNA Is Organized into Chromatin in Eukaryotes 211

Several of these potential chemical modifications are (see Chapter 16). In addition, chromatin remodeling is an now recognized as important to genetic function. One of the important topic in the discussion of epigenetics, the study most well-studied histone modifications involves acetylation­ of modifications of an organism’s genetic and phenotypic by the action of the enzyme histone acetyltransferase (HAT). expression that are not attributable to alteration of the DNA The addition of an acetyl group to the positively charged sequence making up a gene. This topic is discussed in depth amino group present on the side chain of the amino acid lysine in a future chapter (Special Topics Chapter 1—Epigenetics). effectively changes the net charge of the protein by neutral- izing the positive charge. Lysine is in abundance in histones, Heterochromatin and geneticists have known for some time that acetylation is Although we know that the DNA of the eukaryotic chromo- linked to gene activation. It appears that high levels of acety- some consists of one continuous double-helical fiber along lation open up, or remodel, the chromatin fiber, an effect that its entire length, we also know that the whole chromosome increases in regions of active genes and decreases in inactive is not structurally uniform from end to end. In the early part regions. In a well-studied example, the inactivation of the X of the twentieth century, it was observed that some parts of chromosome in mammals, forming a Barr body (Chapter 5), the chromosome remain condensed and stain deeply during histone H4 is known to be greatly underacetylated. interphase, while most parts are uncoiled and do not stain. Two other important chemical modifications are the In 1928, the terms euchromatin and heterochromatin methylation and phosphorylation of amino acids that were coined to describe the parts of chromosomes that are are part of histones. These chemical processes result from uncoiled and those that remain condensed, respectively. the action of enzymes called methyltransferases and kinases, Subsequent investigation revealed a number of charac- respectively. Methyl groups may be added to both arginine teristics that distinguish heterochromatin from euchroma- and lysine residues of histones, and these changes can either tin. Heterochromatic areas are genetically inactive because increase or decrease transcription depending on which they either lack genes or contain genes that are repressed. amino acids are methylated. Phosphate groups can be added Also, heterochromatin replicates later during the S phase to the hydroxyl groups of the amino acids serine and histi- of the cell cycle than euchromatin does. The discovery dine, introducing a negative charge on the protein. During of heterochromatin provided the first clues that parts of the cell cycle, increased phosphorylation, particularly of eukaryotic chromosomes do not always encode proteins. For histone H3, is known to occur at characteristic times. Such example, one heterochromatic region of the chromosome, chemical modification is believed to be related to the cycle the telomere, maintains the chromosome’s structural integ- of chromatin unfolding and condensation that occurs dur- rity during DNA replication, while another, the centromere, ing and after DNA replication. It is important to note that facilitates chromosome movement during cell division. the above chemical modifications (acetylation, methylation, The presence of heterochromatin is unique to and char- and phosphorylation) are all reversible, under the direction acteristic of the genetic material of eukaryotes. In some cases, of specific enzymes. whole chromosomes are heterochromatic. A case in point is the Not to be confused with histone methylation, the nitrog- mammalian Y chromosome, much of which is genetically inert. enous base cytosine within the DNA itself can also be meth- And, as we discussed in Chapter 5, the inactivated X chromo- ylated, forming 5-methyl cytosine. Cytosine methylation is some in mammalian females is condensed into an inert hetero- usually negatively correlated with gene activity and occurs chromatic Barr body. In some species, such as mealy bugs, all most often when the nucleotide cytidylic acid is next to the chromosomes of one entire haploid set are heterochromatic. nucleotide guanylic acid, forming what is called a CpG island. When certain heterochromatic areas from one chromo- The research described above has extended our knowl- some are translocated to a new site on the same or another edge of nucleosomes and chromatin organization and serves nonhomologous chromosome, genetically active areas here as a general introduction to the concept of chromatin sometimes become genetically inert if they lie adjacent to remodeling. A great deal more work must be done, however, the translocated heterochromatin. This influence on exist- to elucidate the specific involvement of chromatin remodel- ing euchromatin is one example of what is more generally ing during genetic processes. In particular, the way in which referred to as a position effect. That is, the position of a the modifications are influenced by regulatory molecules gene or group of genes relative to all other genetic material within cells will provide important insights into the mecha- may affect their expression. nisms of gene expression. What is clear is that the dynamic forms in which chromatin exists are vitally important to the way that all genetic processes directly involving DNA Chromosome Banding are executed. We will return to a more detailed discussion Until about 1970, mitotic chromosomes viewed under the of the role of chromatin remodeling when we consider the light microscope could be distinguished only by their relative regulation of eukaryotic gene expression later in the text sizes and the positions of their centromeres. In karyotypes,

M11_KLUG8414_10_SE_C11.indd 211 16/11/18 5:14 pm 212 11 chromosome Structure and DNA Sequence Organization

two or more chromosomes are often visually indistinguish- genes are present in more than one copy (they are referred able from one another. Numerous cytological techniques, to as multiple-copy genes) and so are repetitive in nature. referred to as chromosome banding, have now made it pos- However, the majority of repetitive sequences do not encode sible to distinguish such chromosomes from one another as proteins. Nevertheless, many are transcribed, and the a result of differential staining along the longitudinal axis of resultant play multiple roles in eukaryotes, includ- mitotic chromosomes. ing chromatin remodeling and regulation of gene expres- The most useful of these techniques, called G-banding, sion, as discussed in greater detail in Chapter 16. We will involves the digestion of the mitotic chromosomes with the explore three main categories of repetitive sequences: (1) proteolytic enzyme trypsin, followed by Giemsa staining. This heterochromatin found to be associated with centromeres procedure stains regions of DNA that are rich in A ‑T base pairs. and making up telomeres, (2) tandem repeats of both short Another technique, called C-banding, uses chromosome prep- and long DNA sequences, and (3) transposable sequences arations that are heat denatured. Subsequent Giemsa staining that are interspersed throughout the genome of eukaryotes. reveals only the heterochromatic regions of the centromeres. These, and other chromosome-banding techniques, reflect the heterogeneity and complexity of the chromo- Satellite DNA some along its length. So precise is the banding pattern that The nucleotide composition of the DNA (e.g., the percentage when a segment of one chromosome has been translocated of C-G versus A -T pairs) of a particular species is reflected in to another chromosome, its origin can be determined with its density, which can be measured with sedimentation equi- great precision. librium centrifugation. When eukaryotic DNA is analyzed in this way, the majority is present as a single main band, or peak, of fairly uniform density. However, one or more addi- ESSENTIAL POINT tional peaks represent DNA that differs slightly in density. Heterochromatin, prematurely condensed in interphase and for the This component, called satellite DNA, represents a vari- most part genetically inert, is illustrated by the centromeric and telo- meric regions of eukaryotic chromosomes, the Y chromosome, and able proportion of the total DNA, depending on the species. the Barr body. ■ A profile of main-band and satellite DNA from the mouse is shown in Figure 11.12. By contrast, bacteria contain only main-band DNA. The significance of satellite DNA remained an enigma 11.5 Eukaryotic Genomes until the mid-1960s, when Roy Britten and David Kohne Demonstrate Complex Sequence developed the technique for measuring the reassociation Organization Characterized by kinetics of DNA that had previously been dissociated into sin- gle strands. They demonstrated that certain portions of DNA Repetitive DNA reassociated more rapidly than others. They concluded that

Thus far, we have looked at how DNA Repetitive is organized into chromosomes in bac- DNA teriophages, bacteria, and eukaryotes. We now begin an examination of what we know about the organization of DNA Highly Middle sequences within the chromosomes repetitive repetitive making up an organism’s genome, plac- ing our emphasis on eukaryotes. In addition to single copies of Satellite Tandem Interspersed unique DNA sequences that make up DNA repeats retrotransposons genes, many DNA sequences within eukaryotic chromosomes are repetitive in nature. Various levels of repetition Multiple- Mini- Micro- SINEs LINEs occur within the genomes of organ- copy genes satellites satellites isms. Many studies have now provided insights into repetitive DNA, demon- rRNA VNTRs STRs Alu L1 strating various classes of sequences genes and organization. Figure 11.11 schema- tizes these categories. Some functional FIGURE 11.11 An overview of the categories of repetitive DNA.

M11_KLUG8414_10_SE_C11.indd 212 16/11/18 5:14 pm 11.5 EUKARYOTIC GENOMES DEMONSTRATE COMPLEX SEQUENCE ORGANIZATION CHARACTERIZED 213

role, it is believed that the DNA sequence contained within MB the centromere is critical. Careful analysis has confirmed this prediction. The minimal region of the centromere that supports the function of chromosomal segregation is desig- nated the CEN region. Within this heterochromatic region ) of the chromosome, the DNA binds a platform of proteins, 260 S which in multicellular organisms includes the kinetochore­

(OD that binds to the spindle fiber during division (see Absorbance Figure 2.8). The CEN regions of the yeast Saccharomyces cerevisiae were the first to be studied. Each centromere serves an iden- tical function, so it is not surprising that CENs from differ- 1.71 1.70 1.69 1.68 ent chromosomes were found to be remarkably similar in their organization. The CEN region of yeast chromosomes Density consists of about 120 bp. Mutational analysis suggests that FIGURE 11.12 Separation of main-band (MB) and satellite portions near the 3′@end of this DNA region are most criti- (S) DNA from the mouse, using ultracentrifugation in a CsCl cal to centromere function since mutations in them, but not gradient. those nearer the 5′@end, disrupt centromere function. Thus, the DNA of this region appears to be essential to the eventual rapid reassociation was characteristic of multiple DNA frag- binding to the spindle fiber. ments composed of identical or nearly identical nucleotide Centromere sequences of multicellular eukaryotes sequences—the basis for the descriptive term repetitive DNA. are much more extensive than in yeast and vary consider- When satellite DNA is subjected to analysis by reassoci- ably in size. For example, in Drosophila the CEN region is ation kinetics, it falls into the category of highly repetitive found within some 200 to 600 kb of DNA, much of which DNA, which is known to consist of relatively short sequences is highly repetitive. Recall from our prior discussion that repeated a large number of times. Further evidence suggests highly repetitive satellite DNA is localized in the centro- that these sequences are present as tandem repeats clustered mere regions of mice. In humans, one of the most recog- in very specific chromosomal areas known to be heterochro- nized satellite DNA sequences is the alphoid family, found matic—the regions flanking centromeres. This was discov- mainly in the centromere regions. Alphoid sequences, each ered in 1969 when several researchers, including Mary Lou about 170 bp in length, are present in tandem arrays of up Pardue and Joe Gall, applied in situ hybridization to the to 1 million base pairs. Embedded within this repetitive study of satellite DNA. This technique involves the molecu- DNA are more specific sequences that are critical to cen- lar hybridization between an isolated fraction of radioac- tromere function. tively labeled DNA or RNA probes and the DNA contained in the chromosomes of a cytological preparation. Following the hybridization procedure, autoradiography is performed to locate the chromosome areas complementary to the frac- tion of DNA or RNA. Pardue and Gall demonstrated that radioactive probes made from mouse satellite DNA hybridize with the DNA of centromeric regions of mouse mitotic chromosomes, which are all telocentric (Figure 11.13). Several conclusions were drawn: Satellite DNA differs from main-band DNA in its molecular composition, as established by buoyant density studies. It is composed of repetitive sequences. Finally, satellite DNA is found in the heterochromatic centromeric regions of chromosomes.

Centromeric DNA Sequences FIGURE 11.13 In situ molecular hybridization between The separation of homologs during mitosis and meiosis RNA transcribed from mouse satellite DNA and mitotic chromosomes. The grains in the autoradiograph localize depends on centromeres, described cytologically as the the chromosome regions (the centromeres) containing primary constrictions along eukaryotic chromosomes. In this ­satellite DNA sequences.

M11_KLUG8414_10_SE_C11.indd 213 16/11/18 5:14 pm 214 11 chromosome Structure and DNA Sequence Organization

One final observation of interest is that the H3 histone, short or long, and many have the added distinction of being a normal part of most all eukaryotic nucleosomes, is substi- similar to transposable elements, which are mobile and tuted by a variant histone designated CENP-A in centromeric can potentially relocate within the genome. A large portion heterochromatin. It is believed that the N-terminal protein of the human genome is composed of such sequences. tails that make CENP-A unique are involved in the binding of For example, short interspersed elements, called SINEs, kinetechore proteins that are essential to the microtubules are less than 500 base pairs long and may be present 500,000 of spindle fibers. This finding supports the supposition that times or more in the human genome. The best characterized the DNA sequence found only in centromeres is related to human SINE is a set of closely related sequences called the Alu the function of this unique chromosomal structure. family (the name is based on the presence of DNA sequences recognized by the restriction endonuclease AluI). Members of Middle Repetitive Sequences: VNTRs and STRs this DNA family, also found in other mammals, are 200 to 300 base pairs long and are dispersed rather uniformly throughout A brief look at still another prominent category of repeti- the genome, both between and within genes. In humans, this tive DNA sheds additional light on the organization of the family encompasses more than 5 ­percent of the entire genome. eukaryotic genome. In addition to highly repetitive DNA, Alu sequences are of particular interest because some which constitutes about 5 percent of the human genome members of the Alu family are transcribed into RNA, (and 10 percent of the mouse genome), a second category, although the specific role of this RNA is not certain. Even middle (or moderately) repetitive DNA, is fairly well so, the consequence of Alu sequences is their potential for characterized. Because we now know a great deal about the transposition within the genome, which is related to chro- human genome, we will use our own species to illustrate this mosome rearrangements during evolution. Alu sequences category of DNA in genome organization. are thought to have arisen from an RNA element whose Although middle repetitive DNA does include some dupli- DNA complement was dispersed throughout the genome as cated genes (such as those encoding ribosomal RNA), most a result of the activity of reverse transcriptase (an enzyme prominent in this category are either noncoding tandemly that synthesizes DNA on an RNA template). repeated sequences or noncoding interspersed sequences. The group of long interspersed elements (LINEs) rep- No function has been ascribed to these components of the resents yet another category of repetitive transposable DNA genome. An example is DNA described as ­variable number sequences. LINEs are usually about 6 kb in length and in the tandem repeats (VNTRs). These repeating DNA sequences human genome are present approximately 850,000 times. The may be 15 to 100 bp long and are found within and between most prominent example in humans is the L1 family. Members genes. Many such clusters are dispersed throughout the of this sequence family are about 6400 base pairs long and are genome and are often referred to as minisatellites. present up to 500,000 times. Their 5′@end is highly variable, The number of tandem copies of each specific sequence and their role within the genome has yet to be defined. at each location varies from one individual to the next, cre- The general mechanism for transposition of L1 ele- ating localized regions of 1000 to 20,000 bp (1 to 20 kb) in ments is now clear. The L1 DNA sequence is first transcribed length. As we will see in Special Topics Chapter 6–DNA into an RNA molecule. The RNA then serves as the template Forensics, the variation in size (length) of these regions for the synthesis of the DNA complement using the enzyme between individual humans was originally the basis for reverse transcriptase. This enzyme is encoded by a portion the forensic technique referred to as DNA fingerprinting. of the L1 sequence. The new L1 copy then integrates into Another group of tandemly repeated sequences con- the DNA of the chromosome at a new site. Because of the sists of di-, tri-, tetra-, and pentanucleotides, also referred to similarity of this transposition mechanism to that used by as microsatellites or short tandem repeats (STRs). Like retroviruses, LINEs are referred to as retrotransposons. VNTRs, they are dispersed throughout the genome and vary SINEs and LINEs represent a significant portion of among individuals in the number of repeats present at any human DNA. SINEs constitute about 13 percent of the site. For example, in humans, the most common microsatel- human genome, whereas LINEs constitute up to 21 percent. lite is the dinucleotide (CA)n, where n equals the number of Within both types of elements, repeating sequences of DNA repeats. Most commonly, n is between 5 and 50. These clusters are present in combination with unique sequences. have served as useful molecular markers for genome analysis.

Repetitive Transposed Sequences: Middle Repetitive Multiple-Copy Genes SINEs and LINEs In some cases, middle repetitive DNA includes functional Still another category of repetitive DNA consists of sequences genes present tandemly in multiple copies. For example, that are interspersed individually throughout the genome, many copies exist of the genes encoding ribosomal RNA. Dro- rather than being tandemly repeated. They can be either sophila has 120 copies per haploid genome. Single genetic

M11_KLUG8414_10_SE_C11.indd 214 16/11/18 5:14 pm 11.6 The Vast Majority of a Eukaryotic Genome Does Not Encode Functional Genes 215

units encode a large precursor molecule that is processed a substantial portion of the human genome. In addition into the 5.8S, 18S, and 28S rRNA components. In humans, to repetitive DNA, a large amount of the DNA consists of multiple copies of this gene are clustered on the p arm of the single-copy sequences as defined by reassociation kinetic acrocentric chromosomes 13, 14, 15, 21, and 22. Multiple analysis (Chapter 9) that appear to be noncoding. Included copies of the genes encoding 5S rRNA are transcribed sepa- are many instances of what we call pseudogenes. These rately from multiple clusters found together on the terminal are DNA sequences representing evolutionary vestiges of portion of the p arm of chromosome 1. duplicated copies of genes that have undergone significant mutational alteration. As a result, although they show some homology to their parent gene, they are usually not ESSENTIAL POINT transcribed because of insertions and deletions throughout Eukaryotic genomes demonstrate complex sequence organization their structure. characterized by numerous categories of repetitive DNA, consisting of either tandem repeats clustered in various regions of the genome Although the proportion of the genome consisting of or single sequences repeatedly interspersed at random throughout repetitive DNA varies among organisms, one feature seems the genome. ■ to be shared: Only a very small part of the genome actually codes for proteins. For example, the 20,000 to 30,000 genes encoding proteins in the sea urchin occupy less than 10 per- 11.6 The Vast Majority of a Eukaryotic cent of the genome. In Drosophila, only 5 to 10 percent of the genome is occupied by genes coding for proteins. In humans, Genome Does Not Encode Functional it appears that the estimated 20,000 protein-coding genes Genes occupy only about 2 percent of the total DNA sequence mak- ing up the genome. Given the preceding information concerning various forms Related to the above observation, we are currently dis- of repetitive DNA in eukaryotes, it is of interest to pose an covering many cases where DNA sequences are transcribed important question: What proportion of the eukaryotic genome into RNA molecules that are not translated into proteins, actually encodes functional genes? and that play important cellular roles, such as regulation of We have seen that, taken together, the various forms of genetic activity. This topic is explored in more depth later in highly repetitive and moderately repetitive DNA comprise the text; see Chapter 16.

EXPLORING GENOMICS

Database of Genomic Variants: Structural Mastering Genetics Visit the Study Area: Exploring Genomics Variations in the Human Genome

n this chapter, we focused on struc- (DGV), which provides a quickly expand- of interest to you. A table will appear tural details of chromosomes and ing summary of structural variations in under the “Variants” tab showing sev- DNA sequence organization in chro- the human genome including CNVs. eral columns of data including: Imosomes. A related finding is that large seg- ■■ Exercise I - Database of Genomic Start and Stop: Shows the locus for ments of DNA and a number of genes can Variants the CNV, including the base pairs vary greatly in copy number due to dupli- that span the variation. cations, creating copy number variations 1. Access the DGV at http://dgv.tcag.ca/ (CNVs). Many studies are under way to ■■ Variant Accession: Provides a dgv/app/home. Click the “About the identify and map CNVs and to find possible unique identifying number for Project” tab to learn more about the disease conditions associated with them. each variation. Click on the variant purpose of the DGV. Several thousand CNVs have been accession number to reveal a sepa- identified in the human genome, and 2. Information in the DGV is easily rate page of specific details about estimates suggest there may be thou- viewed by clicking on a chromosome each CNV, including the chromo- sands more within human populations. In of interest using the “Find DGV Vari- somal banding location for the this Exploring Genomics exercise we will ants by Chromosome” feature. Using variation and known genes that are visit the Database of Genomic Variants this feature, click on a chromosome located in the CNV. (continued)

M11_KLUG8414_10_SE_C11.indd 215 16/11/18 5:14 pm 216 11 chromosome Structure and DNA Sequence Organization

Exploring Genomics—continued ■■ Variant Type: Most variations in Defensin (DEF) genes are part of a chromosomes to find the information this database are CNVs. Variant large family of highly duplicated genes. you will need for your answers. subtypes, such as deletions or To learn more about DEF genes and a. On what chromosome(s) did you find insertions, and duplications, are CNVs, use the Keyword Search box and CNVs containing DEF genes? shown in an adjacent column. search for DEF. A results page for the search will appear with a listing of rel- b. What did you learn about the function 3. Let’s analyze a particular group of evant CNVs. Click on the name for any of DEF gene products? What do DEF CNVs. Many CNVs are unlikely to of the different DEF genes listed, which proteins do? affect phenotype because they involve will take you to a wealth of information large areas of non-protein-coding or c. Variations in DEF genotypes and DEF [including links to Online Mendelian nonregulatory sequences. But gene- gene expression in humans have been Inheritance in Man (OMIM)] about containing CNVs have been identified, implicated in a number of different these genes so that you can answer the including variants containing genes human disease conditions. Give exam- following questions. Do this for sev- associated with Alzheimer disease, Par- ples of the kinds of disorders affected eral DEF-containing CNVs on different kinson disease, and other conditions. by variations in DEF genotypes.

CASE STUDY Helping or hurting?

oberts syndrome is a rare inherited disorder characterized 1. In Roberts syndrome, how could premature separation of cen- by facial defects as well as severe limb shortening, extra dig- tromeres during mitosis cause the wide range of phenotypic R its, and deformities of the knees and ankles. A cytogenetic­ deficiencies? analysis of patients with Roberts syndrome, using Giemsa staining or 2. What ethical obligations do the parents owe to their child in C-banding, reveals that there is premature separation of centromeres this situation and to helping others with Roberts syndrome by and other heterochromatic regions during mitotic metaphase instead allowing images of their child to be used in raising awareness of anaphase. A couple who has an infant with Roberts syndrome is of this disorder? contacted by a local organization dedicated to promoting research 3. If the parents decide to allow their infant to be photographed, on rare genetic diseases, asking if they can photograph the infant as what steps should the local organization take to ensure appro- part of a campaign to obtain funding for these conditions. The couple priate use and distribution of the photos? learned that the privacy of such medical images is not well protected, and they often are subsequently displayed on public Web sites. The See Onion R. (2014). History, or Just Horror? Slate Magazine couple was torn between helping to raise awareness and promoting (http://www.slate.com/articles/news_and_politics/history/2014/11/ research on this condition and sheltering their child from having his old_medical_photographs_are_images_of_syphilis_and_tuberculo- images used inappropriately. Several interesting questions are raised. sis_patients.html).

INSIGHTS AND SOLUTIONS

A previously undiscovered single-cell organism was found living at a chromosome consists of nucleosomes (discounting any linker great depth on the ocean floor. Its nucleus contained only a single lin- DNA), how many are there, and how many total proteins are ear chromosome with 7 * 106 nucleotide pairs of DNA coalesced with needed to form them? three types of histone-like proteins. Consider the following questions: Solution: Since the chromosome contains 7 * 106 bp of 2 1. A short micrococcal nuclease digestion yielded DNA fractions DNA, the number of nucleosomes, each containing 7 * 10 of 700, 1400, and 2100 bp. Predict what these fractions repre- bp, is equal to sent. What conclusions can be drawn? 7 * 106 7 * 102 = 104 nucleosomes Solution: The chromatin fiber may consist of a variation of The chromosome contains> 104 copies of each of the three pro- nucleosomes containing 700 bp of DNA. The 1400- and 2100- teins, for a total of 3 * 104 proteins. bp fractions, respectively, represent two and three linked 3. Further analysis revealed the organism’s DNA to be a double nucleosomes. Enzymatic digestion may have been incomplete, helix similar to the Watson–Crick model but containing 20 leading to the latter two fractions. bp per complete turn of the right-handed helix. The physical 2. The analysis of individual nucleosomes reveals that each size of the nucleosome was exactly double the volume occu- unit contained one copy of each protein and that the short pied by that found in all other known eukaryotes, by virtue of linker DNA contained no protein bound to it. If the entire increasing the distance along the fiber axis by a factor of two.

M11_KLUG8414_10_SE_C11.indd 216 16/11/18 5:14 pm PROBLEMS AND DISCUSSION QUESTIONS 217

Compare the degree of compaction of this organism’s nucleo- normal eukaryote compacts a length of DNA consisting of 20 some to that found in other eukaryotes. complete turns of the helix (200 bp per nucleosome/10 bp per turn) into a nucleosome one-half the volume of that in the Solution: The unique organism compacts a length of DNA unique organism. The degree of compaction is therefore less consisting of 35 complete turns of the helix (700 bp per in the unique organism. nucleosome/20 bp per turn) into each nucleosome. The

Mastering Genetics Visit for Problems and Discussion Questions ­instructor-assigned tutorials and problems.

1. HOW DO WE KNOW? In this chapter, we focused on how DNA 10. Describe the transitions that occur as nucleosomes are coiled and is organized at the chromosomal level. Along the way, we found folded, ultimately forming a chromatid. many opportunities to consider the methods and reasoning by 11. Provide a comprehensive definition of heterochromatin, and list which much of this information was acquired. From the expla- as many examples as you can. nations given in the chapter, propose answers to the following 12. Contrast the various categories of repetitive DNA. fundamental questions: 13. Define satellite DNA. Describe where it is found in the genome of (a) How do we know that viral and bacterial chromosomes eukaryotes and its role as part of chromosomes. most often consist of circular DNA molecules devoid of 14. Contrast the structure of SINE and LINE DNA sequences. Why are protein? LINEs referred to as retrotransposons? (b) What is the experimental basis for concluding that puffs in 15. Mammals contain a diploid genome consisting of at least 109 bp. polytene chromosomes and loops in lampbrush chromo- If this amount of DNA is present as chromatin fibers, where each somes are areas of intense transcription of RNA? group of 200 bp of DNA is combined with nine histones into a (c) How did we learn that eukaryotic chromatin exists in the nucleosome and each group of six nucleosomes is combined into form of repeating nucleosomes, each consisting of about a solenoid, achieving a final packing ratio of 50, determine: 200 base pairs and an octamer of histones? (a) the total number of nucleosomes in all fibers. (d) How do we know that satellite DNA consists of repeti- (b) the total number of histone molecules combined with DNA tive sequences and has been derived from regions of the in the diploid genome. centromere? (c) the combined length of all fibers. 2. CONCEPT QUESTION Review the Chapter Concepts list on 16. Assume that a viral DNA molecule is a 50@μm-long circular p. 202. These all relate to how DNA is organized in viral, bacte- strand of a uniform 2-nm diameter. If this molecule is contained rial, and eukaryote chromosomes. Write a short essay that con- in a viral head that is a 0.08@μm-diameter sphere, will the DNA trasts the major differences between the organization of DNA in molecule fit into the viral head, assuming complete flexibility of viruses and bacteria versus eukaryotes. the molecule? Justify your answer mathematically. 3. Contrast the sizes of the chromosomes of bacteriophage l and T2 17. How many base pairs are in a molecule of phage T2 DNA that is with that of E. coli, and compare the appearance and size of the 52 μm long? DNA associated with mitochondria and chloroplasts. 18. The human genome contains approximately 106 copies of an Alu 4. Describe how giant polytene chromosomes are formed. sequence, one of the best-studied classes of short interspersed 5. What genetic process is occurring in a puff of a polytene elements (SINEs), per haploid genome. Individual Alus share a chromosome? 282-nucleotide consensus sequence followed by a 3′-adenine- 6. During what genetic process are lampbrush chromosomes pres- rich tail region. Given that there are approximately 3 * 109 ent in vertebrates? bp per human haploid genome, about how many base pairs are 7. Why might we predict that the organization of eukaryotic spaced between each Alu sequence? genetic material will be more complex than that of viruses or 19. Below is a diagram of the general structure of the bacteriophage bacteria? l chromosome. Speculate on the mechanism by which it forms a 8. Describe the sequence of research findings that led to the devel- closed ring upon infection of the host cell. opment of the model of chromatin structure. 9. What are the molecular composition and arrangement of the 5′GGGCGGCGACCT:double@stranded region:3′ components in the nucleosome? 3′:double@stranded region:CCCGCCGCTGGA5′

M11_KLUG8414_10_SE_C11.indd 217 16/11/18 5:14 pm 12 The Genetic Code and Transcription

Electron micrograph visualizing the process of transcription. CHAPTER CONCEPTS

■■ Genetic information is stored in DNA using a triplet code that is nearly univer- sal to all living things on Earth. ■■ The encoded genetic information he linear sequence of deoxyribonucleotides making up DNA ultimately stored in DNA is initially copied into an dictates the components constituting proteins, the end product of most RNA transcript. Tgenes. The central question is how such information stored as a nucleic ■■ Once transferred from DNA to RNA, acid is decoded into a protein. Figure 12.1 gives a simplified overview of how the genetic code exists as triplet this transfer of information occurs. In the first step in gene expression, infor- codons, using the four ribonucleotides mation on one of the two strands of DNA (the template strand) is copied into in RNA as the letters composing it. an RNA complement through transcription. Once synthesized, this RNA acts ■■ By using four different letters taken as a “messenger” molecule bearing the coded information—hence its name, three at a time, 64 triplet sequences messenger RNA (mRNA). The mRNAs then associate with , where are possible. Most encode one of the decoding into proteins takes place. 20 amino acids present in proteins, In this chapter, we focus on the initial phases of gene expression by which serve as the end products of addressing two major questions. First, how is genetic information encoded? most genes. Second, how does the transfer from DNA to RNA occur, thus defining the ■■ Several codons provide signals that process of transcription? As you shall see, ingenious analytical research ­initiate or terminate protein synthesis. established that the genetic code is written in units of three letters— ■■ The process of transcription ribonucleotides present in mRNA that reflect the stored information in is similar but more complex in genes. Most all triplet code words direct the incorporation of a specific ­eukaryotes ­compared to bacteria and amino acid into a protein as it is synthesized. As we can predict based on our ­bacteriophages that infect them. prior discussion of the replication of DNA, transcription is also a complex process dependent on a major polymerase enzyme and a cast of supporting proteins. We will explore what is known about transcription in bacteria and then contrast this model with the differences found in eukaryotes. Together, the information in this and Chapter 13 provides a comprehensive picture of

218

M12_KLUG8414_10_SE_C12.indd 218 16/11/18 5:14 pm 12.2 Early Studies Established the Basic Operational Patterns of the Code 219

Gene 3. The code is unambiguous—each triplet specifies only a single amino acid.

DNA 4. The code is degenerate; that is, a given amino acid can be specified by more than one triplet codon. This is the case for 18 of the 20 amino acids. 3' 5' TACCACAACTCG 5. The code contains one “start” and three “stop” signals, DNA template strand triplets that initiate and terminate .

6. No internal punctuation (such as a comma) is used in the code. Thus, the code is said to be commaless. Once trans- Transcription lation of mRNA begins, the codons are read one after the other, with no breaks between them. mRNA 5' 3' 7. The code is nonoverlapping. Once translation com- AUGGUGUUGAGC mences, any single ribonucleotide at a specific location Triplet code words within the mRNA is part of only one triplet.

8. The code is nearly universal. With only minor excep- tions, almost all viruses, bacteria, archaea, and eukary- Translation on ribosomes otes use a single coding dictionary.

Met Val Leu Ser Protein Amino acids 12.2 Early Studies Established the

FIGURE 12.1 flowchart illustrating how genetic Basic Operational Patterns of the Code ­information encoded in DNA produces protein. In the late 1950s, before it became clear that mRNA is the molecular genetics, which serves as the most basic founda- intermediate that transfers genetic information from DNA to tion for understanding living organisms. In Chapter 13, we proteins, researchers thought that DNA itself might directly will address how translation occurs and discuss the struc- encode proteins during their synthesis. Because ribosomes ture and function of proteins. had already been identified, the initial thinking was that information in DNA was transferred in the nucleus to the RNA of the , which served as the template for pro- tein synthesis in the cytoplasm. This concept soon became 12.1 The Genetic Code Exhibits a untenable as accumulating evidence indicated the existence of an unstable intermediate template. The RNA of ribosomes, Number of Characteristics on the other hand, was extremely stable. As a result, in 1961 François Jacob and Jacques Monod postulated the existence Before we consider the various analytical approaches that of messenger RNA (mRNA). Once mRNA was discovered, led to our current understanding of the genetic code, let’s it was clear that even though genetic information is stored in summarize the general features that characterize it. DNA, the code that is translated into proteins resides in RNA. 1. The genetic code is written in linear form, using the ribo- The central question then was how only four letters—the four nucleotide bases that compose mRNA molecules as “let- ribonucleotides—could specify 20 words (the amino acids). ters.” The ribonucleotide sequence is derived from the complementary nucleotide bases in DNA.

2. Each “word” within the mRNA consists of three ribo- The Triplet Nature of the Code nucleotide letters, thus representing a triplet code. With In the early 1960s, Sydney Brenner argued on theoretical several exceptions, each group of three ribonucleotides, grounds that the code had to be a triplet since three-letter called a codon, specifies one amino acid. words represent the minimal use of four letters to specify

M12_KLUG8414_10_SE_C12.indd 219 16/11/18 5:14 pm 220 12 the Genetic Code and Transcription

(a) anything other than a triplet. This work also suggested that Initially in frame most triplet codes are not blank, but rather encode amino 3' 5' acids, supporting the concept of a degenerate code. GAGGAGGAGGAGGAG

C inserted C 3' 5' GAGGAC GGAGGAGGA 12.3 Studies by Nirenberg,

Out of frame Matthaei, and Others Deciphered the Code (b) Initially in frame 3' 5' In 1961, Marshall Nirenberg and J. Heinrich Matthaei deci- GAGGAGGAGGAGGAG phered the first specific coding sequences, which served as

CAG inserted a cornerstone for the complete analysis of the genetic code. CAG Their success, as well as that of others who made impor- 3' 5' tant contributions to breaking the code, was dependent GAG G A C A G GGAG G AG on the use of two experimental tools—an in vitro (cell-free) protein-synthesizing system and an enzyme, polynucleotide Out of frame Back in frame phosphorylase, which enabled the production of synthetic mRNAs. These mRNAs are templates for polypeptide syn- FIGURE 12.2 The effect of frameshift mutations on a DNA sequence with the repeating triplet sequence GAG. (a) The thesis in the cell-free system. insertion of a single nucleotide shifts all subsequent trip- let reading frames. (b) The insertion of three nucleotides changes only two triplets, but the frame of reading is then reestablished to the original sequence. Synthesizing Polypeptides in a Cell-Free System 20 amino acids. A code of four nucleotides, taken two at a In the cell-free system, amino acids are incorporated into time, for example, provides only 16 unique code words (42). polypeptide chains. This in vitro mixture must contain the A triplet code yields 64 words (43)—clearly more than the essential factors for protein synthesis in the cell: ribosomes, 20 needed—and is much simpler than a four-letter code (44), tRNAs, amino acids, and other molecules essential to trans- which specifies 256 words. lation (see Chapter 13). In order to follow (or trace) protein Experimental evidence supporting the triplet nature of synthesis, one or more of the amino acids must be radio- the code was subsequently derived from research by Francis­ active. Finally, an mRNA must be added, which serves as Crick and his colleagues. Using phage T4, they studied the template that will be translated. frameshift mutations, which result from the addition or In 1961, mRNA had yet to be isolated. However, use of deletion of one or more nucleotides within a gene and sub- the enzyme polynucleotide phosphorylase allowed the arti- sequently the mRNA transcribed by it. The gain or loss of ficial synthesis of RNA templates, which could be added to letters shifts the frame of reading during translation. Crick the cell-free system. This enzyme, isolated from bacteria, and his colleagues found that the gain or loss of one or two catalyzes the reaction shown in Figure 12.3. Discovered nucleotides caused a frameshift mutation, but when three in 1955 by Marianne Grunberg-Manago and , nucleotides were involved, the frame of reading was reestab- the enzyme functions metabolically in bacterial cells to lished (Figure 12.2). This would not occur if the code was degrade RNA. However, in vitro, with high concentrations

Polynucleotide phosphorylase n [rNDP] [rNMP]n + n [Pi] Ribonucleoside diphosphates RNA Inorganic phosphates FIGURE 12.3 The reaction catalyzed by the enzyme poly- nucleotide phosphorylase. (synthesis) Note that the equilibrium P P P of the reaction favors the + 3 Pi P + P + P (degradation) P P P degradation of RNA but can be “forced” in the direction favoring synthesis.

M12_KLUG8414_10_SE_C12.indd 220 16/11/18 5:14 pm 12.3 Studies by Nirenberg, Matthaei, and Others Deciphered the Code 221

of ribonucleoside diphosphates, the reaction can be “forced” Mixed Heteropolymers in the opposite direction to synthesize RNA, as shown. With these techniques in hand, Nirenberg and Mat- In contrast to RNA polymerase, polynucleotide phos- thaei, and Ochoa and coworkers turned to the use of RNA phorylase does not require a DNA template. As a result, ­heteropolymers. In this type of experiment, two or more each addition of a ribonucleotide is random, based on the different ribonucleoside diphosphates are added in com- relative concentration of the four ribonucleoside diphos- bination to form the synthetic mRNA. The researchers phates added to the reaction mixtures. The probability of reasoned that if they knew the relative proportion of each the insertion of a specific ribonucleotide is proportional to type of ribonucleoside diphosphate, they could predict the the availability of that molecule, relative to other available frequency of any particular triplet codon occurring in the ribonucleotides. This point is absolutely critical to understand- synthetic mRNA. If they then added the mRNA to the cell- ing the work of Nirenberg and others in the ensuing discussion. free system and ascertained the percentage of any particular Together, the cell-free system and the availability of amino acid present in the new protein, they could analyze synthetic mRNAs provided a means of deciphering the ribo- the results and predict the composition (not the specific nucleotide composition of various triplets encoding specific sequence) of triplets specifying particular amino acids. amino acids. This approach is shown in Figure 12.4. Suppose that A and C are added in a ratio of 1A:5C. The insertion of a ribo- The Use of Homopolymers nucleotide at any position along the RNA molecule during its synthesis is determined by the ratio of A:C. Therefore, there In their initial experiments, Nirenberg and Matthaei syn- is a 1/6 chance for an A and a 5/6 chance for a C to occupy thesized RNA homopolymers, each with only one type of each position. On this basis, we can calculate the frequency ribonucleotide. Therefore, the mRNA added to the in vitro of any given triplet appearing in the message. system was UUUUUU . . . , AAAAAA . . . , CCCCCC . . . , or For AAA, the frequency is (1 6)3, or about 0.4 percent. GGGGGG . . . . They tested each mRNA and were able to For AAC, ACA, and CAA, the frequencies are identical— determine which, if any, amino acids were incorporated > that is, (1 6)2(5 6), or about 2.3 percent for each triplet. into newly synthesized proteins. To do this, the researchers Together, all three 2A:1C triplets account for 6.9 percent labeled 1 of the 20 amino acids added to the in vitro system > > of the total three-letter sequences. In the same way, each and conducted a series of experiments, each with a different of three 1A:2C triplets accounts for (1 6)(5 6)2, or 11.6 radioactively labeled amino acid. percent (or a total of 34.8 percent); CCC is represented by For example, in experiments using 14C@phenylalanine > > (5 6)3, or 57.9 percent of the triplets. (Table 12.1), Nirenberg and Matthaei concluded that the By examining the percentages of any given amino acid message poly U (polyuridylic acid) directed the incorpora- > incorporated into the protein synthesized under the direc- tion of only phenylalanine into the homopolymer polyphe- tion of this message, we can propose probable base com- nylalanine. Assuming the validity of a triplet code, they positions for each amino acid (Figure 12.4). Since proline determined the first specific codon assignment—UUU codes appears 69 percent of the time, we could propose that proline for phenylalanine. Using similar experiments, they quickly is encoded by CCC (57.9 percent) and one triplet of 2C:1A found that AAA codes for lysine and CCC codes for proline. (11.6 percent). Histidine, at 14 percent, is probably coded Poly G was not an adequate template, probably because the by one 2C:1A (11.6 percent) and one 1C:2A (2.3 percent). molecule folds back upon itself. Thus, the assignment for Threonine, at 12 percent, is likely coded by only one 2C:1A. GGG had to await other approaches. Asparagine and glutamine each appear to be coded by one of Note that the specific triplet codon assignments were pos- the 1C:2A triplets, and lysine appears to be coded by AAA. sible only because homopolymers were used. This method Using as many as all four ribonucleotides to construct yields only the composition of triplets, but since three identi- the mRNA, the researchers conducted many similar experi- cal letters can have only one possible sequence (e.g., UUU), ments. Although determining the composition of the triplet the actual codons were identified. code words for all 20 amino acids represented a significant breakthrough, the specific sequences of triplets were still 14 TABLE 12.1 Incorporation of C@phenylalanine into Protein unknown—other approaches were needed. Artificial mRNA Radioactivity (counts/min) None 44 Poly U 39,800 ESSENTIAL POINT Poly A 50 The use of RNA homopolymers and mixed heteroploymers in a cell- Poly C 38 free system allowed the determination of the composition, but not the sequence, of triplet codons designating specific amino acids. ■ Source: After Nirenberg and Matthaei (1961).

M12_KLUG8414_10_SE_C12.indd 221 16/11/18 5:14 pm 222 12 the Genetic Code and Transcription

RNA Heteropolymer with Ratio of 1A:5C Possible Probability of occurrence compositions Possible triplets of any triplet Final %

3 3A AAA (1/6) = 1/216 = 0.5% 0.5 2 2A:1C AAC ACA CAA (1/6) (5/6) = 5/216 = 2.3% 3 * 2.3 = 6.9 2 1A:2C ACC CAC CCA (1/6)(5/6) = 25/216 = 11.6% 3 * 11.6 = 34.8 3 3C CCC (5/6) = 125/216 = 57.9% 57.9

~100 Chemical synthesis of message

RNA C C C C C C C C A C C C C C C A A C C A C C C C C A C C C C C A C C C A A

Translation of message

Percentage of amino Probable base-composition acids in protein assignments

Lysine <1 AAA

Glutamine 2 2A:1C

Asparagine 2 2A:1C

Threonine 12 1A:2C

Histidine 14 1A:2C, 2A:1C

Proline 69 CCC, 1A:2C

FIGURE 12.4 results and interpretation of a mixed heteropolymer experiment where a ratio of 1A:5C is used (1/6A:5/6C).

NOW SOLVE THIS The Triplet Binding Assay 12.1 In a mixed heteropolymer experiment using poly- It was not long before more advanced techniques were devel- nucleotide phosphorylase, 3/4G:1/4C was used to form oped. In 1964, Nirenberg and developed the the synthetic message. The amino acid composition of the triplet binding assay, which led to specific assignments resulting protein was determined to be: of triplets. The technique took advantage of the observa- tion that ribosomes, when presented in vitro with an RNA Glycine 36/64 (56 percent) sequence as short as three ribonucleotides, will bind to it 12/64 (19 percent) and form a complex similar to that found in vivo. The triplet Arginine 12/64 (19 percent) sequence acts like a codon in mRNA, attracting a transfer Proline 4/64 (6 percent) RNA molecule containing the complementary sequence and carrying a specific amino acid (Figure 12.5). The trip- From this information, (a) Indicate the percentage (or fraction) of the time let sequence in tRNA that is complementary to a codon of each possible codon will occur in the message. mRNA is called an anticodon. (b) Determine one consistent base-composition Although it was not yet feasible to chemically synthesize assignment for the amino acids present. long stretches of RNA, triplets of known sequence could be HINT: This problem asks you to analyze a mixed heteropolymer synthesized in the laboratory to serve as templates. All that experiment and to predict codon composition assignments for the was needed was a method to determine which tRNA–amino amino acids encoded by the synthetic message. The key to its solu- acid was bound to the triplet RNA–ribosome complex. The tion is to first calculate the proportion of each triplet codon in the test system Nirenberg and Leder devised was quite simple. synthetic RNA and then match these to the proportions of amino The amino acid to be tested was made radioactive, and a acids that are synthesized. charged tRNA was produced. Because codon compositions

M12_KLUG8414_10_SE_C12.indd 222 16/11/18 5:14 pm 12.3 Studies by Nirenberg, Matthaei, and Others Deciphered the Code 223

5' 3' U U U Small subunit Triplet codon Synthetic UUU Ribosome RNA triplet A A A

+ A A A Anticodon

Phe

Charged tRNAPhe Large subunit RNA triplet–tRNA–ribosome complex

FIGURE 12.5 illustration of the behavior of the components during the triplet binding assay. The synthetic UUU triplet RNA sequence acts as a codon, attracting the complementary AAA anticodon of the charged trnaphe, which together are bound by the subunits of the ribosome.

were known, researchers could narrow the range of amino triplet specifies only one amino acid. As you shall see later acids that should be tested for each specific triplet. in this chapter, these conclusions have been upheld with The radioactively charged tRNA, the RNA triplet, and only minor exceptions. The triplet binding technique was a ribosomes were incubated together and then passed through major innovation in deciphering the genetic code. a nitrocellulose filter, which retains the larger ribosomes but not the other smaller components, such as unbound charged tRNA. If radioactivity is not retained on the filter, an incor- Repeating Copolymers rect amino acid has been tested. But if radioactivity remains Yet another innovative technique used to decipher the on the filter, it is retained because the charged tRNA has genetic code was developed in the early 1960s by Har Gobind bound to the triplet associated with the ribosome. When this Khorana, who chemically synthesized long RNA molecules occurs, a specific codon assignment can be made. consisting of short sequences repeated many times. First, Work proceeded in several laboratories, and in many he created shorter sequences (e.g., di-, tri-, and tetra- cases clear-cut, unambiguous results were obtained. nucleotides), which were then replicated many times and Table 12.2, for example, shows 26 triplets assigned to 9 finally joined enzymatically to form the long polynucleo- amino acids. However, in some cases, the degree of trip- tides, referred to as copolymers. As shown in Figure 12.6, let binding was inefficient and assignments were not pos- a dinucleotide made in this way is converted to an mRNA sible. Eventually, about 50 of the 64 triplets were assigned. with two repeating triplets. A trinucleotide is converted to These specific assignments of triplets to amino acids led to an mRNA with three potential triplets, depending on the two major conclusions. First, the genetic code is degenerate; point at which initiation occurs, and a tetranucleotide cre- that is, one amino acid can be specified by more than one ates four repeating triplets. triplet. Second, the code is unambiguous. That is, a single When synthetic mRNAs were added to a cell-free sys- tem, the predicted number of amino acids incorporated into

TABLE 12.2 Amino Acid Assignments to Specific Trinucleotides polypeptides was upheld. Several examples are shown in Derived from the Triplet Binding Assay Table 12.3. When the data were combined with those on

Trinucleotides Amino Acid composition assignment and triplet binding, specific assign- ments were possible. AAA AAG Lysine One example of specific assignments made in this way AUG Methionine will illustrate the value of Khorana’s approach. Consider the AUU AUC AUA Isoleucine following experiments in concert with one another: CCG CCA CCU CCC Proline CUC CUA CUG CUU Leucine (1) The repeating trinucleotide sequence UUCUUCUUC . . . can GAA GAG Glutamic acid be read as three possible repeating triplets—UUC, UCU, and CUU—depending on the initiation point. When UCA UCG UCU UCC Serine placed in a cell-free translation system, three different UGU UGC Cysteine polypeptide homopolymers—containing either phe- UUA UUG Leucine nylalanine, serine, or leucine—are produced. Thus, we UUU UUC Phenylalanine know that each of the three triplets encodes one of the

M12_KLUG8414_10_SE_C12.indd 223 16/11/18 5:14 pm 224 12 the Genetic Code and Transcription

FIGURE 12.6 The conversion Repeating Repeating of di-, tri-, and tetranucleotides sequence Polynucleotides triplets into repeating RNA copoly- mers. The triplet codons that 5' 3' are produced in each case are UGUGUGUGUGUGUGU UGU shown. Dinucleotide and UG GUG Initiation

5' 3' UUGUUGUUGUUGUUGUUGU UUG or Trinucleotide UGU UUG or GUU Initiation

UAU 5' 3' and CUA Tetranucleotide UAUCUAUCUAUCUAUCUAUCU UAUC and UCU Initiation and AUC

three amino acids, but we do not know which codes triplets UCU and CUC specify leucine and serine, but we which; still do not know which triplet specifies which amino acid. However, when considering both sets of results in (2) On the other hand, the repeating dinucleotide sequence concert, we can conclude that UCU, which is common to UCUCUCUC . . . produces the triplets UCU and CUC and, both experiments, must encode either leucine or serine when used in an experiment, leads to the incorpora- but not phenylalanine. Thus, either CUU or UUC encodes tion of leucine and serine into a polypeptide. Thus, the leucine or serine, while the other encodes phenylalanine;

TABLE 12.3 Amino Acids Incorporated Using Repeated (3) To derive more specific information, we can examine the Synthetic Copolymers of RNA results of using the repeating tetranucleotide sequence Repeating Codons Amino Acids in UUAC, which produces the triplets UUA, UAC, ACU, and Copolymer Produced Polypeptides CUU. The CUU triplet is one of the two in which we are UG UGU Cysteine interested. Three amino acids are incorporated by this GUG Valine experiment: leucine, threonine, and tyrosine. Because AC ACA Threonine CUU must specify only serine or leucine, and because, of CAC Histidine these two, only leucine appears in the resulting polypep- UUC UUC Phenylalanine tide, we may conclude that CUU specifies leucine. Once UCU Serine this assignment is established, we can logically determine CUU Leucine all others. Of the two triplet pairs remaining (UUC and AUC AUC Isoleucine UCU from the first experiment and UCU and CUC from the second experiment), whichever triplet is common to both UCA Serine must encode serine. This is UCU. By elimination, UUC is CAU Histidine determined to encode phenylalanine and CUC is deter- UAUC UAU Tyrosine mined to encode leucine. Thus, through painstaking logi- CUA Leucine cal analysis, four specific triplets encoding three different UCU Serine amino acids have been assigned from these experiments. AUC Isoleucine From these and similar interpretations, Khorana reaf- GAUA GAU None firmed the identity of triplets that had already been deci- AGA None phered and filled in gaps left from other approaches. A UAG None number of examples are shown in Table 12.3. Of great inter- AUA None est, the use of two tetranucleotide sequences, GAUA and

M12_KLUG8414_10_SE_C12.indd 224 16/11/18 5:14 pm 12.4 The Coding Dictionary Reveals the Function of the 64 Triplets 225

GUAA, suggested that at least two triplets were termination Second position of codon codons. Khorana reached this conclusion because neither U C A G of these repeating sequences directed the incorporation of UUU UCU UAU UGU U more than a few amino acids into a polypeptide, too few for Phe Tyr Cys UUC UCC UAC UGC C him to detect. There are no triplets common to both mes- U Ser UUA UCA UAA Stop UGA Stop A sages, and both seemed to contain at least one triplet that end) (3’- codon of Thirdposition UUG UCG UAG Stop UGG Trp G terminates protein synthesis. Of the possible triplets in the CUU CCU CAU CGU U poly–(GAUA) sequence shown in Table 12.3, UAG was later Leu His CUC CCC CAC CGC C shown to be a termination codon. C Pro Arg CUA CCA CAA CGA A Gln CUG CCG CAG CGG G ESSENTIAL POINT AUU ACU AAU AGU U Asn Ser Use of the triplet binding assay and of repeating copolymers allowed AUC Ile ACC AAC AGC C the determination of the specific sequences of triplet codons desig- A Thr AUA ACA AAA AGA A nating specific amino acids. ■ Lys Arg AUG Met ACG AAG AGG G

First position of codon (5’- end) GUU GCU GAU GGU U Asp NOW SOLVE THIS GUC GCC GAC GGC C G Val Ala Gly GUA GCA GAA GGA A 12.2 When repeating copolymers are used to form syn- Glu GUG GCG GAG GGG G thetic mRNAs, dinucleotides produce a single type of polypeptide that contains only two different amino acids. Initiation Termination On the other hand, using a trinucleotide sequence pro- duces three different polypeptides, each consisting of only FIGURE 12.7 The coding dictionary. AUG encodes methio- a single amino acid. Why? What will be produced when a nine, which initiates most polypeptide chains. All other repeating tetranucleotide is used? amino acids except tryptophan, which is encoded only by UGG, are encoded by two to six triplets. The triplets UAA, HINT: This problem asks you to consider different outcomes of UAG, and UGA are termination signals and do not encode repeating copolymer experiments. The key to its solution is to be any amino acids. aware that when using a repeating copolymer of RNA, transla- tion can be initiated at different ribonucleotides. You must simply Also evident is the pattern of degeneracy. Most often, in determine the number of triplet codons produced by initiation at a set of codons specifying the same amino acid, the first two each of the different ribonucleotides. letters are the same, with only the third differing. Crick dis- cerned a pattern in the degeneracy at the third position, and in 1966, he postulated the wobble hypothesis. Crick’s hypothesis first predicted that the initial two 12.4 The Coding Dictionary Reveals ribonucleotides of triplet codes are more critical than the the Function of the 64 Triplets third in attracting the correct tRNA during translation. He postulated that hydrogen bonding at the third position of the codon–anticodon interaction is less constrained and need The various techniques used to decipher the genetic code not adhere as specifically to the established base-pairing have yielded a dictionary of 61 triplet codons assigned to rules. The wobble hypothesis thus proposes a more flexible amino acids. The remaining three triplets are termination set of base-pairing rules at the third position of the codon signals and do not specify any amino acid. (Table 12.4). This relaxed base-pairing requirement, or “wobble,” Degeneracy and the Wobble Hypothesis allows the anticodon of a single form of tRNA to pair with A general pattern of triplet codon assignments becomes more than one triplet in mRNA. Consistent with the wob- apparent when we look at the genetic coding dictionary. ble hypothesis and degeneracy, U at the first position (the Figure 12.7 designates the assignments in a particularly 5′@end) of the tRNA anticodon may pair with A or G at the illustrative form first suggested by Francis Crick. third position (the 3′@end) of the mRNA codon, and G may Most evident is that the code is degenerate, as the early likewise pair with U or C. Inosine (I), one of the modified researchers predicted. That is, almost all amino acids are bases found in tRNA, may pair with C, U, or A. Applying specified by two, three, or four different codons. Three these wobble rules, a minimum of about 30 different tRNA amino acids (serine, arginine, and leucine) are each encoded species is necessary to accommodate the 61 triplets speci- by six different codons. Only tryptophan and methionine are fying an amino acid. If nothing more, wobble can be con- encoded by single codons. sidered a potential economy measure, provided that the

M12_KLUG8414_10_SE_C12.indd 225 16/11/18 5:14 pm 226 12 the Genetic Code and Transcription

TABLE 12.4 Codon–Anticodon Base-Pairing Rules protein synthesis or the entire formylmethionine residue Base at First Position Base at Third Position is removed. In eukaryotes, methionine is also the initial (5′@end) of tRNA (3′@end) of mRNA amino acid during polypeptide synthesis. However, it is not A U formylated. C G As mentioned in the preceding section, three other trip- G C or U lets (UAG, UAA, and UGA) serve as termination codons, U A or G punctuation signals that do not code for any amino acid. I A, U, or C They are not recognized by a tRNA molecule, and transla- tion terminates when they are encountered. Mutations that produce any of the three triplets internally in a gene will also fidelity of translation is not compromised. Current esti- result in termination. Consequently, only a partial polypep- mates are that 30 to 40 tRNA species are present in bacteria tide has been synthesized when it is prematurely released and up to 50 tRNA species exist in animal and plant cells. from the ribosome. When such a change occurs in the DNA, it is called a nonsense mutation.

The Ordered Nature of the Code Still another observation has become apparent in the pattern of codon sequences and their corresponding amino acids, ESSENTIAL POINT leading to the description referred to as an ordered genetic The complete coding dictionary reveals that of the 64 possible triplet codons, 61 encode the 20 amino acids found in proteins, while three code. Chemically similar amino acids often share one or two triplets terminate translation. ■ “middle” bases in the different triplets encoding them. For example, either U or C is often present in the second position of triplets that specify hydrophobic amino acids, including valine and alanine, among others. Two codons (AAA and 12.5 The Genetic Code Has AAG) specify the positively charged amino acid lysine. If only the middle letter of these codons is changed from A to Been Confirmed in Studies of G (AGA and AGG), the positively charged amino acid argi- Bacteriophage MS2 nine is specified. The chemical properties of amino acids will be discussed The various aspects of the genetic code discussed thus far yield in more detail in Chapter 13. The end result of an “ordered” a fairly complete picture, suggesting that it is triplet in nature, code is that it buffers the potential effect of mutation on degenerate, unambiguous, and commaless, and that it con- protein function. While many mutations of the second base tains punctuation start and stop signals. That these features of triplet codons result in a change of one amino acid to are correct was confirmed by analysis of the RNA-containing another, the change is often to an amino acid with similar bacteriophage MS2 by Walter Fiers and his coworkers. chemical properties. In such cases, protein function may not MS2 is a bacteriophage that infects E. coli. Its nucleic be noticeably altered. acid (RNA) contains only about 3500 ribonucleotides, making up only four genes, specifying a coat protein, an RNA replicase, a lysis protein, and a maturation protein. Initiation and Termination The small genome and a few gene products enabled Fiers In contrast to the in vitro experiments discussed earlier, and his colleagues to sequence the genes and their prod- initiation of protein synthesis in vivo is a highly specific ucts. When the chemical constitution of these genes and process. In bacteria, the initial amino acid inserted into their encoded proteins were compared, they were found to all polypeptide chains is a modified form of methionine—­ exhibit colinearity.­ That is, based on the coding dictionary, N-­formylmethionine (fMet). Only one codon, AUG, codes the linear sequence of triplet codons corresponds precisely with for methionine, and it is sometimes called the initiator the linear sequence of amino acids in each protein. Punctua- codon. However, when AUG appears internally in mRNA, tion was also confirmed. For example, in the coat protein rather than at an initiating position, unformylated methionine gene, the codon for the first amino acids is AUG, the com- is inserted into the polypeptide chain. Rarely, another codon, mon initiator codon. The codon for the last amino acid is GUG, specifies methionine during initiation, though it is not followed by two consecutive termination codons, UAA and clear why this happens, since GUG normally encodes valine. UAG. The analysis clearly showed that the genetic code in In bacteria, either the formyl group is removed this virus was identical to that established ­experimentally in from the initial methionine upon the completion of bacterial systems.

M12_KLUG8414_10_SE_C12.indd 226 16/11/18 5:14 pm 12.7 Different Initiation Points Create Overlapping Genes 227

TABLE 12.5 Exceptions to the Universal Code 12.6 The Genetic Code Is Nearly Normal Code Altered Code Universal Codon Word Word Source UGA Termination Trp Human and yeast Between 1960 and 1978, it was generally assumed that mitochondria; Mycoplasma the genetic code would be found to be universal, applying CUA Leu Thr Yeast mitochondria equally to viruses, bacteria, archaea, and eukaryotes. Cer- tainly, the nature of mRNA and the translation machinery AUA Ile Met Human mitochondria seemed to be very similar in these organisms. For exam- AGA Arg Termination Human mitochondria ple, cell-free systems derived from bacteria can translate AGG Arg Termination Human mitochondria eukaryotic mRNAs. Poly U stimulates synthesis of poly- UAA Termination Gln Paramecium, Tetrahymena, phenylalanine in cell-free systems when the components and Stylonychia are derived from eukaryotes. Many recent studies involv- UAG Termination Gln Paramecium ing recombinant DNA technology (see Chapter 17) reveal that eukaryotic genes can be inserted into bacterial cells, which are then transcribed and translated. Within eukary- methionine in the , but in the cytoplasm, otes, mRNAs from mice and rabbits have been injected into methionine is specified by AUG. Similarly, UGA calls for ter- amphibian eggs and efficiently translated. For the many mination in the cytoplasm, but it specifies tryptophan in the eukaryotic genes that have been sequenced, notably those mitochondrion; in the cytoplasm, tryptophan is specified by for hemoglobin molecules, the amino acid sequence of the UGG. It has been suggested that such changes in codon recog- encoded proteins adheres to the coding dictionary estab- nition may represent an evolutionary trend toward reducing lished from bacterial studies. the number of tRNAs needed in mitochondria; only 22 tRNA However, several 1979 reports on the coding properties species are encoded in human mitochondria, for example. of DNA derived from mitochondria (mtDNA) of yeast and However, until more examples are found, the differences humans undermined the principle of the universality of the must be considered to be exceptions to the previously estab- genetic language. Since then, mtDNA has been examined in lished general coding rules. many other organisms. Cloned mtDNA fragments have been sequenced and compared with the amino acid sequences of various mito- chondrial proteins, revealing several exceptions to the cod- ing dictionary (Table 12.5). Most surprising is that the 12.7 Different Initiation Points codon UGA, normally specifying termination, encodes tryp- Create Overlapping Genes tophan during translation in yeast and human mitochon- dria. In yeast mitochondria, threonine is inserted instead of Earlier we stated that the genetic code is nonoverlapping— leucine when CUA is encountered in mRNA. In human mito- each ribonucleotide in an mRNA is part of only one codon. chondria, AUA, which normally specifies isoleucine, directs However, this characteristic of the code does not rule out the internal insertion of methionine. the possibility that a single mRNA may have multiple ini- In 1985, several other exceptions to the standard cod- tiation points for translation. If so, these points could theo- ing dictionary were discovered in the bacterium Myco- retically create several different reading frames within the plasma capricolum and in the nuclear genes of the protozoan same mRNA, thus specifying more than one polypeptide and ciliates Paramecium, Tetrahymena, and Stylonychia. For leading to the concept of overlapping genes. example, as shown in Table 12.5, one alteration converts That this might actually occur in some viruses was sus- the termination codon UGA to tryptophan, yet several oth- pected when phage fX174 was carefully investigated. The ers convert the normal termination codons UAA and UAG circular DNA chromosome consists of 5386 nucleotides, to glutamine. These changes are significant because bac- which should encode a maximum of 1795 amino acids, suf- teria and several eukaryotes are involved, representing ficient for five or six proteins. However, this small virus in distinct species that have evolved separately over a long fact synthesizes 11 proteins consisting of more than 2300 period of time. amino acids. A comparison of the nucleotide sequence of Note the apparent pattern in several of the altered the DNA and the amino acid sequences of the polypeptides codon assignments. The change in coding capacity involves synthesized has clarified the apparent paradox. At least four only a shift in recognition of the third, or wobble, position. cases of multiple initiation have been discovered, creating For example, AUA specifies isoleucine in the cytoplasm and overlapping genes.

M12_KLUG8414_10_SE_C12.indd 227 16/11/18 5:14 pm 228 12 the Genetic Code and Transcription

For example, in one case, the coding sequences for the Collectively, these observations suggested that genetic initiation of two polypeptides are found at separate positions information, stored in DNA, is transferred to an RNA inter- within the reading frame that specifies the sequence of a mediate, which directs the synthesis of proteins. As with third polypeptide. In one case, seven different polypeptides most new ideas in molecular genetics, the initial supporting may be created from a DNA sequence that might otherwise evidence was based on experimental studies of bacteria and have specified only three polypeptides. their phages. It was clearly established that during initial A similar situation has been observed in other viruses, infection, RNA synthesis preceded phage protein synthesis including phage G4 and the animal virus SV40. Like fX174, and that the RNA is complementary to phage DNA. phage G4 contains a circular single-stranded DNA molecule. The results of these experiments agree with the concept The use of overlapping reading frames optimizes the use of a of a messenger RNA (mRNA) being made on a DNA template limited amount of DNA present in these small viruses. How- and then directing the synthesis of specific proteins in asso- ever, such an approach to storing information has a distinct ciation with ribosomes. This concept was formally proposed disadvantage in that a single mutation may affect more than by François Jacob and Jacques Monod in 1961 as part of a one protein and thus increase the chances that the change model for gene regulation in bacteria. Since then, mRNA has will be deleterious or lethal. been isolated and studied thoroughly. There is no longer any question about its role in genetic processes.

12.8 Transcription Synthesizes RNA on a DNA Template 12.9 RNA Polymerase Directs RNA Synthesis Even while the genetic code was being studied, it was quite clear that proteins were the end products of many genes. To establish that RNA can be synthesized on a DNA tem- Thus, while some geneticists attempted to elucidate the plate, it was necessary to demonstrate that there is an code, other research efforts focused on the nature of genetic enzyme capable of directing this synthesis. By 1959, several expression. The central question was how DNA, a nucleic investigators, including Samuel Weiss, had independently acid, could specify a protein composed of amino acids. isolated such a molecule from rat liver. Called RNA poly- The complex multistep process begins with the transfer merase, it has the same general substrate requirements as of genetic information stored in DNA to RNA. The process by does DNA polymerase, the major exception being that the which RNA molecules are synthesized on a DNA template substrate nucleotides contain the ribose rather than the is called transcription. It results in an mRNA molecule deoxyribose form of the sugar. Unlike DNA polymerase, no complementary to the gene sequence of one of the double primer is required to initiate synthesis. The overall reaction helix’s two strands. Each triplet codon in the mRNA is, in summarizing the synthesis of RNA on a DNA template can turn, complementary to the anticodon region of its corre- be expressed as sponding tRNA as the amino acid is correctly inserted into

the polypeptide chain during translation. The significance of RNA n(NTP) ¡ (NMP)n + n(PPi) transcription is enormous, for it is the initial step in the pro- polymerase cess of information flow within the cell. The idea that RNA is As this equation reveals, nucleoside triphosphates involved as an intermediate molecule in the process of infor- (NTPs) are substrates for the enzyme, which catalyzes the mation flow between DNA and protein was suggested by the polymerization of nucleoside monophosphates (NMPs), or

following observations: nucleotides, into a polynucleotide chain (NMP)n. Nucleo- tides are linked during synthesis by 3′ to 5′ phosphodiester 1. DNA is, for the most part, associated with chromosomes in bonds (see Figure 9.9). The energy created by cleaving the tri- the nucleus of the eukaryotic cell. However, protein syn- phosphate precursor into the monophosphate form drives the thesis occurs in association with ribosomes located out- reaction, and inorganic pyrophosphates (PP ) are produced. side the nucleus in the cytoplasm. Therefore, DNA does i A second equation summarizes the sequential addi- not appear to participate directly in protein synthesis. tion of each ribonucleotide as the process of transcription 2. RNA is synthesized in the nucleus of eukaryotic cells, progresses where DNA is found, and is chemically similar to DNA. RNA (NMP)n + NTP ¡ (NMP)n 1 + PPi 3. Following its synthesis, most messenger RNA migrates polymerase + to the cytoplasm, where protein synthesis (translation) As this equation shows, each step of transcription occurs. involves the addition of one ribonucleotide (NMP) to the

M12_KLUG8414_10_SE_C12.indd 228 16/11/18 5:14 pm 12.9 RNA Polymerase Directs RNA Synthesis 229

growing polyribonucleotide chain (NMP)n+1, using a nucleo- nucleotide sequence, but with uridine (U) substituted for side triphosphate (NTP) as the precursor. thymidine (T) in the RNA. RNA polymerase from E. coli has been extensively The initial step is template binding [Figure 12.8(b)]. characterized and shown to consist of subunits designated In bacteria, the site of this initial binding is established a, b, b′, and v. The core enzyme contains the subunits when the RNA polymerase s factor recognizes specific DNA

a(two copies), b, b′, and v. A slightly more complex form of the sequences called promoters. These regions are located in enzyme, the holoenzyme, contains an additional subunit the region upstream (5′) from the point of initial transcrip- called the sigma (S) factor and has a molecular weight tion of a gene. It is believed that the enzyme “explores” a of almost 500,000 Da. The b and b′ polypeptides provide length of DNA until it recognizes the promoter region and the catalytic basis and active site for transcription, while the binds to about 60 nucleotide pairs of the helix, 40 of which s factor [Figure 12.8(a)] plays a regulatory function in are upstream from the point of initial transcription. Once the initiation of RNA transcription [Figure 12.8(a)]. this occurs, the helix is denatured or unwound locally, Although there is but a single form of the core enzyme making the DNA template accessible to the action of the in E. coli, there are several different s factors, creating vari- enzyme. The point at which transcription actually begins ations of the polymerase holoenzyme. On the other hand, is called the transcription start site, often indicated as eukaryotes display three distinct forms of RNA polymerase, position +1. each consisting of a greater number of polypeptide subunits The importance of promoter sequences cannot be over- than in bacteria. emphasized. They govern the efficiency of the initiation of transcription. In bacteria, both strong promoters and weak promoters have been discovered. Because the interaction ESSENTIAL POINT of promoters with RNA polymerase governs transcrip- Transcription—the initial step in gene expression—is the synthesis, tion, the nature of the binding between them is at the heart under the direction of RNA polymerase, of a strand of RNA comple- of discussions concerning genetic regulation, the subject mentary to a DNA template. ■ of Chapters 15 and 16. While we will later pursue more

Promoters, Template Binding, (a) Transcription components and the s Factor RNA polymerase Transcription results in the synthesis core enzyme of a single-stranded RNA molecule s factor complementary to a region along NTPs only one strand of the DNA double helix. When discussing transcription, the DNA strand that serves as a tem- DNA Promoter Gene plate for RNA polymerase is denoted as the and the (b) Template binding and initiation of transcription template strand Coding complementary DNA strand is called strand the coding strand. Note that the complementary strand is called the coding strand because it and the RNA

molecule transcribed from the tem- 5' Nascent plate strand have the same 5′ to 3′ RNA 5' Template strand

FIGURE 12.8 The early stages of tran- (c) Chain elongation scription in bacteria, showing (a) the s dissociates components of the process; (b) tem- plate binding at the -10 site involving the s factor of RNA polymerase and subsequent initiation of RNA synthe- sis; and (c) chain elongation, after the s factor has dissociated from the transcription complex and the enzyme 5' moves along the DNA template. Growing RNA transcript

M12_KLUG8414_10_SE_C12.indd 229 16/11/18 5:14 pm 230 12 the Genetic Code and Transcription

detailed information involving promoter–enzyme interac- After a few ribonucleotides have been added to the tions, we must address three points here. growing RNA chain, the s factor dissociates from the holo- The first point is the concept of consensus sequences enzyme and elongation proceeds under the direction of the of DNA. These sequences are similar (homologous) in dif- core enzyme. In E. coli, this process proceeds at the rate of ferent genes of the same organism or in one or more genes about 50 nucleotides/second at 37°C. of related organisms. Their conservation throughout evolu- Eventually, the enzyme traverses the entire gene until tion attests to the critical nature of their role in biological it encounters a specific nucleotide sequence that acts as processes. Two such sequences have been found in bacte- a termination signal. Such termination sequences are rial promoters. One, TATAAT, is located 10 nucleotides extremely important in bacteria because of the close prox- upstream from the site of initial transcription (the -10 imity of the end of one gene and the upstream sequences of region, or Pribnow box). The other, TTGACA, is located 35 the adjacent gene. An interesting aspect of termination in nucleotides upstream (the -35 region). Mutations in either bacteria is that the termination sequence alluded to above region diminish transcription, often severely. is actually transcribed into RNA. The unique sequence of Sequences such as these are said to be cis-acting DNA nucleotides in this termination region causes the newly elements. Use of the term cis is drawn from organic chem- formed transcript to fold back on itself, forming what is istry nomenclature, meaning “next to” or on the same side called a hairpin secondary­ structure, held together by as, in contrast to being “across from,” or trans, to other func- hydrogen bonds. There are two different types of transcrip- tional groups. In molecular genetics, then, cis-elements are tion termination mechanisms in bacteria, both of which are adjacent parts of the same DNA molecule. This is in contrast dependent on the formation of a hairpin structure in the to trans-acting factors, molecules that bind to these DNA RNA being transcribed. elements to influence gene expression. Most transcripts in E. coli are terminated by intrinsic The second point is that the degree of RNA polymerase termination. In intrinsic termination, a hairpin structure binding to different promoters varies greatly, causing vari- encoded by the termination sequence causes RNA poly- able gene expression. Currently, this is attributed to sequence merase to stall. Immediately after the hairpin is a string of variation in the promoters. In bacteria, both strong promot- uracil residues. The U bases of the transcript have a rela- ers and weak promoters have been discovered, causing a tively weak interaction with the A bases on the template variation in time of initiation from once every 1 to 2 seconds strand of the DNA because there are only two hydrogen to as little as once every 10 to 20 minutes. Mutations in pro- bonds per base pair. This leads to dissociation of RNA poly- moter sequences may severely reduce the initiation of gene merase and the transcript is released. expression. Other bacterial transcripts are terminated by rho-­ The third point involves the s factor in bacteria. In dependent termination, which involves a termination E. coli, the major form is designated as s70, based on its factor called rho (R). Rho is a large hexameric protein with molecular weight of 70 kilodaltons (kDa). The promoters of RNA helicase activity—it can dissociate RNA hairpins and most bacterial genes are recognized by this form; however, DNA–RNA interactions. Rho binds to a specific sequence several alternative s factors (e.g., s32, s54, sS, and sE) are on the transcript and moves in the 3′ direction chasing called upon to regulate other genes. Each s factor recognizes after RNA polymerase. When RNA polymerase reaches the different promoter sequences, which in turn provides speci- hairpin structure encoded by the termination sequence, it ficity to the initiation of transcription. pauses and rho catches up. Rho moves through the hairpin and then causes dissociation of RNA polymerase by break- ing the hydrogen bonds between the DNA template and the Initiation, Elongation, and Termination of transcript. RNA Synthesis In bacteria, groups of genes whose products function Once it has recognized and bound to the promoter [Figure together are often clustered along the chromosome. In many 12.8(b)], RNA polymerase catalyzes initiation, the insertion such cases, they are contiguous, and all but the last gene lack of the first 5′@ribonucleoside triphosphate, which is comple- the termination sequence. The result is that during tran- mentary to the first nucleotide at the start site of the DNA scription, a large mRNA is produced that encodes more than template strand. As we noted earlier, no primer is required. one protein. Since genes in bacteria are sometimes called Subsequent ribonucleotide complements are inserted and cistrons, the RNA is called a polycistronic mRNA. The linked by phosphodiester bonds as RNA polymerization products of genes transcribed in this fashion are usually all ­proceeds. This process of chain ­elongation [Figure 12.8(c)] needed by the cell at the same time, so this is an efficient way continues in a 5′ to 3′ extension, creating a temporary DNA/ to transcribe and subsequently translate the needed genetic RNA duplex whose chains run antiparallel to one another. information. Polycistronic mRNAs are rare in eukaryotes.

M12_KLUG8414_10_SE_C12.indd 230 16/11/18 5:14 pm 12.10 Transcription in Eukaryotes Differs from Bacterial Transcription in Several Ways 231

NOW SOLVE THIS bacterial RNA polymerase requires only a s factor to bind the promoter and initiate transcription, in eukaryotes, 12.3 The following represent deoxyribonucleotide several general transcription factors (GTFs) are required sequences in the template strand of DNA: to bind the promoter, recruit RNA polymerase, and initi- Sequence 1: 5′@CTTTTTTGCCAT@3′ ate transcription. Furthermore, in addition to promoters, Sequence 2: 5′@ACATCAATAACT@3′ eukaryotic genes often have other cis-acting control units Sequence 3: 5′@TACAAGGGTTCT@3′ called enhancers and silencers (discussed below, and in more detail in Chapter 16), which greatly influence tran- (a) For each strand, determine the mRNA sequence scriptional activity. that would be derived from transcription. (b) Using Figure 12.7, determine the amino acid 5. In bacteria, transcription termination is often dependent sequence that is encoded by these mRNAs. upon the formation of a hairpin secondary structure in (c) For Sequence 1, what is the sequence of the coding the transcript. However, eukaryotic transcription termi- DNA strand? nation is more complex. Transcriptional termination for HINT: This problem asks you to consider the outcome of the protein-coding genes involves sequence-specific cleavage transfer of complementary information from DNA to RNA and to of the transcript, which then leads to eventual dissocia- determine the amino acids encoded by this information. The key tion of RNA polymerase from the DNA template. to its solution is to remember that in RNA, uracil is complemen- tary to adenine, and that while DNA stores genetic information 6. In eukaryotes, the initial (or primary) transcripts of pro- in the cell, the code that is translated is contained in the RNA tein-coding mRNAs, called pre-mRNAs, undergo com- complementary to the template strand of DNA. plex alterations, generally referred to as “processing,” to produce a mature mRNA. Processing often involves the addition of a 5′@cap and a 3′@tail, and the removal of intervening sequences that are not a part of the mature 12.10 Transcription in Eukaryotes mRNA. In the remainder of this chapter we will look at the basic details of transcription and mRNA processing Differs from Bacterial Transcription in in eukaryotic cells. The process of transcription is highly Several Ways regulated, determining which DNA sequences are copied into RNA and when and how frequently they are tran- Much of our knowledge of transcription has been derived scribed. We will return to topics directly related to the from studies of bacteria. The general aspects of the mechan- regulation of eukaryotic gene transcription later in the ics of these processes are mostly similar in eukaryotes, but text (see Chapter 16). there are several notable differences:

1. Transcription in eukaryotes occurs within the nucleus. Initiation, Elongation, and Termination of Thus, unlike the bacterial process, in eukaryotes the RNA Transcription in Eukaryotes transcript is not free to associate with ribosomes prior to As noted earlier, eukaryotic RNA polymerase exists in three the completion of transcription. For the mRNA to be trans- distinct forms. Each eukaryotic RNA polymerase is larger lated, it must move out of the nucleus into the cytoplasm. and more complex than the single form of RNA polymerase 2. Transcription in eukaryotes occurs under the direction of found in bacteria. For example, yeast and human RNA poly- three separate forms of RNA polymerase, rather than the merase II enzymes consist of 12 subunits. While the three single form seen in bacteria. forms of the enzyme share certain protein subunits, each nevertheless transcribes different types of genes, as indi- 3. Initiation of transcription of eukaryotic genes requires that cated in Table 12.6. compact chromatin fiber, characterized by nucleosome coil- ing, be uncoiled to make the DNA helix accessible to RNA polymerase and other regulatory proteins. This transition, TABLE 12.6 RNA Polymerases in Eukaryotes

referred to as chromatin remodeling (Chapter 11), reflects Form Product Location the dynamics involved in the conformational change that I rRNA Nucleolus occurs as the DNA helix is opened. II* mRNA, snRNA Nucleoplasm 4. Initiation and regulation of transcription entail a more III 5SrRNA, tRNA Nucleoplasm extensive interaction between cis-acting DNA sequences * RNAP II also synthesizes a variety of other RNAs, including miRNAs and trans-acting protein factors. For example, while and lncRNAs (see Chapter 16).

M12_KLUG8414_10_SE_C12.indd 231 16/11/18 5:14 pm 232 12 the Genetic Code and Transcription

RNA polymerases I and III (RNAP I and RNAP III) II–mediated transcription, and the transcriptional activa- transcribe transfer RNAs (tRNAs) and ribosomal RNAs tors and transcriptional repressors that influence the effi- (rRNAs), which are needed in essentially all cells at all times ciency or the rate of RNAP II transcription initiation. for the basic process of protein synthesis. In contrast, RNA The general transcription factors are essential because polymerase II (RNAP II), which transcribes protein-coding RNAP II cannot bind directly to eukaryotic core-promoter genes, is highly regulated. Protein-coding genes are often sites and initiate transcription without their presence. The expressed at different times, in response to different signals, general transcription factors involved with human RNAP II and in different cell types. Thus, RNAP II activity is tightly binding are well characterized and are designated TFIIA, regulated on a gene-by-gene basis. For this reason, most stud- TFIIB, and so on. One of these, TFIID, binds directly to the ies of transcription in eukaryotes have focused on RNAP II. TATA-box sequence. Once initial binding of TFIID to DNA The activity of RNAP II is dependent on both the cis-­ occurs, the other general transcription factors, along with acting regulatory elements of the gene and a number of RNAP II, bind sequentially to TFIID, forming an extensive trans-acting transcription factors that bind to these DNA pre-initiation complex. elements. (We will consider cis elements first and then turn Transcriptional activators and repressors bind to to trans factors.) enhancer and silencer elements and regulate transcrip- At least four different types of cis-acting DNA elements tion initiation by aiding or preventing the assembly of regulate the initiation of transcription by RNAP II. The first pre-­initiation complexes and the release of RNAP II from of these, the core promoter, includes the transcription pre-initiation into full transcription elongation. They appear start site. It determines where RNAP II binds to the DNA to supplant the role of the s factor seen in the bacterial and where it begins transcribing the DNA into RNA. Another enzyme and are important in eukaryotic gene regulation. We promoter element, called a proximal-promoter element, will consider the roles of general and specific transcription is located upstream of the start site and helps modulate the factors in eukaryotic gene regulation, as well as the various level of transcription. The last two types of cis-acting ele- DNA elements to which they bind, in more detail later in the ments, called enhancers and silencers, influence the effi- text (Chapter 16). ciency or the rate of transcription initiation by RNAP II from Unlike in bacteria, there is no specific sequence that the core-promoter element. signals for the termination of transcription. In fact, RNAP In some eukaryotic genes, a cis-acting element within II often continues transcription well beyond what will be the core promoter is the Goldberg–Hogness box, or TATA the eventual 3′@end of the mature mRNA. Once transcrip- box. Located about 30 nucleotide pairs upstream (-30) tion has incorporated a specific sequence AAUAAA, known from the start point of transcription, TATA boxes share a as the polyadenylation signal sequence (discussed A consensus sequence TATA ΋TAAR, where R indicates any below), the transcript is enzymatically cleaved roughly 10 purine nucleotide. The sequence and function of TATA boxes to 35 bases further downstream in the 3′ direction. Cleav- are analogous to those found in the -10 promoter region of age of the transcript destabilizes RNAP II, and both DNA bacterial genes. However, recall that in bacteria, RNA poly- and RNA are released from the enzyme as transcription merase binds directly to the -10 promoter region. As we will is terminated. This completes the cycle that constitutes see below, the same is not the case in eukaryotes. transcription. Although eukaryotic promoter elements can determine the site and general efficiency of initiation, other elements— known as enhancers and silencers—have more dramatic Processing Eukaryotic RNA: Caps and Tails effects on eukaryotic gene transcription. As their names Although the base sequence of DNA in bacteria is transcribed suggest, enhancers increase transcription levels and silenc- into an mRNA that is immediately and directly translated ers decrease them. The locations of these elements can vary into the amino acid sequence as dictated by the genetic code, from immediately upstream of a promoter to downstream, eukaryotic mRNAs require significant alteration before they within, or kilobases away, from a gene. In other words, they are transported to the cytoplasm and translated. By 1970, can modulate transcription from a distance. Each eukary- evidence showed that eukaryotic mRNA is transcribed ini- otic gene has its own unique arrangement of promoter, tially as a precursor molecule much larger than that which enhancer, and silencer elements. is translated into protein. It was proposed that this primary Complementing the cis-acting regulatory sequences are transcript of a gene (a pre-mRNA) must be processed in the various trans-acting factors that facilitate RNAP II binding nucleus before it appears in the cytoplasm as a mature mRNA and, therefore, the initiation of transcription. These proteins molecule. The various processing steps, discussed in the sec- are referred to as transcription factors. There are two broad tions that follow, are summarized in Figure 12.9. categories of transcription factors: the general transcription An important posttranscriptional modification factors (GTFs) that are absolutely required for all RNAP of eukaryotic RNA transcripts destined to become

M12_KLUG8414_10_SE_C12.indd 232 16/11/18 5:14 pm 12.11 The Coding Regions of Eukaryotic Genes Are Interrupted 233

Coding strand 5' 3' While the AAUAAA signal sequence is not 3' 5' found on all eukaryotic mRNAs, it appears to Template strand P E I E I E I 1 1 2 2 3 3 be essential to those that have it. If the sequence Transcription is changed as a result of a mutation, those tran- scripts that would normally have it cannot add the Pre-mRNA 5' 3' poly-A tail. In the absence of this tail, these RNA transcripts are rapidly degraded by nucleases. The Cap added poly-A binding ­protein, as the name suggests, m7G binds to poly-A tails and prevents nucleases from 5' 3' degrading the 3′@end of the mRNA. In addition, the Splicing poly-A tail is important for export of the mRNA 7 m G from the nucleus to the cytoplasm and for transla- 5' 3' tion of the mRNA. E1 E2 E3 Poly-A tails are also found on mRNAs in bac- Tail added teria and archaea. However, these bacterial poly-A m7G tails are generally much shorter and found on only Mature mRNA 5' AAAAA 3' a small fraction of mRNA molecules. In addition, whereas poly-A tails are protective in eukaryotes, FIGURE 12.9 posttranscriptional RNA processing in eukary- poly-A tails are generally associated with mRNA otes. Beginning at the promoter (P) of a gene, transcription degradation in bacteria. produces a pre-mRNA containing several introns (I) and exons (E), as identified under the DNA template strand. Shortly after transcription begins, a m7G cap is added to ESSENTIAL POINT the 5′@end. Next, and during transcription elongation, the The process of creating the initial transcript during transcription is introns are spliced out and the exons joined. Finally, a more complex in eukaryotes than in bacteria, including the addition poly-A tail is added to the 3′@end. While this figure depicts of a 5′m7G cap and a 3′ poly-A tail to the pre-mRNA. ■ these steps sequentially, in some eukaryotic transcripts, the poly-A tail is added before splicing of all introns has been completed. 12.11 The Coding Regions of mRNAs occurs at the 5′@end of these molecules, where a 7-­methylguanosine (m7G) cap is added. This cap is Eukaryotic Genes Are Interrupted by added shortly after synthesis of the initial RNA transcript Intervening Sequences Called Introns has begun. The cap stabilizes the mRNA by protecting the 5′@end of the molecule from nuclease attack. Subsequently, As mentioned above, the primary mRNA transcript, or pre- the cap facilitates the transport of mature mRNAs from the mRNA, is often longer than the mature mRNA in eukaryotes. nucleus into the cytoplasm and is required for the initiation An explanation for this phenomenon emerged in 1977 when of translation of the mRNA into protein. Chemically, the cap research groups led by Phillip Sharp and Richard Roberts

is a guanosine residue with a methyl group (CH3) at position independently published direct evidence that the genes of 7 of the base. The cap is also distinguished by a unique 5′ animal viruses contain internal (also referred to as interven- to 5′ triphosphate bridge that connects it to the initial ribo- ing or intragenic) nucleotide sequences that do not encode for nucleotide of the RNA. amino acids in the final protein product. These noncoding Further insights into the processing of RNA transcripts internal sequences are also present in pre-mRNAs, but they during the maturation of mRNA came from the discovery are removed during RNA processing to produce the mature that mRNAs contain, at their 3′@end, a stretch of as many as mRNA (Figure 12.9), which is then translated. Such nucle- 250 adenylic acid residues. As discussed earlier in the con- otide sequences—ones that intervene between sequences text of eukaryotic transcription termination, the transcript that code for amino acids—are called introns (derived from is cleaved roughly 10 to 35 ribonucleotides after the highly intragenic region). Sequences that are retained in the mature conserved AAUAAA polyadenylation signal sequence. An mRNA and expressed are called exons (for expressed enzyme known as poly-A polymerase then catalyzes the region). The process of removing introns from a pre-mRNA addition of a poly-A tail to the free 3′@OH group at the end of and joining together exons is called RNA splicing. the transcript. Poly-A tails are found at the 3′@end of almost One of the first intron-containing genes identified all mRNAs studied in a variety of eukaryotic organisms. The was the B@globin gene in mice and rabbits, studied inde- exceptions in eukaryotes seem to be mRNAs that encode his- pendently by Philip Leder and Richard Flavell. The mouse tone proteins. gene contains an intron 550 nucleotides long, beginning

M12_KLUG8414_10_SE_C12.indd 233 16/11/18 5:14 pm 234 12 the Genetic Code and Transcription

119 TABLE 12.7 Contrasting Human Gene Size, mRNA Size, and Number of Introns 40 403 Mouse insulin Gene mRNA Number Gene Size (kb) Size (kb) of Introns 126 580 Insulin 1.7 0.4 2 143 222 224 Collagen 38.0 5.0 51 Rabbit 6-globin [pro@a@2(1)] 1150 246 576 398 860 370 1625 Albumin 25.0 2.1 14 188 53 132 118 142 155 Phenylalanine 90.0 2.4 12 Chicken ovalbumin hydroxylase Exons Introns Dystrophin 2400.0 17.0 79

FIGURE 12.10 intron and exon sequences in various eukaryotic genes. The numbers indicate the number of nucleotides present in various intron and exon regions. introns are removed by splicing. As shown in Table 12.7, only about 13 percent of the collagen gene consists of exons immediately after the sequence specifying the 104th amino that appear in mature mRNA. For other genes, an even more acid. In rabbits, there is an intron of 580 base pairs near the extreme picture emerges. Only about 8 percent of the albu- sequence for the 110th amino acid—a strikingly similar pat- min gene codes for the amino acids in the albumin protein, tern to that seen in mice. In addition, another intron of about and in the largest human gene known, dystrophin (which is 120 nucleotides exists earlier in both genes. Similar introns the protein product absent in Duchenne muscular dystro- have been found in the b@globin gene in all mammals exam- phy), less than 1 percent of the gene sequence is retained in ined thus far. the mRNA. The ovalbumin gene of chickens, as shown in Although the vast majority of mammalian genes exam- Figure 12.10, contains seven introns. In fact, the major- ined thus far contain introns, there are several exceptions. ity of the gene’s DNA sequence is composed of introns and Notably, the genes coding for histones and for interferon, a is thus noncoding. The pre-mRNA is nearly three times the signaling protein of the immune system, appear to contain length of the mature spliced mRNA. no introns. The identification of introns in eukaryotic genes involves a direct comparison of nucleotide sequences of DNA Why Do Introns Exist? with those of mRNA and their correlation with amino acid A curious genetics student who first learns about the con- sequences. Such an approach allows the precise identifica- cept of introns and RNA splicing often wonders why introns tion of all intervening sequences. By identifying common exist. If intron sequences are destined for removal, then why sequences that appear at intron/exon boundaries, scientists are they there in the first place? Wouldn’t it be more effi- are now able to identify introns with excellent accuracy cient if introns were absent and hence never transcribed? using only the genomic DNA sequence and computational Indeed, scientists asked these same questions shortly after tools. We will return to this topic when we consider genomic introns were discovered in 1977. However, we know now analysis (Chapter 18). that introns indeed serve several functions: We now have a fairly comprehensive view of intron- 1. Some genes can encode for more than one protein prod- containing eukaryotic genes from many species. In the bud- uct through the alternative use of exons. This process, ding yeast Saccharomyces cervisiae, 283 out of the roughly known as alternative splicing (described in more detail 6000 protein-coding genes have introns. However, introns in Chapter 16), produces different mature mRNAs from are far more common in humans; roughly 94 percent of the same pre-mRNA by splicing out introns and ligating human protein-coding genes contain introns with an aver- together different combinations of exons. This means that age of nine exons and eight introns per gene. An extreme a eukaryotic genome can encode a greater number of pro- example of the number of introns present in a single gene is teins than it has protein-coding genes. provided by the gene coding for one of the subunits of colla- gen, the major connective tissue protein in vertebrates. The 2. Introns may also be important to the evolution of genes. pro@a@2(1) collagen gene contains 51 introns. The precision On evolutionary time scales, DNA sequences may be of RNA splicing must be ­extraordinary if errors are not to be moved around within the genome. The modular exon/ introduced into the mature mRNA. intron gene structure allows for new genes to evolve Equally noteworthy is the difference between the length when, for example, an exon is introduced into an of a typical gene and the length of the final mRNA after existing gene.

M12_KLUG8414_10_SE_C12.indd 234 16/11/18 5:14 pm 12.11 The Coding Regions of Eukaryotic Genes Are Interrupted 235

3. Once an intron is excised from a pre-mRNA, it is generally Exon 1 Intron Exon 2 degraded. However, there are many ­documented cases 5' 3' HO– where an intron actually contains a functional noncoding G RNA (see Chapter 16). In such cases, the excised intron is processed to liberate the noncoding RNA, which then Active site functions within the cell. within intron (a) Intron 4. Introns can also regulate transcription. For example, Exon 1 Exon 2 5' P HO– G P 3' intronic sequences in the DNA frequently harbor cis- regulatory elements, such as enhancers and ­silencers that upregulate or downregulate transcription, respectively. (b) HO– G Rxn1 5' P P 3'

ESSENTIAL POINT Many eukaryotic genes contain intervening sequences, or introns, which are transcribed into the pre-mRNA and must be spliced out to P create the mature mRNA. ■ G (c) Rxn2 5' –OH P 3' Splicing Mechanisms: Self-Splicing RNAs The discovery of introns led to intensive attempts to eluci- date the mechanism by which they are excised and exons are spliced back together. A great deal of progress has been P G made, relying heavily on in vitro studies. Interestingly, it (d) + 5' P 3' appears that somewhat different mechanisms exist for dif- Intron spliced out; exons joined ferent classes of transcripts, as well as for RNAs produced in mitochondria and chloroplasts. FIGURE 12.11 Splicing mechanism for removal of a group We might envision the simplest possible mechanism for I intron. The process is one of self-excision involving two removing an intron to involve two steps: (1) the intron is cut transesterification reactions. at both ends by an endonuclease and (2) the adjacent exons are joined, or ligated, by a ligase. This is, apparently, what happens to the introns present in transfer RNAs (tRNAs) in bacteria. However, in studies of other RNAs—tRNAs in and breaks the phosphodiester bond (“P”) between the nucle- higher eukaryotes and rRNAs and pre-mRNAs in all eukary- otides at the 5′@end of the intron and the 3′@end of the left- otes—precise excision of introns is much more complex and hand exon [Figure 12.11(b)]. The second reaction involves a much more interesting story. the interaction of the newly formed 3′@OH group on the left- Introns in eukaryotes can be categorized into several hand exon and the phosphodiester bond at the right intron/ groups based on their splicing mechanisms. Group I introns, exon boundary [Figure 12.11(c)]. The intron is spliced out such as those in the of rRNAs, require and the two exons are ligated, leading to the mature RNA no outside help for intron excision; the intron itself is the [Figure 12.11(d)]. source of the enzymatic activity necessary for splicing. This Self-excision of group I introns, as described above, amazing discovery was made in 1982 by and is known to occur in preliminary transcripts for mRNAs, colleagues during a study of the ciliate protozoan Tetrahy- tRNAs, and rRNAs in bacteria, lower eukaryotes, and higher mena. RNAs that are capable of such catalytic activity are plants. referred to as . Self-excision also governs the removal of introns from The self-excision process for group I introns is illustrated the primary mRNA and tRNA transcripts produced in mito- in Figure 12.11. Chemically, two reactions take place. The chondria and chloroplasts; these are examples of group II first is an interaction between a free guanosine (symbolized introns. Splicing of group II introns is somewhat different as “G”), which acts as a cofactor in the reaction, and the pri- than for group I, but also involves two autocatalytic reac- mary transcript [Figure 12.11(a)]. After guanosine is posi- tions leading to the excision of introns. Group II introns are tioned in the active site of the intron, its 3′@OH group attacks found in fungi, plants, protists, and bacteria.

M12_KLUG8414_10_SE_C12.indd 235 16/11/18 5:14 pm 236 12 thE GENETIC CODE AND TRANSCRIPTION

Splicing Mechanisms: The Spliceosome large, 40S in yeast and 60S in mammals, being the same size Compared to the group I and group II introns discussed as ribosomal subunits! Introns removed by the spliceosome above, introns in nuclear-derived eukaryotic pre-mRNAs are known as spliceosomal introns. may be much larger, and their removal appears to require One set of essential components of spliceosomes are the a much more complex mechanism. These splicing reactions small nuclear RNAs (snRNAs). These RNAs are usually 80 are not autocatalytic, but instead are mediated by a molecu- to 400 nucleotides long and, because they are rich in uridine lar complex called the spliceosome. This structure is very residues, have been arbitrarily named U1, U2, . . . , U6. The snRNAs are complexed with proteins to form small nuclear ribonucleoproteins (snRNPs), pronounced “snurps,” Intron which are named after the specific snRNAs contained within Branch point them (the U2 snRNA is contained within the U2 snRNP).

Exon 1 GU A AG Exon 2 Figure 12.12 depicts a model of the steps involved in 5' 3' the removal of one spliceosomal intron. Keep in mind that 5' splice site 3' splice site while this figure shows separate components, the process involves the huge spliceosome that envelopes the RNA being U1 spliced. The nucleotide sequences near the ends of the intron U2 begin at the 5′@end with a GU dinucleotide sequence, called U1 U2 the donor sequence, and terminate at the 3′@end with an AG dinucleotide, called the acceptor sequence. These, as well as other consensus sequences shared by introns, attract spe- U5 U6 cific snRNAs of the spliceosome. For example, the snRNA U1 U4 U5 U2 bears a nucleotide sequence that is complementary to the U6 5′@donor sequence end of the intron. Base pairing resulting from this homology promotes the binding that represents U4 U1 the initial step in the formation of the spliceosome. After the other snRNPs (U2, U4, U5, and U6) are added, splicing commences. As with group I splicing, two reactions occur. U5 The first involves the interaction of the 3′@OH group from an U6 U2 adenine (A) residue present within the branch point region of the intron. The A residue attacks the 5′@splice site, cutting the RNA chain. In a subsequent step involving several other snRNPs, an intermediate structure is formed and the second U6 reaction ensues, linking the cut 5′@end of the intron to the A. Lariat formed U5 U2 This results in the formation of a characteristic loop struc- ture called a lariat, which contains the excised intron. The exons are then ligated and the snRNPs are released.

U6 U2 U5 + EVOLVING CONCEPT OF THE GENE Intron excised The elucidation of the genetic code in the 1960s sup-

U2 ported the concept that the gene is composed of a linear series of triplet nucleotides encoding the ami- A U6 + no acid sequence of a protein. While this is indeed the case in bacteria and viruses, in 1977, it became Exons ligated ­apparent that in eukaryotes, the gene is divided into U5 coding sequences, called exons, which are interrupted by noncoding sequences, called introns (intervening FIGURE 12.12 a model of the splicing mechanism for removal of a spliceosomal intron. Excision is dependent on sequences), which must be spliced out during produc- snRNPs (U1, U2, etc.). The lariat structure is characteristic of tion of the mature mRNA. ■ this mechanism.

M12_KLUG8414_10_SE_C12.indd 236 16/11/18 5:14 pm CASE STUDY 237

A family of three ADAR (adenosine deaminase acting on 12.12 RNA Editing May Modify the RNA) enzymes is responsible for this editing. The double- Final Transcript stranded RNAs required for editing by the ADAR enzymes are provided by intron/exon pairing of the GluR mRNA tran- In the late 1980s, still another unexpected form of posttran- scripts. The editing alters the physiological parameters (sol- scriptional RNA processing was discovered in several organ- ute permeability and desensitization response time) of the isms. In this form, referred to as RNA editing, the nucleotide receptors containing the subunits. sequence of a pre-mRNA is actually changed prior to transla- Findings such as these have established that RNA edit- tion. As a result, the ribonucleotide sequence of the mature ing provides still another important mechanism of posttran- RNA differs from the sequence encoded in the exons of the scriptional modification. These discoveries have important DNA from which the RNA was transcribed. implications for the regulation of gene expression. Although other variations exist, there are two main types of RNA editing: insertion/deletion editing, in which nucleo- tides are added to or subtracted from the total number of bases; and substitution editing, in which the identities of individ- 12.13 Transcription Has Been ual nucleotide bases are altered. Substitution editing is used Visualized by Electron Microscopy in some nuclear-derived eukaryotic RNAs and is prevalent in mitochondrial and chloroplast RNAs transcribed in plants. We conclude our coverage of transcription by referring you Trypanosoma, a parasite that causes African sleep- back to the chapter opening photograph (p. 218), which is ing sickness, uses extensive insertion/deletion editing in a striking visualization of transcription occurring in the mitochondrial RNAs. The uridines added to an individual oocyte nucleus of Xenopus laevis, the clawed frog. Note the transcript can make up more than 60 percent of the coding central axis that runs horizontally from left to right and sequence, usually forming the initiation codon and bringing from which threads appear to be emanating vertically. This the rest of the sequence into the proper reading frame. Inser- axis, appearing as a thin thread, is the DNA of most of one tion/deletion editing in trypanosomes is directed by gRNA gene encoding ribosomal RNA (rDNA). Each of the emanat- (guide RNA) templates, which are also transcribed from ing threads, which grows longer the farther to the right it the mitochondrial genome. These small RNAs share a high is found, is an rRNA molecule being transcribed. What is degree of complementarity to the edited region of the final apparent is that multiple copies of RNA polymerase have mRNAs. They base-pair with the pre-edited mRNAs to direct initiated transcription at a point near the left end and that the editing machinery to make the correct changes. transcription by each of them has proceeded to the right. An excellent example of substitutional editing involves Simultaneous transcription by many of these polymerases the subunits constituting the glutamate receptor channels results in the electron micrograph that has captured an (GluR) in mammalian brain tissue. In this case, adenos- image of the entire process. ine (A) to inosine (I) editing occurs in pre-mRNAs prior to It is fascinating to visualize the process and to confirm our their translation, during which I is read as guanosine (G). expectations based on the biochemical analysis of this process.

CASE STUDY Treatment dilemmas

30-year-old woman was undergoing therapy for 1. Consider different ways in which a mutation, a single base-pair b@thalassemia, a recessive trait caused by absence of or change or small deletion, in the gene encoding hemoglobin b A reduced synthesis of the hemoglobin b chain, a subunit of chain could lead to b@thalassemia. For example, how might the oxygen-carrying molecule in red blood cells. In this condition, mutations in promoter, enhancer, or coding regions yield this red blood cells are rapidly destroyed, freeing a large amount of iron, outcome? which is deposited in tissues and organs. The blood transfusions 2. Why is it important that the physician emphasize to the patient the patient had received every 2 or 3 weeks since the age of 7 to that she must bear the responsibility for the final decision (i.e., stave off anemia were further aggravating iron buildup. Her major that once she has considered all aspects of the decision, she organs were showing damage, and she was in danger of death from act autonomously)? cardiac disease. Her physician suggested that she consider under- 3. If you were faced with this decision, what further input might going a hematopoietic (bone marrow) stem cell transplant (HSCT). you seek? Since these stem cells give rise to red blood cells, such a transplant could potentially restore her health. While this might seem like an For related reading, see Caocci, G., et al. (2011). Ethical issues of easy decision, it is not. Advanced cases have a high risk (almost unrelated hematopoietic stem cell transplantation in adult thalas- 30 percent) for transplantation-related death. At this point, the semia patients. BMC Medical Ethics 12(4):1–7. woman is faced with a difficult and important decision.

M12_KLUG8414_10_SE_C12.indd 237 16/11/18 5:14 pm 238 12 thE GENETIC CODE AND TRANSCRIPTION

GENETICS, ETHICS, AND SOCIETY

Treating Duchenne Muscular Dystrophy with Exon-Skipping Drugs

ne in every 3500 newborn frame is restored, the translated protein 1. In 2016, the FDA gave accelerated males is afflicted by a serious may retain some activity, even though approval to eteplirsen. Although the X-linked recessive disease it lacks the amino acids encoded by the FDA’s advisory panel initially voted Oknown as Duchenne muscular dystro- skipped exon. against approval, intense lobbying by phy (DMD). This disease is progressive, Eteplirsen is a molecule known as DMD patients and their families may resulting in muscle degeneration, heart an antisense oligonucleotide (ASO). ASOs have successfuly pressured the FDA disease, and premature death. are short synthetic single-stranded DNA to approve this new drug. Discuss DMD is caused by mutations in the molecules that have specific sequences this case of accelerated approval, dystrophin gene, a 2400-kb gene contain- complementary to a portion of a targeted considering ethical arguments on ing 79 exons. The DMD primary tran- mRNA. When ASOs enter cells, they bind both sides of the controversy. script is 2100 kb long and takes about to their complementary sequence in the 16 hours to transcribe. Many DMD target mRNA. This results in the mRNA’s Read about the FDA’s approval of eteplirsen mutations are frameshift mutations that degradation, or interferes with its splicing at: Tavernise, S., F.D.A. approves muscu- shift the reading frame in the mRNA so or translation. lar dystrophy drug that patients lobbied that at least one codon downstream of Once inside a cell, eteplirsen binds for. New York Times, September 19, 2016 the frameshift mutation becomes a stop to the dystrophin pre-mRNA exon 51 (http://www.nytimes.com/2016/09/20 codon. This stop codon causes prema- splice junctions, interfering with normal /business/fda-approves-muscular ture termination of translation of the pre-mRNA splicing. As a result, exon -dystrophy-drug-that-patients-lobbied mRNA. Most of the resulting truncated 50 is spliced to exon 52, thereby elimi- -for.html). dystrophin proteins are not functional. nating exon 51 and its mutation from 2. Only about 13 percent of DMD There are no cures for DMD and few the mature mRNA. This exon deletion patients have exon 51 mutations effective treatments. restores the correct reading frame to the leading to premature translation Recently, a new DMD drug called dystrophin mRNA. Upon translation, the termination. Describe several other eteplirsen completed clinical trials and new dystrophin protein, although lacking exon-skipping drugs that are under received accelerated approval by the some amino acids, has enough activity to development for the treatment of U.S. Food and Drug Administration restore partial function to the patient’s DMD. Are any of these new drugs in (FDA). This drug uses a technique called muscles. clinical trials? exon skipping to target mutations in exon 51 of dystrophin. Investigate the links on the Muscular Dys- In exon skipping, the exon containing Your Turn trophy Association Web site (https://www the mutation is removed during pre- ake time, individually or in groups, .mda.org/quest/article/exon-skipping​ mRNA splicing. If the exons that precede to consider the ethical and tech- -dmd-what-it-and-whom-can-it-help). and follow the skipped exon are spliced T nical issues that surround DMD together in such a way that the reading exon-skipping drugs.

INSIGHTS AND SOLUTIONS

1. Calculate how many triplet codons would be possible had evo- RNA molecule? How frequently will the most frequent lution seized on six bases (three complementary base pairs) triplet occur? rather than four bases within the structure of DNA. Would Solution: There will be (3)3 or 27 triplets produced. The six bases accommodate a two-letter code, assuming 20 amino most frequent will be CCC, present (1 2)3 or 1/8 of the time. acids and start and stop codons? 3. In a regular copolymer experiment, where> UUAC is repeated 3 Solution: Six things taken three at a time will produce (6) over and over, how many different triplets will occur in or 216 triplet codes. If the code was a doublet, there would be the synthetic RNA, and how many amino acids will occur 2 (6) or 36 two-letter codes, more than enough to accommo- in the polypeptide when this RNA is translated? (Consult date 20 amino acids and start and stop signals. Figure 12.7.) 2. In a heteropolymer experiment using 1/2C:1/4A:1/4G, Solution: The synthetic RNA will repeat four triplets—UUA, how many different triplets will occur in the synthetic CUU, ACU, UAC—over and over. Because both UUA and CUU

M12_KLUG8414_10_SE_C12.indd 238 16/11/18 5:14 pm Problems and Discussion Questions 239

encode leucine, while ACU and UAC encode threonine and over a period of 20 minutes, until no further protein is made. tyrosine, respectively, the polypeptides synthesized under the Explain these results. directions of this RNA would contain three amino acids in the Solution: The mRNA, which is the basis for translation of the repeating sequence Leu-Leu-Thr-Tyr. protein, has a lifetime of about 20 minutes. When actinomy- 4. Actinomycin D inhibits DNA-dependent RNA synthesis. This cin D is added, transcription is inhibited and no new mRNAs antibiotic is added to a bacterial culture where a specific pro- are made. Those already present support the translation of tein is being monitored. Compared to a control culture, where the protein for up to 20 minutes. no antibiotic is added, translation of the protein declines

Mastering Genetics™ Visit for Problems and Discussion Questions instructor-assigned tutorials and problems.

1. HOW DO WE KNOW? In this chapter, we focused on the genetic AGG is known to code for arginine. Taking into account the wob- code and the transcription of genetic information stored in DNA ble hypothesis, assign each of the four remaining different triplet into complementary RNA molecules. Along the way, we found codes to its correct amino acid. many opportunities to consider the methods and reasoning by 7. In the triplet binding assay technique, radioactivity remains on which much of this information was acquired. From the explana- the filter when the amino acid corresponding to the experimental tions given in the chapter, what answers would you propose to triplet is labeled. Explain the basis of this technique. the following fundamental questions: 8. When the amino acid sequences of insulin isolated from differ- (a) How did we determine the compositions of codons encoding ent organisms were determined, some differences were noted. specific amino acids? For example, alanine was substituted for threonine, serine was (b) How were the specific sequences of triplet codes deter- substituted for glycine, and valine was substituted for isoleucine mined experimentally? at corresponding positions in the protein. List the single-base (c) How were the experimentally derived triplet codon assign- changes that could occur in triplets to produce these amino acid ments verified in studies using bacteriophage MS2? changes. (d) How do we know that mRNA exists and serves as an inter- 9. In studies of the amino acid sequence of wild-type and mutant mediate between information encoded in DNA and its con- forms of tryptophan synthetase in E. coli, the following changes comitant gene product? have been observed: (e) How do we know that the initial transcript of a eukaryotic gene contains noncoding sequences that must be removed Thr before accurate translation into proteins can occur? Arg Ser 2. CONCEPT QUESTION Review the Chapter Concepts list on Ile p. 218. These all center on how genetic information is stored in Gly DNA and transferred to RNA prior to translation into proteins. Val Write a short essay that summarizes the key properties of the Glu Ala genetic code and the process by which RNA is transcribed on a DNA template. 3. In studies of frameshift mutations, Crick, Barnett, Brenner, and Determine a set of triplet codes in which only a single-nucleotide Watts–Tobin found that either three nucleotide insertions or change produces each amino acid change. deletions restored the correct reading frame. 10. Why doesn’t polynucleotide phosphorylase (Ochoa’s enzyme) (a) Assuming the code is a triplet, what effect would the addi- synthesize RNA in vivo? tion or loss of six nucleotides have on the reading frame? 11. Refer to Table 12.1. Can you hypothesize why a mixture (b) If the code were a sextuplet (consisting of six nucleotides), of (Poly U)+(Poly A) would not stimulate incorporation of would the reading frame be restored by the addition or loss 14C@phenylalanine into protein? of three, six, or nine nucleotides? 12. Predict the amino acid sequence produced during transla- 4. The mRNA formed from the repeating tetranucleotide UUAC tion of the short theoretical mRNA sequences below. (Note incorporates only three amino acids, but the use of UAUC incor- that the second sequence was formed from the first by a dele- porates four amino acids. Why? tion of only one nucleotide.) What type of mutation gave rise to 5. In studies using repeating copolymers, AC . . . incorporates thre- sequence 2? onine and histidine, and CAACAA . . . incorporates glutamine, asparagine, and threonine. What triplet code can definitely be Sequence 1: 5′@AUGCCGGAUUAUAGUUGA@3′ assigned to threonine? Sequence 2: 5′@AUGCCGGAUUAAGUUGA@3′ 6. In a coding experiment using repeating copolymers (as shown in Table 12.3), the following data were obtained. 13. A short RNA molecule was isolated that demonstrated a hyper- chromic shift indicating secondary structure (see p. 177 in Copolymer Codons Produced Amino Acids in Polypeptide Chapter 9). Its sequence was determined to be AG AGA, GAG Arg, Glu AAG AGA, AAG, GAA Lys, Arg, Glu 5′@AGGCGCCGACUCUACU@3′

M12_KLUG8414_10_SE_C12.indd 239 16/11/18 5:14 pm 240 12 thE GENETIC CODE AND TRANSCRIPTION

(a) Propose a two-dimensional model for this molecule. 22. In a mixed heteropolymer experiment, messages were created (b) What DNA sequence would give rise to this RNA molecule with either 4/5C:1/5A or 4/5A:1/5C. These messages yielded pro- through transcription? teins with the amino acid compositions shown in the following (c) If the molecule were a tRNA fragment containing a CGA table. Using these data, predict the most specific coding composi- anticodon, what would the corresponding codon be? tion for each amino acid. (d) If the molecule were an internal part of a message, what amino acid sequence would result from it following 4/5C:1/5A 4/5A:1/5C translation? (Refer to the code chart in Figure 12.7.) Pro 63.0% Pro 3.5% 14. A glycine residue exists at position 210 of the tryptophan synthe- His 13.0% His 3.0% tase enzyme of wild-type E. coli. If the codon specifying glycine is GGA, how many single-base substitutions will result in an amino Thr 16.0% Thr 16.6% acid substitution at position 210, and what are they? How many Glu 3.0% Glu 13.0% will result if the wild-type codon is GGU? Asp 3.0% Asp 13.0% 15. Shown here is a theoretical viral mRNA sequence Lys 0.5% Lys 50.0% 5′@AUGCAUACCUAUGAGACCCUUGGA@3′ 98.5% 99.1%

(a) Assuming that it could arise from overlapping genes, how 23. Shown in this problem are the amino acid sequences of the wild many different polypeptide sequences can be produced? type and three mutant forms of a short protein. Using the chart in Figure 12.7, what are the sequences? (a) Using Figure 12.7, predict the type of mutation that created (b) A base-substitution mutation that altered the sequence in each altered protein. part (a) eliminated the synthesis of all but one polypeptide. (b) Determine the specific ribonucleotide change that led to the The altered sequence is shown below. Use Figure 12.7 to synthesis of each mutant protein. determine why it was altered. (c) The wild-type RNA consists of nine triplets. What is the role of the ninth triplet? 5′@AUGCAUACCUAUGUGACCCUUGGA@3′ (d) For the first eight wild-type triplets, which, if any, can you determine specifically from an analysis of the mutant pro- teins? In each case, explain why or why not. 16. Most proteins have more leucine than histidine residues but (e) Another mutation (mutant 4) is isolated. Its amino acid more histidine than tryptophan residues. Correlate the number sequence is unchanged, but mutant cells produce abnor- of codons for these three amino acids with this information. mally low amounts of the wild-type proteins. As specifically 17. Define the process of transcription. Where does this process fit as you can, predict where this mutation exists in the gene. into the central dogma of molecular genetics? 18. Describe the structure of RNA polymerase in bacteria. What is the core enzyme? What is the role of the s factor? Wild type: Met-Trp-Tyr-Arg-Gly-Ser-Pro-Thr 19. In a written paragraph, describe the abbreviated chemical reac- Mutant 1: Met-Trp tions that summarize RNA polymerase-directed transcription. Mutant 2: Met-Trp-His-Arg-Gly-Ser-Pro-Thr 20. Messenger RNA molecules are very difficult to isolate from bacte- ria because they are quickly degraded. Can you suggest a reason Mutant 3: Met-Cys-Ile-Val-Val-Val-Gln-His why this occurs? Eukaryotic mRNAs are more stable and exist longer in the cell than do bacteria mRNAs. Is this an advantage 24. Alternative splicing is a common mechanism for eukaryotes to or a disadvantage for a pancreatic cell making large quantities of expand their repertoire of gene functions. At least one estimate insulin? indicates that approximately 50 percent of human genes use 21. One form of posttranscriptional modification of most eukaryotic alternative splicing, and approximately 15 percent of disease- RNA transcripts is the addition of a poly-A tail at the 3′@end. The causing mutations involve aberrant alternative splicing. Differ- absence of a poly-A tail leads to rapid degradation of the tran- ent tissues show remarkably different frequencies of alternative script. Poly-A tails of various lengths are also added to many bac- splicing, with the brain accounting for approximately 18 percent terial RNA transcripts where, instead of promoting stability, they of such events. enhance degradation. In both cases, RNA secondary structures, (a) Define alternative splicing and speculate on the evolution- stabilizing proteins, or degrading enzymes interact with poly-A ary strategy alternative splicing offers to organisms. tails. Considering the activities of RNAs, what might be the gen- (b) Why might some tissues engage in more alternative splicing eral functions of 3′@polyadenylation?? than others?

M12_KLUG8414_10_SE_C12.indd 240 16/11/18 5:14 pm 13 Translation and Proteins

CHAPTER CONCEPTS Crystal structure of a Thermus thermophilus 70S ribosome containing three bound transfer RNAs. ■■ The ribonucleotide sequence of messenger RNA (mRNA) reflects genetic information stored in DNA that makes up genes and corresponds to the amino acid sequences in proteins encoded by those genes. ■■ The process of translation decodes the n Chapter 12, we established that a genetic code stores information in the information in mRNA, leading to the form of triplet nucleotides in DNA and that this information is initially synthesis of polypeptide chains. Iexpressed through the process of transcription into a messenger RNA that ■■ Translation involves the interactions of is complementary to one strand of the DNA helix. However, the final product mRNA, tRNA, ribosomes, and a variety of gene expression, in the case of protein-coding genes, is a polypeptide chain of translation factors essential to the consisting of a linear series of amino acids whose sequence has been pre- initiation, elongation, and termination of scribed by the genetic code. In this chapter, we will examine how the informa- the polypeptide chain. tion present in mRNA is utilized to create polypeptides, which then fold into ■■ Proteins, the final product of many protein molecules. We will also review the evidence confirming that proteins genes, achieve a three-dimensional are the end products of many genes, and we will briefly discuss the various conformation that is based on the levels of protein structure, diversity, and function. This information extends primary amino acid sequences of the our understanding of gene expression and provides an important foundation polypeptide chains making up each for interpreting how the mutations that arise in DNA can result in the diverse protein. phenotypic effects observed in organisms. ■■ The function of any protein is closely tied to its three-dimensional structure, which can be disrupted by mutation. 13.1 Translation of mRNA Depends on Ribosomes and Transfer RNAs

Translation of mRNA is the biological polymerization of amino acids into polypeptide chains. This process, alluded to in our discussion of the genetic code in Chapter 12, occurs only in association with ribosomes, which serve as

241

M13_KLUG8414_10_SE_C13.indd 241 16/11/18 5:14 pm 242 13 Translation and Proteins

nonspecific workbenches. The central question in translation eukaryotic cell contains many times more. Electron micros- is how triplet ribonucleotides of mRNA direct specific amino copy reveals that the bacterial ribosome is about 40 nm at acids into their correct position in the polypeptide. This ques- its largest dimension and consists of two subunits, one large tion was answered once transfer RNA (tRNA) was discov- and one small. Both subunits consist of one or more mole- ered. This class of molecules adapts specific triplet codons in cules of rRNA and an array of ribosomal proteins. When mRNA to their correct amino acids. The adaptor hypothesis for the two subunits are associated with each other in a single the role of tRNA was postulated by Francis Crick in 1957. ribosome, the structure is sometimes called a monosome. In association with a ribosome, mRNA presents a triplet The main differences between bacterial and eukaryotic codon that calls for a specific amino acid. A specific tRNA ribosomes are summarized in Figure 13.1. The subunit and molecule contains within its nucleotide sequence three con- rRNA components are most easily isolated and character- secutive ribonucleotides complementary to the codon, called ized on the basis of their sedimentation behavior in sucrose the anticodon, which can base-pair with the codon. Another gradients [their rate of migration, or Svedberg coefficient (S), region of this tRNA is covalently bonded to its corresponding which is a reflection of their density, mass, and shape]. In amino acid. bacteria, the monosome is a 70S particle; in eukaryotes, it is Hydrogen bonding of tRNAs to mRNA holds the amino 80S. Sedimentation coefficients, which reflect the variable acids in proximity so that a peptide bond can be formed. rate of migration of different-sized particles and molecules, (Amino acid structure and peptide bond chemistry will be are not additive. For example, the bacterial 70S monosome covered later in this chapter.) This process occurs over and consists of a 50S and a 30S subunit, and the eukaryotic 80S over as mRNA runs through the ribosome and amino acids monosome consists of a 60S and a 40S subunit. are polymerized into a polypeptide. Before we discuss the The larger subunit in bacteria consists of a 23S rRNA actual process of translation, let’s first consider the struc- molecule, a 5S rRNA molecule, and 33 ribosomal proteins. tures of the ribosome and tRNA. In the eukaryotic equivalent, a 28S rRNA molecule is accom- panied by a 5.8S and 5S rRNA molecule and 47 proteins. The smaller bacterial subunits consist of a 16S rRNA component Ribosomal Structure and 21 proteins. In the eukaryotic equivalent, an 18S rRNA Because of its essential role in the expression of genetic component and 33 proteins are found. The approximate information, the ribosome has been extensively analyzed. molecular weights (MWs) and the number of nucleotides of One bacterial cell contains about 60,000 ribosomes, and a these components are also shown in Figure 13.1.

Bacteria Eukaryotes 6 6 Monosome 70S (2.3 * 10 Da) Monosome 80S (4.3 * 10 Da)

Large subunit Small subunit Large subunit Small subunit

6 6 6 6 50S 1.5 * 10 Da 30S 0.8 * 10 Da 60S 2.9 * 10 Da 40S 1.4 * 10 Da

23S rRNA 16S rRNA 28S rRNA 18S rRNA (2904 nucleotides) (1542 nucleotides) (5034 nucleotides) (1870 nucleotides) + + + + 33 proteins 21 proteins 47 proteins 33 proteins + 5S rRNA 5S rRNA + 5.8S rRNA (121 nucleotides) (121 (156 nucleotides) nucleotides)

FIGURE 13.1 A comparison of the components of bacterial and eukaryotic ribosomes. Specific values are given for E. coli and human ribosomes.

M13_KLUG8414_10_SE_C13.indd 242 16/11/18 5:14 pm 13.1 Translation of mRNA Depends on Ribosomes and Transfer RNAs 243

It is now clear that the rRNA molecules perform the all- understanding of the function of these components has, to important catalytic functions associated with translation. date, eluded geneticists. This is not surprising; the ribosome The many proteins, whose functions were long a mystery, is the largest and perhaps the most intricate of all cellular are thought to promote the binding of the various molecules structures. For example, the human monosome has a com- involved in translation and, in general, to fine-tune the pro- bined molecular weight of 4.3 million Da! cess. This conclusion is based on the observation that some of the catalytic functions in ribosomes still occur in experi- ESSENTIAL POINT ments involving “ribosomal protein-depleted” ribosomes. Translation is the synthesis of polypeptide chains under the direction Genes coding for the rRNA components are redundant of mRNA in association with ribosomes. ■ in the genome. For example, the E. coli genome contains seven copies of a single sequence that encodes all three rRNAs—23S, 16S, and 5S. The initial transcript of each set tRNA Structure of these genes produces a 30S RNA molecule that is enzy- Because of their small size and stability in the cell, transfer matically cleaved into these smaller components. Coupling RNAs (tRNAs) have been investigated extensively and are of the genetic information encoding these three rRNA com- the best characterized RNA molecules. They are com- ponents ensures that after multiple transcription events, posed of only 75 to 90 nucleotides, displaying a nearly equal quantities of all three will be present as ribosomes identical structure in bacteria and eukaryotes. In both are assembled. types of organisms, tRNAs are transcribed from DNA as In eukaryotes, many more copies of a precursor larger precursors, which are cleaved into mature 4S tRNA sequence are present. Each copy is initially transcribed molecules. Take, for example, tRNATyr (the superscript into an RNA molecule of about 45S that is subsequently identifies the specific tRNA by the amino acid that binds to processed into 28S, 18S, and 5.8S rRNA components. it, called its cognate amino acid). In E. coli, mature tRNATyr These species are homologous to the three rRNA compo- is composed of 77 nucleotides, yet its precursor contains nents of E. coli. In Drosophila, approximately 120 copies 126 nucleotides. per haploid genome are present, while in Xenopus laevis, In 1965, Robert Holley and his colleagues reported the more than 500 copies of the larger precursor sequence are complete sequence of tRNAAla isolated from yeast. Of great present per haploid genome. In mammalian cells, the ini- interest was their finding that a number of nucleotides in the tial transcript is also 45S. The unique 5S rRNA component tRNA contain a modified base not typically found in mRNA. of eukaryotes is not part of this larger transcript. Instead, Two of these nucleotides, inosinic acid and pseudouridylic copies of the gene coding for the 5S rRNA are distinct and acid, are illustrated in Figure 13.2. These modified bases located separately. serve several functions. For example, they confer structural The rRNA genes, called rDNA, are present in clusters at stability and are important for hydrogen bonding between various chromosomal sites. Each cluster in eukaryotes con- the tRNA and the mRNA being translated. sists of tandem repeats, with each repeat unit separated by Holley’s sequence analysis led him to propose the two- a noncoding spacer DNA sequence. In humans, these gene dimensional cloverleaf model of tRNA. It was known clusters have been localized near the ends of chromosomes that tRNA demonstrates a secondary structure due to 13, 14, 15, 21, and 22. A separate gene cluster encoding 5S base pairing. Holley discovered that he could arrange the rRNA has been located on human chromosome 1. linear model in such a way that several stretches of base Despite a detailed knowledge of the structure and pairing would result. This arrangement created a series of genetic origin of the ribosomal components, a complete paired stems and unpaired loops resembling the shape of

O O

H C H N C H C N N N H C C C C C N N H H C O

P Ribose P Ribose Inosinic acid (l) Pseudouridylic acid (°)

FIGURE 13.2 Ribonucleotides containing two unusual nitrogenous bases found in transfer RNA.

M13_KLUG8414_10_SE_C13.indd 243 16/11/18 5:14 pm 244 13 Translation and Proteins

a cloverleaf. Loops consistently contained modified bases Anticodon that did not generally form base pairs. Holley’s model is shown in Figure 13.3. The triplets GCU, GCC, and GCA specify alanine; there- Anticodon loop fore, Holley looked for an anticodon sequence complemen- tary to one of these codons in his tRNAAla molecule. He found it in the form of CGI (the 3′ to 5′ direction) in one loop of the cloverleaf. The nitrogenous base I (inosinic acid) can form hydrogen bonds with U, C, or A, the third members of the alanine triplets. Thus, the anticodon loop was established. Studies of other tRNA species reveal many constant features. At the 3′@end, all tRNAs contain the sequence D loop . . . pCpCpA@3′. This is the end of the molecule where the amino acid is covalently joined to the terminal adenosine residue. All tRNAs contain the nucleotide 5′@Gp . . . at the other end of the molecule. In addition, the lengths of various stems and loops are very similar. Each tRNA that has been 3' examined also contains an anticodon complementary to the 5' known amino acid codon for which it is specific, and all anti- Amino acid TcC stem TcC loop codon loops are present in the same position of the cloverleaf. binding site Because the cloverleaf model was predicted strictly on the FIGURE 13.4 A three-dimensional model of transfer RNA. basis of nucleotide sequence, there was great interest in the X-ray crystallographic examination of tRNA, which reveals a three-dimensional structure. By 1974, , Jon end of the molecule is the anticodon loop, and at the other Roberts, Brian Clark, Aaron Klug, and their colleagues had end is the 3′@acceptor region where the amino acid is bound. succeeded in crystallizing tRNA and performing X-ray crystal- Geneticists speculate that the shapes of the intervening­ loops lography at a resolution of 3 Å. At this resolution, the pattern may be recognized by the specific enzymes responsible for formed by individual nucleotides is discernible. adding amino acids to tRNAs—a subject to which we now turn As a result of these studies, a complete three-dimensional our attention. model of tRNA was proposed, as shown in Figure 13.4. At one Charging tRNA Anticodon Before translation can proceed, the tRNA molecules must be G I C chemically linked to their respective amino acids. This acti- vation process, called charging, occurs under the direction of enzymes called aminoacyl tRNA synthetases. There are Anticodon stem 20 different amino acids, thus there are 20 different ami- noacyl tRNA synthetases. Recall from Chapter 12 that the Variable loop third position of the triplet code has a relaxed base-pairing requirement, or ‘wobble.’ Based on this, only 30 to 40 differ- ent tRNAs (depending on the type of organism) are neces- D stem sary to accommodate the 61 amino acid specifying codons TcC stem of the genetic code. The charging process is outlined in Figure 13.5. In the Acceptor stem initial step, the amino acid is converted to an activated­ form, reacting with ATP to create an aminoacyladenylic­ acid. G P A covalent linkage is formed between the 5′@phosphate 5'-end group of ATP and the carboxyl end of the amino acid. This C C molecule remains associated with the synthetase enzyme, HO A Amino acid binding site forming a complex that then reacts with a specific tRNA 3'-end molecule. ­During this next step, the amino acid is attached

FIGURE 13.3 Holley’s two-dimensional cloverleaf model of to the appropriate tRNA through a high-energy ester bond transfer RNA. Blocks represent nitrogenous bases. between the 3′@end of the tRNA and the carboxyl group of

M13_KLUG8414_10_SE_C13.indd 244 16/11/18 5:14 pm 13.2 Translation of mRNA Can Be Divided into Three Steps 245

HINT: This problem is concerned with establishing whether Aminoacyl tRNA tRNA or the amino acid added to the tRNA during charging is synthetasex Amino acid ATP responsible for attracting the charged tRNA to mRNA during translation. The key to its solution is the observation that in this x + aa + P P P A experiment, when the triplet codon in mRNA calls for cysteine, Enzyme alanine is inserted during translation, even though it is the “incor- rect” amino acid. Aminoacyladenylic acid

x aa P A + + P P 13.2 Translation of mRNA Can Be tRNAx Divided into Three Steps

Activated enzyme complex 3' Much like transcription, the process of translation can best be described by breaking it into discrete phases. We will consider three such phases, each with its own set of illustra- tions (Figures 13.6, 13.7, and 13.8), but keep in mind that Aminoacyl tRNA translation is a dynamic, continuous process. As you read the x synthetase following discussion, keep track of the step-by-step events + depicted in the figures. While the core concepts of translation are common for bacterial and eukaryotic cells, the process + is simpler in bacteria and is discussed in this section. Many x P A aa of the protein factors involved in bacterial translation, and x Charged tRNA their roles, are summarized in Table 13.1.

FIGURE 13.5 steps involved in charging tRNA. The superscript x denotes that only the corresponding specific tRNA and specific aminoacyl tRNA synthetase enzyme are TABLE 13.1 Various Protein Factors Involved during Translation involved in the charging process for each amino acid. in E. coli

Process Factor Role the amino acid. The charged tRNA may now participate directly in protein synthesis. Aminoacyl tRNA synthetases Initiation of IF1 Binds to 30S subunit and prevents translation aminoacyl tRNA from binding to the A are highly specific enzymes because they recognize only site prematurely one amino acid and the subset of corresponding tRNAs IF2 Binds to the initiator fMet-tRNA called Accurate charging is crucial isoaccepting tRNAs. and transfers it to the P site of the if fidelity of translation is to be maintained. 30S-mRNA complex; releases from complex upon GTP hydrolysis, which is required for 50S subunit binding ESSENTIAL POINT IF3 Binds to 30S subunit, preventing it Translation depends on tRNA molecules that serve as adaptors from associating with the 50S subunit between triplet codons in mRNA and the corresponding amino prematurely acids. ■ Elongation of EF-Tu Binds GTP; brings aminoacyl tRNA to polypeptide the A site of the ribosome NOW SOLVE THIS EF-Ts Regulates EF-Tu activity EF-G Stimulates translocation; 13.1 In 1962, F. Chapeville and others reported an experiment GTP-dependent in which they isolated radioactive 14C@cysteinyl@tRNACys Cys Termination RF1 Catalyzes release of the polypeptide (charged tRNA + cysteine). They then removed the sulfur of translation chain from tRNA and dissociation of Cys group from the cysteine, creating alanyl@tRNA (charged and release of the translocation complex; specific Cys Cys tRNA + alanine). When alanyl@tRNA was added to a polypeptide for UAA and UAG termination codons synthetic mRNA calling for cysteine, but not alanine, a poly- RF2 Behaves like RF1; specific for UGA peptide chain was synthesized containing alanine. What can and UAA codons you conclude from this experiment? RF3 Stimulates RF1 and RF2 release

M13_KLUG8414_10_SE_C13.indd 245 16/11/18 5:14 pm 246 13 Translation and Proteins

Initiation Translation Components in Bacteria Initiation of bacterial translation is depicted in Figure 13.6, IF1 IF2 which at the top, in a box, shows all the individual compo- GTP IF3 nents involved in the process. Recall that the ribosome serves E P A Initiation factors as a workbench for the translation process. Ribosomes, when Small subunit they are not involved in translation, are dissociated into their EF-Tu Ribosome EF-G large and small subunits. Note that ribosomes contain three Elongation factors sites, the aminoacyl (A) site, the peptidyl (P) site, and the E P A RF2 exit (E) site, the roles of which will soon become apparent. RF1 RF3 The initiation phase of translation in bacteria requires the Release factors association of a small ribosomal subunit, an mRNA molecule, Large subunit 2 Anticodon a specific charged initiator tRNA, GTP, Mg +, and three pro- UAC Many triplet codons teinaceous initiation factors (IFs). In bacteria, the initia- 5' 3' tion codon of mRNA—AUG—calls for the modified amino acid AUGUUCGGU A AGUGA fMet mRNA N-formylmethionine (fMet). Initiator tRNA The three initiation factors first bind to the small ribo- somal subunit, and this complex in turn binds to mRNA (Step 1). In bacteria, a short sequence on the mRNA, called the Initiation of Translation in Bacteria Shine–Dalgarno sequence, base-pairs with a region of the 1. IF3 16S rRNA of the small ribosomal subunit, facilitating initiation. P A 5' 3' While IF1 primarily blocks the A site from being bound AUG UUC GGU A AGUGA IF1 AC to a tRNA and IF3 serves to inhibit the small subunit from U IF2 associating with the large subunit prematurely, IF2 plays + GTP a more direct role in initiation. Essentially a GTPase, IF2 interacts with the mRNA and tRNAfMet, stabilizing them in the P site (Step 2). This step “sets” the reading frame so that Initiation factors bind to small subunit and attract mRNA. all subsequent groups of three ribonucleotides are trans- lated accurately. Upon release of IF3, the small subunit (and IF3 its associated mRNA and tRNAfMet) then combines with the 2. P A large ribosomal subunit to create the 70S initiation complex. 5' 3' In this process, a molecule of GTP covalently linked to IF2 is AUG UUC GGU A AGUGA UAC IF1 hydrolyzed to GDP, causing a conformational change in IF2, IF2 and IF1 and IF2 are subsequently released (Step 3). GTP

Initiation complex

Elongation tRNAfMet binds to AUG codon of mRNA in P site, forming The second phase of translation, elongation, is depicted in initiation complex; IF3 is released. Figure 13.7. As per our prior discussion, the initiation com- plex is now poised for the insertion into the A site of the sec- 3. IF1 ond aminoacyl tRNA bearing the amino acid corresponding P A 5' 3' to the second triplet sequence on the mRNA. Charged tRNAs AUG UUC GGU A AGUGA UAC are transported into the complex by one of the ­elongation EF-Tu (Step 1). Like IF2 during initiation, A factors (EFs), AG EF-Tu is a GTPase and is bound by a GTP. Hydrolysis of + EF-Tu GTP causes a conformational change of EF-Tu such that it IF2 + + GTP releases the bound aminoacyl tRNA. GDP The next step is for the terminal amino acid in the P site Large subunit binds to complex; IF1 and IF2 are released. (methionine in this case) to be linked to the amino acid now Subsequent aminoacyl tRNA is poised to enter the A site. present on the tRNA in the A site by the formation of a pep- FIGURE 13.6 initiation of translation in bacteria. (The sepa- tide bond. Such lengthening of a growing polypeptide chain rate components required for all three phases of translation by one amino acid is called elongation. The newly formed are depicted in the box at the top of the figure.)

M13_KLUG8414_10_SE_C13.indd 246 16/11/18 5:14 pm 13.2 Translation of mRNA Can Be Divided into Three Steps 247

dipeptide remains attached to the end of the tRNA still resid- transferase, embedded in the large subunit of the ribosome. ing in the A site. The energy to form a new peptide bond is However, it is now clear that the catalytic activity is actually supplied by breaking the high-energy ester bond between a a function of the 23S rRNA of the large subunit. In such a tRNA and its cognate amino acid. These reactions were ini- case, as we saw with splicing of pre-mRNAs (see Chapter 12), tially believed to be catalyzed by an enzyme called peptidyl we refer to the complex as a , recognizing the cata- lytic role that RNA plays in the process. Elongation during Translation in Bacteria Before elongation can be repeated, the tRNA attached to

1. P A the P site, which is now uncharged, must be released from the 5' 3' large subunit. The uncharged tRNA moves briefly into a third AUG UUC G GU A AGUGA U A C A AG site on the ribosome, the E (exit) site. The entire mRNA–tRNA– E aa2–aa1 complex then shifts in the direction of the P site by

EF-Tu a distance of three nucleotides (Step 2). This event, called translocation, requires elongation factor G (EF-G), a GTPase (Step 3). After peptide bond formation, EF-G hydrolyzes GTP, which causes a conformational change of EF-G such that it Second charged tRNA has entered the A site, elongates. This change causes a ratchet-like movement of the facilitated by EF-Tu; first elongation step commences. small subunit relative to the large subunit. The end result is that the third codon of mRNA is now positioned in the A site 2. P A and is ready to accept its specific charged tRNA (Step 4). One 5' 3' AUG UUC G GU A AGUGA simple way to distinguish the A and P sites in your mind is to A AG remember that, following translocation, the P site (P for peptide) A C U ­contains a tRNA attached to a peptide chain (a peptidyl tRNA), whereas the A site (A for amino acid) contains a charged tRNA Peptide bond with its amino acid attached (an aminoacyl tRNA). The sequence of elongation and translocation is repeated over and over (Steps 4 and 5). An additional amino acid is Peptide bond forms; uncharged tRNA moves to the E added to the growing polypeptide chain each time the mRNA site and subsequently out of the ribosome; the mRNA has been translocated three bases to the left, causing the advances by three nucleotides through the ribosome. Once tRNA bearing the dipeptide to shift into the P site. a polypeptide chain of sufficient size is assembled (about 30 amino acids), it begins to emerge from the base of the large

3. P A subunit through an exit tunnel, as illustrated in Step 6. 5' 3' As we have seen, the role of the small subunit during elon- AUG UUC G GU A AGUGA A A G gation is to “decode” the codons in the mRNA, while the role of the large subunit is peptide-bond synthesis. The efficiency E C C A EF-G of the process is remarkably high: The observed error rate + + EF-Tu is only about 10-4. At this rate, an incorrect amino acid will GTP + GTP occur only once in every 20 polypeptides of an average length of 500 amino acids! In a species such as E. coli, elongation pro- The first elongation step is complete, facilitated by ceeds at a rate of about 15 amino acids per second at 37°C. EF-G. The third charged tRNA is ready to enter the A site. Termination codon 4. 5. 6. P A P A P A 5' 3' 5' 3' 5' 3' AUG UUC G GU A AGUGA AUGUUC G GU A AGUGA AUGUUC G GU A AG UGA A AG CC A CC A UUC G E A A E EF-Tu + Many elongation steps GDP Peptide bond formed Polypeptide Third charged tRNA has Tripeptide formed; second entered the A site, facilitated by EF-Tu; elongation step completed; Polypeptide chain synthesized second elongation step begins. uncharged tRNA moves to the E site. and exits the ribosome.

FIGURE 13.7 elongation of the growing polypeptide chain during translation in bacteria.

M13_KLUG8414_10_SE_C13.indd 247 16/11/18 5:14 pm 248 13 Translation and Proteins

Termination of Translation in Bacteria A site: UAG, UAA, or UGA. These codons do not specify an amino acid, nor do they call for a tRNA in the A site. They 1. P A 5' 3' are called stop codons, termination codons, or nonsense A UGUUCGGU A AG UGA codons. Often, several consecutive stop codons are part of C UU RF1 an mRNA. When one such stop codon is encountered, the RF2 polypeptide, now completed, is still connected to the pep- tidyl tRNA in the P site, and the A site is empty. The ter- mination codon is not recognized by a tRNA; rather, it is recognized by a release factor (RF1 or RF2), which binds to the A site and stimulates hydrolysis of the polypeptide Termination codon enters the A site; RF1 or RF2 stimulates hydrolysis of the from the peptidyl tRNA, leading to its release from the trans- polypeptide from peptidyl tRNA. lation complex (Step 1). RF1 is specific to the UAA and UAG stop codons, and RF2 is specific to the UAA and the UGA codons. Then, release factor RF3 binds to the ribosome and 2. the tRNA is released from the P site of the ribosome, which E P A then dissociates into its subunits (Step 2). If a termination codon should appear in the middle of an mRNA molecule as 5' 3' A UGUUCGGU A AGUGA a result of mutation, the same process occurs, and the poly- GTP peptide chain is prematurely terminated. RF3

GDP + P + Energy UUC Polyribosomes RF1 As elongation proceeds and the initial portion of an mRNA RF2 molecule has passed through the ribosome, this portion of mRNA is free to associate with another small subunit to form Ribosomal subunits dissociate and mRNA is released; polypeptide folds into native 3D a second initiation complex. The process can be repeated conformation of protein; tRNA is released. several times with a single mRNA and results in what are called polyribosomes, or just polysomes. FIGURE 13.8 Termination of the process of translation in bacteria. After cells are gently lysed in the laboratory, polyribo- somes can be isolated from them and analyzed. The photos Termination in Figure 13.9 show these complexes as seen under the Termination, the third phase of translation, is depicted in electron microscope. In Figure 13.9(a), you can see the thin Figure 13.8. The process is signaled by the presence of lines of mRNA between the individual ribosomes. The micro- any one of three possible triplet codons appearing in the graph in Figure 13.9(b) is even more remarkable, for it

(a) mRNA (b) Ribosome

Ribosome mRNA Polypeptide chain

FIGURE 13.9 Polyribosomes as seen under the electron microscope. Those in (a) were derived from rabbit reticulocytes engaged in the translation of hemoglobin mRNA. The polyribosomes in (b) were taken from the giant salivary gland cells of the midgefly, Chironomus thummi. Note that the nascent polypeptide chains are apparent as they emerge from each ribosome. Their length increases as translation proceeds from left (5′) to right (3′) along the mRNA.

M13_KLUG8414_10_SE_C13.indd 248 16/11/18 5:14 pm 13.3 HIGH-RESOLUTION STUDIES HAVE REVEALED MANY DETAILS 249

shows the polypeptide chains emerging from the ribosomes Crystallographic analysis also supports the concept that during translation. The formation of polysome complexes rRNA is the real “player” in the ribosome during translation. represents an efficient use of the components available for The interface between the two subunits, considered to be the protein synthesis during a unit of time. It is as if the mRNA location in the ribosome where polymerization of amino is threaded through numerous ribosomes that are side-by- acids occurs, is composed almost exclusively of RNA. In con- side such that translation is occurring simultaneously in trast, the numerous ribosomal proteins are found mostly on each one, but each subsequent ribosome is a bit behind its the periphery of the ribosome. These observations confirm neighbor in the amount of mRNA that has been translated. what has been predicted on genetic grounds—the catalytic steps that join amino acids during translation occur under the direction of RNA, not proteins. ESSENTIAL POINT Another interesting finding involves the actual location Translation, like transcription, is subdivided into the stages of initia- tion, elongation, and termination and relies on base-pairing affinities of the various sites predicted to house tRNAs during trans- between complementary nucleotides. ■ lation. All three sites (A, P, and E) have been identified in X-ray diffraction studies, and in each case, the RNA of the ribosome makes direct contact with the various loops and domains of the tRNA molecule. This observation helps us 13.3 High-Resolution Studies Have understand why the distinctive three-dimensional confor- mation that is characteristic of all tRNA molecules has been Revealed Many Details about the preserved throughout evolution. Functional Bacterial Ribosome Still another noteworthy observation takes us back almost 50 years, to when Francis Crick proposed the wobble Our knowledge of the process of translation and the struc- hypothesis, as introduced in Chapter 12. The Ramakrish- ture of the ribosome is based primarily on biochemical nan group has identified the precise location along the 16S and genetic observations, in addition to the visualization rRNA of the 30S subunit involved in the decoding step that of ribosomes under the electron microscope. To confirm connects mRNA to the proper tRNA. At this location, two and refine this information, the next step is to examine the particular nucleotides of the 16S rRNA actually flip out and ribosome at even higher levels of resolution. For example, probe the codon:anticodon region, and are believed to check X-ray diffraction analysis of ribosome crystals is one way to for accuracy of base pairing during this interaction. Accord- achieve this. However, because of its tremendous size and ing to the wobble hypothesis, the stringency of this step is the complexity of molecular interactions occurring in the high for the first two base pairs but less so for the third (or functional ribosome, it was extremely difficult to obtain wobble) base pair. the crystals necessary to perform X-ray diffraction stud- As our knowledge of the translation process in bacteria ies. Nevertheless, great strides have been made over the has continued to grow, a remarkable study was reported in past decade. First, the individual ribosomal subunits were 2010 by Niels Fischer and colleagues. Using a unique high- crystallized and examined in several laboratories, most resolution approach—the technique of time-resolved single prominently that of Venkatraman Ramakrishnan. Then, particle cryo-electron microscopy (cryo-EM)—the 70S E. coli the crystal structure of the intact 70S ribosome, complete ribosome was captured while in the process of translation with associated mRNA and tRNAs, was examined by Harry and examined at a resolution of 5.5 Å. This research team Noller and colleagues. In essence, the entire translational examined how tRNA is translocated during elongation of complex was seen at the atomic level. Both Ramakrishnan the polypeptide chain. They demonstrated that the trajec- and Noller derived the ribosomes from the bacterium Ther- tories are coupled with dynamic conformational changes in mus thermophilus. the components of the ribosome. Surprisingly, the work has Many noteworthy observations have come from these revealed that during translation, the ribosome behaves as a investigations. For example, the shape of the ribosome complex molecular machine powered by Brownian movement changes during different functional states, attesting to the driven by thermal energy. That is, the energetic requirements dynamic nature of the process of translation. A great deal for achieving the various conformational changes essential has also been learned about the location of the RNA com- to translocation are inherent to the ribosome itself. ponents of the subunits. About one-third of the 16S RNA is Numerous questions about ribosome structure and responsible for producing a flat projection, referred to as the function still remain. In particular, the precise role of the platform, within the smaller 30S subunit, and it modulates many ribosomal proteins is yet to be clarified. Nevertheless, movement of the mRNA–tRNA complex during transloca- the models that are emerging from the above research pro- tion. One of the models based on Noller’s findings is shown vide us with a much better understanding of the mechanism in the opening photograph of this chapter (p. 241). of translation.

M13_KLUG8414_10_SE_C13.indd 249 16/11/18 5:14 pm 250 13 Translation and Proteins

A followed by a G ( ΋G NNAUGG). Named after its discov- 13.4 Translation Is More Complex erer, Marilyn Kozak, this Kozak sequence is considered to in Eukaryotes increase the efficiency of translation initiation in eukaryotes. Interestingly, the poly-A tail at the 3′-end of eukaryotic The general features of the model of translation we just mRNAs also plays an important role in translation initiation. discussed were initially derived from investigations of the Recall from Chapter 12 that poly-A binding proteins bind process in bacteria. Conceptually, the most significant dif- to the poly-A tail to protect the mRNA from degradation. ference between translation in bacteria and eukaryotes The poly-A-binding proteins also bind to one of the eukary- is that in bacteria transcription and translation both take otic initiation factors, eIF4G, which in turn binds to eIF4E, place in the cytoplasm and therefore are coupled, whereas in also known as the cap-binding protein. As this name suggests, eukaryotes these two processes are separated both spatially eIF4G binds the m7G cap. The resulting complex is required and temporally. In eukaryotic cells transcription occurs in for translation initiation for many eukaryotic mRNAs. the nucleus and translation in the cytoplasm. This separa- Because the mRNA forms a loop that is closed where the cap tion provides multiple opportunities for the regulation of and tail are brought together, the process is called closed- gene expression in eukaryotic cells (a topic we will turn to loop translation (Figure 13.10). One possible advantage of in Chapter 16). closed-loop translation is that the cell will not waste energy Another central difference between bacterial and translating a partially degraded mRNA lacking either a cap eukaryotic translation, as we have already seen (Figure 13.1), or poly-A tail, features necessary for the closed loop. Another is that eukaryotes have larger ribosomes composed of a advantage that has been proposed is that the closed-loop greater number of proteins and RNAs. Interestingly, bacterial structure allows for efficient ribosome recycling whereby ribo- and eukaryotic rRNAs share what is called a core sequence, somes that complete synthesis of one polypeptide can dissoci- but in eukaryotes, rRNAs are lengthened by the addition of ate from the mRNA and then reinitiate translation adjacent expansion segments (ESs), which are important for ribosome to the cap, which is a short distance away in the loop. assembly and may also contribute to the regulation and Still other differences between translation in bacteria ­specificity of translation. and eukaryotes are noteworthy. Eukaryotic mRNAs are The initiation phase is particularly rich in differences much longer lived than are their bacterial counterparts. between eukaryotes and bacteria. For example, recall that Bacterial mRNAs, which lack a cap and poly-A tail, are often bacterial translation initiation is dependent upon the small translated immediately after transcription and degraded subunit pairing with a short sequence upstream of the start within five minutes. On the other hand eukaryotic mRNAs, codon—the Shine–Dalgarno sequence. Eukaryotes lack this sequence. Instead, in a process termed cap-dependent trans- 40S lation, initiation in eukaryotes begins with the small subunit 5'-cap 7 7 (m G) associating with the 7-methylguanosine (m G) cap located at 60S elF4E the 5′-end of eukaryotic mRNAs (Chapter 12). In this context, elF4G a specific sequence on bacterial mRNAs and the physical cap PABP AU AAAAAAA G structure on eukaryotic mRNAs serve analogous functions. 3' Start Recall, too, that bacterial translation initiation requires G Stop A codon initiation factors (IF1, IF2, and IF3; Figure 13.6) to attract U codon mRNA to the small subunit and prevent premature associa- tion of the large subunit. In eukaryotes, a suite of eukaryotic initiation factors (eIFs) carry out these processes. A complex Polypeptide Messenger RNA chain (mRNA) consisting of several eIFs, the initiator tRNA, and the small subunit of the ribosome assemble adjacent to the m7G cap. The assembly then slides along the mRNA searching for the start codon in a process known as scanning. While the bacte- rial start codon is recognized by an initiator tRNA carrying an N-formylmethionine (fMet), the eukaryotic start codon encodes unformylated methionine. It turns out that a unique FIGURE 13.10 eukaryotic closed-loop translation. eIF4E Met transfer RNA (tRNAi ) is used during eukaryotic initiation, binds to the cap on the mRNA and to a scaffold protein, one different from the tRNAMet used for AUG codons after the eIF4G, which binds to poly-A binding proteins (PABPs) on start. Another eukaryote-specific feature of translation ini- the poly-A tail of the mRNA. Ribosomes assemble at the cap, scan for the start codon, translate around the loop tiation is that many mRNAs contain a purine (A or G) three terminating at a stop codon, and may then reinitiate bases upstream from the AUG initiator codon, which is often translation in a process called ribosome recycling.

M13_KLUG8414_10_SE_C13.indd 250 16/11/18 5:14 pm 13.6 STUDIES OF NEUROSPORA LED TO THE ONE-GENE:ONE-ENZYME HYPOTHESIS 251

with the protection of a cap and poly-A tail, can persist far alkaptonuria. Individuals with this disorder have an longer with an average of a 10-hour half-life for mRNAs in important metabolic pathway blocked. As a result, they cultured human cells. Thus, eukaryotic mRNAs are often cannot metabolize the alkapton 2,5-dihydroxyphenylace- available for translation for much longer periods of time. tic acid, also known as homogentisic acid. Homogentisic After translation initiation, proteins similar to those in acid accumulates in cells and tissues and is excreted in the bacteria guide the elongation and termination of transla- urine. The molecule’s oxidation products are black and eas- tion in eukaryotes. Many of these eukaryotic elongation ily detectable in the diapers of newborns. The products tend factors (eEFs) and eukaryotic release factors (eRFs) are to accumulate in cartilaginous areas, causing the ears and clearly homologous to their counterparts in bacteria. For nose to darken. The deposition of homogentisic acid in joints example, the role of bacterial EF-Tu, which guides aminoacyl leads to a benign arthritic condition. This rare disease is not tRNAs into the A site of the ribosome, is fulfilled by eEF@1a. serious, but it persists throughout an individual’s life. Unlike bacteria, there is a single release factor, eRF1, which Garrod studied alkaptonuria by looking for patterns of recognizes each of the three stop codons in eukaryotes. inheritance of this benign trait. Eventually he concluded We conclude this section by noting that in 2015, the crys- that it was genetic in nature. Of 32 known cases, he ascer- tal structure of the highly complex 80S human ribosome was tained that 19 were confined to seven families, with one visualized by Bruno Klaholz and colleagues at the remark- family having four affected siblings. In several instances, able resolution of 3.6 Å. This research reveals the interac- the parents were unaffected but known to be related as first tions between the rRNAs and proteins of the ribosome at an cousins. Parents who are so related have a higher probability atomic level of detail. Their images reveal that the interface than unrelated parents of producing offspring that express of the large and small subunits remodels during translation, recessive traits because such parents are both more likely to reflecting a rotational movement of the subunits as the ribo- be heterozygous for some of the same recessive traits. Gar- some translocates, or moves along the mRNA. Many antibi- rod concluded that this inherited condition was the result otics target the bacterial ribosome to block its activity, but of an alternative mode of metabolism, thus implying that have some negative side effects when used as drugs to fight hereditary information controls chemical reactions in the bacterial infections in humans due to partially inhibiting body. While genes and enzymes were not familiar terms dur- the activity of the human ribosome. This study provides an ing Garrod’s time (1902), he used the corresponding con- important model that may assist in reducing the side effects cepts of unit factors and ferments. of antibiotics by increasing their specificity for bacterial Only a few geneticists, including Bateson, were familiar ribosomes. In addition, this study may enable the design of with or referred to Garrod’s work. Garrod’s ideas fit nicely with drugs to slow down the rate of translation of the highly active Bateson’s belief that inherited conditions are caused by the lack ribosomes in human cancer cells, thus starving these cells of of some critical substance. In 1909, Bateson published Mendel’s the protein synthesis on which they are dependent. Principles of Heredity, in which he linked Garrod’s ferments with heredity. However, for almost 30 years, most geneticists failed to see the relationship between genes and enzymes. Gar- rod and Bateson, like Mendel, were ahead of their time. 13.5 The Initial Insight That Proteins Are Important in Heredity Was Provided by the Study of Inborn 13.6 Studies of Neurospora Led to Errors of Metabolism the One-Gene:One-Enzyme Hypothesis

Let’s consider how we know that proteins are the end prod- In two separate investigations beginning in 1933, George ucts of genetic expression. The first insight into the role of Beadle provided the first convincing experimental evi- proteins in genetic processes was provided by observations dence that genes are directly responsible for the synthesis made by Sir Archibald Garrod and William Bateson early in of enzymes. The first investigation, conducted in collabora- the twentieth century. Garrod was born into an English fam­ tion with Boris Ephrussi, involved Drosophila eye pigments. ily of medical scientists. His father was a physician with a Together, they confirmed that mutant genes that alter the strong interest in the chemical basis of rheumatoid arthritis, eye color of fruit flies could be linked to biochemical errors and his eldest brother was a leading zoologist in London. It is that, in all likelihood, involved the loss of enzyme func- not surprising, then, that as a practicing physician, Garrod tion. Encouraged by these findings, Beadle then joined with became interested in several human disorders that seemed Edward Tatum to investigate nutritional mutations in the to be inherited. Although he also studied albinism and cys- pink bread mold Neurospora crassa. This investigation led tinuria, we shall describe his investigation of the disorder to the one-gene:one-enzyme hypothesis.

M13_KLUG8414_10_SE_C13.indd 251 11/17/18 1:33 AM 252 13 Translation and Proteins

Analysis of Neurospora Mutants by medium containing all the necessary growth factors (e.g., Beadle and Tatum vitamins and amino acids). Under such growth conditions, a In the early 1940s, Beadle and Tatum chose to work with mutant strain unable to grow on minimal medium was able to Neurospora because much was known about its biochemistry grow by virtue of supplements present in the enriched com- and because mutations could be induced and isolated with plete medium. All the cultures were then transferred to mini- relative ease. By inducing mutations, they produced strains mal medium. If growth occurred on the minimal medium, the that had genetic blocks of reactions essential to the growth organisms were able to synthesize all the necessary growth of the organism. factors themselves, and the researchers concluded that the Beadle and Tatum knew that this mold could manufac- culture did not contain a nutritional mutation. If no growth ture nearly everything necessary for normal development. occurred on minimal medium, they concluded that the culture For example, using rudimentary carbon and nitrogen sources, contained a nutritional mutation, and the only task remain- this organism can synthesize nine water-soluble vitamins, ing was to determine its type. These results are shown in 20 amino acids, numerous carotenoid pigments, and all essen- Figure 13–11(a). tial purines and pyrimidines. Beadle and Tatum irradiated Many thousands of individual spores from this pro- asexual conidia (spores) with X rays to increase the frequency cedure were isolated and grown on complete medium. In of mutations and allowed them to be grown on “complete” subsequent tests on minimal medium, many cultures failed

(a)

1 Growth on both X-ray or UV radiation No induced nutritional mutation

No growth on minimal 2 medium

Conidia Nutritional mutation was induced

Normal ( ) and Complete medium Minimal medium mutant ( ) conidia

(b)

Growth only when an amino acid supplement is provided

Induced mutant cannot synthesize an amino acid

Complete Minimal Minimal Minimal Minimal medium medium + + + vitamins purines and amino pyrimidines acids FIGURE 13.11 Induction, isolation, and characterization of a nutritional auxotrophic (c) mutation in Neuros- pora. (a) Most conidia Mutant cells grow only are not affected, but when tyrosine is added one conidium (shown in red) contains a muta- Mutation affects synthesis tion. In (b) and (c), the - of tyrosine (tyr ) precise nature of the mutation is established Complete Minimal Minimal Minimal Minimal Minimal and found to involve medium medium + + + + the biosynthesis of leucine alanine tyrosine phenylalanine tyrosine.

M13_KLUG8414_10_SE_C13.indd 252 16/11/18 5:14 pm 13.7 Studies of Human Hemoglobin Established That One Gene Encodes One Polypeptide 253

to grow, indicating that a nutritional mutation had been phraseology, one-gene:one-protein hypothesis. Second, induced. To identify the mutant type, the mutant strains proteins often show a substructure consisting of two or more were then tested on a series of different minimal media polypeptide chains. This is the basis of the quaternary pro- [Figure 13–11(b)], each containing groups of supplements, tein structure, which we will discuss later in this chapter. and subsequently on media containing single vitamins, Because each distinct polypeptide chain is encoded by purines, pyrimidines, or amino acids [Figure 13–11(c)] a separate gene, a more accurate statement of Beadle and until one specific supplement that permitted growth was Tatum’s basic tenet is one-gene:one-polypeptide chain found. Beadle and Tatum reasoned that the supplement that hypothesis. These modifications of the original hypothesis restored growth would be the molecule that the mutant strain became apparent during the analysis of hemoglobin struc- could not synthesize. ture in individuals with sickle-cell anemia. The first mutant strain they isolated required vitamin

B6 (pyridoxine) in the medium, and the second required Sickle-Cell Anemia vitamin B1 (thiamine). Using the same procedure, Beadle and Tatum eventually isolated and studied hundreds of The first direct evidence that genes specify proteins other mutants deficient in the ability to synthesize other vitamins, than enzymes came from work on mutant hemoglobin mol- amino acids, or other substances. ecules found in humans with the disorder sickle-cell ane- The findings derived from testing over 80,000 spores mia. Affected individuals have erythrocytes that, under ­convinced Beadle and Tatum that genetics and biochemis- low oxygen tension, become elongated and curved because try have much in common. It seemed likely that each nutri- of the polymerization of hemoglobin. The sickle shape of tional mutation caused the loss of the enzymatic activity that these erythrocytes is in contrast to the biconcave disc shape facilitated an essential reaction in wild-type organisms. It also characteristic in unaffected individuals (Figure 13.12). appeared that a mutation could be found for nearly any enzy- Those with the disease suffer attacks when red blood cells matically controlled reaction. Beadle and Tatum had thus aggregate in the venous side of capillary systems, where provided sound experimental evidence for the hypothesis that oxygen tension is very low. As a result, a variety of tissues one gene specifies one enzyme, an idea alluded to over 30 years are deprived of oxygen and suffer severe damage. When earlier by Garrod and Bateson. With modifications, this this occurs, an individual is said to experience a sickle-cell ­concept was to become another major principle of genetics. crisis. If left untreated, a crisis can be fatal. The kidneys, muscles, joints, brain, gastrointestinal tract, and lungs can ESSENTIAL POINT be affected. Beadle and Tatum’s work with nutritional mutations in Neurospora led In addition to undergoing crises, these individuals them to propose that one gene encodes one enzyme. ■ are anemic because their erythrocytes are destroyed more rapidly than are normal red blood cells. Compensatory physiological mechanisms include increased red blood cell production by bone marrow, along with accentuated heart 13.7 Studies of Human Hemoglobin Established That One Gene Encodes One Polypeptide

The one-gene:one-enzyme hypothesis that was developed in the early 1940s was not immediately accepted by all geneti- cists. This is not surprising because it was not yet clear how mutant enzymes could cause variation in many phenotypic traits. For example, Drosophila mutants demonstrate altered eye size, wing shape, wing-vein pattern, and so on. Plants exhibit mutant varieties of seed texture, height, and fruit size. How an inactive mutant enzyme could result in such phenotypes puzzled many geneticists. Two factors soon modified the one-gene:one-enzyme hypothesis. First, although nearly all enzymes are proteins, not all proteins are enzymes. As the study of biochemical genetics FIGURE 13.12 A comparison of an erythrocyte from a progressed, it became clear that all proteins are specified by healthy individual (left) and from an individual afflicted with the information stored in genes, leading to the more accurate sickle-cell anemia (right).

M13_KLUG8414_10_SE_C13.indd 253 16/11/18 5:14 pm 254 13 Translation and Proteins

action. These mechanisms lead to abnormal bone size and this work has led to a thorough study of human hemoglo- shape, as well as dilation of the heart. bins, which has provided valuable genetic insights. In 1949, James Neel and E. A. Beet demonstrated that the disease is inherited as a Mendelian trait. Pedigree analy- ESSENTIAL POINT sis revealed three genotypes and phenotypes controlled by Pauling and Ingram’s investigations of hemoglobin from patients a single pair of alleles, HbA and HbS. Unaffected and affected with sickle-cell anemia led to the modification of the one-gene:one- individuals result from the homozygous genotypes HbAHbA enzyme hypothesis to indicate that one gene encodes one polypep- tide chain. ■ and HbSHbS, respectively. The red blood cells of heterozy- gotes, who exhibit the sickle-cell trait but not the disease, undergo much less sickling because over half of their hemo- EVOLVING CONCEPT OF THE GENE globin is normal. Although they are largely unaffected, In the 1940s, a time when the molecular nature of heterozygotes are “carriers” of the defective gene, which is the gene had yet to be defined, groundbreaking work transmitted on average to 50 percent of their offspring. of Beadle and Tatum provided the first experimen- In the same year, and his coworkers pro- tal evidence concerning the product of genes, their vided the first insight into the molecular basis of the disease. “one-gene:one-enzyme” hypothesis. This idea received They showed that hemoglobins isolated from people with further support and was later modified to indicate and without sickle-cell anemia differ in their rates of elec- that one gene specifies one polypeptide chain. ■ trophoretic migration. In this technique, charged molecules migrate in an electric field. If the net charge of two molecules is different, their rates of migration will be different. On this basis, Pauling and his colleagues concluded that a chemi- cal difference exists between normal (HbA) and sickle-cell 13.8 Variation in Protein Structure Is (HbS) hemoglobin. the Basis of Biological Diversity Pauling’s findings suggested two possibilities. It was known that hemoglobin consists of four nonproteinaceous, Having established that the genetic information is stored in iron-containing heme groups and a globin portion that con- DNA and influences cellular activities through the proteins­ tains four polypeptide chains. The alteration in net charge it encodes, we turn now to a brief discussion of protein struc- in HbS had to be due, theoretically, to a chemical change in ture. How can these molecules play such a critical role in one of these components. Work carried out between 1954 determining the complexity of cellular activities? As we shall and 1957 by Vernon Ingram demonstrated that the chemical see, the fundamental aspects of the structure of proteins change occurs in the primary structure of the globin portion ­provide the basis for incredible complexity and diversity. At of the hemoglobin molecule. Ingram showed that HbS differs the outset, we should differentiate between polypeptides­ in amino acid composition compared to HbA. Human adult and proteins. Both are molecules composed of amino acids. hemoglobin contains two identical a chains of 141 amino They differ, however, in their state of assembly and func- acids and two identical b chains of 146 amino acids. Analysis tional capacity. Polypeptides are the precursors of proteins. revealed just a single amino acid change: Valine was substi- As it is assembled on the ribosome during translation, the tuted for glutamic acid at the sixth position of the b chain molecule is called a polypeptide. When released from the (Figure 13.13). ribosome ­following translation, a polypeptide folds and The significance of this discovery has been multifaceted. assumes a higher order of structure. When this occurs, a It clearly establishes that a single gene provides the genetic three-dimensional conformation emerges. In many cases, information for a single polypeptide chain. Studies of HbS several polypeptides interact to produce this conformation. also demonstrate that a mutation can affect the phenotype When the final conformation is achieved, the molecule is by directing a single amino acid substitution. Also, by pro- now fully functional and is appropriately called a protein. Its viding the explanation for sickle-cell anemia, the concept of three-dimensional conformation is essential to the ­function inherited molecular disease was firmly established. Finally, of the molecule.

Normal HbA Sickle-cell HbS

NH2 Val His Leu Thr Pro Glu Glu COOH NH2 Val His Leu Thr Pro Val Glu COOH #6 #6 Partial amino acid sequences of b chains

FIGURE 13.13 A comparison of the amino acid sequence of the b chain found in HbA and HbS.

M13_KLUG8414_10_SE_C13.indd 254 16/11/18 5:14 pm 13.8 Variation in Protein Structure Is the Basis of Biological Diversity 255

The polypeptide chains of proteins, like nucleic acids, dehydration reaction, releasing a molecule of H2O. The are linear nonbranched polymers. There are 20 commonly resulting covalent bond is a peptide bond [Figure 13.15(a)]. occurring amino acids that serve as the subunits (the build- Two amino acids linked together constitute a dipeptide, ing blocks) of proteins. Each amino acid has a carboxyl three a tripeptide, and so on. Once 10 or more amino acids group, an amino group, and an R (radical) group (a side are linked by peptide bonds, the chain is referred to as a poly- chain) bound covalently to a central carbon (C) atom. The peptide. Generally, no matter how long a polypeptide is, it R group gives each amino acid its chemical identity exhibit- will contain a free amino group at one end (the N-terminus) ing a variety of configurations that can be divided into four and a free carboxyl group at the other end (the C-terminus). main classes: nonpolar (hydrophobic), polar (hydrophilic), Four levels of protein structure are recognized: pri- positively charged, and negatively charged. Figure 13.14 mary, secondary, tertiary, and quaternary. The sequence of shows the chemical structure of all 20 amino acids and the amino acids in the linear backbone of the polypeptide consti- class to which each belongs. Because polypeptides are often tutes its primary structure. It is specified by the sequence long polymers and because each position may be occupied by in DNA via an mRNA intermediate. The primary structure any 1 of the 20 amino acids with their unique chemical prop- of a polypeptide helps determine the specific characteristics erties, enormous variation in chemical conformation and of the higher orders of organization as a protein is formed. activity is possible. For example, if an average polypeptide Secondary structures are certain regular or repeat- is composed of 200 amino acids, 20200 different molecules, ing configurations in space assumed by amino acids lying each with a unique sequence, can be created using the 20 close to one another in the polypeptide chain. In 1951, Linus different building blocks. Pauling, Herman Branson, and Robert Corey predicted, on Around 1900, German chemist Emil Fischer determined theoretical grounds, an A@helix as one type of secondary the manner in which the amino acids are bonded together. structure. The a@helix model [Figure 13.15(b)] has since He showed that the amino group of one amino acid reacts been confirmed by X-ray crystallographic studies. The helix with the carboxyl group of another amino acid during a is composed of a right-handed spiral chain of amino acids

1. Nonpolar: Hydrophobic CH3 NH

CH3 CH3 S

CH HC CH CH CH 3 3 2 2 CH2

CH2 CH2 CH3 HC CH3 CH2 CH3 CH CH2 CH2 CH2

N C

Alanine Valine Leucine Isoleucine Methionine Proline Tryptophan Phenylalanine (Ala, A) (Val, V) (Leu, L) (Ile, I) (Met, M) (Pro, P) (Trp, W) (Phe, F)

2. Polar: Hydrophilic OH O NH2 C O NH2

H CH3 SH C CH2

H H C OH H C OH CH2 CH2 CH2 CH2

Glycine Serine Threonine Cysteine Tyrosine Asparagine Glutamine (Gly, G) (Ser, S) (Thr, T) (Cys, C) (Tyr, Y) (Asn, N) (Gln, Q)

3. Polar: positively charged (basic) NH2 4. Polar: negatively charged (acidic)

+ + NH3 C NH2 Amino CH NH C 2 O O- group +HN N - C R CH2 CH2 O O

C CH CH C CH2 CH2 2 Carboxyl CH CH group CH2 CH2 CH2 2 2 Amino acid structure Histidine Lysine Arginine Aspartic acid Glutamic acid (His, H) (Lys, K) (Arg, R) (Asp, D) (Glu, E)

FIGURE 13.14 Chemical structures of the 20 amino acids encoded by organisms, divided into four major classes. Each amino acid has two abbreviations in universal use; for example alanine is designated by Ala or A.

M13_KLUG8414_10_SE_C13.indd 255 16/11/18 5:14 pm 256 13 Translation and Proteins

(a) Peptide bond formation (b) A-helix (c) B-pleated sheet

- + + -

H2O

- + Key Hydrogen bond Carboxyl end O atom Amino end Covalent bond Peptide bond C atom of carboxyl group Central C atom N atom R group H atom Hydrogen bond

FIGURE 13.15 (a) Peptide bond formation between two amino acids, resulting from a dehydra- tion reaction. (b) The righthanded a@helix, which represents one form of secondary structure of a polypeptide chain. (c) The b@pleated sheet, an alternative form of secondary structure of poly- peptide chains. To maintain clarity, not all atoms are shown.

stabilized by hydrogen bonds. The side chains (the R groups) another. Hemoglobin, a protein consisting of four polypep- of amino acids extend outward from the helix, and each tide chains, has been studied in great detail. Many enzymes, amino acid residue occupies a vertical distance of 1.5 Å in including DNA and RNA polymerase, demonstrate quater- the helix. There are 3.6 residues per turn. nary structure. Also in 1951, Pauling and Corey proposed a second struc- ture, the B@pleated sheet. In this model, a single-polypeptide­ ESSENTIAL POINT chain folds back on itself, or several chains run in either par- Proteins, the end products of genes, demonstrate four levels of struc- allel or antiparallel fashion next to one another. Each such tural organization that together describe their three-dimensional structure is stabilized by hydrogen bonds formed between conformation, which is the basis of each molecule’s function. ■ atoms on adjacent chains [Figure 13.15(c)]. A zigzagging plane is formed in space with adjacent amino acids 3.5 Å apart. As a general rule, most proteins demonstrate a mixture of a@helix and b@pleated@sheet structures. While the secondary structure describes the arrange- ment of amino acids within certain areas of a polypeptide chain, the tertiary structure defines the three-dimensional conformation of the entire chain in space. Each protein twists and turns and loops around itself in a very particular fashion, characteristic of the specific protein. A model of the three-dimensional tertiary structure of the respiratory pig- ment myoglobin is shown in Figure 13.16. The three-dimensional conformation achieved by any protein is a product of the primary structure of the poly- peptide. As the polypeptide is folded, the most thermo- dynamically stable conformation is created. This level of organization is essential because the specific function of any protein is directly related to its tertiary structure. The concept of quaternary structure applies to those FIGURE 13.16 The tertiary level of protein structure in a proteins composed of more than one polypeptide chain and respiratory pigment, myoglobin. The bound oxygen atom is indicates the position of the various chains in relation to one shown in red.

M13_KLUG8414_10_SE_C13.indd 256 16/11/18 5:14 pm 13.9 Proteins Function in Many Diverse Roles 257

NOW SOLVE THIS but also because improperly folded proteins can accumu- late and be detrimental to cells and the organisms that 13.2 HbS results from the substitution of valine for contain them. For example, a group of transmissible brain glutamic acid at the number 6 position in the b chain disorders in mammals—scrapie in sheep, bovine spongi- of human hemoglobin. HbC is the result of a change at the same position in the b chain, but in this case lysine form ­encephalopathy (mad cow disease) in cattle, and replaces glutamic acid. Return to the genetic code table Creutzfeldt–Jakob disease in humans—are caused by the (Figure 12.7) and determine whether single-nucleotide presence in the brain of prions, which are aggregates of a changes can account for these mutations. Then view misfolded protein. The misfolded protein (called PrPSc) is Figure 13.14 and examine the R groups in the amino acids an altered version of a normal cellular protein (called PrPC) glutamic acid, valine, and lysine. Describe the chemical synthesized in neurons and found in the brains of all adult differences between the three amino acids. Predict how the mammals. The difference between PrPC and PrPSc lies in changes might alter the structure of the molecule and lead their secondary protein structures. Normal, noninfectious to altered hemoglobin function. PrPc folds into an a@helix, whereas infectious PrPSc folds into HINT: This problem asks you to consider the potential impact a b@pleated sheet. When an abnormal PrPSc molecule con- of several amino acid substitutions that result from mutations in tacts a PrPC molecule, the normal protein refolds into the one of the genes encoding one of the chains making up human abnormal conformation. The process continues as a chain hemoglobin. The key to its solution is to consider and compare reaction, with potentially devastating results—the forma- the structure of the three amino acids (glutamic acid, lysine, and tion of prion particles that eventually destroy the brain. valine) and their net charge (see Figure 13.14). Hence, this group of disorders can be considered diseases of secondary protein structure. Currently, many laboratories are studying protein Protein Folding and Misfolding folding and misfolding, particularly as related to genetics. It was long thought that protein folding was a spontane- Numerous inherited human disorders are caused by mis- ous process whereby a linear molecule exiting the ribo- folded proteins that form aggregates. Sickle-cell anemia, some achieved a three-dimensional, thermodynamically discussed earlier in this chapter, is a case in point, where the stable conformation based solely on the combined chemi- b chains of hemoglobin are altered as the result of a single cal properties inherent in the amino acid sequence. This amino acid change, causing the molecules to aggregate within indeed is the case for many proteins. However, numerous erythrocytes, with devastating results. Various progressive studies have shown that for other proteins, correct folding is neurodegenerative diseases such as Huntington disease, dependent on members of a family of molecules called chap- Alzheimer disease, and Parkinson disease are linked to erones. Chaperones are themselves proteins (sometimes the formation of abnormal protein aggregates in the brain. called molecular chaperones or chaperonins) that function by Huntington disease is inherited as an autosomal dominant mediating the folding process by excluding the formation trait, whereas less clearly defined genetic components are of alternative, incorrect patterns. While they may initially associated with Alzheimer and Parkinson diseases. interact with the protein in question, like enzymes, they do not become part of the final product. Chaperones have been identified in all organisms and are even present in mitochon- dria and chloroplasts. 13.9 Proteins Function in Many In eukaryotic cells, chaperones are particularly Diverse Roles important when translation occurs on membrane-bound ribosomes, where the newly translated polypeptide is The essence of life on Earth rests at the level of diverse extruded into the lumen of the endoplasmic reticulum. cellular function. While DNA and mRNA serve as vehicles Even in their presence, misfolding may still occur, and to store and express genetic information, proteins are at one more system of “quality control” exists. As misfolded the heart of cellular function. And it is the capability of proteins are transported out of the endoplasmic reticulum cells to assume diverse structures and functions that dis- to the cytoplasm, they are “tagged” by another class of tinguishes most eukaryotes from simpler organisms such small proteins called ubiquitins. The protein–ubiquitin as bacteria. Therefore, an introductory understanding of complex moves to a cellular structure called the protea- protein function is critical to a complete view of genetic some, which cleaves off ubiquitin and degrades the pro- processes. tein (see Chapter 16 for additional information on this Proteins are the most diverse macromolecules found in process). cells and serve many different functions. For example, the Protein folding is a critically important process, not respiratory pigments hemoglobin and myoglobin, trans- only because misfolded proteins may be nonfunctional, port oxygen, which is essential for cellular metabolism.

M13_KLUG8414_10_SE_C13.indd 257 21/01/19 3:24 PM 258 13 Translation and Proteins

Collagen and keratin are structural proteins associated enzymes involved in the genetic and cellular processes are with skin, connective tissue, and hair. Actin and myosin described throughout the text. are contractile proteins, found in abundance in muscle tis- sue, while tubulin is the basis of the function of microtu- bules in mitotic and meiotic spindles. Still other examples Protein Domains Impart Function are the immunoglobulins, which function in the immune We conclude this chapter by briefly discussing the impor- system of vertebrates; transport proteins, involved in the tant finding that regions made up of specific amino acid movement of molecules across membranes; some of the hor- sequences are associated with specific functions in protein mones and their receptors, which regulate various types of molecules. Such sequences, usually between 50 and 300 chemical activity; histones, which bind to DNA in eukary- amino acids, constitute protein domains and represent otic organisms; and transcription factors that regulate modular portions of the protein that fold into stable, unique gene expression. conformations independently of the rest of the molecule. Nevertheless, the most diverse and extensive group of Different domains impart different functional capabilities. proteins (in terms of function) are the enzymes, to which we Some proteins contain only a single domain, while others have referred throughout this chapter. Enzymes specialize contain two or more. in catalyzing chemical reactions within living cells. Like all The significance of domains resides in the tertiary catalysts, they increase the rate at which a chemical reaction structures of proteins. Each domain can contain a mixture reaches equilibrium, but they do not alter the end-point of of secondary structures, including a@helices and b@pleated the chemical equilibrium. Their remarkable, highly specific sheets. The unique conformation of a given domain imparts catalytic properties largely determine the metabolic capac- a specific function to the protein. For example, a domain ity of any cell type and provide the underlying basis of what may serve as the catalytic site of an enzyme, or it may impart we refer to as biochemistry. The specific functions of many an ability to bind to a specific ligand. Thus, discussions of proteins may mention catalytic domains, DNA-binding ESSENTIAL POINT domains, and so on. In short, a protein must be seen as being Of the myriad functions performed by proteins, the most influential composed of a series of structural and functional modules. role belongs to enzymes, which serve as highly specific biological Obviously, the presence of multiple domains in a single pro- catalysts that play a central role in the production of all classes of tein increases the versatility of each molecule and adds to its molecules in living systems. ■ functional complexity.

CASE STUDY Crippled ribosomes

iamond–Blackfan anemia (DBA) is a rare, dominant 2. A couple with a child affected with DBA undergoes in vitro genetic disorder characterized by bone marrow mal- fertilization (IVF) and genetic testing of the resulting embryos function, birth defects, and a predisposition to certain to ensure that the embryos will not have DBA. However, D they also want the embryos screened to ensure that the one cancers. Infants with DBA usually develop anemia in the first year of life, have lower than normal production of red blood cells implanted can serve as a suitable donor for their existing child. in their bone marrow, and have a high risk of developing leu- Their plan is to have stem cells from the umbilical cord of the kemia and bone cancer. At the molecular level, DBA is caused new baby transplanted to their existing child with DBA, thereby by mutations in any one of 10 genes that encode ribosomal curing the condition. What are the ethical pros and cons of this proteins. The first-line therapy for DBA is treatment, situation? but more than half of affected children develop resistance to 3. While a stem cell transplant from an unaffected donor is cur- the drugs and in these cases, treatment is halted. DBA can be rently the only cure for DBA, genome-editing technologies may treated successfully with bone marrow or stem cell transplants one day enable the correction of a mutation in a patient’s own from donors with closely matching immune system markers. bone marrow stem cells. However, what specific information Transplants from unrelated donors have significant levels of would be needed, beyond a symptom-based diagnosis of DBA, complications and mortality. in order to accomplish this? 1. Given that a faulty ribosomal protein is the culprit and causes For related reading, see Penning, G., et al. (2002). Ethical consider- DBA, discuss the possible role of normal ribosomal proteins. ations on preimplantation genetic diagnosis for HLA typing to match Why might bone marrow cells be more susceptible to such a a future child as a donor of haematopoietic stem cells to a sibling. mutation than other cells? Hum. Reprod. 17(3):534–538.

M13_KLUG8414_10_SE_C13.indd 258 16/11/18 5:14 pm PROBLEMS AND DISCUSSION QUESTIONS 259

INSIGHTS AND SOLUTIONS

1. As an extension of Beadle and Tatum’s work with Neurospora, We now analyze these three compounds and the control it is possible to study multiple mutations whose impact is on of their synthesis by the enzymes encoded by mutations 2, 3, the same biochemical pathway. The growth responses in the and 4. Because product B allows growth in all three cases, it following chart were obtained using four mutant strains of may be considered the “end product”—it bypasses the block Neurospora and the chemically related compounds A, B, C, in all three instances. Using similar reasoning, product A pre- and D. None of the mutants grow on minimal medium. Draw cedes B in the pathway because it bypasses the block in two all possible conclusions from these data. of the three steps, and product D precedes B yielding a partial Growth Supplement solution Mutation A B C D C(?) ¡ D ¡ A ¡ B

1 - - - - Now let’s determine which mutations control which steps. 2 + + - + Since mutation 2 can be alleviated by products D, B, and A, 3 + + - - it must control a step prior to all three products, perhaps the direct conversion to D (although we cannot be certain). Muta- 4 - + - - tion 3 is alleviated by B and A, so its effect must precede them Solution: Nothing can be concluded about mutation 1 except in the pathway. Thus, we assign it as controlling the conver- that it lacks some essential growth factor, perhaps even unre- sion of D to A. Likewise, we can assign mutation 4 to the con- lated to the biochemical pathway represented by mutations 2, version of A to B, leading to a more complete solution 3, and 4. Nor can anything be concluded about compound C. If 2(?) 3 4 C(?) ¡ D ¡ A ¡ B it is involved in the pathway, it is a product that was synthe- sized prior to compounds A, B, and D.

Mastering Genetics Visit for Problems and Discussion Questions instructor-assigned tutorials and problems.

1. HOW DO WE KNOW? In this chapter, we focused on the trans- 6. During translation, what molecule bears the anticodon? The lation of mRNA into proteins as well as on protein structure and codon? function. Along the way, we found many opportunities to con- 7. Summarize the steps involved in charging tRNAs with their sider the methods and reasoning by which much of this informa- appropriate amino acids. tion was acquired. From the explanations given in the chapter, 8. Each transfer RNA requires at least four specific recognition sites what answers would you propose to the following fundamental that must be inherent in its tertiary protein structure in order for questions: it to carry out its role. What are these sites? (a) What experimentally derived information led to Holley’s 9. What are the major differences between translation in bacteria proposal of the two-dimensional cloverleaf model of tRNA? and translation in eukaryotes? (b) What experimental information verifies that certain codons 10. Explain why the one-gene:one-enzyme hypothesis is no longer in mRNA specify chain termination during translation? considered to be totally accurate. (c) How do we know, based on studies of Neurospora nutri- 11. Hemoglobin is a tetramer consisting of two a and two b chains. tional mutations, that one gene specifies one enzyme? What level of protein structure is described in this statement? (d) On what basis have we concluded that proteins are the end 12. Using sickle-cell anemia as a basis, describe what is meant by a products of genetic expression? genetic or inherited molecular disease. What are the similarities 2. CHAPTER CONCEPTS Review the Chapter Concepts list on and dissimilarities between this type of a disorder and a disease p. 241. These all relate to the translation of genetic information caused by an invading microorganism? stored in mRNA into proteins and how chemical information in 13. Describe the genetic and molecular basis of sickle-cell anemia. proteins imparts function to those molecules. Write a brief essay 14. Assuming that each nucleotide is 0.34 nm long in mRNA, how that discusses the role of ribosomes in the process of translation many triplet codes can simultaneously occupy space in a ribo- as it relates to these concepts. some that is 20 nm in diameter? 3. List and describe the role of all molecular constituents present in 15. Review the concept of colinearity in Section 12.5 (p. 226) and a functional polyribosome. consider the following question: Certain mutations called 4. Contrast the roles of tRNA and mRNA during translation, and list amber in bacteria and viruses result in premature termination all enzymes that participate in the translation processes. of polypeptide chains during translation. Many amber muta- 5. Francis Crick proposed the adaptor hypothesis for the function tions have been detected at different points along the gene that of tRNA. Why did he choose that description? codes for a head protein in phage T4. How might this system be

M13_KLUG8414_10_SE_C13.indd 259 16/11/18 5:14 pm 260 13 Translation and Proteins

further investigated to demonstrate and support the concept of 22. A series of mutations in the bacterium Salmonella typhimurium colinearity? results in the requirement of either tryptophan or some related 16. In your opinion, which of the four levels of protein organization molecule in order for growth to occur. From the data shown here, is the most critical to a protein’s function? Defend your choice. suggest a biosynthetic pathway for tryptophan. 17. List and describe the function of as many nonenzymatic proteins as you can that are unique to eukaryotes. Growth Supplement 18. How does an enzyme function? Why are enzymes essential for Indole living organisms? Minimal ­Anthranilic Glycerol 19. Shown in the following table are several amino acid substitutions Mutation Medium Acid Phosphate Indole Tryptophan in the a and b chains of human hemoglobin. Use the genetic code trp-8 - + + + + table in Figure 12.7 to determine how many of them can occur as a result of a single nucleotide change. trp-2 - - + + + trp-3 - - - + + Substituted trp-1 - - - - + Hb Type Normal Amino Acid Amino Acid HbJ Toronto Ala Asp (a@5) 23. The emergence of antibiotic-resistant strains of Enterococci HbJ Oxford Gly Asp (a@15) and transfer of resistant genes to other bacterial pathogens have highlighted the need for new generations of antibiotics to Hb Mexico Gln Glu (a@54) combat serious infections. To grasp the range of potential sites Hb Bethesda Tyr His (b@145) for the action of existing antibiotics, sketch the components Hb Sydney Val Ala (b@67) of the translation machinery (e.g., see Step 3 of Figure 13.6), HbM Saskatoon His Tyr (b@63) and using a series of numbered pointers, indicate the specific location for the action of the antibiotics shown in the following table. 20. Three independently assorting genes are known to control the biochemical pathway below that provides the basis for flower color in a hypothetical plant Antibiotic Action

A- B- C- 1. Streptomycin Binds to 30S ribosomal subunit colorless ¡ yellow ¡ green ¡ speckled 2. Chloramphenicol Inhibits the peptidyl transferase Homozygous recessive mutations, which disrupt enzyme func- ­function of 70S ribosome tion controlling each step, are known. Determine the phenotypic 3. Tetracycline Inhibits binding of charged tRNA to results in the F1 and F2 generations resulting from the P1 crosses involving true-breeding plants given here. ribosome (a) speckled (AABBCC) * yellow (AAbbCC) 4. Erythromycin Binds to free 50S particle and pre- (b) yellow (AAbbCC) * green (AABBcc) vents formation of 70S ribosome (c) colorless (aaBBCC) green (AABBcc) * 5. Kasugamycin fMet 21. How would the results in cross (a) of Problem 20 vary if genes Inhibits binding of tRNA A and B were linked with no crossing over between them? How 6. Thiostrepton Prevents translocation by inhibiting would the results of cross (a) vary if genes A and B were linked EF-G and 20 map units apart?

M13_KLUG8414_10_SE_C13.indd 260 16/11/18 5:14 pm 14 Gene Mutation, DNA Repair, and Transposition

CHAPTER CONCEPTS

■■ Mutations comprise any change in the nucleotide sequence of an organism’s genome. ■■ Mutations are a source of genetic varia- tion and provide the raw material for nat- ural selection. They are also the source Pigment mutations within an ear of corn, caused by transposition of the Ds element. of genetic damage that contributes to cell death, genetic diseases, and cancer. ■■ Mutations have a wide range of effects on organisms depending on the type of base-pair alteration, the location of the mutation within the chromosome, and the function of the affected gene product. he ability of DNA molecules to store, replicate, transmit, and ■■ Mutations can occur spontaneously as decode information is the basis of genetic function. But equally a result of natural biological and chemi- important are the changes that occur to DNA sequences. Without cal processes, or they can be induced T the variation that arises from changes in DNA sequences, there would by external factors, such as chemicals be no phenotypic variability, no adaptation to environmental changes, or radiation. and no evolution. Gene mutations are the source of new alleles and are ■■ Single-gene mutations cause a wide the origin of genetic variation within populations. On the downside, they variety of human diseases. are also the source of genetic changes that can lead to cell death, genetic ■■ Organisms rely on a number of DNA diseases, and cancer. repair mechanisms to counteract muta- Mutations also provide the basis for genetic analysis. The phenotypic tions. These mechanisms range from variations resulting from mutations allow geneticists to identify and study proofreading and correction of replica- the genes responsible for the modified trait. In genetic investigations, muta- tion errors to base excision and homolo- gous recombination repair. tions act as identifying “markers” for genes so that they can be followed during their transmission from parents to offspring. Without phenotypic ■■ Mutations in genes whose products variability, classical genetic analysis would be impossible. For example, if control DNA repair lead to genome hypermutability, human DNA repair all pea plants displayed a uniform phenotype, Mendel would have had no ­diseases, and cancers. foundation for his research. We have examined mutations in large regions of chromosomes—­ ■■ Transposable elements may move into and out of chromosomes, causing chro- chromosomal mutations (see Chapter 6). In contrast, the mutations we mosome breaks and inducing muta- will now explore are those occurring primarily in the base-pair sequence tions both within coding regions and in of DNA within and surrounding individual genes—gene mutations. We gene-regulatory regions. will also describe how the cell defends itself from mutations using various ­mechanisms of DNA repair.

261

M14_KLUG8414_10_SE_C14.indd 261 16/11/18 5:14 pm 262 14 Gene Mutation, DNA Repair, and Transposition

protein. This is known as a nonsense mutation. If the point 14.1 Gene Mutations Are Classified mutation alters a codon but does not result in a change in the in Various Ways amino acid at that position in the protein (due to degeneracy of the genetic code), it can be considered a silent mutation. A mutation can be defined as an alteration in the nucleotide Because eukaryotic genomes consist of so much more sequence of an organism’s genome. Any base-pair change in noncoding DNA than coding DNA (see Chapter 11), the vast any part of a DNA molecule can be considered a mutation. ­majority of mutations are likely to occur in noncoding regions. A mutation may comprise a single base-pair substitution, a These mutations may be considered neutral mutations if deletion or insertion of one or more base pairs, or a major they do not affect gene products or gene expression. Most silent alteration in the structure of a chromosome. mutations, which do not change the amino acid sequence of Mutations may occur within regions of a gene that code the encoded protein, can also be considered neutral mutations. for protein or within noncoding regions of a gene such as You will often see two other terms used to describe introns and regulatory sequences, including promoters, base substitutions. If a pyrimidine replaces a pyrimidine or enhancers, and splicing signals. Mutations may or may not a purine replaces a purine, a transition has occurred. If a bring about a detectable change in phenotype. The extent to purine replaces a pyrimidine, or vice versa, a transversion which a mutation changes the characteristics of an organism has occurred. depends on which type of cell suffers the mutation and the Another type of change is the insertion or deletion of one degree to which the mutation alters the function of a gene or more nucleotides at any point within the gene. As illus- product or a gene-regulatory region. trated in Figure 14.1, the loss or addition of a single nucleo- Because of the wide range of types and effects of muta- tide causes all of the subsequent three-letter codons to be tions, geneticists classify mutations according to several changed. These are called frameshift mutations because different schemes. These organizational schemes are not the frame of triplet reading during translation is altered. A mutually exclusive. In this section, we outline some of the frameshift mutation will occur when any number of bases ways in which gene mutations are classified. are added or deleted, except multiples of three, which would reestablish the initial frame of reading (see Figure 12.2). It is possible that one of the many altered triplets will be UAA, Classification Based on Type UAG, or UGA, the translation termination codons. When one of Molecular Change of these triplets is encountered during translation, polypep- Geneticists often classify gene mutations in terms of the tide synthesis is terminated at that point. Obviously, the nucleotide changes that constitute the mutation. A change results of frameshift mutations can be very severe, such as of one base pair to another in a DNA molecule is known as a producing a truncated protein or defective enzymes, espe- point mutation, or base substitution (see Figure 14.1). A cially if they occur early in the coding sequence. change of one nucleotide of a triplet within a protein-coding portion of a gene may result in the creation of a new triplet that codes for a different amino acid in the protein product. If Classification Based on Effect on Function this occurs, the mutation is known as a missense mutation. A As discussed earlier (see Chapter 4), a loss-of-function second possible outcome is that the triplet will be changed into mutation is one that reduces or eliminates the function of a stop codon, resulting in the termination of translation of the the gene product. Mutations that result in complete loss of

THE CAT SAW THE DOG

Change of Loss of Gain of one letter one letter one letter

Substitution Deletion Insertion

THE BAT SAW THE DOG THE ATS AWT HED OG THE CMA TSA WTH EDO G THE CAT SAW THE HOG THE CAT SAT THE DOG Loss of C Insertion of M

Point mutation Frameshift mutation Frameshift mutation

FIGURE 14.1 analogy showing the effects of substitution, deletion, and insertion of one letter in a sentence composed of three-letter words to demonstrate point and frameshift mutations.

M14_KLUG8414_10_SE_C14.indd 262 16/11/18 5:14 pm 14.1 Gene Mutations Are Classified in Various Ways 263

function are known as null mutations. Any type of muta- A suppressor mutation can occur within the same gene that tion, from a point mutation to deletion of the entire gene, suffered the first mutation (intragenic mutation) or else- may lead to a loss of function. where in the genome (intergenic mutation). Most loss-of-function mutations are recessive. A Depending on their type and location, mutations can recessive­ mutation results in a wild-type phenotype when have a wide range of phenotypic effects, from none to severe. present in a diploid organism and the other allele is wild type. Some examples of mutation types based on their phenotypic In this case, the presence of less than 100 percent of the gene outcomes are listed in Table 14.1. product is sufficient to bring about the wild-type phenotype. Some loss-of-function mutations can be dominant. Classification Based on Location of Mutation A dominant mutation results in a mutant phenotype in a diploid organism, even when the wild-type allele is also Mutations may be classified according to the cell type or present. Dominant mutations in diploid organisms can have chromosomal locations in which they occur. Somatic muta- several different types of effects. A dominant negative­ tions are those occurring in any cell in the body except germ mutation in one allele may encode a gene product that cells, whereas germ-line mutations occur only in germ is inactive and directly interferes with the function of the cells. Autosomal mutations are mutations within genes product of the wild-type allele. For example, this can occur located on the autosomes, whereas X-linked and Y-linked when the nonfunctional gene product binds to the wild-type mutations are those within genes located on the X or Y gene product in a homodimer, inactivating or reducing the ­chromosome, respectively. activity of the homodimer. Mutations arising in somatic cells are not transmitted A dominant negative mutation can also result from to future generations. When a recessive autosomal muta- haploinsufficiency,­ which occurs when one allele is inacti- tion occurs in a somatic cell of an adult multicellular diploid vated by mutation, leaving the individual with only one func- organism, it is unlikely to result in a detectable phenotype. tional copy of a gene. The active allele may be a wild-type The expression of most such mutations is likely to be masked copy of the gene but does not produce enough wild-type gene by expression of the wild-type allele within that cell and the product to bring about a wild-type phenotype. In humans, presence of nonmutant cells in the remainder of the organ- Marfan syndrome is an example of a disorder caused by ism. Somatic mutations will have a greater impact if they are ­haploinsufficiency—in this case as a result of a loss-of-­ dominant or, in males, if they are X-linked, since such muta- function mutation in one copy of the fibrillin-1 (FBN1) gene. tions are most likely to be immediately expressed. In addi- In contrast, a gain-of-function mutation codes for a tion, the impact of dominant or X-linked somatic mutations gene product with enhanced, negative, or new functions. This will be more noticeable if they occur early in development, may be due to a change in the amino acid sequence of the pro- when a small number of undifferentiated cells replicate to tein that confers a new activity, or it may result from a muta- give rise to several differentiated tissues or organs. tion in a regulatory region of the gene, leading to expression Mutations in germ cells have the potential of being of the gene at higher levels or at abnormal times or places. expressed in all cells of an offspring. Inherited dominant Typically, gain-of-function mutations are dominant. autosomal mutations will be expressed phenotypically in A suppressor mutation is a second mutation that the first generation. X-linked recessive mutations arising either reverts or relieves the effects of a previous mutation. in the gametes of a female (the homogametic sex; having

TABLE 14.1 Classifications of Mutations by Phenotypic Effects

Classification Phenotype Example Visible Visible morphological trait Mendel’s pea characteristics Nutritional Altered nutritional characteristics Loss of ability to synthesize an essential amino acid in bacteria Biochemical Changes in protein function Defective hemoglobin leading to sickle-cell anemia in humans Behavioral Behavior pattern changes Brain mutations affecting Drosophila mating behaviors Regulatory Altered gene expression Regulatory gene mutations affecting expression of the lac operon in E. coli Lethal Altered organism survival Tay-Sachs and Huntington disease in humans Conditional Phenotype expressed only under certain environ- Temperature-sensitive mutations affecting coat mental conditions color in Siamese cats

M14_KLUG8414_10_SE_C14.indd 263 16/11/18 5:14 pm 264 14 Gene Mutation, DNA Repair, and Transposition

two X chromosomes) may be expressed in male offspring, most organisms are exposed and, as such, may be considered who are by definition hemizygous for the gene mutation natural agents that cause induced mutations. because they have one X and one Y chromosome. This will We will next describe several aspects of spontaneous occur provided that the male offspring receives the affected mutations, including mutation rates. X chromosome. Because of heterozygosity, the occurrence of an autosomal recessive mutation in the gametes of either males or females (even one resulting in a lethal allele) may Spontaneous Mutation Rates in Nonhuman go unnoticed for many generations, until the resultant allele Organisms has become widespread in the population. Usually, the new Several generalizations can be made regarding spontane- allele will become evident only when a chance mating brings ous mutation rates. The mutation rate is defined as the two copies of it together into the homozygous condition. likelihood that a gene will undergo a mutation in a single generation or in forming a single gamete. First, the rate of ESSENTIAL POINT spontaneous mutation is exceedingly low for all organisms. Mutations can have many different effects on gene function, depend- Second, the rate varies between different organisms. Third, ing on the type of nucleotide changes that comprise the mutation even within the same species, the spontaneous mutation rate and their locations. Phenotypic effects can range from neutral or varies from gene to gene. silent to loss of function, gain of function, or lethality. ■ Viral and bacterial genes undergo spontaneous muta- tion at an average of about 1 in 100 million (10-8) replica- tions or cell divisions. Maize and Drosophila demonstrate NOW SOLVE THIS rates several orders of magnitude higher. The genes stud- 6 14.1 If a point mutation occurs within a human egg cell ied in these groups average between 1 in 1,000,000 (10- ) genome that changes an A to a T, what is the most likely and 1 in 100,000 (10-5) mutations per gamete formed. effect of this mutation on the phenotype of an offspring Some mouse genes are another order of magnitude higher that develops from this mutated egg? in their spontaneous mutation rate:1 in 100,000 to 1 in -5 -4 HINT: This problem asks you to predict the effects of a single 10,000 (10 to 10 ). It is not clear why such large varia- base-pair mutation on phenotype. The key to its solution involves tions occur in mutation rates. an understanding of the organization of the human genome as The variation in rates between organisms may, in part, well as the effects of mutations on coding and noncoding regions reflect the relative efficiencies of their DNA proofreading of genes and the effects of mutations on development. and repair systems. We will discuss these systems later in the chapter. Variation between genes in a given organism may be For more practice, see Problems 4–7. due to inherent differences in mutability in different regions of the genome. Some DNA sequences appear to be highly sus- ceptible to mutation and are known as mutation hot spots.

14.2 Mutations Can Be Spontaneous Mutation Rates in Humans Spontaneous or Induced Now that whole-genome sequencing is becoming both rapid and economical, it is possible to examine entire genomes, Mutations can be classified as either spontaneous or induced, both coding and noncoding regions, and to compare genomes although these two categories overlap to some degree. Spon- from parents and offspring and estimate spontaneous germ- taneous mutations are changes in the nucleotide sequence line mutation rates. of genes that appear to occur naturally. No specific agents are In 2012, a research group in Iceland sequenced the associated with their occurrence. Many of these mutations genomes of 78 parent/offspring sets, comprising 219 arise as a result of normal biological or chemical processes ­individuals, and compared the single-nucleotide in the organism that alter the structure of nitrogenous bases. polymorphisms (SNPs) (see Chapter 7) throughout their Often, spontaneous mutations occur during the enzymatic genomes.1 Their data revealed that a newborn baby’s process of DNA replication, as we discuss later in this chapter. genome contains an average of 60 new mutations, com- In contrast to spontaneous mutations, the mutations pared with those of his or her parents. Their research also that result from the influence of extraneous factors are con- revealed that the number of new mutations depends sig- sidered to be induced mutations. Induced mutations may nificantly on the age of the father at the time of conception. be the result of either natural or artificial agents. For exam- ple, radiation from cosmic and mineral sources and ultra- 1 Kong, A., et al. (2012). Rate of de novo mutations and the impor- violet radiation from the sun are energy sources to which tance of father’s age to disease risk. Nature 488:471–475.

M14_KLUG8414_10_SE_C14.indd 264 16/11/18 5:14 pm 14.3 Spontaneous Mutations Arise from Replication Errors and Base Modifications 265

For example, when the father is 20 years old, he contributes mutations. Replication errors due to mispairing predomi- approximately 25 new mutations to the child. When he is 40 nantly lead to point mutations. The fact that bases can take years old, he ­contributes approximately 65 new mutations. several forms, known as tautomers, increases the chance In contrast, the mother contributes about 15 new mutations, of mispairing during DNA replication, as we will explain at any age. The researchers estimated that the father con- shortly. tributes approximately 2 mutations per year of his age, with In addition to mispairing and point mutations, DNA repli- the mutation rate doubling every 16.5 years. The large pro- cation can lead to the introduction of small insertions or dele- portion of mutations contributed by fathers is likely due to tions. These mutations can occur when one strand of the DNA the fact that male germ cells go through more cell divisions template loops out and becomes displaced during replication, during a lifetime than do female germ cells. or when DNA polymerase slips or stutters during ­replication— Of the 4933 new SNP mutations that were identified in events termed replication slippage. If a loop occurs in the this study, only 73 occurred within gene exons. Other stud- template strand during replication, DNA polymerase may ies have suggested that about 10 percent of single-nucleotide miss the looped-out nucleotides, and a small deletion in the mutations lead to negative phenotypic changes. If so, then new strand will be introduced. If DNA polymerase repeatedly an average spontaneous mutation rate of 60 new mutations introduces nucleotides that are not present in the template might yield about six deleterious phenotypic effects per strand, an insertion of one or more nucleotides will occur. generation. Insertions and deletions may lead to frameshift mutations or It is estimated that somatic cell mutation rates are amino acid insertions or deletions in the gene product. between 4 and 25 times higher than those in germ-line cells. Replication slippage can occur anywhere in the DNA but It is well accepted that somatic mutations are responsible for seems more common in regions containing tandemly repeated the development of most cancers. We will discuss the effects sequences. Repeat sequences are hot spots for DNA mutation of somatic mutations on the development of cancer in more and in some cases contribute to hereditary diseases, such as detail later (see Chapter 19). fragile-X syndrome and Huntington disease. The hypermuta- bility of repeat sequences in noncoding regions of the genome is the basis for several current methods of forensic DNA analysis. ESSENTIAL POINT Spontaneous mutations can occur naturally without the action of extraneous agents. Induced mutations occur as a result of Tautomeric Shifts ­extraneous agents which can be either natural or human-made. Spontaneous mutation rates vary between organisms and between Purines and pyrimidines can exist in tautomeric forms—that different regions of genome. ■ is, in alternate chemical forms that differ by the shift of a single proton in the molecule. The biologically important tautomers are the keto–enol forms of thymine and gua- nine and the amino–imino forms of cytosine and adenine. 14.3 Spontaneous Mutations Arise Tautomeric shifts change the covalent structure of the from Replication Errors and Base ­molecule, allowing hydrogen bonding with noncomplemen- tary bases, and hence, may lead to permanent base-pair Modifications changes and mutations. Figure 14.2 compares normal base-pairing arrangements with rare unorthodox pairings. In this section, we will outline some of the processes that ­Anomalous T-G and C-A pairs, among others, may be formed. lead to spontaneous mutations. Many of the DNA changes A mutation occurs during DNA replication when a tran- that occur during spontaneous mutagenesis also occur, at a siently formed tautomer in the template strand pairs with a higher rate, during induced mutagenesis. noncomplementary base. In the next round of replication, the “mismatched” members of the base pair are separated, and each becomes the template for its normal complemen- DNA Replication Errors and Slippage tary base. The end result is a point mutation (Figure 14.3). As we learned earlier (see Chapter 10), the process of DNA replication is imperfect. Occasionally, DNA polymerases insert incorrect nucleotides during replication of a strand of Depurination and Deamination DNA. Although DNA polymerases can correct most of these Some of the most common causes of spontaneous mutations replication errors using their inherent 3′ to 5′ exonuclease are two forms of DNA base damage: depurination and proofreading capacity, misincorporated nucleotides may deamination. Depurination is the loss of one of the persist after replication. If these errors are not detected nitrogenous bases in an intact double-helical DNA molecule. and corrected by DNA repair mechanisms, they may lead to Most frequently, the base is either guanine or adenine—in

M14_KLUG8414_10_SE_C14.indd 265 16/11/18 5:14 pm 266 14 Gene Mutation, DNA Repair, and Transposition

(a) Standard base-pairing arrangements H

H N H H H O N H C CH3 O H N N C C C C C N C C C C H C N H N C H C N H N C N N C C N N C C N O H N

O H H Thymine (keto) Adenine (amino) Cytosine (amino) Guanine (keto)

(b) Anomalous base-pairing arrangements

CH3 O H O H H H N FIGURE 14.2 Examples C N H N H of standard base-pairing C C C C H N N C arrangements (a) com- H C N H N C C C C C pared with examples of the N C C N H C N H N C N anomalous base pairing that occurs as a result of tautomeric O H N N C C N shifts (b). The long triangles indicate the point at which H O H each base bonds to a backbone Thymine (enol) Guanine (keto) Cytosine (imino) Adenine (amino) sugar.

other words, a purine. These bases may be lost if the glycosidic A T bond linking the 1′@C of the deoxyribose and the number 9 T A position of the purine ring is broken, leaving­ an apurinic­ site G C on one strand of the DNA. Geneticists estimate that thousands of such spontaneous lesions are formed daily in the DNA of C G mammalian cells in culture. If apurinic sites are not repaired, there will be no base at that position to act as a template A T during DNA replication. As a result, DNA polymerase may introduce a nucleotide at random at that site. No tautomeric T A Tautomeric shift shift to imino form In deamination, an amino group in cytosine or ade- G C nine is converted to a keto group. In these cases, cytosine is C G converted to uracil, and adenine is changed to the guanine- Replication resembling compound hypoxanthine (Figure 14.4). The (round 1) major effect of these changes is an alteration in the base-pair- A T A T ing specificities of these two bases during DNA replication. T A Anomalous C A Tautomer For example, cytosine normally pairs with guanine. Follow- C-A base ing its conversion to uracil, which pairs with adenine, the G C pair formed G C original G-C pair is converted to an A-U pair and then, in the C G C G next replication, is converted to an A-T pair. When adenine is No mutation deaminated, the original A-T pair is ultimately converted to a A T G-C pair because hypoxanthine pairs naturally with cytosine, which then pairs with guanine in the next replication. C A Tautomeric shift back to G C amino form Oxidative Damage C G DNA may also suffer damage from the by-products of Replication (round 2) normal cellular processes. These by-products include reac- tive oxygen species (electrophilic oxidants) that are gen- A T A T erated during normal aerobic respiration. For example, C G Transition T A mutation G C G C C G C G FIGURE 14.3 formation of an A-T to G-C transition muta- tion as a result of a transient tautomeric shift in adenine.

M14_KLUG8414_10_SE_C14.indd 266 16/11/18 5:14 pm 14.4 Induced Mutations Arise from DNA Damage Caused by Chemicals and Radiation 267

H H 14.4 Induced Mutations Arise H H N H H O H N N C from DNA Damage Caused by C C C C C C Chemicals and Radiation H C N H C N H N C N N C N C C N All cells on Earth are exposed to a plethora of agents called , which have the O O H Cytosine Uracil Adenine potential to damage DNA and cause induced mutations. Some of these agents, such as some H H fungal toxins, cosmic rays, and UV light, are H H N N H H N O H N natural components of our environment. Oth- C C C C C C C C ers, including some industrial pollutants, medi- cal X rays, and chemicals within tobacco smoke, N C N N C N H N C can be considered as unnatural or human-made N C N C C N additions to our modern world. On the positive H H O side, geneticists have harnessed some mutagens Adenine Hypoxanthine Cytosine for use in analyzing genes and gene functions. The mechanisms by which some of these natural and FIGURE 14.4 Deamination of cytosine and adenine, leading to new base pairing and mutation. Cytosine is converted to uracil, which unnatural agents lead to mutations are outlined base-pairs with adenine. Adenine is converted to hypoxanthine, which in this section. base-pairs with cytosine. Base Analogs One category of mutagenic chemicals is base analogs, com- superoxides (O-), hydroxyl radicals ( # OH), and hydrogen 2 pounds that can substitute for purines or pyrimidines during peroxide (H O ) are created during cellular metabolism and 2 2 nucleic acid biosynthesis. For example, the synthetic chemi- are constant threats to the integrity of DNA. Such reactive cal 5-bromouracil (5-BU), a derivative of uracil, behaves oxidants, also generated by exposure to high-energy radia- as a thymine analog but with a bromine atom substituted tion, can produce more than 100 different types of chemi- at the number 5 position of the pyrimidine ring. If 5-BU cal modifications in DNA, including modifications to bases, is chemically linked to deoxyribose, the nucleoside ana- loss of bases, and ­single-stranded breaks. log bromodeoxyuridine (BrdU) is formed. Figure 14.5 compares the structure of 5-BU with that of thymine. The presence of the bromine atom in place of the methyl group ESSENTIAL POINT increases the probability that a tautomeric shift will occur. Spontaneous mutations result from many different causes includ- If BrdU is incorporated into DNA in place of thymidine and a ing errors during DNA replication and changes in DNA base pairing tautomeric shift to the enol form of 5-BU occurs, 5-BU base- related to tautomeric shifts, depurinations, deaminations, and reac- tive oxidant damage. ■ pairs with guanine. After one round of replication, an A-T to G-C transition results. Furthermore, the presence of 5-BU within DNA increases the sensitivity of the molecule to UV NOW SOLVE THIS light, which itself is mutagenic. 14.2 One of the most famous cases of an X-linked reces- Alkylating, Intercalating, and Adduct-Forming sive mutation in humans is that of hemophilia found in the Agents descendants of Britain’s Queen Victoria. The pedigree of the royal family indicates that Victoria was heterozygous A number of naturally occurring and human-made chemi- for the trait; however, her father was not affected, and no cals alter the structure of DNA and cause mutations. The other member of her maternal line appeared to carry the sulfur-containing mustard gases, used during World War mutation. What are some possible explanations of how I, were some of the first chemical mutagens identified in the mutation arose? What types of mutations could lead chemical warfare studies. Mustard gases are alkylating

to the disease? agents—that is, they donate an alkyl group, such as CH3 HINT: This problem asks you to determine the sources of new or CH2CH3, to amino or keto groups in nucleotides. Ethyl- mutations. The key to its solution is to consider the ways in which methane sulfonate (EMS), for example, alkylates the keto mutations occur, the types of cells in which they can occur, and groups in the number 6 position of guanine and in the num- how they are inherited. ber 4 position of thymine. As with base analogs, base-­pairing affinities are altered, and transition mutations result.

M14_KLUG8414_10_SE_C14.indd 267 16/11/18 5:14 pm 268 14 Gene Mutation, DNA Repair, and Transposition

H during the cooking of meats such as beef, chicken, and fish. HCAs are formed at high temperatures CH H 3 O Br O H N N from amino acids and creatine. Many HCAs cova- C C C C C C C lently bind to guanine bases. At least 17 different 5 4 HCAs have been linked to the development of can- H C 6 3 N H H C N H N C N 1 2 cers, such as those of the stomach, colon, and breast. N C N C C N Ultraviolet Light O O H Thymine 5-BU (keto form) Adenine All electromagnetic radiation consists of energetic waves that we define by their different wavelengths H (Figure 14.7). The full range of wavelengths is Br O H O N C referred to as the electromagnetic spectrum, and C C C C the energy of any radiation in the spectrum varies H C N H N C N inversely with its wavelength. Waves in the range of N C C N visible light and longer are benign when they inter- act with most organic molecules. However, waves O H N of shorter length than visible light, being inher- H ently more energetic, have the potential to disrupt 5-BU (enol form) Guanine organic molecules. Purines and pyrimidines absorb ultraviolet FIGURE 14.5 similarity of the chemical structure of 5-bromouracil (UV) radiation most intensely at a wavelength (5-BU) and thymine. In the common keto form, 5-BU base-pairs nor- mally with adenine, behaving as a thymine analog. In the rare enol of about 260 nanometers (nm). Although Earth’s form, it pairs anomalously with guanine. ozone layer absorbs the most dangerous types of UV radiation, sufficient UV radiation can induce thousands of For example, 6-ethylguanine acts as an analog of adenine DNA lesions per hour in any cell exposed to this radiation. and pairs with thymine (Figure 14.6). One major effect of UV radiation on DNA is the creation Intercalating agents are chemicals that have dimen- of pyrimidine dimers—chemical species consisting of sions and shapes that allow them to wedge between the base two identical pyrimidines—particularly ones consisting pairs of DNA. Wedged intercalating agents cause base pairs of two thymidine residues (Figure 14.8). The dimers dis- to distort and DNA strands to unwind. These changes in tort the DNA conformation and inhibit normal replication. DNA structure affect many functions including transcrip- As a result, errors can be introduced in the base sequence tion, replication, and repair. Deletions and insertions occur of DNA during replication through the actions of error- during DNA replication and repair, leading to frame-shift prone DNA polymerases. When UV-induced dimerization mutations. is extensive, it is responsible (at least in part) for the killing Another group of chemicals that cause mutations are effects of UV radiation on cells. known as adduct-forming agents. A DNA adduct is a substance that covalently binds to DNA, altering its con- formation and interfering with replication and repair. Two Ionizing Radiation examples of adduct-forming substances are acetaldehyde As noted above, the energy of radiation varies inversely with (a component of cigarette smoke) and heterocyclic amines wavelength. Therefore, X rays, gamma rays, and cosmic (HCAs). HCAs are cancer-causing chemicals that are created rays are more energetic than UV radiation (Figure 14.7).

CH2CH3

O CH H N O H N O 3 C C C C C C C C N N C N H EMS C N H N C H FIGURE 14.6 Conversion of N C N C C N guanine to 6-ethylguanine by the alkylating agent ethylmethane NH H N H O 2 sulfonate (EMS). The 6-ethylgua- Guanine 6-Ethylguanine Thymine nine base-pairs with thymine.

M14_KLUG8414_10_SE_C14.indd 268 16/11/18 5:14 pm 14.4 Induced Mutations Arise from DNA Damage Caused by Chemicals and Radiation 269

Visible spectrum (wavelength) 750 nm 700 nm 650 nm 600 nm 550 nm 500 nm 450 nm 380 nm

Gamma Cosmic Radio waves Microwaves Infrared UV X rays rays rays

103 m 109 nm 106 nm 103 nm 1 nm 10–3 nm 10–5 nm (1 m) Decreasing wavelength FIGURE 14.7 The regions of the electromagnetic Increasing energy spectrum and their associated wavelengths.

As a result, they penetrate deeply into tissues, causing ion- chromosomes, and producing a variety of chromosomal ization of the molecules encountered along the way. Hence, aberrations, such as deletions, translocations, and chro- this type of radiation is called ionizing radiation. mosomal fragmentation. As ionizing radiation penetrates cells, stable molecules Although it is often assumed that radiation from artifi- and atoms are transformed into free radicals—chemi- cial sources such as nuclear power plant waste and medical cal species containing one or more unpaired electrons. X rays are the most significant sources of radiation exposure Free radicals can directly or indirectly affect the genetic for humans, scientific data indicate otherwise. Scientists esti- material, altering purines and pyrimidines in DNA, break- mate that less than 20 percent of human radiation exposure ing phosphodiester bonds, disrupting the integrity of arises from human-made sources. The greatest radiation exposure comes from radon gas, cosmic rays, and natural soil radioactivity. More than half of human-made radia- Single strand tion exposure comes from medical X rays and radioactive of DNA pharmaceuticals.

P O H O C N ESSENTIAL POINT Thymine Sugar N C O Mutations can be induced by many types of chemicals and radia- base tions. These agents can damage both DNA bases and the sugar- C C phosphate backbones of DNA molecules. ■

H CH3 P O H O NOW SOLVE THIS C N Sugar N C O 14.3 The cancer drug melphalan is an alkylating agent C C of the mustard gas family. It acts in two ways: by caus- ing alkylation of guanine bases and by cross linking DNA H CH 3 Thymine strands together. Describe two ways in which melphalan dimer P O H might kill cancer cells. What are two ways in which cancer O C N cells could repair the DNA-damaging effects of melphalan? N C O Sugar HINT: This problem asks you to consider the effect of the alkyla- C C tion of guanine on base pairing during DNA replication. The key to its solution is to consider the effects of mutations on cellular H CH3 processes that allow cells to grow and divide. In Section 14.6, you will learn about the ways in which cells repair the types of muta- FIGURE 14.8 Depiction of a thymine dimer induced by UV radiation. The covalent crosslinks (shown in red) occur tions introduced by alkylating agents. between carbon atoms of the pyrimidine rings.

M14_KLUG8414_10_SE_C14.indd 269 16/11/18 5:14 pm 270 14 Gene Mutation, DNA Repair, and Transposition

degradation of the abnormal mRNA or creation of abnor- 14.5 Single-Gene Mutations Cause a mal protein products. Wide Range of Human Diseases Single-Gene Mutations and B Thalassemia Although most human genetic diseases are polygenic— that is, caused by variations in several genes—even a single Although some single-gene diseases, such as sickle-cell base-pair change in one of the approximately 20,000 human anemia (discussed in Chapter 13), are caused by one spe- genes can lead to a serious inherited disorder. These mono- cific base-pair change within a gene, most are caused by genic diseases can be caused by many different types of any of a large number of different mutations. The muta- single-gene mutations. Table 14.2 lists some examples of tion profile associated with b@thalassemia provides an the types of ­single-gene mutations that can lead to serious example of the latter, more common, type of monogenic genetic diseases. A comprehensive database of human disease. genes, mutations, and disorders is available in the Online b@thalassemia is an inherited autosomal recessive Mendelian Inheritance in Man (OMIM) database (described blood disorder resulting from a reduction or absence of in the Exploring Genomics feature in Chapter 3). As of 2018, hemoglobin. It is the most common single-gene disease in the OMIM database has cataloged approximately 5000 the world, affecting people worldwide, but especially popu- human phenotypes for which the molecular basis is known. lations in Mediterranean, North African, Middle Eastern, Geneticists estimate that approximately 30 percent Central Asian, and Southeast Asian countries. of mutations that cause human diseases are single base- People with b@thalassemia have varying degrees of pair changes that create nonsense mutations. These muta- ­anemia—from severe to mild—with symptoms including tions not only code for a prematurely terminated protein weakness, delayed development, jaundice, enlarged organs, product, but also trigger rapid decay of the mRNA. Many and often a need for frequent blood transfusions. more mutations are missense mutations that alter the Mutations in the b@globin gene (HBB gene) cause amino acid sequence of a protein and frameshift muta- b@thalassemia. The HBB gene encodes the 146-amino-acid tions that alter the protein sequence and create inter- b@globin polypeptide. Two b@globin polypeptides associate nal nonsense codons. Other common disease-associated with two a@globin polypeptides to form the adult hemoglobin mutations affect the sequences of gene promoters, mRNA tetramer. The HBB gene spans 1.6 kilobases of DNA on the splicing signals, and other noncoding sequences that affect short arm of chromosome 11. It is made up of three exons transcription, processing, and stability of mRNA or pro- and two introns. tein. One recent study showed that about 15 percent of Scientists have discovered approximately 400 differ- all point mutations that cause human genetic diseases ent mutations in the HBB gene that cause b@thalassemia, result in abnormal mRNA splicing. Approximately 85 although most cases worldwide are associated with only percent of these splicing mutations alter the sequence of about 20 of these mutations. Table 14.3 provides a sum- 5′ and 3′ splice signals. The remainder create new splice mary of the types of single-gene mutations that cause sites within the gene. Splicing defects often result in b@thalassemia.

TABLE 14.2 Examples of Human Disorders Caused by Single-Gene Mutations

Type of Mutation Disorder Molecular Change

Missense Achondroplasia Glycine to arginine at position 380 of FGFR3 gene

Nonsense Marfan syndrome Tyrosine to STOP codon at position 2113 of ­fibrillin-1 gene

Insertion Familial hypercholesterolemia Various short insertions throughout the LDLR gene

Deletion Cystic fibrosis Three-base-pair deletion of phenylalanine codon at position 508 of CFTR gene

Trinucleotide repeat expansions Huntington disease 740 repeats of (CAG) sequence in coding region of Huntingtin gene

M14_KLUG8414_10_SE_C14.indd 270 16/11/18 5:14 pm 14.6 Organisms Use DNA Repair Systems to Counteract Mutations 271

TABLE 14.3 Types of Mutations in the HBB Gene That Cause B@Thalassemia

Gene Region Affected Number of Mutations Known Description

5′ upstream region 22 Single base-pair mutations occur between -101 and -25 upstream from transcription start site. For exam- ple, a T S A transition in the TATA sequence at -30 results in decreased gene transcription and severe disease.

mRNA CAP site 1 Single base-pair mutation (A S C transversion) at +1 posi- tion leads to decreased levels of mRNA.

5′ untranslated region 3 Single base-pair mutations at +20, +22, and +33 cause decreases in transcription and translation and mild disease. ATG translation initiation codon 7 Single base-pair mutations alter the mRNA AUG sequence, resulting in no translation and severe disease. Exons 1, 2, and 3 coding regions 36 Single base-pair missense and nonsense mutations, and mutations that create abnormal mRNA splice sites. Disease severity varies from mild to extreme. Introns 1 and 2 38 Single base-pair transitions and transversions that reduce or abolish mRNA splicing and create abnormal splice sites that affect mRNA stability. Most cause severe disease. Polyadenylation site 6 Single base-pair changes in the AATAAA sequence reduce the efficiency of mRNA cleavage and ­polyadenylation, yield- ing long mRNAs or unstable mRNAs. Disease is mild.

Throughout and surrounding the 7100 Short insertions, deletions, and duplications that alter HBB gene coding sequences, create frameshift stop codons, and alter mRNA splicing.

Proofreading and Mismatch Repair 14.6 Organisms Use DNA Repair Some of the most common types of mutations arise during Systems to Counteract Mutations DNA replication when an incorrect nucleotide is inserted by DNA polymerase. The major DNA synthesizing enzyme Living systems have evolved a variety of elaborate repair in bacteria (DNA polymerase III) makes an error approxi- systems that counteract both spontaneous and induced mately once every 100,000 insertions, leading to an error rate DNA damage. These DNA repair systems are absolutely of 10-5. Fortunately, DNA polymerase proofreads each step, essential to the maintenance of the genetic integrity of catching 99 percent of those errors. If an incorrect nucleotide organisms and, as such, to the survival of organisms on is inserted during polymerization, the enzyme can recognize Earth. The balance between mutation and repair results in the error and “reverse” its direction. It then behaves as a 3′ the observed mutation rates of individual genes and organ- to 5′ exonuclease, cutting out the incorrect nucleotide and isms. Of foremost interest in humans is the ability of these replacing it with the correct one. This improves the efficiency systems to counteract genetic damage that would otherwise of replication 100-fold, creating only 1 mismatch in every 107 result in genetic diseases and cancer. The link between insertions, for a final error rate of 10-7. defective DNA repair and cancer susceptibility is described To cope with errors such as base–base mismatches, later (see Chapter 19). small insertions, and deletions that remain after proof- We now embark on a review of these and other DNA reading, another mechanism, called mismatch repair repair mechanisms, with the emphasis on the major (MMR), may be activated. During MMR, the mismatches approaches that organisms use to counteract genetic are detected, the incorrect nucleotide is removed, and the damage. correct nucleotide is inserted in its place.

M14_KLUG8414_10_SE_C14.indd 271 16/11/18 5:14 pm 272 14 Gene Mutation, DNA Repair, and Transposition

Following replication, the repair enzymes mentioned Postreplication repair below are able to recognize any mismatch that is introduced on the newly synthesized DNA strand and bind to the strand. T T Lesion An endonuclease enzyme creates a nick in the backbone of the newly synthesized DNA strand, either 5′ or 3′ to the mis- match. An exonuclease unwinds and degrades the nicked DNA strand, until the region of the mismatch is reached. AA Complementary region Finally, DNA polymerase fills in the gap created by the exo- DNA unwound prior nuclease, using the correct DNA strand as a template. DNA to replication ligase then seals the gap. A series of E. coli gene products, MutH, MutL, and­ T T MutS, as well as exonucleases, DNA polymerase III, and DNA ligase, are involved in MMR. Mutations in the mutH, mutL, and mutS genes result in bacterial strains deficient in MMR. AA Replication skips over In humans, mutations in genes that code for DNA MMR lesion and continues proteins (such as hMSH2 and hMLH1, which are the human equivalents of the mutS and mutL genes of E. coli) are associ- T T ated with the hereditary nonpolyposis colon cancer. MMR defects are commonly found in other cancers, such as leu- Recombined A A kemias, lymphomas, and tumors of the ovary, prostate, and complement endometrium. Cells from these cancers show genome-wide increases in the rate of spontaneous mutation. The link Undamaged complementary region of parental strand between defective MMR and cancer is supported by experi- New gap formed is recombined ments with mice. Mice that are engineered to have deficien- cies in MMR genes accumulate large numbers of mutations T T and are cancer-prone.

A A Postreplication Repair and the SOS Repair System AA New gap is filled by DNA Another type of DNA repair system, called postreplication polymerase and DNA ligase repair, responds after damaged DNA has escaped repair FIGURE 14.9 Postreplication repair occurs if DNA replica- and has failed to be completely replicated. As illustrated in tion has skipped over a lesion such as a thymine dimer. Figure 14.9, when DNA bearing a lesion of some sort (such as Through the process of recombination, the correct com- a pyrimidine dimer) is being replicated, DNA polymerase may plementary sequence is recruited from the parental strand and inserted into the gap opposite the lesion. The new gap stall at the lesion and then skip over it, leaving an unreplicated is filled by DNA polymerase and DNA ligase. gap on the newly synthesized strand. To correct the gap, RecA protein directs a recombinational exchange with the corre- sponding region on the undamaged parental strand of the same minimize DNA damage, hence its name. During SOS repair, polarity (the “donor” strand). When the undamaged segment DNA synthesis becomes error-prone, inserting random and of the donor strand DNA replaces the gapped segment, a gap possibly incorrect nucleotides in places that would normally is created on the donor strand. The gap can be filled by repair stall DNA replication. As a result, SOS repair itself becomes synthesis as replication proceeds. Because a recombinational mutagenic—although it may allow the cell to survive DNA event is involved in this type of DNA repair, it is considered to damage that would otherwise kill it. be a form of homologous recombination repair. Another postreplication repair pathway, the E. coli SOS repair system, also responds to damaged DNA, but in a dif- Photoreactivation Repair: Reversal of UV ferent way. In the presence of a large number of unrepaired Damage DNA mismatches and gaps, the bacteria can induce expres- As was illustrated in Figure 14.8, UV light introduces mutations sion of about 20 genes (including lexA, recA, and uvr) whose by the creation of pyrimidine dimers. UV-induced damage to products allow DNA replication to occur even in the pres- E. coli DNA can be partially reversed if, following irradiation, ence of DNA lesions. This type of repair is a last resort to the cells are exposed briefly to visible light, especially in the

M14_KLUG8414_10_SE_C14.indd 272 16/11/18 5:14 pm 14.6 Organisms Use DNA Repair Systems to Counteract Mutations 273

blue range of the visible spectrum. The process is dependent Base excision repair on the activity of a protein called photoreactivation enzyme (PRE) or photolyase. The enzyme’s mode of action is to cleave A C U A G T 5' the cross-linking bonds between thymine dimers. Although Duplex with U–G mismatch the enzyme will associate with a thymine dimer in the dark, 3' it must absorb a photon of blue light to cleave the dimer. The T G G T C A enzyme is also detectable in many organisms, including other Uracil DNA glycosylase U recognizes and excises bacteria, fungi, plants, and some ­vertebrates—though not in incorrect base A C A G T humans. Humans and other organisms that lack photoreacti- 5' vation repair must rely on other repair mechanisms to reverse the effects of UV radiation. 3' T G G T C A AP endonuclease recognizes lesion and nicks DNA strand Base and Nucleotide Excision Repair A C A G T 5' A number of light-independent DNA repair systems exist in all bacteria and eukaryotes. The basic mechanisms involved 3' T G G T C A in these types of repair—collectively referred to as excision repair or cut-and-paste mechanisms—consist of the follow- DNA polymerase and DNA ligase fill gap ing three steps. A C C A G T 5' 1. The damage, distortion, or error present on one of the two Mismatch repaired 3' strands of the DNA helix is recognized and enzymatically T G G T C A clipped out by an endonuclease. Excisions in the phospho- diester backbone usually include a number of nucleotides FIGURE 14.10 Base excision repair (BER) accomplished by adjacent to the error as well, leaving a gap on one strand uracil DNA glycosylase, AP endonuclease, DNA polymerase, of the helix. and DNA ligase. Uracil is recognized as a noncomplemen- tary base, excised, and replaced with the complementary base (C). 2. A DNA polymerase fills in the gap by inserting nucleo- tides complementary to those on the intact strand, which it uses as a replicative template. The enzyme adds these Although much has been learned about the mechanisms nucleotides to the free 3′@OH end of the clipped DNA. In E. of BER in E. coli, BER systems have also been detected in coli, this step is usually performed by DNA polymerase I. eukaryotes from yeast to humans. Experimental evidence shows that both mouse and human cells that are defective 3. DNA ligase seals the final “nick” that remains at the in BER activity are hypersensitive to the killing effects of 3′@OH end of the last nucleotide inserted, closing the gap. gamma rays and oxidizing agents. There are two types of excision repair: base excision Nucleotide excision repair (NER) pathways repair repair and nucleotide excision repair. Base excision repair “bulky” lesions in DNA that alter or distort the double helix. (BER) corrects DNA that contains incorrect base pairings These lesions include the UV-induced pyrimidine dimers due to the presence of chemically modified bases or uridine and DNA adducts discussed previously. that are inappropriately incorporated into DNA The NER pathway (Figure 14.11) was first discovered or created by deamination of cytosine. The first step in the in 1964 by Paul Howard-Flanders and coworkers, who iso- BER pathway involves the recognition of an inappropriately lated several independent E. coli mutants that are sensitive paired base by enzymes called DNA glycosylases. There are to UV radiation. One group of genes was designated uvr a number of DNA glycosylases, each of which recognizes a (ultraviolet repair) and included the uvrA, uvrB, and uvrC specific base. For example, the enzyme uracil DNA glycosyl- mutations. In the NER pathway, the uvr gene products are ase recognizes the presence of uracil in DNA (Figure 14.10). involved in recognizing and clipping out lesions in the DNA. DNA glycosylases first cut the glycosidic bond between the Usually, a specific number of nucleotides are clipped out target base and its sugar, creating an apyrimidinic (or around both sides of the lesion. In E. coli, usually a total of apurinic) site. The sugar with the missing base is then rec- 13 nucleotides are removed, including the lesion. The repair ognized by an enzyme called AP endonuclease. The AP is then completed by DNA polymerase I and DNA ligase, in endonuclease makes cuts in the phosphodiester backbone a manner similar to that occurring in BER. The undamaged at the apyrimidinic or apurinic site. The gap is filled by DNA strand opposite the lesion is used as a template for the repli- polymerase and DNA ligase. cation, resulting in repair.

M14_KLUG8414_10_SE_C14.indd 273 16/11/18 5:14 pm 274 14 Gene Mutation, DNA Repair, and Transposition

Nucleotide excision repair from normal individuals and those with XP. (Fibroblasts are undifferentiated connective tissue cells.) 5' The involvement of multiple genes in NER and XP has been investigated using somatic cell hybridization. 3' Fibroblast cells from any two unrelated XP patients, when DNA is grown together in tissue culture, can fuse together, forming damaged heterokaryons. A heterokaryon is a single cell with two Lesion nuclei from different organisms but a common cytoplasm. If the mutation in each of the two XP cells occurs in the same gene, the heterokaryon, like the cells that fused to form it, will still be unable to undergo NER. This is because there is no Nuclease uvr gene normal copy of the relevant gene present in the heterokaryon. excises lesion products However, if NER does occur in the heterokaryon, the mutations in the two XP cells must have been present in two different genes. Hence, the two mutants are said to demonstrate complementation, a concept discussed ear- lier (see Chapter 4). Complementation occurs because the DNA heterokaryon has at least one normal copy of each gene Gap is filled polymerase I in the fused cell. By fusing XP cells from a large number of XP patients, researchers were able to determine how 5' many genes contribute to the XP phenotype. Based on 3' these and other studies, XP patients were divided into Gap is sealed; seven complementation groups, indicating that at least DNA normal pairing seven different genes code for proteins that are involved is restored ligase in nucleotide excision repair in humans. A gene repre- senting each of these complementation groups, XPA to XPG (Xeroderma Pigmentosum gene A to G), has now been identified, and a homologous gene for each has been FIGURE 14.11 nucleotide excision repair (NER) of a UV- induced thymine dimer. During repair, 13 nucleotides identified in yeast. are excised in bacteria, and 28 nucleotides are excised in Approximately 20 percent of XP patients do not fall into eukaryotes. any of the seven complementation groups. Cells from most

Nucleotide Excision Repair and Xeroderma Pigmentosum in Humans The mechanism of NER in eukaryotes is much more com- plicated than that in bacteria and involves many more pro- teins, encoded by about 30 genes. Much of what is known about the system in humans has come from detailed studies of individuals with xeroderma pigmentosum (XP), a rare recessive genetic disorder that predisposes individuals to severe skin abnormalities, skin cancers, and a wide range of other symptoms including developmental and neurologi- cal defects. Patients with XP are extremely sensitive to UV radiation in sunlight. In addition, they have a 2000-fold higher rate of cancer, particularly skin cancer, than the gen- eral population. The condition is severe and may be lethal, although early detection and protection from sunlight can FIGURE 14.12 Two individuals with xeroderma pigmento- arrest it (Figure 14.12). sum. These XP patients show characteristic XP skin lesions induced by sunlight, as well as mottled redness (erythema) The repair of UV-induced lesions in XP has been inves- and irregular pigment changes to the skin, in response to tigated in vitro, using human fibroblast cell cultures derived cellular injury.

M14_KLUG8414_10_SE_C14.indd 274 16/11/18 5:15 pm 14.6 Organisms Use DNA Repair Systems to Counteract Mutations 275

of these patients have mutations in the gene coding for DNA DSB repair usually occurs during the late S or early polymerase eta (Pol h), which is a lower-fidelity DNA poly- G2 phase of the cell cycle, after DNA replication, a time merase that allows DNA replication to proceed past dam- when sister chromatids are available to be used as repair aged DNA. Approximately another 6 percent of XP patients templates. Because an undamaged template is used during do not have mutations in either the seven complementation repair synthesis, homologous recombination repair is an group genes or the DNA polymerase eta (POLH) gene, suggest- accurate process. ing that other genes or mutations outside of coding regions A second pathway, called nonhomologous end join- may be involved in XP. ing, also repairs double-strand breaks. However, as the name implies, the mechanism does not recruit a homologous region of DNA during repair. This system is activated in G1, Double-Strand Break Repair in Eukaryotes prior to DNA replication. End joining involves a complex of Thus far, we have discussed repair pathways that deal many proteins and may include the DNA-dependent protein with damage or errors within one strand of DNA. We con- kinase and the breast cancer susceptibility gene product, clude our discussion of DNA repair by considering what happens in eukaryotic cells when both strands of the DNA helix are cleaved—as a result of exposure to Double-stranded break ionizing radiation, for example. These types of damage are extremely dangerous to cells, lead- ing to chromosome rearrangements, cancer, or 3' 5' cell death. 5' 3' Specialized forms of DNA repair, the DNA Break detected and double-strand break (DSB) repair pathways, 5‘-ends digested are activated and are responsible for reattaching 3' 5' two broken DNA strands. Defects in these path- 5' 3' ways are associated with X-ray hypersensitiv- ity and immune deficiencies, as well as familial 3‘-end invades homologous dispositions to breast and ovarian cancer. Sev- region of sister chromatid eral human disease syndromes, such as Fanconi 3' 5' 3' anemia and ataxia telangiectasia, result from 5' 3' Sister defects in DSB repair. chromatids One pathway involved in double-strand 5' 3' 3' break repair is homologous recombination repair. The first step in this process involves 3' 5' the activity of an enzyme that recognizes the DNA synthesis across double-strand break and then digests back the damaged region 5′-ends of the broken DNA helix, leaving over- 3' 5' Figure 14.13 hanging 3′-ends ( ). One over- 5' 3' hanging end searches for a region of sequence complementarity on the sister chromatid and then 5' 3' invades the homologous DNA duplex, aligning the complementary sequences. 3' 5' Once aligned, DNA synthesis proceeds from Heteroduplex resolved and the 3′ overhanging ends, using the undamaged gaps filled by DNA synthesis and ligation homologous DNA strands as templates. The interaction of two sister chromatids is necessary 3' 5' because, when both strands of one helix are bro- 5' 3' ken, there is no undamaged parental DNA strand 3' 5' available to use as a template DNA sequence 5' 3' during repair. After DNA repair synthesis, the

resulting heteroduplex molecule is resolved and FIGURE 14.13 steps in homologous recombination repair of double- the two chromatids separate. stranded breaks.

M14_KLUG8414_10_SE_C14.indd 275 16/11/18 5:15 pm 276 14 Gene Mutation, DNA Repair, and Transposition

BRCA1. These and other proteins bind to the free ends of the broken DNA, trim the ends, and ligate them back together. Because some nucleotide sequences are lost in the process of end joining, it is an error-prone repair system. In addi- his- mutants Potential tion, if more than one chromosome suffers a double-strand plus liver enzymes plus liver enzymes break, the wrong ends could be joined together, leading to abnormal chromosome structures, such as those discussed earlier (see Chapter 6).

Add mutagenic mixture to filter paper disk ESSENTIAL POINT Spread bacteria on agar medium Organisms counteract mutations by using a range of DNA repair without histidine mechanisms. Errors in DNA synthesis can be repaired by proofread- ing, mismatch repair, and postreplication repair. DNA damage can Place disk on surface of be repaired by photoreactivation repair, SOS repair, base excision medium after plating repair, nucleotide excision repair, and double-strand break repair. ■ bacteria

NOW SOLVE THIS

14.4 Geneticists often use the alkylating agent ethyl- methane sulfonate (EMS; see Figure 14.6) to induce Incubate at 37°C mutations in Drosophila. Why is EMS a mutagen of choice for genetic research? What would be the effects of EMS in a strain of Drosophila lacking functional mismatch repair systems? HINT: This problem asks you to evaluate EMS as a useful mutagen and to determine its effects in the absence of DNA + + repair. The key to its solution is to consider the chemical effects Spontaneous his his revertants revertants (control) induced by mutagen of EMS on DNA. Also, consider the types of DNA repair that

may operate on EMS-mutated DNA and the efficiency of these FIGURE 14.14 The , which screens ­compounds processes. for potential mutagenicity. The high number of his+ ­revertant colonies on the right side of the figure ­confirms that the substance being tested was indeed mutagenic.

- 14.7 are unable to synthesize histidine (his strains) and there- The Ames Test Is Used to Assess fore require histidine for growth. The assay measures the Mutagenicity of Compounds the frequency of reverse mutations that occur within the mutant gene, yielding wild-type bacteria (his+ revertants) Mutagenicity can be tested in various organisms, includ- (Figure 14.14). The his- strains also have an increased ing fungi, plants, and cultured mammalian cells; however, sensitivity to mutagens due to the presence of mutations one of the most common tests, which we describe here, uses in genes involved in both DNA damage repair and the bacteria. ­synthesis of the lipopolysaccharide barrier that coats these The Ames test (named for American biochemist bacteria and protects them from external substances. Bruce Ames, who invented the assay in the 1960s) uses Many substances entering the human body are rela- a number of different strains of the bacterium Salmo- tively innocuous until activated metabolically, usually in nella typhimurium that have been selected for their abil- the liver, to more chemically reactive products. Thus, the ity to reveal the presence of specific types of mutations. Ames test includes a step in which the test compound is incu- For example, some strains are used to detect base-pair bated in vitro in the presence of a mammalian liver extract. ­substitutions, and other strains detect various frame- Alternatively, test compounds may be injected into a mouse shift mutations. Each strain contains a mutation in one where they are modified by liver enzymes and then recov- of the genes of the histidine operon. The mutant strains ered for use in the Ames test.

M14_KLUG8414_10_SE_C14.indd 276 16/11/18 5:15 pm 14.8 Transposable Elements Move within the Genome and May Create Mutations 277

In the initial use of Ames testing in the 1970s, a large DNA Transposons number of known carcinogens, or cancer-causing agents, DNA transposons move from one location to another with- were examined, and more than 80 percent of these were out going through an RNA intermediate stage. They are shown to be strong mutagens. This is not surprising, as the abundant in many organisms from bacteria to humans. transformation of cells to the malignant state occurs as a DNA transposons share several structural features result of mutations. For example, more than 60 compounds that are important for their function (Figure 14.15). found in cigarette smoke test positive in the Ames test and Inverted terminal repeats (ITRs) are located on each end cause cancer in animal tests. Although a positive response of the TE, and an open reading frame (ORF) codes for the in the Ames test does not prove that a compound is carci- enzyme transposase; both are required for movement of the nogenic, the Ames test is useful as a preliminary screening TE in and out of the genome. ITRs are DNA sequences of device, as it is a rapid, convenient way to assess mutagenic- between 9 and 40 bp long that are identical in sequence, but ity. Other tests of potential mutagens and carcinogens use inverted relative to each other. ITRs are essential for trans- laboratory animals such as rats and mice; however, these position and are recognized and bound by the transposase tests can take several years to complete and are more expen- enzyme. Short direct repeats (DRs) are present in the host sive. The Ames test is used extensively during the develop- DNA, flanking each TE insertion. These ­flanking DRs are ment of industrial and pharmaceutical chemical compounds. created as a consequence of the TE insertion process. DNA transposons vary considerably in length and are classified as either autonomous or nonautonomous. 14.8 Autonomous transposons are able to transpose by them- Transposable Elements Move selves, as they encode a functional transposase enzyme within the Genome and May Create and have intact ITRs. Nonautonomous transposons cannot Mutations move on their own because they do not encode their own functional transposase enzyme. They require the presence Transposable elements (TEs), informally known as “jump- of an autonomous transposon elsewhere in the genome, ing genes,” are DNA sequences that can move or transpose so that the transposase synthesized by the autonomous within and between chromosomes, inserting themselves into element can be used by the nonautonomous element for various locations within the genome. They can range from transposition. 50 to 10,000 base pairs in length. Most DNA transposons move through the genome using TEs are present in the genomes of all organisms from “cut-and-paste” mechanisms, in which the transposon is bacteria to humans. Not only are they ubiquitous, but they physically cut out of the genome and then inserted into a also make up large portions of some eukaryotic genomes. For new position in the same or a different chromosome. Usu- example, almost 50 percent of the human genome is derived ally, but not always, the site from which the DNA transposon from TEs. Some organisms with unusually large genomes, was cut is repaired accurately, leaving no trace of the origi- such as salamanders and barley, contain hundreds of thou- nal DNA transposon. sands of copies of various types of TEs constituting as much as 85 percent of these genomes. The movement of TEs from one place in the genome to Transposon another has the capacity to disrupt genes and cause muta- tions, as well as to create chromosomal damage such as DR ITR Transposase ORF ITR DR double-strand breaks. TEs also act as sites of genome rear- AGCTTAGGC GCCTAAGCT rangement events, when homologous recombination occurs 5' 3' 5' 3' between DNA sequences with sequence similarities. 3' 5' 3' 5' Since their discovery, TEs have also become valuable TCGAATCCG CGGATTCGA tools in genetic research. Geneticists harness these DNA elements as mutagens, as cloning tags, and as vehicles for FIGURE 14.15 structural features of DNA transposons. DNA transposons, shown in red, contain an open reading introducing foreign DNA into model organisms. frame (ORF) that encodes the enzyme transposase. Some TEs can be classified into two groups, based on their DNA transposons also contain ORFs encoding other pro- methods of transposition. Retrotransposons move using an teins in addition to transposase. Inverted terminal repeats RNA intermediate, and DNA transposons move in and out of (ITRs), shown in detail below the main diagram, are short DNA sequences that are inverted relative to each other. the genome as DNA elements. We will look at both groups in Direct repeats (DRs, shown in blue) flank the DNA transpo- the sections that follow. son in the chromosomal DNA.

M14_KLUG8414_10_SE_C14.indd 277 16/11/18 5:15 pm 278 14 Gene Mutation, DNA Repair, and Transposition

Examples of two DNA transposons and the ways in (a) In absence of Ac, Ds is not transposable. which their movements can affect gene expression are Wild-type expression of W occurs. described next. Ds W

DNA Transposons—the Ac–Ds System in Maize (b) When Ac is present, Ds may be transposed. DNA transposons were first discovered by Barbara Ac Ds W McClintock in the late 1940s as a result of her research on the genetics of maize. Her work involved analysis of the genetic behavior of two mutations, Dissociation (Ds) Ac is present. Ds is transposed. and Activator (Ac), expressed in either the endosperm Ac Ds W or aleurone layers of maize seeds. She then correlated her genetic observations with cytological examinations of Chromosome breaks and fragment is lost. maize chromosomes. Initially, McClintock determined that W expression ceases, producing mutant effect. Ds W Ds was located on chromosome 9. If Ac was also present in Ac the genome, Ds induced breakage at a point on the chromo- some adjacent to its own location. If chromosome breakage occurred in somatic cells during their development, prog- eny cells often lost part of the broken chromosome, causing (c) Ds can move into and out of another gene. a variety of phenotypic effects. The chapter opening photo Ac Ds W illustrates the types of phenotypic effects caused by Ds

mutations in kernels of corn. Ds is transposed into W gene. Subsequent analysis suggested to McClintock that both W expression is inhibited, Ds and Ac elements sometimes moved to new chromosomal producing mutant effect. W locations. While Ds moved only if Ac was also present, Ac Ac Ds was capable of autonomous movement. Where Ds came to

reside determined its genetic effects—that is, it might cause Ds “jumps” out of W gene. chromosome breakage, or it might inhibit expression of a Wild-type expression of W is restored. certain gene. In cells in which Ds caused a gene mutation, Ac Ds W Ds might move again, restoring the gene mutation to wild type. Figure 14.16 illustrates the types of movements and effects brought about by Ds and Ac elements. FIGURE 14.16 effects of Ac and Ds elements on gene expression. (a) If Ds is present in the absence of Ac, there is In McClintock’s original observation, pigment synthe- normal expression of a distantly located hypothetical gene sis was restored in cells in which the Ds element jumped W. (b) In the presence of Ac, Ds may transpose to a region out of chromosome 9. McClintock concluded that Ds was a adjacent to W. Ds can induce chromosome breakage, which may lead to loss of a chromosome fragment bearing the mobile controlling element. Similar mobility was later W gene. (c) In the presence of Ac, Ds may transpose into the also revealed for Ac. We now commonly refer to these as W gene, disrupting W-gene expression. If Ds subsequently transposable elements (TEs). transposes out of the W gene, W-gene expression may Several Ac and Ds elements have now been analyzed, return to normal. and the relationship between the two elements has been clarified. The first Ds element studied (Ds9) is nearly verified her conclusions. She was awarded the Nobel Prize identical to Ac except for a 194-bp deletion within the in Physiology or Medicine in 1983. transposase gene. The deletion of part of the transposase gene in the Ds9 element explains its dependence on the Ac element for transposition. Several other Ds elements have Retrotransposons also been sequenced, and each contains an even larger dele- Retrotransposons are TEs that amplify and move within tion within the transposase gene. In each case, however, the the genome using RNA as an intermediate. Their methods of ITRs are retained. transposition are sometimes described as “copy-and-paste” Although the significance of Barbara McClintock’s mechanisms. In many ways, retrotransposons resemble mobile controlling elements was not fully appreciated fol- retroviruses, which replicate using similar mechanisms. lowing her initial observations, molecular analysis has since However, retrotransposons do not encode all of the proteins

M14_KLUG8414_10_SE_C14.indd 278 16/11/18 5:15 pm 14.8 Transposable Elements Move within the Genome and May Create Mutations 279

that are required to form mature virus particles and there- and may create mutations at many sites in the genome. In fore are not infective. addition, the original retrotransposon is not excised during Retrotransposons can be very abundant in some organ- transposition. isms. For example, maize genomes are made up of as much as Next, we will look at one well-studied example of a 78 percent retrotransposon DNA. Approximately 42 ­percent ­retrotransposon—copia—and describe its effects on the white of the human genome consists of retrotransposons or their locus in Drosophila. remnants, whereas only approximately 3 percent of the human genome consists of DNA transposons. There are two types of retrotransposons—the long- Retrotransposons—the Copia –White-Apricot terminal-repeat (LTR) retrotransposons and the non-LTR System in Drosophila retrotransposons. In addition, retrotransposons, like DNA Copia elements are transcribed into “copious” amounts of transposons, can be either autonomous or nonautonomous. RNA (hence their name). They are present in 10 to 100 ­copies Like DNA transposons, retrotransposons encode pro- in the genomes of Drosophila cells. Mapping studies show teins that are required for their transposition and are that they are transposable to different chromosomal loca- flanked by direct repeats at their insertion sites. The struc- tions and are dispersed throughout the genome. ture of an LTR retrotransposon and its gene products is Each copia element consists of approximately 5000 to shown in Figure 14.17. 8000 bp of DNA, including a long terminal repeat (LTR) The steps in retrotransposon transposition involve sequence of 267 bp at each end. Like other LTR retrotrans- the actions of both retrotransposon-encoded proteins and posons, transcription of the copia element begins in the 5′ those that are part of the cell’s normal transcriptional and LTR, which contains a promoter and transcription start site. translational machinery. In the first step, the cell’s RNA The transcript is cleaved and polyadenylated within the 3′ polymerases transcribe the retrotransposon DNA into one LTR, which contains a polyadenylation site. These features or more RNA copies. In the second step, the RNA copies are allow the retrotransposon to be transcribed by the cell’s RNA translated into the two enzymes required for transposition— polymerase. reverse transcriptase and integrase. The retrotransposon One of the earliest descriptions of copia effects came RNAs are then converted to double-stranded DNA copies from research into the white-apricot mutation in Drosophila. through the actions of reverse transcriptase. The ends of the This mutation changes the Drosophila eye color from a wild- double-stranded DNAs are recognized by integrase, which type red to an orange-yellow color [Figure 14.18(a)]. DNA then inserts the retrotransposons into the genome. Because sequencing studies demonstrate that the mutation is caused many RNA copies can be converted to DNA and transposed by an insertion of copia into the second intron of the white in this way, retrotransposons are able to accumulate rapidly gene. As a result of this insertion, most of the transcripts that originate from the white gene promoter terminate prema- turely within the 3′ LTR of the copia retrotransposon. These prematurely terminated transcripts do not encode functional DR LTR LTR DR white gene product, resulting in a loss of red pigment in the Integrase and reverse eye [Figure 14.18(b)]. Because some white gene transcripts transcriptase ORFs read through the copia element, enough white gene product Transcription is produced to yield a light-orange colored eye. AAAA Integrase ORF RT ORF Transposable Elements in Humans Translation Recent genomic sequencing data reveal that almost half of the human genome is composed of TE DNA. As discussed Integrase Reverse transcriptase earlier (see Chapter 11), the major families of human TEs are the long interspersed elements (LINEs) and short FIGURE 14.17 structure of an LTR retrotransposon. Open interspersed elements (SINEs), both of which are non- reading frames (ORFs) encode the enzymes integrase and LTR retrotransposons. Together, they make up 34 percent reverse transcriptase (RT). Transcription promoters and polyadenylation sites are located within 5′ and 3′ long ter- of the human genome. Other families of TEs account for a minal repeats (LTRs). The bottom part of the diagram shows further 11 percent (Table 14.4). As coding sequences com- transcription of the LTR retrotransposon and translation into prise only about 1 percent of the human genome, there is integrase and reverse transcriptase. Non-LTR transposons lack LTRs, and their promoter and polyadenylation sites are about 40 to 50 times more TE DNA in the human genome located adjacent to the retrotransposon ORFs. than DNA in functional genes.

M14_KLUG8414_10_SE_C14.indd 279 16/11/18 5:15 pm 280 14 Gene Mutation, DNA Repair, and Transposition

(a) Wild-type eye color 1 2 3 4 5 6

AAAA

Wild-type white gene locus and mRNA containing 6 exons Wild-type eye color

(b) Mutant white-apricot eye color Copia insertion in second intron

1 2 3 4 5 6

AAAA

Copia insertion and prematurely terminated mRNA Mutant eye color

FIGURE 14.18 effects of copia insertion into the white gene of Drosophila. (a) The white gene (top, blue) in wild-type Drosophila contains six exons, all of which are present in the mRNA (bottom, orange). (b) The white gene in mutant white-apricot Drosophila contains an insertion of copia (red) in the second intron and a prematurely terminated mRNA con- taining only two exons.

Although most human TEs appear to be inactive, the another in the gamete-forming cells of the mother, prior to potential mobility and mutagenic effects of these elements being transmitted to the son. have far-reaching implications for human genetics, as can LINE insertions into the human dystrophin gene be seen in an example of a TE “caught in the act.” The case (another X-linked gene) have resulted in at least two sepa- involves a male child with hemophilia. One cause of hemo- rate cases of Duchenne muscular dystrophy. In one case, a philia is a defect in blood-clotting factor VIII, the product LINE inserted into exon 48, and in another case, it inserted of an X-linked gene. Haig Kazazian and colleagues found into exon 44, both leading to frameshift mutations and pre- LINEs inserted at two points within the gene. Research- mature termination of translation of the dystrophin pro- ers were interested in determining if one of the mother’s tein. There are also reports that LINEs have inserted into X chromosomes also contained this specific LINE. If so, the the APC and c-myc genes, leading to mutations that may unaffected mother would be heterozygous and could pass have contributed to the development of some colon and the LINE-containing chromosome to her son. The surpris- breast cancers. ing finding was that the LINE sequence was not present on SINE insertions are also responsible for more than either of her X chromosomes but was detected on chromo- 30 cases of human disease. In one case, an Alu element inte- some 22 of both parents. This suggests that this mobile grated into the BRCA2 gene, inactivating this tumor-sup- element may have transposed from one chromosome to pressor gene and leading to a familial case of breast cancer.

TABLE 14.4 Transposable Elements in the Human Genome

Element Type Length Copies in Genome % of Genome LINEs 1–6 kb 850,000 21 SINEs 100–500 bp 1,500,000 13 LTR elements 65 kb 443,000 8 DNA transposons 80–300 bp 294,000 3 Unclassified — 3,000 0.1

M14_KLUG8414_10_SE_C14.indd 280 16/11/18 5:15 pm CASE STUDY 281

Other genes that have been mutated by Alu integrations are transposons, leading to duplications, deletions, inver- the factor IX gene (leading to hemophilia B), the ChE gene sions, or chromosome translocations. Any of these rear- (leading to acholinesterasemia), and the NF1 gene (leading rangements may bring about phenotypic changes or to neurofibromatosis). disease. It is thought that about 0.2 percent of detectable human Transposable Elements, Mutations, and mutations may be due to TE insertions. Other organisms Evolution appear to suffer more damage due to transposition. For example, about 10 percent of new mouse mutations and TEs can have a wide range of effects on genes, based on 50 percent of Drosophila mutations are caused by insertions where they are inserted and their composition. Here are a of TEs in or near genes. few examples: Because of their ability to alter genes and chromo- ■■ The insertion of a TE into one of a gene’s coding regions somes, TEs contribute to evolution. Some mutations caused may disrupt the gene’s normal translation reading frame by TE insertions or deletions may be beneficial to the or may induce premature termination of translation of organism under certain circumstances. These mutations the gene’s mRNA. may be selected for and maintained through evolution. ■■ The insertion of a TE containing polyadenylation or In some cases, TEs themselves may be modified to per- transcription termination signals into a gene’s intron form functions that become beneficial to the organism. One may bring about termination of the gene’s transcription example of TEs that contributed to evolution is provided within the element. In addition, it can cause aberrant by Drosophila telomeres. LINE-like elements are present at splicing of an RNA transcribed from the gene. the ends of Drosophila chromosomes, and these elements have evolved to act as telomeres, maintaining the length of ■■ Insertions of a TE into a gene’s transcription regulatory Drosophila chromosomes over successive cell divisions. region may disrupt the gene’s normal regulation or may cause the gene to be expressed differently as a result of the presence of the TE’s own promoter or enhancer ESSENTIAL POINT sequences. Transposable elements can move within a genome, creating muta- ■■ The presence of two or more identical TEs in a genome tions and altering gene expression. They may also contribute to evolution. ■ creates the potential for recombination between the

CASE STUDY An unexpected diagnosis

ix months pregnant, an expectant mother had a routine 1. What information would be most relevant to concluding which ultrasound that showed that the limbs of the fetus were of the two mutation origins, inherited or new, most likely unusually short. Her physician suspected that the baby ­pertains in this case? How does this conclusion impact on this S couple’s decision to have more children? might have a genetic form of dwarfism called achondroplasia, an autosomal dominant trait occurring with a frequency of about 2. It has been suggested that prenatal genetic testing for achon- 1 in 27,000 births. The parents were directed to a genetic coun- droplasia be made available and offered to all women. Would selor to discuss this diagnosis. In the conference, they learned you agree with this initiative? What ethical considerations that achondroplasia is caused by a mutant allele. Sometimes it would you consider when evaluating the medical and societal is passed from one generation to another, but in 80 percent of consequences of offering such testing? all cases it is the result of a spontaneous mutation that arises in a gamete of one of the parents. They also learned that most For related reading see Radoi, V., et al. (2016). How to provide a children with achondroplasia have normal intelligence and a normal genetic counseling in a simple case of antenatal diagnosis of achon- life span. droplasia. Gineco.eu.12:56–58. DOI:10.18643/gieu.2016.56.

M14_KLUG8414_10_SE_C14.indd 281 11/17/18 3:37 AM 282 14 Gene Mutation, DNA Repair, and Transposition

INSIGHTS AND SOLUTIONS

1. A rare dominant mutation expressed at birth was studied 3. The base analog 2-amino purine (2-AP) substitutes for in humans. Records showed that six cases were discovered adenine during DNA replication, but it may base-pair with in 40,000 live births. Family histories revealed that in two cytosine. The base analog 5-bromouracil (5-BU) substitutes cases, the mutation was already present in one of the parents. for thymine, but it may base-pair with guanine. Follow the Calculate the spontaneous mutation rate for this mutation. double-stranded trinucleotide sequence shown at the top of What are some underlying assumptions that may affect our the figure through three rounds of replication, assuming that, conclusions? in the first round, both analogs are present and become incor- porated wherever possible. Before the second and third round Solution: Only four cases represent a new mutation. Because of replication, any unincorporated base analogs are removed. each live birth represents two gametes, the sample size is What final sequences occur? from 80,000 meiotic events. The rate is equal to

4/80,000 = 1/20,000 = 5 * 10-5 Solution: A T We have assumed that the mutant gene is fully penetrant and is expressed in each individual bearing it. If it is not fully G C penetrant, our calculation may be an underestimate because one or more mutations may have gone undetected. We have T A also assumed that the screening was 100 percent accurate. One or more mutant individuals may have been “missed,” 5-BU 2-AP again leading to an underestimate. Finally, we assumed that substitutes substitutes the viabilities of the mutant and nonmutant individuals are for T Round I for A equivalent and that they survive equally in utero. Therefore, A 5BU 2AP T our assumption is that the number of mutant individuals at birth is equal to the number at conception. If this were not G C G C true, our calculation would again be an underestimate. T 2AP 5BU A 2. Consider the following estimates:

(a) There are 7 * 109 humans living on this planet.

(b) Each individual has about 20,000 (0.2 * 105) genes. Round II

(c) The average mutation rate at each locus is 10-5. A T G 5BU 2AP C A T

How many spontaneous mutations are currently present in G C G C G C G C the human population? Assuming that these mutations are equally distributed among all genes, how many new muta- T A C 2AP 5BU G T A tions have arisen in each gene in the human population?

Solution: First, since each individual is diploid, there are two copies of each gene per person, each arising from a Round III separate gamete. Therefore, the total number of spontaneous G C G 5BU 2AP C G C mutations is

5 and and (2 * 0.2 * 10 genes/individual) G C G C G C G C 9 5 * (7 * 10 individuals) * (10- mutations/gene) C G C 2AP 5BU G C G = (0.4 * 105) * (7 * 109) * (10-5) mutations = 2.8 * 109 mutations in the population 2.8 * 109 mutations / 0.2 * 105 genes = 14 * 104 mutations per gene in the population

M14_KLUG8414_10_SE_C14.indd 282 16/11/18 5:15 pm PROBLEMS AND DISCUSSION QUESTIONS 283

Mastering Genetics Visit for Problems and Discussion Questions ­instructor-assigned tutorials and problems.

1. HOW DO WE KNOW? In this chapter, we focused on how gene 19. In maize, a Ds or Ac transposon can alter the function of genes mutations arise and how cells repair DNA damage. At the same at or near the site of transposon insertion. It is possible for time, we found opportunities to consider the methods and rea- these elements to transpose away from their original insertion soning by which much of this information was acquired. From site, causing a reversion of the mutant phenotype. In some the explanations given in the chapter, cases, however, even more severe phenotypes appear, due to (a) How do we know that many cancer-causing agents (carcino- events at or near the mutant allele. What might be happen- gens) are also mutagenic? ing to the transposon or the nearby gene to create more severe (b) How do we know that certain chemicals and wavelengths of mutations? radiation induce mutations in DNA? 20. It is estimated that about 0.2 percent of human mutations are due (c) How do we know that DNA repair mechanisms detect to TE insertions, and a much higher degree of mutational damage and correct the majority of spontaneous and induced is known to occur in some other organisms. In what way might a mutations? TE insertion contribute positively to evolution? 2. CONCEPT QUESTION Review the Chapter Concepts list on 21. In a bacterial culture in which all cells are unable to synthe- p. 261. These concepts relate to how gene mutations occur, their size leucine (leu-), a potent mutagen is added, and the cells are phenotypic effects, and how mutations can be repaired. Write a allowed to undergo one round of replication. At that point, sam- short essay contrasting how these concepts may differ between ples are taken, a series of dilutions are made, and the cells are bacteria and eukaryotes. plated on either minimal medium or minimal medium contain- 3. What is a spontaneous mutation, and why are spontaneous muta- ing leucine. The first culture condition (minimal medium) allows tions rare? the growth of only leu+ cells, while the second culture condition 4. Why would a mutation in a somatic cell of a multicellular organ- (minimal medium with leucine added) allows growth of all cells. ism not necessarily result in a detectable phenotype? The results of the experiment are as follows: 5. Most mutations are thought to be deleterious. Why, then, is it reasonable to state that mutations are essential to the evolution- Culture Condition Dilution Colonies ary process? Minimal medium 10-1 18 6. Why is a random mutation more likely to be deleterious than Minimal medium + leucine 10-7 6 beneficial? 7. Most mutations in a diploid organism are recessive. Why? What is the rate of mutation at the locus associated with leucine 8. What is the difference between a silent mutation and a neutral biosynthesis? mutation? 22. Presented here are hypothetical findings from studies of hetero- 9. Describe a tautomeric shift and how it may lead to a mutation. karyons formed from seven human xeroderma pigmentosum cell 10. Contrast and compare the mutagenic effects of deaminating strains: agents, alkylating agents, and base analogs. 11. Why are frameshift mutations likely to be more detrimental than XP1 XP2 XP3 XP4 XP5 XP6 XP7 point mutations, in which a single pyrimidine or purine has been XP1 - substituted? 12. Why are X rays more potent mutagens than UV radiation? XP2 - - 13. Contrast the various types of DNA repair mechanisms known to XP3 - - - counteract the effects of UV radiation. What is the role of visible light in repairing UV-induced mutations? XP4 + + + - 14. Mammography is an accurate screening technique for the early XP5 + + + + - detection of breast cancer in humans. Because this technique XP6 uses X rays diagnostically, it has been highly controversial. Can + + + + - - you explain why? What reasons justify the use of X rays for such XP7 + + + + - - - a medical screening technique? Note: + = complementation; - = no complementation 15. A significant number of mutations in the HBB gene that cause human b@thalassemia occur within introns or in upstream non- These data are measurements of the occurrence or nonoc- coding sequences. Explain why mutations in these regions often currence of unscheduled DNA synthesis in the fused hetero- lead to severe disease, although they may not directly alter the karyon. None of the strains alone shows any unscheduled DNA coding regions of the gene. synthesis. Which strains fall into the same complementation 16. Describe how the Ames test screens for potential environmental groups? How many different groups are revealed based on mutagens. Why is it thought that a compound that tests positively these data? What can we conclude about the genetic basis of in the Ames test may also be carcinogenic? XP from these data? 17. What genetic defects result in the disorder xeroderma pigmento- 23. Skin cancer carries a lifetime risk nearly equal to that of all sum (XP) in humans? How do these defects create the phenotypes other cancers combined. Following is a graph [modified from associated with the disorder? K. H. Kraemer (1997). Proc. Natl. Acad. Sci. (USA) 94:11–14] 18. Compare DNA transposons and retrotransposons. What proper- depicting the age of onset of skin cancers in patients with or ties do they share? without XP, where the cumulative percentage of skin cancer is

M14_KLUG8414_10_SE_C14.indd 283 16/11/18 5:15 pm 284 14 Gene Mutation, DNA Repair, and Transposition

plotted against age. The non-XP curve is based on 29,757 can- 24. It has been noted that most transposons in humans and other cers surveyed by the National Cancer Institute, and the curve organisms are located in noncoding regions of the genome— representing those with XP is based on 63 skin cancers from the regions such as introns, pseudogenes, and stretches of particu- Xeroderma Pigmentosum Registry. lar types of repetitive DNA. There are several ways to interpret (a) Provide an overview of the information contained in the this observation. Describe two possible interpretations. Which graph. interpretation do you favor? Why? 25. Mutations in the IL2RG gene cause approximately 30 percent 100 of severe combined immunodeficiency disorder (SCID) cases in humans. These mutations result in alterations to a protein com- ponent of cytokine receptors that are essential for proper devel- opment of the immune system. The IL2RG gene is composed of XP eight exons and contains upstream and downstream sequences 50 that are necessary for proper transcription and translation. Non-XP Below are some of the mutations observed. For each, explain its

with cancers likely influence on the IL2RG gene product (assume its length to be 375 amino acids).

Cumulative percentage Cumulative (a) Nonsense mutation in a coding region 0 0 20 40 60 80 (b) Insertion in exon 1, causing frameshift (c) Insertion in exon 7, causing frameshift Age of onset (in years) (d) Missense mutation (e) Deletion in exon 2, causing frameshift (b) Explain why individuals with XP show such an early age (f) Deletion in exon 2, in frame of onset. (g) Large deletion covering exons 2 and 3

M14_KLUG8414_10_SE_C14.indd 284 16/11/18 5:15 pm 15 Regulation of Gene Expression in Bacteria

CHAPTER CONCEPTS

■■ In bacteria, regulation of gene expres- sion is often linked to the metabolic needs of the cell. A model generated using crystal structure analysis ­depicting the lac repressor tetramer bound to two operator DNA sequences ■■ Efficient expression of genetic infor- (shown in blue). mation in bacteria is dependent on ­intricate regulatory mechanisms that exert control over transcription. ■■ Mechanisms that regulate ­transcription are categorized as exerting either revious chapters have discussed how DNA is organized into genes, positive or negative control of gene how genes store genetic information, and how this information is expression. P expressed through the processes of transcription and translation. We ■■ Bacterial genes that encode proteins now consider one of the most fundamental questions in molecular genetics: with related functions tend to be orga- How is genetic expression regulated? It is now clear that gene expression varies nized in clusters and are often under widely in bacteria under different environmental conditions. For example, coordinated control. Such clusters, detailed analysis of proteins in Escherichia coli shows that concentrations of including their adjacent regulatory the 4000 or so polypeptide chains encoded by the genome vary widely. Some sequences, are called operons. proteins may be present in as few as 5 to 10 molecules per cell, whereas oth- ■■ Transcription of genes within operons is ers, such as ribosomal proteins and the many proteins involved in the glyco- either inducible or repressible. lytic pathway, are present in as many as 100,000 copies per cell. Although ■■ Often, the end product of a metabolic most bacterial gene products are present continuously at a basal level (a few pathway induces or represses gene copies), the concentration of these products can increase dramatically when expression in that pathway. required. Clearly, fundamental regulatory mechanisms must exist to control the expression of the genetic information. In this chapter, we will explore regulation of gene expression in bacteria. As we have seen in several previous chapters, bacteria have been ­especially useful research organisms in genetics for a number of reasons. For one thing, they have extremely short reproductive cycles, and literally ­hundreds of ­generations, giving rise to billions of genetically ­identical ­bacteria, can be produced in overnight cultures. In addition, they can be studied in

285

M15_KLUG8414_10_SE_C15.indd 285 16/11/18 5:15 pm 286 15 Regulation of Gene Expression in Bacteria

“pure culture,” allowing mutant strains of genetically medium, then it is inefficient for the organism to expend unique bacteria to be isolated and investigated separately. energy to synthesize the enzymes necessary for tryptophan In addition to regulating gene expression in response production. A mechanism has therefore evolved whereby to environmental conditions, bacteria also must respond to tryptophan plays a role in repressing the transcription attacks from bacteriophages—the viruses that infect them of mRNA needed for producing tryptophan-synthesizing (see Chapter 8). Later in this chapter we will explore the enzymes. In contrast to the inducible system controlling lac- regulation of a genetic system called CRISPR-Cas that serves tose metabolism, the system governing tryptophan expres- as an immune system to fight invading bacteriophage DNA sion is said to be repressible. sequences. Note that we introduced CRISPR-Cas in the open- Regulation, whether of the inducible or repressible type, ing discussion in Chapter 1 as one of the most exciting discov- may be under either negative or positive control. Under eries recently made in genetics. As such, in addition to our negative control, genetic expression occurs unless it is shut off coverage in this chapter, we will continue to present infor- by some form of a regulator molecule. In contrast, under posi- mation involving this system in Chapter 17, Special Topics tive control, transcription occurs only if a regulator molecule Chapter 3—Gene Therapy, and Special Topics ­Chapter 6— directly stimulates RNA production. In theory, either type of Genetically Modified Foods. control or a combination of the two can govern inducible or repressible systems.

15.1 Bacteria Regulate ESSENTIAL POINT Gene Expression in Response to Regulation of gene expression in bacteria is often linked to the meta- bolic needs of the organism. ■ Environmental Conditions

The regulation of gene expression has been extensively studied in bacteria, particularly in E. coli. Geneticists have 15.2 learned that highly efficient genetic mechanisms have Lactose Metabolism in E. coli evolved in these organisms to turn transcription of specific Is Regulated by an Inducible System genes on and off, depending on the cell’s metabolic need for the respective gene products. Not only do bacteria respond Beginning in 1946 with the studies of Jacques Monod and to changes in their environment, but they also regulate continuing through the next decade with significant con- gene activity associated with a variety of cellular activities, tributions by , François Jacob, and André including the replication, recombination, and repair of their Lwoff, genetic and biochemical evidence concerning lactose DNA, and with cell division. metabolism was amassed. Research provided insights into The idea that microorganisms regulate the synthesis of the way in which gene activity is repressed when lactose is their gene products is not a new one. As early as 1900, it was absent but induced when it is available. In the presence of shown that when lactose (a galactose and glucose-containing lactose, the concentration of the enzymes responsible for disaccharide) is present in the growth medium of yeast, the its metabolism increases rapidly from a few molecules to organisms synthesize enzymes required for lactose metabo- thousands per cell. The enzymes responsible for lactose lism. When lactose is absent, synthesis diminishes to a basal metabolism are thus inducible, and lactose serves as the level. Soon thereafter, investigators generalized that bacte- inducer. ria adapt to their environment, producing certain enzymes In bacteria, genes that code for enzymes with related only when specific chemical substrates are present. These functions (for example, the set of genes involved with lac- are now referred to as inducible enzymes. In contrast, tose metabolism) tend to be organized in clusters on the enzymes that are produced continuously, regardless of the bacterial chromosome, and transcription of these genes chemical makeup of the environment, are called constitu- is often under the coordinated control of a single regula- tive enzymes. tory region. Such clusters, including their adjacent regu- More recent investigation has revealed a contrasting latory sequences, are called operons. The location of the system, whereby the presence of a specific molecule inhibits regulatory region is almost always upstream 5′ of the gene expression. Such molecules are usually end products gene cluster it controls. Because the regulatory region is 1 2 of anabolic biosynthetic pathways. For example, utilizing on the same strand as those genes, we refer to it as a cis- a multistep metabolic pathway, the amino acid tryptophan acting site. Cis-acting regulatory regions are bound by can be synthesized by bacterial cells. If a sufficient sup- molecules that control transcription of the gene cluster. ply of tryptophan is present in the environment or culture Such molecules are called trans-acting factors. Events

M15_KLUG8414_10_SE_C15.indd 286 16/11/18 5:15 pm 15.2 LACTOSE METABOLISM IN E. COLI IS REGULATED BY AN INDUCIBLE SYSTEM 287

Regulatory region Structural genes

Repressor gene Promoter–Operator b-Galactosidase gene Permease gene Transacetylase gene

lacl P O lacZ lacY lacA

lac Operon

FIGURE 15.1 A simplified overview of the genes and regulatory units involved in the control of lactose metabolism. (The regions within this stretch of DNA are not drawn to scale.)

at the regulatory site determine whether the genes are CH2OH CH2OH transcribed into mRNA and thus whether the correspond- OH O O OH ing enzymes or other protein products may be synthesized from the genetic information in the mRNA. Binding of a OH O OH trans-acting ­element at a cis-acting site can regulate the gene cluster either ­negatively (by turning off transcrip- tion) or positively (by turning on transcription of genes in OH OH Lactose the cluster). In this section, we discuss how transcription of such bacterial gene clusters is coordinately regulated.

The discovery of a regulatory gene and a regulatory b-Galactosidase + H2O site that are part of the gene cluster was paramount to

the understanding of how gene expression is controlled CH2OH CH2OH in the system. Neither of these regulatory elements OH O OH O OH encodes enzymes necessary for lactose metabolism—the function of the three genes in the cluster. As illustrated OH + OH in Figure 15.1, the three structural genes and the adja- HO cent regulatory site constitute the lactose (lac) operon. OH OH Together, the entire gene cluster functions in an integrated fashion to provide a rapid response to the presence or Galactose Glucose absence of lactose. FIGURE 15.2 The catabolic conversion of the disaccha- Structural Genes ride lactose into its monosaccharide units, galactose and glucose. Genes coding for the primary structure of an enzyme are called structural genes. There are three structural genes in the lac operon. The lacZ gene encodes B-galactosidase, an enzyme whose primary role is to convert the disaccha- lacZ- or permease lacY - are unable to use lactose as ride lactose to the monosaccharides glucose and galactose an energy source. Mutations were also found in the trans- 1 2 1 2 (Figure 15.2). This conversion is essential if lactose is to acetylase gene. Mapping studies by Lederberg established serve as the primary energy source in glycolysis. The sec- that all three genes are closely linked or contiguous to one ond gene, lacY, specifies the primary structure of ­permease, another on the bacterial chromosome, in the order Z–Y–A an enzyme that facilitates the entry of lactose into the (see Figure 15.1). bacterial cell. The third gene, lacA, codes for the enzyme Knowledge of their close linkage led to another discov- transacetylase.­ While its physiological role is still not ery relevant to what later became known about the regu- ­completely clear, it may be involved in the removal of toxic lation of structural genes: All three genes are transcribed by-products of lactose digestion from the cell. as a single unit, resulting in a so-called polycistronic mRNA To study the genes coding for these three enzymes, (Figure 15.3) (recall that cistron refers to the part of a researchers isolated numerous mutations that lacked the nucleotide sequence coding for a single gene). This results function of one or the other enzyme. Such lac- mutants in the coordinate regulation of all three genes, since a were first isolated and studied by Joshua Lederberg. single-message RNA is simultaneously translated into all Mutant cells that fail to produce active b@galactosidase three gene products.

M15_KLUG8414_10_SE_C15.indd 287 16/11/18 5:15 pm 288 15 Regulation of Gene Expression in Bacteria

Structural genes called a repressor gene. A second set of constitutive muta- lacZ lacY lacA tions producing effects identical to those of lacI - is present in a region immediately adjacent to the structural genes. This class of mutations, designated lacOC, is located in the Transcription operator region of the operon. In both types of constitutive mutants, the enzymes are produced continually, inducibility Ribosome Polycistronic mRNA is eliminated, and gene regulation has been lost.

Moves Translation along The Operon Model: Negative Control mRNA Around 1960, Jacob and Monod proposed a hypothetical mech- anism involving negative control that they called the operon model, in which a group of genes is regulated and expressed together as a unit. As we saw in Figure 15.1, the lac operon they proposed consists of the Z, Y, and A structural genes, as well Proteins as the adjacent sequences of DNA referred to as the operator region. They argued that the lacI gene regulates the transcrip- tion of the structural genes by producing a repressor ­molecule b-Galactosidase Permease Transacetylase and that the repressor is allosteric, meaning that the molecule reversibly interacts with another molecule, undergoing both FIGURE 15.3 The structural genes of the lac operon are a conformational change in three-dimensional shape and a transcribed into a single polycistronic mRNA, which is trans- lated simultaneously by several ribosomes into the three change in chemical activity. Figure 15.5 illustrates the com- enzymes encoded by the operon. ponents of the lac operon as well as the action of the lac repres- sor in the presence and absence of lactose. The Discovery of Regulatory Mutations Jacob and Monod suggested that the repressor nor- How does lactose stimulate transcription of the lac operon mally binds to the DNA sequence of the operator region. and induce the synthesis of the enzymes for which it codes? A When it does so, it inhibits the action of RNA polymerase, partial answer came from studies using gratuitous inducers, effectively repressing the transcription of the structural chemical analogs of lactose such as the sulfur-containing ana- genes [Figure 15.5(b)]. However, when lactose is present, log isopropylthiogalactoside (IPTG), shown in Figure 15.4. this sugar binds to the repressor and causes an allosteric Gratuitous inducers behave like natural inducers, but they do (conformational) change. The change alters the binding site not serve as substrates for the enzymes that are subsequently of the repressor, rendering it incapable of interacting with synthesized. Their discovery provides strong evidence that operator DNA [Figure 15.5(c)]. In the absence of the repres- the primary induction event does not depend on the interac- sor–operator interaction, RNA polymerase transcribes the tion between the inducer and the enzyme. structural genes, and the enzymes necessary for lactose What, then, is the role of lactose in induction? The metabolism are produced. Because transcription occurs answer to this question requires the study of another class only when the repressor fails to bind to the operator region, of mutations described as constitutive mutations. In cells regulation is said to be under negative control. bearing these types of mutations, enzymes are produced To summarize, the operon model invokes a series of regardless of the presence or absence of lactose. Studies of molecular interactions between proteins, inducers, and DNA the constitutive mutation lacI - mapped the mutation to a to explain the efficient regulation of structural gene expres- site on the bacterial chromosome close to, but not part of, sion. In the absence of lactose, the enzymes encoded by the the structural genes lacZ, lacY, and lacA. This mutation led genes are not needed and the expression of genes encoding researchers to discover the lacI gene, which is appropriately these enzymes is repressed. When lactose is present, it indi- rectly induces the activation of the genes by binding with the CH2OH CH3 repressor.* If all lactose is metabolized, none is available to OH O S CH bind to the repressor, which is again free to bind to operator

CH3 DNA and to repress transcription. OH Both the I - and OC constitutive mutations interfere with these molecular interactions, allowing continuous OH * Technically, the inducer is allolactose, an isomer of lactose. When FIGURE 15.4 The gratuitous inducer lactose enters the bacterial cell, some of it is converted to allolac- ­isopropylthiogalactoside (IPTG). tose by the b@galactosidase enzyme.

M15_KLUG8414_10_SE_C15.indd 288 16/11/18 5:15 pm 15.2 LACTOSE METABOLISM IN E. COLI IS REGULATED BY AN INDUCIBLE SYSTEM 289

(a) Components Repressor Promoter Operator Leader Structural genes gene (I) (P) gene (O) (L) Z Y A

Operator- binding site

Repressor RNA protein polymerase Lactose- binding site Lactose

(b) I+ O+ Z+ Y+ A+ (wild type) — no lactose present — Repressed I P O L Z Y A

Repressor binds to operator, No transcription blocking transcription

No enzymes

(c) I+ O+ Z+ Y+ A+ (wild type) — lactose present — Induced I P O L Z Y A

No binding occurs; Transcription transcription proceeds

Operator-binding region is altered when bound to lactose

mRNA Translation

Enzymes

FIGURE 15.5 The components of the wild-type lac operon and the response in the absence and presence of lactose.

transcription of the structural genes. In the case of the I - Genetic Proof of the Operon Model mutant, seen in Figure 15.6(a), the repressor protein is The operon model is a good one because it leads to three altered or absent and cannot bind to the operator region, major predictions that can be tested to determine its valid- so the structural genes are always turned on. In the case of ity. The major predictions to be tested are that (1) the I C the O mutant [Figure 15.6(b)], the nucleotide sequence of gene produces a diffusible product (that is, a trans-acting the operator DNA is altered and will not bind with a normal product), (2) the O region is involved in regulation but repressor molecule. The result is the same: The structural does not produce a product (it is cis-acting), and (3) the O genes are always transcribed. region must be adjacent to the structural genes to regulate transcription. The creation of partially diploid bacteria allows us to ESSENTIAL POINT assess these assumptions, particularly those that predict the Genes involved in the metabolism of lactose are coordinately regu- presence of trans-acting regulatory elements. For example, lated by a negative control system that responds to the presence or the F may contain chromosomal genes (Chapter 8), in absence of lactose. ■ which case it is designated F′. When an F- cell acquires such

M15_KLUG8414_10_SE_C15.indd 289 16/11/18 5:15 pm 290 15 Regulation of Gene Expression in Bacteria

(a) I- O+ Z+ Y + A+ (mutant repressor gene) — no lactose present — Constitutive

I- P O L Z Y A

No binding occurs; Transcription transcription proceeds Operator-binding region of repressor is altered

Translation mRNA

Enzymes

(b) I+ Oc Z+ Y + A+ (mutant operator gene) — no lactose present — Constitutive

I P Oc L Z Y A

Transcription Nucleotide sequence of operator gene is altered. No binding occurs; transcription proceeds

Translation mRNA

Enzymes

FIGURE 15.6 The response of the lac operon in the absence of lactose when a cell bears either the I - or the OC mutation.

a plasmid, it contains its own chromosome plus one or more TABLE 15.1 A Comparison of Gene Activity (∙ or ∙) in the additional genes present in the plasmid. This host cell is thus a Presence or Absence of Lactose for Various E. coli Genotypes merozygote, a cell that is diploid for certain added genes (but Presence of not for the rest of the chromosome). The use of such a plasmid B-Galactosidase Activity makes it possible, for example, to introduce an I + gene into a Genotype Lactose Present Lactose Absent - + host cell whose genotype is I , or to introduce an O region A. I +O+Z+ + - C into a host cell of genotype O . The Jacob–Monod operon I +O+Z- - - model predicts how regulation should be affected in such I -O+Z+ + + cells. Adding an I + gene to an I - cell should restore inducibil- I +OCZ+ + + ity, because the normal wild-type repressor, which is a trans- - + + + acting factor, would be produced by the inserted I + gene. In B. I O Z F′I + - contrast, adding an O+ region to an OC cell should have no I +OCZ+ >F′O+ + + effect on constitutive enzyme production, since regulation C. I +O+Z+>F′I - + - + depends on an O region being located immediately adjacent I +O+Z+ >F′OC + - to the structural genes—that is, O+ is a cis-acting element. D. I SO+Z+> - - Results of these experiments are shown in Table 15.1, I SO+Z+ F′I + - - where Z represents the structural genes (and the inserted genes are listed after the designation F′). In both cases *Note: In parts> B to D, most genotypes are partially diploid, containing an F factor plus attached genes F′ . described above, the Jacob–Monod model is upheld (part B of Table 15.1). Part C of the table shows the reverse experi- Another prediction1 of2 the operon model is that certain ments, where either an I - gene or an OC region is added to mutations in the I gene should have the opposite effect of I -. cells of normal inducible genotypes. As the model predicts, That is, instead of being constitutive because the repressor inducibility is maintained in these partial diploids. cannot bind the operator, mutant repressor molecules should

M15_KLUG8414_10_SE_C15.indd 290 16/11/18 5:15 pm 15.2 LACTOSE METABOLISM IN E. COLI IS REGULATED BY AN INDUCIBLE SYSTEM 291

IS O+ Z+ Y + A+ (mutant repressor gene) — lactose present — Repressed IS P O L Z Y A

Repressor always bound to operator, blocking transcription

Lactose-binding region is altered; no binding to lactose

FIGURE 15.7 The response of the lac operon in the presence of lactose in a cell bearing the IS mutation.

be produced that cannot interact with the inducer, lactose. IPTG was higher inside the bag than in the external solu- Thus, these repressors would always bind to the operator tion, indicating that an IPTG-binding material was present sequence, and the structural genes would be permanently in the cell extract and was too large to diffuse across the repressed. In cases like this, the presence of an additional I + wall of the bag. gene would have little or no effect on repression. S In fact, such a mutation, I , was discovered wherein the NOW SOLVE THIS operon, as predicted, is “superrepressed,” as shown in part D of Table 15.1 (and depicted in Figure 15.7). An additional 15.1 Even though the lac Z, Y, and A structural genes are transcribed as a single polycistronic mRNA, each gene I + gene does not effectively relieve repression of gene activ- contains the initiation and termination signals essential for ity. These observations are consistent with the idea that translation. Predict what will happen when a cell growing in the repressor contains separate DNA-binding domains and the presence of lactose contains a deletion of one nucleo- inducer-binding domains. tide (a) early in the Z gene and (b) early in the A gene. HINT: This problem requires you to combine your understanding Isolation of the Repressor of the genetic expression of the lac operon with that of the genetic Although Jacob and Monod’s operon theory succeeded in code, frameshift mutations, and termination of transcription. The explaining many aspects of genetic regulation in bacteria, key to its solution is to consider the effect of the loss of one nucleo- the nature of the repressor molecule was not known when tide within a polycistronic mRNA. their landmark paper was published in 1961. Although they had assumed that the allosteric repressor was a protein, RNA was also a viable candidate because the activity of the Ultimately, the IPTG-binding material was purified and molecule required the ability to bind to DNA. Despite many shown to have various characteristics of a protein. In con- attempts to isolate and characterize the hypothetical repres- trast, extracts of I - constitutive cells having no lac repres- sor molecule, no direct chemical evidence was forthcoming. sor activity did not exhibit IPTG binding, strongly suggesting A single E. coli cell contains no more than ten or so copies of that the isolated protein was the repressor molecule. the lac repressor, and direct chemical identification of ten To confirm this thinking, Gilbert and Müller-Hill grew molecules in a population of millions of proteins and RNAs E. coli cells in a medium containing radioactive sulfur and in a single cell presented a tremendous challenge. then isolated the IPTG-binding protein, which was labeled In 1966, Walter Gilbert and Benno Müller-Hill in its sulfur-containing amino acids. Next, this protein was reported the isolation of the lac repressor in partially mixed with DNA from a strain of phage lambda (l) carry- purified form. To facilitate the isolation, they used a regu- ing the lacO+ gene. When the two substances are separate, lator quantity I q mutant strain that contains about ten the DNA sediments at 40S and the IPTG-binding protein times as much repressor as do wild-type E. coli cells. Also sediments at 7S. However, when the DNA and protein were 1 2 instrumental in their success was the use of the gratuitous mixed and sedimented in a gradient, using ultracentrifuga- inducer IPTG, which binds to the repressor, and the tech- tion, the radioactive protein sedimented at the same rate nique of equilibrium dialysis. In this technique, extracts as did the DNA, indicating that the protein binds to the of I q cells were placed in a dialysis bag and allowed to DNA. Further experiments showed that the IPTG-binding, attain equilibrium with an external solution of radioactive or repressor, protein binds only to DNA containing the lac IPTG, a molecule small enough to diffuse freely in and out region and does not bind to lac DNA containing an operator- of the bag. At equilibrium, the concentration of radioactive constitutive OC mutation.

M15_KLUG8414_10_SE_C15.indd 291 16/11/18 5:15 pm 292 15 Regulation of Gene Expression in Bacteria

ESSENTIAL POINT is involved in diminishing the expression of the lac operon Research on the lac operon in E. coli pioneered our understanding of when glucose is present. This inhibition is called catabolite the regulation of gene expression. ■ repression. To understand catabolite repression, let’s backtrack for a moment to review the system depicted in Figure 15.5. When the lac repressor is bound to the inducer (lactose), the 15.3 The Catabolite-Activating lac operon is activated, and RNA polymerase transcribes­ the structural genes. As stated earlier in the text (see Protein (CAP) Exerts Positive Control ­Chapter 12), transcription is initiated as a result of the over the lac Operon ­binding that occurs between RNA polymerase and the nucle- otide sequence of the promoter region, found upstream 5′ As described in the preceding discussion of the lac operon, from the initial coding sequences. Within the lac operon, 1 2 the role of b-galactosidase is to cleave lactose into its com- the promoter is found between the I gene and the operator ponents, glucose and galactose. Then, to be used by the cell, region (O) (see Figure 15.1). Careful examination has revealed the galactose, too, must be converted to glucose. What if the that polymerase binding is never very efficient unless CAP is also cell found itself in an environment that contained ample present to facilitate the process. amounts of both lactose and glucose? Given that glucose is The mechanism is summarized in Figure 15.8. In the the preferred carbon source for E. coli, it would not be ener- absence of glucose and under inducible conditions, CAP getically efficient for a cell to induce transcription of the exerts positive control by binding to the CAP site, facili- lac operon, since what it really needs—glucose—is already tating RNA-polymerase binding at the promoter, and thus present. As we will see next, still another molecular com- transcription. Therefore, for maximal transcription of the ponent, called the catabolite-activating protein (CAP), structural genes to occur, the repressor must be bound by

(a) Glucose absent

CAP–cAMP CAP (Catabolite- complex binds RNA polymerase binds activating protein) O Structural genes +

cAMP As cAMP levels increase, CAP-binding Polymerase cAMP binds to CAP, site site causing an allosteric Transcription occurs transition Promoter region

Translation occurs

(b) Glucose present RNA polymerase CAP cannot seldom binds CAP bind efficiently

O Structural genes Glucose cAMP levels decrease CAP-binding Polymerase site site Transcription diminished Promoter region

Translation diminished

FIGURE 15.8 catabolite repression. (a) In the absence of glucose, cAMP levels increase, ­resulting in the formation of a cAMP–CAP complex, which binds to the CAP site of the ­promoter, stimulating transcription. (b) In the presence of glucose, cAMP levels decrease, cAMP–CAP complexes are not formed, and transcription is not stimulated.

M15_KLUG8414_10_SE_C15.indd 292 16/11/18 5:15 pm 15.4 THE TRYPTOPHAN (TRP) OPERON IN E. COLI IS A REPRESSIBLE GENE SYSTEM 293

lactose (so as not to repress operon expression) and CAP must be bound to the CAP-binding site. 15.4 The Tryptophan (trp) Operon in This leads to the central question about CAP: How does E. coli Is a Repressible Gene System the presence of glucose inhibit CAP binding? The answer involves still another molecule, cyclic adenosine mono- Although the process of induction had been known for some phosphate (cAMP), which is a nucleotide with an ade- time, it was not until 1953 that Monod and colleagues discov- nine base, a ribose sugar, and a single phosphate bound to ered a repressible operon. When grown in minimal medium the sugar at both the 5′ and 3′ positions. In order to bind to (see Chapter 8), wild-type E. coli produce the enzymes nec- the promoter, CAP must first be bound to cAMP. The level of essary for the biosynthesis of amino acids as well as many cAMP is itself dependent on an enzyme, adenyl cyclase, other essential macromolecules. Focusing his studies on which catalyzes the conversion of ATP to cAMP. The role of the amino acid tryptophan and the enzyme tryptophan glucose in catabolite repression is now clear. It inhibits the synthetase, Monod discovered that if tryptophan is present activity of adenyl cyclase, causing a decline in the level of in sufficient quantity in the growth medium, the enzymes cAMP in the cell. Under this condition, CAP cannot form the necessary for its synthesis are not produced. It is energeti- cAMP–CAP complex essential to the positive stimulation of cally advantageous for bacteria to repress expression of transcription of the lac operon. genes involved in tryptophan synthesis when ample trypto- Regulation of the lac operon by catabolite repression phan is present in the growth medium. results in efficient energy use, because the presence of glu- Further investigation showed that a series of enzymes cose will override the need for the metabolism of lactose, if encoded by five contiguous genes on the E. coli chromosome are lactose is also available to the cell. In contrast to the nega- involved in tryptophan synthesis. These genes are part of an tive regulation conferred by the lac repressor, the action of operon, and in the presence of tryptophan, all are coordinately cAMP–CAP constitutes positive regulation. Thus, a com- repressed and none of the enzymes are produced. Because of bination of positive and negative regulatory mechanisms the great similarity between this repression and the induction determines transcription levels of the lac operon. Catabolite of enzymes for lactose metabolism, Jacob and Monod proposed repression involving CAP has also been observed for other a model of gene regulation analogous to the lac system. inducible operons, including those controlling the metabo- To account for repression, Jacob and Monod suggested lism of galactose and arabinose. the presence of a normally inactive repressor that alone cannot interact with the operator region of the operon. However, the repressor is an allosteric molecule that can bind to tryp- ESSENTIAL POINT tophan. When tryptophan is present, the resultant complex The catabolite-activating protein (CAP) exerts positive control over lac gene expression by interacting with RNA polymerase at the lac of repressor and tryptophan attains a new conformation that promoter and by responding to the levels of cyclic AMP in the bacte- binds to the operator, repressing transcription. Thus, when rial cell. ■ tryptophan, the end product of this anabolic pathway, is present, the system is repressed and enzymes are not made. Since the regulatory complex inhibits transcription of the NOW SOLVE THIS operon, this repressible system is under negative control. And as tryptophan participates in repression, it is referred 15.2 Predict the level of genetic activity of the lac operon as well as the status of the lac repressor and the CAP protein to as a corepressor in this regulatory scheme. under the cellular conditions listed in the accompanying table. Lactose Glucose Evidence for the trp Operon (a) - - Support for the concept of a repressible operon was soon (b) + - forthcoming, based primarily on the isolation of two dis- (c) - + tinct categories of constitutive mutations. The first class, (d) + + trpR-, maps at a considerable distance from the structural genes. This locus represents the gene coding for the repres- HINT: This problem asks you to combine your knowledge of the Jacob–Monod model of the regulation of the lac operon with your sor. Presumably, the mutation inhibits either the repressor’s understanding of how catabolite repression impacts on this model. interaction with tryptophan or repressor formation entirely. The key to its solution is to keep in mind that regulation involving Whichever the case, repression never occurs in cells with lactose is a negative control system, while regulation involving glu- the trpR- mutation. As expected, if the trpR+ gene encodes a cose and catabolite repression is a positive control system. functional repressor molecule, the presence of a copy of this gene will restore repressibility.

M15_KLUG8414_10_SE_C15.indd 293 16/11/18 5:15 pm 294 15 Regulation of Gene Expression in Bacteria

The second constitutive mutation is analogous to that catalyze the biosynthesis of tryptophan. As in the lac of the operator of the lactose operon, because it maps imme- operon, a promoter region (trpP) represents the binding diately adjacent to the structural genes. Furthermore, the site for RNA polymerase, and an operator region (trpO) is insertion of a plasmid bearing a wild-type operator gene into bound by the repressor. In the absence of binding, transcrip- mutant cells does not restore repression. This is what would tion is initiated within the trpP–trpO region and proceeds be predicted if the mutant operator no longer interacts with along a leader sequence 162 nucleotides prior to the first the repressor–tryptophan complex. structural gene (trpE). Within that leader sequence, still A detailed model of the trp operon and its regulation another regulatory site has been demonstrated, called an is shown in Figure 15.9. The five contiguous structural attenuator—the subject of Section 15.5. As we will see, this genes (trpE, D, C, B, and A) are transcribed as a polycis- regulatory unit is an integral part of this operon’s control tronic message directing translation of the enzymes that mechanism.

(a) Components Promoter Operator Leader Attenuator trpP trpO L A

trpR P O L A trpE trpD trpC trpB trpA

5' Repressor Regulatory region Structural genes 3' gene trp Operon Tryptophan- binding site Tryptophan (corepressor) Repressor protein

(b) Tryptophan absent R P O L A E D C B A

Repressor alone cannot Transcription bind to operator proceeds

Polycistronic mRNA

(c) Tryptophan present R P O L A E D C B A

Repressor–tryptophan Transcription complex binds blocked to operator

Repressor binds to tryptophan, causing allosteric transition

FIGURE 15.9 A repressible operon. (a) The components involved in regulation of the trypto- phan operon. (b) In the absence of tryptophan, an inactive repressor is made that cannot bind to the operator (O), thus allowing transcription to proceed. (c) When tryptophan is present, it binds to the repressor, causing an allosteric transition to occur. This complex binds to the oper- ator region, leading to repression of the operon.

M15_KLUG8414_10_SE_C15.indd 294 16/11/18 5:15 pm 15.5 RNA Plays Diverse Roles in Regulating Gene Expression in Bacteria 295

ESSENTIAL POINT or increase transcription initiation from their target pro- Unlike the inducible lac operon, the trp operon is repressible. In the moters by affecting the binding of RNA polymerase to the presence of tryptophan, the repressor binds to the regulatory region promoter. of the trp operon and represses transcription initiation. ■ Gene regulation in bacteria can also occur through the interactions of regulatory molecules with specific regions of a nascent mRNA, after transcription has been initiated. The EVOLVING CONCEPT OF THE GENE binding of these regulatory molecules alters the secondary The groundbreaking work of Jacob, Monod, and structure of the mRNA, leading to premature transcription Lwoff in the early 1960s, which established the operon termination or repression of translation. We will discuss model for the regulation of gene expression in bac- three types of regulation­ involving RNA—attenuation, ribo- teria, expanded the concept of the gene to include switches, and small noncoding RNAs, abbreviated in bacteria noncoding regulatory sequences that are present as sRNAs. These types of regulation fine-tune levels of gene upstream 5′ from the coding region. In bacterial expression in bacteria. operons, the transcription of several contiguous struc- 1 2 tural genes whose products are involved in the same Attenuation biochemical pathway is regulated in a coordinated fashion. , Kevin Bertrand, and their colleagues observed that, when tryptophan is present and the trp operon is repressed, initiation of transcription still occurs at a low level but is subsequently terminated at a point about 15.5 RNA Plays Diverse Roles 140 nucleotides along the transcript. They called this pro- cess attenuation, as it “weakens or impairs” expression of in Regulating Gene Expression in the operon. In contrast, when tryptophan is absent or pres- Bacteria ent in very low concentrations, transcription is initiated but is not subsequently terminated, instead continuing beyond In the preceding sections of this chapter we focused on gene the leader sequence into the structural genes. regulation brought about by DNA-binding regulatory pro- Based on these observations, Yanofsky and colleagues teins that interact with promoter and operator regions of presented a model to explain how attenuation occurs the genes to be regulated. These regulatory proteins, such (­Figure 15.10). They proposed that the initial DNA sequence as the lac repressor and the CAP protein, act to decrease that is transcribed gives rise to an mRNA sequence that has

(a) Stem-loop structures in leader RNA sequence (b) Alternative secondary structures of leader RNA

Ribosome stalls on tryptophan codons, allowing formation of Transcription 2-3 stem-loop proceeds

U G Trp G Charged tRNA Tryptophan U 2 3 G not available 4 codons G 1

Antiterminator hairpin 1 2 UUUUUUU Ribosome moves 3 4 into 1-2 region, Charged preventing tRNATrp available formation of 2 2-3 stem-loop 3 Transcription terminates

1 4 Translation initiated

Terminator hairpin

FIGURE 15.10 The attenuation model regulating the tryptophan operon.

M15_KLUG8414_10_SE_C15.indd 295 16/11/18 5:15 pm 296 15 Regulation of Gene Expression in Bacteria

the potential to fold into two mutually exclusive stem-loop Riboswitches structures referred to as hairpins. If tryptophan is scarce, an Since the elucidation of attenuation in the trp operon, numer- mRNA secondary structure referred to as the antitermi- ous cases of gene regulation that also depend on alternative nator hairpin is formed. Transcription proceeds past the forms of mRNA secondary structure have been documented. antiterminator hairpin region, and the entire mRNA is sub- These are examples of what are more generally referred sequently produced. Alternatively, in the presence of excess to as riboswitches. As with attenuation of the trp operon tryptophan, the mRNA structure that is formed is referred discussed earlier, the mechanism of riboswitch regulation to as a terminator hairpin, and transcription is almost involves short ribonucleotide sequences (or elements) pres- always terminated prematurely, just beyond a region called ent in the 5′-untranslated regions of mRNAs (UTRs). These the attenuator. RNA elements are capable of binding with small molecule A key point in Yanofsky’s model is that the transcript ligands, such as metabolites, whose synthesis or activity is of the leader sequence (see Figure 15.9) must be translated controlled by the genes encoded by the mRNA. Such binding for the antiterminator hairpin to form. This leader tran- causes a conformational change in one domain of the ribo- script contains two triplets (UGG) that encode tryptophan, switch element, which induces another change at a second and these are present just downstream of the AUG sequence RNA domain, most often creating a transcription terminator that signals the initiation of translation by ribosomes. When structure. This terminator structure interfaces directly with Trp adequate tryptophan is present, charged tRNA is present the transcriptional machinery and shuts it down. in the cell, whereby ribosomes translate these UGG triplets, Riboswitches can recognize a broad range of ligands, proceed through the attenuator, and allow the terminator including amino acids, purines, vitamin cofactors, amino hairpin to form, and the operon is not transcribed. If cells are sugars, and metal ions. They are widespread in bacteria. In Trp starved of tryptophan, charged tRNA will be unavailable Bacillus subtilis, for example, approximately 5 percent of this and ribosomes “stall” during translation of the UGG triplets. bacterium’s genes are regulated by riboswitches. They are Thus, the antiterminator hairpin forms within the leader also found in archaea, fungi, and plants and may prove to transcript, and as a result, transcription proceeds, leading be present in animals as well. to expression of the entire set of structural genes. The two important domains within a riboswitch are the Many other bacterial operons use attenuation to con- ligand-binding site, called the aptamer, and the expression trol gene expression. These include operons that encode platform, which is capable of forming the terminator struc- enzymes involved in the biosynthesis of amino acids such ture. Figure 15.11 illustrates the principles involved in ribo- as threonine, histidine, leucine, and phenylalanine. As with switch control. The 5′@UTR of an mRNA is shown on the left the trp operon, attenuation occurs in a leader sequence that side of the figure in the absence of the ligand (metabolite). RNA contains an attenuator region. polymerase has transcribed the unbound ligand-binding site,

Ligand-binding site + Ligand

Ligand binds, RNA inducing conformational polymerase Transcription changes is terminated

Antiterminator Terminator conformation conformation

FIGURE 15.11 An illustration of the mechanism of riboswitch regulation of gene expression, where the default position (left) is in the antiterminator conformation. Upon binding by the ligand, the mRNA adopts the terminator conformation (right).

M15_KLUG8414_10_SE_C15.indd 296 16/11/18 5:15 pm 15.6 CRISPR-Cas Is an Adaptive Immune System in Bacteria 297

and in the default conformation, the expression domain adopts sRNAs have been shown to play important roles in gene an antiterminator conformation. Thus, transcription continues regulation in response to changing environmental condi- through the expression platform and into the coding region. On tions or stress. For example, the sRNA DsrA of E. coli is the right side of the figure, the presence of the ligand on the upregulated in response to low temperature and promotes ligand-binding site induces an alternative conformation in the the expression of genes that enable the long-term survival of expression platform, creating the terminator conformation. RNA the cell under stressful conditions. In contrast, RyhB sRNA polymerase is effectively blocked, and transcription ceases. from E. coli is a negative regulator of gene expression. In response to low iron levels, RyhB is transcribed to inhibit the translation of several nonessential iron-containing enzymes Small Noncoding RNAs Play Regulatory Roles so that the more critical iron-containing enzymes can utilize in Bacteria what little iron is present in the cytoplasm. Bacterial small noncoding RNAs (sRNAs) were discovered decades ago, and it is thought that E. coli contain roughly 80 to 100 different sRNAs, while other species are reported to ESSENTIAL POINT have three times that number. sRNAs are generally between RNAs are sometimes involved in the regulation of gene expression 50 and 500 nucleotides long and are involved in gene regula- in bacteria, including the process of attenuation, the involvement of riboswitches, and interactions involving small noncoding RNAs tion and the modification of protein function. sRNAs involved (sRNAs). ■ in gene regulation are often transcribed from loci that par- tially overlap the coding genes that they regulate. However, they are transcribed from the opposite strand of DNA and in the opposite direction, making them complementary to 15.6 CRISPR-Cas Is an Adaptive mRNAs transcribed from that locus. In other cases, sRNAs are complementary to target mRNAs but are transcribed from Immune System in Bacteria loci that do not overlap target genes. sRNAs regulate gene expression by binding to mRNAs (usually at the 5′-end) that In the final section of our discussion of the regulation of gene are being transcribed. In some cases, the binding of sRNAs expression in bacteria, we introduce a molecular mecha- to mRNAs blocks translation of the mRNA by masking the nism by which bacteria respond to a specific bacteriophage ribosome-binding site (RBS). In other cases, binding enhances (or simply phage) attack by expressing genetic information translation by preventing secondary structures from forming that leads to the destruction of that phage’s invading DNA. in the mRNA that would block translation, often by masking While bacteria have evolved mechanisms to ward off the RBS (see Figure 15.12). Thus, sRNAs can be both nega- phage infection, most of these mechanisms provide innate tive and positive regulators of gene expression. immunity because they are not tailored to a specific phage. For example, some bacteria possess mechanisms to prevent phage binding to the cell surface, block phage DNA from Negative regulation Positive regulation entering the cell, or induce suicide in infected cells to prevent sRNA pairing inhibits the spread of infection. In contrast, adaptive immunity ribosome binding Translation repressed refers to a defense mechanism whereby past exposure to a sRNA pathogen stimulates a robust defense against future expo- sure to the same pathogen. For example, adaptive immunity RBS RBS in humans enables vaccines to provide protection against specific disease-causing viruses. While it was once thought X sRNA pairing that bacteria were not advanced enough to possess such unmasks the RBS sRNA immunity, we now know that bacteria possess an immune system called CRISPR-Cas that can be adapted to fight spe- cific types of phage by preventing phage gene expression and subsequent phage reproduction. Translation proceeds

FIGURE 15.12 Bacterial small noncoding RNAs regulate The Discovery of CRISPR gene expression. Bacterial sRNAs can be negative regula- tors of gene expression by binding to mRNAs and prevent- A CRISPR is a genomic locus in bacteria that contains clus- ing translation by masking the ribosome-binding site (RBS), tered regularly interspaced short palindromic repeats. Prior or they can be positive regulators of gene expression by binding to mRNAs and preventing secondary structures to the coining of this term, CRISPR loci were first identified (that would otherwise mask an RBS) and enable translation. in 1987 in the Escherichia coli genome based on a simple

M15_KLUG8414_10_SE_C15.indd 297 16/11/18 5:15 pm 298 15 Regulation of Gene Expression in Bacteria

Streptococcus thermophilus CRISPR locus

Repeats GTTTTTGTACTCTCAAGATTTAAGTAACTGTACAAC Leader

Spacer 1 Spacer 3 GAGCTACCAGCTACCCCGTATGTCAGAGAG TAGATTTAATCAGTAATGAGTTAGGCATAA (Streptococcus phage 20617) (Streptococcus phage TP-778L) Spacer 2 TTGAATACCAATGCCAGCTTCTTTTAAGGC (Streptococcus phage CHPC1151)

FIGURE 15.13 A CRISPR locus from the bacterium Streptococcus thermophilus (LMG18311). Spacer sequences are derived from portions of bacteriophage genomes and are flanked on either side by a repeat sequence. Only 3 of 33 total spacers in this CRISPR locus are shown.

description of repeated DNA sequences with nonrepetitive genes encode a wide variety of Cas proteins such as DNases, spacer sequences between them. Since then, CRISPR loci RNases, and proteins of unknown function. The CRISPR-Cas have been identified in ∙50 percent of bacteria species mechanism includes three steps outlined in Figure 15.14. and in ∙90 percent of archaea, another type of prokaryote 1. The first step is known as spacer acquisition. Invading (Figure 15.13). The spacers remained a mystery until 2005 phage DNA is cleaved into small fragments, which are when three independent studies demonstrated that CRISPR directly inserted into the CRISPR locus to become new spacer sequences were identical to fragments of phage spacers. The Cas1 nuclease and an associated Cas2 pro- genomes. This insight led to speculation that viral sequences tein are required for spacer acquisition. New spacers are within CRISPR loci serve as a “molecular memory” of previ- inserted proximal to the leader sequence of the CRISPR ous viral attacks. locus, with older spacers being located progressively more The first experimental evidence that CRISPRs are impor- distal. When new spacers are added, repeat sequences are tant for adaptive immunity came from an unexpected place. duplicated such that each spacer is flanked by repeats. Danisco, a Danish food science company, sought to create a strain of Streptococcus thermophilus that was more resistant 2. In the second step, CRISPR loci are transcribed, starting to phage, thus making it more efficient for use in the produc- at the promoter within the leader. These long transcripts tion of yogurt and cheese. Philippe Horvath’s lab at Danisco, are then processed into short CRISPR-derived RNAs in collaboration with others, found that when they exposed (crRNAs), each containing a single spacer and repeat S. thermophilus to a specific phage, bacterial cells that sur- sequences (on either or both sides). This step is called vived became resistant to the same phage strain, but not to crRNA biogenesis. other phage strains. Furthermore, the resistant bacteria pos- 3. The third step is referred to as target interference. sessed new spacers within their CRISPR loci with an exact Mature crRNAs associate with Cas nucleases and recruit sequence match to portions of the genome of the phages by them to complementary sequences in invading phage which they had been challenged. DNA. The Cas nucleases then cleave the viral DNA, thus Next, the Horvath lab showed that deletion of new spac- neutralizing infection. ers in the resistant strains abolished their phage resistance. Remarkably, the converse was also true; experimental inser- In short, CRISPR-Cas is a simple system whereby ­bacteria tion of new viral sequence-derived spacers into the CRISPR incorporate small segments of viral DNA into their own genome loci of sensitive bacteria rendered them resistant! and then express them as short crRNAs that guide a nuclease to cleave complementary viral DNA. Thus, CRISPR-Cas uses viral DNA sequences to specifically fight that same virus. The CRISPR-Cas Mechanism for RNA-Guided While the three steps of the CRISPR-Cas mechanism are con- Destruction of Invading DNA served across bacterial species, the molecular details and Cas Studies from several labs have elucidated the mechanism proteins involved are varied. For example, the well-studied underlying the bacterial adaptive immune system. In addi- CRISPR-Cas system of Streptococcus pyogenes uses a single tion to the CRISPR loci, adaptive immunity is dependent on nuclease called Cas9 for target interference while many other a set of adjacent CRISPR-associated (cas) genes. The cas species require a large complex of Cas proteins.

M15_KLUG8414_10_SE_C15.indd 298 16/11/18 5:15 pm CASE STUDY 299

CRISPR locus

Spacers 1. Acquisition

cas genes Leader Repeats

CRISPR transcription

2. crRNA biogenesis Pre-crRNA

5' 3' Mature crRNA

Target viral DNA

3. Target interference 5' 3' Cas nuclease

FIGURE 15.14 The mechanism of CRISPR-Cas adaptive immunity in bacteria.

Later in the text (see Chapter 17), we will see how ESSENTIAL POINT CRISPR-Cas has been adapted as a powerful tool for genome Bacteria have an adaptive immune system that uses an RNA- editing with applications in gene therapy (see Special ­Topics guided nuclease to cleave invading viral DNA in a sequence-specific Chapter 3—Gene Therapy) and gene-edited foods (see manner. ■ ­Special Topics Chapter 6—Genetically Modified Foods).

CASE STUDY MRSA in the National Football League (NFL)

n 2013, there was an outbreak of methicillin-resistant Staphylococ- means that when the infection is treated with antibiotics, the cells cus aureus (MRSA) at an NFL training facility. One player suffered ramp up their resistance. a career-ending infection to his foot and sued the team owners for 1. Speculate on how mecA expression is inducible and repres- I$20 million for unsanitary conditions that contributed to the bacte- sible based on what you know about the lac operon. rial infection. A settlement with undisclosed terms was reached in 2017. MRSA is highly contagious and is spread by direct skin contact 2. Chen et al. [Antimicrob. Agents Chemother. (2014) 58(2): or by airborne transmission and can result in amputation or death. 1047–1054] studied several strains of MRSA isolated in ­Taiwan and In addition, MRSA is very difficult to treat because it is resistant to found that some single point mutations in the mecA promoter many antibiotics. For example, b -lactam antibiotics, such as penicil- were linked to increased antibiotic resistance while other point lin, function by binding to and inactivating bacterial penicillin-binding mutations were linked to decreased antibiotic resistance. Why proteins (PBPs), which synthesize the bacterial cell wall. However, might mecA promoter mutations have these opposing effects? MRSA expresses an alternative type of PBP, called PBP2a encoded 3. What ethical responsibility do team owners have with respect to by the mecA gene. b -lactam antibiotics only weakly bind PBP2a, preventing the spread of pathogenic bacteria? What responsibili- and thus cell wall synthesis can continue in their presence. More- ties must players assume to prevent infecting other athletes? over, in a system somewhat analogous to the regulation of the lac See CDC page: Methicillin-resistant Staphyloccocus aureus operon, mecA is induced by the presence of b -lactam antibiotics and (MRSA) (https://www.cdc.gov/mrsa/community/team-hc-providers/ repressed in their absence. This “on-demand” expression of mecA advice-for-athletes.html).

M15_KLUG8414_10_SE_C15.indd 299 16/11/18 5:15 pm 300 15 Regulation of Gene Expression in Bacteria

INSIGHTS AND SOLUTIONS

A hypothetical operon (theo) in E. coli contains several struc- the regulator molecule. If the amino acid is present, either tural genes encoding enzymes that are involved sequentially initially or after sufficient synthesis has occurred, the amino in the biosynthesis of an amino acid. Unlike the lac operon, acid binds to the regulator, forming a complex that interacts in which the repressor gene is separate from the operon, the with the operator region, causing repression of transcription gene encoding the regulator molecule is contained within of the genes within the operon. the theo operon. When the end product (the amino acid) is The theo operon is similar to the tryptophan system, present, it combines with the regulator molecule, and this except that the regulator gene is within the operon rather complex binds to the operator, repressing the operon. In the than separate from it. Therefore, in the theo operon, the regu- absence of the amino acid, the regulatory molecule fails to lator gene is itself regulated by the presence or absence of the bind to the operator, and transcription proceeds. amino acid. Categorize and characterize this operon, and then con- (a) As in the lac operon, a mutation in the theo operator gene sider the following mutations, as well as the situation in inhibits binding with the repressor complex, and transcrip- which the wild-type gene is present along with the mutant tion occurs constitutively. The presence of an F′ plasmid gene in partially diploid cells (F′): bearing the wild-type allele would have no effect, since it is (a) Mutation in the operator region not adjacent to the structural genes.

(b) Mutation in the promoter region (b) A mutation in the theo promoter region would no doubt inhibit binding to RNA polymerase and therefore inhibit tran- (c) Mutation in the regulator gene scription. This would also happen in the lac operon. A wild- In each case, will the operon be active or inactive in tran- type allele present in an F′ plasmid would have no effect. scription, assuming that the mutation affects the regulation of (c) A mutation in the theo regulator gene, as in the lac system, the theo operon? Compare each response with the equivalent may inhibit either its binding to the repressor or its binding to situation for the lac operon. the operator gene. In both cases, transcription will be consti- Solution: The theo operon is repressible and under negative tutive, because the theo system is repressible. Both cases result control. When there is no amino acid present in the medium in the failure of the regulator to bind to the operator, allowing (or the environment), the product of the regulatory gene can- transcription to proceed. In the lac system, failure to bind the not bind to the operator region, and transcription proceeds corepressor lactose would permanently repress the system. under the direction of RNA polymerase. The enzymes neces- The addition of a wild-type allele would restore repressibility, sary for the synthesis of the amino acid are produced, as is provided that this gene was transcribed constitutively.

Mastering Genetics Visit for Problems and Discussion Questions instructor-assigned tutorials and problems.

1. HOW DO WE KNOW? In this chapter, we focused on the regu- regulation, discuss why genes related to common functions are lation of gene expression in bacteria. Along the way, we found found together in operons. many opportunities to consider the methods and reasoning by 3. Contrast positive versus negative control of gene expression. which much of this information was acquired. From the explana- 4. Contrast the role of the repressor in an inducible system and in a tions given in the chapter, what answers would you propose to repressible system. the following fundamental questions? 5. For the lac genotypes shown in the following table, predict (a) How do we know that bacteria regulate the expression of whether the structural genes (Z) are constitutive, permanently certain genes in response to the environment? repressed, or inducible in the presence of lactose. (b) What evidence established that lactose serves as the inducer of a gene whose product is related to lactose Genotype Constitutive Repressed Inducible metabolism? + + + (c) What led researchers to conclude that a repressor molecule I O Z * regulates the lac operon? I-O+Z+ (d) How do we know that the lac repressor is a protein? - C + (e) How do we know that the trp operon is a repressible control I O Z system, in contrast to the lac operon, which is an inducible I-OCZ+ F′O+ control system? I+OCZ+ F′O+ 2. CONCEPT QUESTION Review the Chapter Concepts list > S + + on p. 285. These all relate to the regulation of gene expression in I O Z > bacteria. Write a brief essay that discusses why you think regula- ISO+Z+ F′I+ tory systems evolved in bacteria (i.e., what advantages do regula- tory systems provide to these organisms?), and, in the context of >

M15_KLUG8414_10_SE_C15.indd 300 16/11/18 5:15 pm PROBLEMS AND DISCUSSION QUESTIONS 301

6. For the genotypes and conditions (lactose present or absent) encodes the CAP protein, and (b) the CAP-binding site within the shown in the following table, predict whether functional enzymes, promoter. nonfunctional enzymes, or no enzymes are made. 11. Erythritol is a natural sugar abundant in fruits and ferment- ing foods. Pathogenic bacterial strains that catabolize erythri- Genotype Condition Functional Nonfunctional No tol contain four closely spaced genes, all involved in erythritol Enzyme Enzyme Enzyme metabolism. One of the four genes (eryD) encodes a product Made Made Made that represses the expression of the other three genes. Erythri- I+O+Z+ No lactose * tol catabolism is stimulated by erythritol. Present a regulatory model to account for the regulation of erythritol catabolism in + C + I O Z Lactose such bacterial strains. Does this system appear to be under I-O+Z- No lactose ­inducible or repressible control? 12. Describe the role of attenuation in the regulation of tryptophan - + - Lactose I O Z biosynthesis. I-O+Z+/F′I+ No lactose 13. What is the major difference between the mechanism involved in attenuation and riboswitches and the mechanism involved in the I+OCZ+ F′O+ Lactose regulation of the lactose operon? + + - I O Z > Lactose 14. A bacterial operon is responsible for the production of the bio- F′I+O+Z+ synthetic enzymes needed to make the hypothetical amino acid > I-O+Z- No lactose tisophane (tis). The operon is regulated by a separate gene, R. The deletion of R causes the loss of enzyme synthesis. In the wild- F′I+O+Z+ > type condition, when tis is present, no enzymes are made; in the S I O+Z+ F′O+ No lactose absence of tis, the enzymes are made. Mutations in the operator - I+OCZ+ Lactose gene (O ) result in repression regardless of the presence of tis. Is > the operon under positive or negative control? Propose a model F′O+Z+ > for (a) repression of the genes in the presence of tis in wild-type cells and (b) the mutations. 7. The locations of numerous lacI - and lacI S mutations have been 15. Bacterial sRNAs can bind to mRNAs through complemen- determined within the DNA sequence of the lacI gene. Among tary binding to regulate gene expression. What determines these, lacI - mutations were found to occur in the 5′@upstream whether the sRNA/mRNA binding will promote or repress mRNA region of the gene, while lacI S mutations were found to occur far- translation? ther downstream in the gene. Are the locations of the two types of 16. Why is the CRISPR-Cas system of bacteria considered an adaptive mutations within the gene consistent with what is known about immunity rather than an innate immunity? the function of the repressor that is the product of the lacI gene? 17. How does the molecular mechanism of the CRISPR-Cas system 8. Describe the experimental rationale that allowed the lac use a viral DNA sequence against that same virus? ­repressor to be isolated. 18. In the publication that provided the first evidence of CRISPR- 9. What properties demonstrate that the lac repressor is a protein? Cas as an adaptive immune system [Barrangou, R., et al. (2007). Describe the evidence that it indeed serves as a repressor within Science. 315:1709–1712], the authors state that CRISPR-Cas the operon system. “provides a historical perspective of phage exposure, as well as 10. Predict the effect on the inducibility of the lac operon of a a predictive tool for phage sensitivity.” Explain how this is true mutation that disrupts the function of (a) the crp gene, which using what you know about the CRISPR locus.

M15_KLUG8414_10_SE_C15.indd 301 16/11/18 5:15 pm 16 Regulation of Gene Expression in Eukaryotes

CHAPTER CONCEPTS

■■ While transcription and translation are tightly coupled in bacteria, in eukary- otes, these processes are spatially and temporally separated, and thus inde- Chromosome territories in a human fibroblast cell nucleus. Each pendently regulated. chromosome is stained with a different-colored probe. ■■ Chromatin remodeling, as well as modifications to DNA and histones, play important roles in regulating gene expression in eukaryotes. ■■ Eukaryotic transcription initiation irtually all cells in a multicellular eukaryotic organism contain a requires the assembly of transcrip- complete genome; however, such organisms often possess differ- tion regulatory proteins on DNA sites Vent cell types with diverse morphologies and functions. This simple known as promoters, enhancers, and observation highlights the importance of the regulation of gene expression silencers. in eukaryotes. For example, skin cells and muscle cells differ in appearance ■■ Following transcription, there are sev- and function because they express different genes. Skin cells express kera- eral mechanisms that regulate gene tins, fibrous structural proteins that bestow the skin with protective prop- expression, referred to as posttranscrip- erties. Muscle cells express high levels of myosin II, a protein that mediates tional regulation. muscle contraction. Skin cells do not express myosin II, and muscle cells do ■■ Alternative splicing allows for a single not express keratins. gene to encode different protein iso- In addition to gene expression that is cell-type specific, some genes are forms with different functions. only expressed under certain conditions or at certain times. For example, ■■ RNA-binding proteins regulate mRNA when oxygen levels in the blood are low, such as at high altitude or after stability, degradation, localization, and rigorous exercise, expression of the hormone erythropoietin is upregulated, translation. which leads to an increase in red blood cell production and thus oxygen- ■■ Noncoding RNAs may regulate gene carrying capacity. expression by targeting mRNAs for Underscoring the importance of regulation, the misregulation of genes destruction or translational inhibition. in eukaryotes is associated with developmental defects and disease. For ■■ Posttranslational modification of pro- instance, the overexpression of genes that regulate cellular growth can teins can alter their activity or promote lead to uncontrolled cellular proliferation, a hallmark of cancer. Therefore, their degradation. understanding the mechanisms that control gene expression in eukaryotes is of great interest and may lead to therapies for human diseases.

302

M16_KLUG8414_10_SE_C16.indd 302 16/11/18 5:15 pm 16.1 Organization of the Eukaryotic Cell Facilitates Gene Regulation at Several Levels 303

We will start this chapter by briefly comparing and ■■ Eukaryotic DNA, unlike that of bacteria, is associated contrasting eukaryotic gene expression with that of bacte- with histones and other proteins to form chromatin. ria (see Chapter 15). Next, we will explore several molecular Eukaryotic cells decrease chromatin compaction to make mechanisms that regulate gene expression in eukaryotes. genes accessible to transcription and increase chromatin compaction to inhibit transcription. ■■ In bacteria, the processes of transcription and transla- tion both take place in the cytoplasm and are coupled. 16.1 Organization of the Eukaryotic However, eukaryotic cells are organized with DNA in Cell Facilitates Gene Regulation at the nucleus and ribosomes in the cytoplasm. Thus, tran- scription and translation are separated spatially and Several Levels temporally. ■■ Whereas bacterial mRNAs are translated directly, the As you’ve previously seen (Chapter 15), in bacteria, the mRNAs of many eukaryotic genes must be spliced, regulation of gene expression is often linked to the meta- capped, and polyadenylated prior to export from the bolic needs of the cell. For example, bacteria express genes nucleus and translation in the cytoplasm. Each of these to metabolize lactose when it is present in the environment. processes can be regulated in order to influence the num- Gene expression in bacteria is largely controlled by mecha- bers and types of mRNAs available for translation. nisms that exert positive or negative control over transcrip- tion and translation. While positive and negative regulation ■■ Bacterial mRNAs are often translated immediately of transcription and translation are prominent regulatory and degraded very rapidly. In contrast, eukaryotic mechanisms in eukaryotes as well, there are many impor- mRNAs can have long half-lives and can be trans- tant differences between these processes to consider, and ported, localized, and translated in specific subcellular several additional levels for gene regulation are found in destinations. eukaryotes. Figure 16.1 compares differences in the main ■■ Proteins in bacteria and in eukaryotes can be posttrans- mechanisms of gene regulation in bacteria and eukaryotes, lationally modified by processes such as phosphoryla- which include the following: tion and methylation, which serve many functions,

BACTERIA EUKARYOTES

RNA polymerase Chromatin DNA Chromatin remodeling Transcription DNA Transcription mRNA Ribosome CYTOPLASM Spliceosome RNA polymerase Pre-mRNA Splicing Translation Cap Nuclear membrane mRNA NUCLEUS AAA Poly-A tail Nuclear export Protein modification mRNA stability Protein product Nuclear mRNA localization Ribosome pore

Translation AAA CYTOPLASM

Protein Protein product modification

FIGURE 16.1 a comparison of gene regulation in bacteria (left) and eukaryotes (right).

M16_KLUG8414_10_SE_C16.indd 303 16/11/18 5:15 pm 304 16 rEgulation of Gene Expression in Eukaryotes

including the regulation of protein activities. However, Nuclear architecture and transcriptional regulation are the repertoire of posttranslational modifications in interdependent; changes in nuclear architecture affect tran- eukaryotes is more extensive, leading to additional regu- scription, and changes in transcriptional activity necessitate latory opportunities. changes in chromosome organization.

ESSENTIAL POINT Histone Modifications and Chromatin Compared to bacteria, the regulation of gene expression in Remodeling ­eukaryotes is more complex and includes additional steps such as the regulation of chromatin packaging and the regulation of mRNA The tight association of DNA with histones and other chro- processing. ■ matin-binding proteins inhibits access to the DNA by pro- teins involved in many functions including transcription. This inhibitory conformation is often referred to as “closed” chromatin. Before transcription can be initiated, the struc- 16.2 Eukaryotic Gene Expression ture of chromatin must become “open” to RNA polymerase. One way in which chromatin conformation can switch Is Influenced by Chromatin between “open” and “closed” is by changes to histone com- Modifications position in nucleosomes (see Chapter 11). Most nucleosomes contain histone H2A, while some nucleosomes contain vari- Recall from earlier in the text (see Chapter 11) that eukary- ant histones, such as H2A.Z. Whereas nucleosomes are otic DNA is combined with histones and nonhistone pro- generally a physical barrier to RNA polymerases and DNA- teins to form chromatin. The basic structure of chromatin binding transcriptional regulators, nucleosomes contain- is characterized by repeating DNA–histone complexes ing the H2A.Z variant are not as stable and thus are less of called nucleosomes that are wound into 30-nm fibers, a barrier. Interestingly, H2A.Z-containing nucleosomes are which in turn form other, even more compact structures. enriched at sequences that regulate gene expression, which The presence of these compact chromatin structures inhib- suggests that the histone composition of nucleosomes can its many processes, including DNA replication, repair, and influence transcription. transcription. In this section, we outline some of the ways in A second mechanism of chromatin alteration involves which eukaryotic cells modify chromatin in order to regu- histone modification. Histone modification refers to the late gene expression. covalent addition of functional groups to the N-terminal tails of histone proteins. The most common histone modi- fications are added acetyl, methyl, or phosphate groups. Chromosome Territories Acetylation decreases the positive charge on histones, Despite the widespread analogy that chromosomes within resulting in a reduced affinity of the histone for the negative the nucleus look like a bowl of cooked spaghetti, the nucleus charges on the backbone phosphates of DNA. This in turn has an elegant architecture. Each chromosome within the may assist the formation of open chromatin conformations, interphase nucleus occupies a discrete domain called a which would allow the binding of transcription regulatory chromosome territory and stays separate from other proteins to DNA. chromosomes (see the chapter opening image on page 302). Histone acetylation is catalyzed by histone acetyltrans- Channels between chromosomes contain little or no DNA ferase (HAT) enzymes. In some cases, HATs are recruited and are called interchromatin compartments. to genes by the presence of certain transcriptional activator Transcriptionally active genes are located at the proteins that bind to transcription regulatory regions. In edges of chromosome territories next to the channels of other cases, transcriptional activator proteins themselves the interchromatin compartments. This organization have HAT activity. Of course, what can be opened can also be brings actively transcribed genes into closer association closed. In that case, histone deacetylases (HDACs) remove with each other and with the transcriptional machinery, acetyl groups from histone tails. HDACs can be recruited to thereby facilitating their coordinated expression. Tran- genes by some transcriptional repressor proteins that bind scripts produced at the edge of chromosome territories to gene regulatory sequences. move into the adjacent interchromatin compartments, A third alteration mechanism is chromatin remodel- which house RNA processing machinery and are contigu- ing, which involves the repositioning or removal of nucleo- ous with nuclear pores. This arrangement facilitates the somes on DNA, brought about by chromatin remodeling capping, splicing, and poly-A tailing of mRNAs during and complexes. Chromatin remodeling complexes are large after transcription, and the eventual export of mRNAs into multi-subunit enzymes that use the energy of ATP hydro- the cytoplasm. lysis to move and rearrange nucleosomes. Repositioning of

M16_KLUG8414_10_SE_C16.indd 304 16/11/18 5:15 pm 16.3 EUKARYOTIC TRANSCRIPTION INITIATION REQUIRES SPECIFIC -ACTING SITES 305

nucleosomes makes regions of the chromosome accessible NOW SOLVE THIS to transcription regulatory proteins and RNA polymerase. 16.1 Cancer cells often have abnormal patterns of chro- See Special Topics Chapter ST1—Epigenetics for more matin modifications. In some cancers, the DNA repair information about chromatin modifications that regulate genes MLH1 and BRCA1 are hypermethylated on their gene expression. promoter regions. Explain how this abnormal methylation pattern could contribute to cancer. DNA Methylation HINT: This problem involves an understanding of the types of genes that are mutated in cancer cells. The key to its solution is to Another type of chromatin modification that plays a role in consider how DNA methylation affects gene expression of cancer- gene regulation in some eukaryotes is the enzyme-mediated related genes. addition or removal of methyl groups to or from bases in DNA. DNA methylation in eukaryotes most often occurs at position 5 of cytosine (5-methylcytosine), causing the methyl group to protrude into the major groove of the DNA helix. Methylation occurs most often on the cytosine of CG 16.3 Eukaryotic Transcription doublets in DNA, usually on both strands: Initiation Requires Specific m 5′ - CpG - 3′ Cis-Acting Sites 3′ - GpCm - 5′ Methylatable CpG sequences are not randomly distrib- Earlier in the text (see Chapter 12), we noted that eukary- uted throughout the genome, but tend to be concentrated otes possess three different types of RNA polymerases in CpG-rich regions, called CpG islands, which are often (RNAPs). RNAP I and III transcribe ribosomal RNAs, trans- located in or near promoter regions. Roughly 70 percent of fer RNAs, and some small nuclear RNAs. RNAP II transcribes human genes have a CpG island in their promoter sequence. protein-coding genes and genes for some noncoding RNAs. Evidence suggests that DNA methylation represses For simplicity, in this chapter we will limit our discussion to gene expression. For example, large transcriptionally regulation of genes transcribed by RNAP II, the best studied inert regions of the genome, such as the inactivated X of the eukaryotic RNAPs. chromosome in female mammalian cells (see Chapter 5), Regulation of eukaryotic transcription requires the are often heavily methylated. Conversely, blocking meth- binding of many regulatory factors to specific DNA sequences ylation of genes on the inactivated X chromosome leads to located in and around genes, as well as to sequences located their expression. at great distances. In this section, we will discuss some of By what mechanism might methylation affect gene reg- the DNA sequences—known as cis-acting elements—that ulation? Data from in vitro studies suggest that methylation are required for the accurate and regulated transcription can repress transcription by inhibiting the binding of tran- of genes transcribed by RNAP II. As defined earlier in the scription factors to DNA. Methylated DNA may also recruit text (see Chapter 12), cis-acting DNA elements are located repressive chromatin remodeling complexes and HDACs to on the same chromosome as the gene that they regulate. This gene-regulatory regions. is in contrast to trans-acting factors (such as DNA-binding It is important to know that while cytosine methylation proteins) that can influence the expression of a gene on any is clearly an important mechanism for gene regulation in chromosome. some eukaryotes, it is not uniformly true for all eukaryotes. For example, while DNA methylation is an important gene regulatory mechanism in humans, mice, and many plants, Promoters and Promoter Elements DNA methylation is absent in yeast and the roundworm Cae- A promoter is a region of DNA that is recognized and bound norhabditis elegans. by the basic transcriptional machinery—including RNAP See Special Topics Chapter ST1—Epigenetics for more II and the general transcription factors (see Chapter 12). information about DNA methylation and the regulation of Promoters are required for transcription initiation and are gene expression. located immediately adjacent to the genes they regulate. They specify the site or sites at which transcription begins (the transcription start site), as well as the direction of ESSENTIAL POINT transcription along the DNA. Chromatin modifications, such as nucleosomal chromatin remodel- There are two subcategories within eukaryotic promot- ing, histone modifications, and DNA methylation may either allow or ers. First, core promoters are the minimum part of the inhibit the binding of transcriptional machinery to the DNA. ■ promoter needed for accurate initiation of transcription by

M16_KLUG8414_10_SE_C16.indd 305 16/11/18 5:15 pm 306 16 rEgulation of Gene Expression in Eukaryotes

(a) Focused promoter much more is known about the structure of focused pro- moters. Within focused promoters, there are numerous core-promoter­ elements—short nucleotide sequences Gene that are bound by specific regulatory factors. Such core- promoter elements vary between species and between genes One major transcript within a species. However, focused promoters contain sev- (b) Dispersed promoter eral common core-promoter elements, which are shown in Figure 16.3 and described in more detail below. The Initiator element (Inr) encompasses the transcrip- Gene tion start site, from approximately nucleotides -2 to +4, rel- ative to the start site. In humans, the Inr consensus sequence A is YYAN ΋T YY (where Y indicates any pyrimidine nucleo- tide and N indicates any nucleotide). The transcription start site is the first A residue at +1. The TATA box is located Multiple transcripts at approximately -30 relative to the transcription start A site and has the consensus sequence TATA ΋T AAR (where FIGURE 16.2 focused and dispersed promoters. Focused R indicates any purine nucleotide). The TFIIB recognition promoters (a) specify one specific transcription initiation site. element (BRE) is found at positions either immediately Dispersed promoters (b) specify weak transcription initiation at multiple start sites over an approximately 100-bp region. upstream or downstream from the TATA box. The motif ten Transcription start sites and the directions of transcription element (MTE) and downstream promoter element (DPE) are indicated with arrows. are located downstream of the transcription start site, at approximately +18 to +27 and +28 to +33, respectively. RNAP II. Core promoters are sequences 80 nucleotides long Versions of the BRE, TATA box, and Inr elements appear and include the transcription start site. Second, proximal to be universal components of all focused promoters. How- promoter elements are generally located up to 250 nucle- ever, the MTEs and DPEs are found in only some of these otides upstream of the transcription start site and contain promoters. This is not a comprehensive list of all known binding sites for sequence-specific DNA-binding proteins core-promoter elements; the important concept here is that modulate the efficiency of transcription. that the core-promoter elements serve as a platform for the Core promoters are classified in two ways with respect assembly of RNAP II and the general transcription factors, to transcription start sites. Focused core promoters spec- which is a critical step in gene expression and will be dis- ify transcription initiation at a single specific start site. In cussed in Section 16.4. contrast, dispersed core promoters direct initiation from In addition to core-promoter elements, many genes a number of weak transcription start sites located over a 50- also contain proximal-promoter elements located upstream to 100-nucleotide region (Figure 16.2). The major type of of the TATA box and BRE. Proximal-promoter elements initiation for most genes of lower eukaryotes is focused tran- act along with the core-promoter elements to increase the scription initiation, while about 70 percent of vertebrate levels of basal transcription. For example, the CAAT box genes employ dispersed promoters. Focused promoters are is a common proximal-promoter element. The CAAT box usually associated with genes whose transcription levels has the consensus sequence CAAT or CCAAT and is usu- are highly regulated in terms of time or place. Dispersed ally located about 70 to 80 base pairs upstream from the promoters, in contrast, are associ- ated with genes that are transcribed -40 +1 +40 constitutively, so-called housekeeping genes whose expression is required BRE TATA BRE Inr MTE DPE in almost all cell types. Thus, a single transcription start site may -38 to -32 -2 to +4 +18 to +27 facilitate precise regulation of some -30 to -24 +28 to +33 genes, whereas multiple start sites -23 to -17 may allow for a steady level of tran- scription of genes that are required FIGURE 16.3 Core-promoter elements found in focused promoters. Core-promoter constitutively. elements are usually located between -40 and +40 nucleotides, relative to the While it is not yet clear how transcription start site, indicated as +1. BRE is the TFIIB recognition element, which can be found on one side or other of the TATA box. TATA is the TATA box, Inr is the dispersed promoters specify mul- Initiator element, MTE is the motif ten element, and DPE is the downstream pro- tiple transcriptional start sites, moter element.

M16_KLUG8414_10_SE_C16.indd 306 16/11/18 5:15 pm 16.4 EUKARYOTIC TRANSCRIPTION INITIATION IS REGULATED BY TRANSCRIPTION FACTORS THAT BIND 307

3.5

3.0

1.0 Relative transcription level transcription Relative 0

(GC) (CCAAT) (TATA) (Initiation)

FIGURE 16.4 summary of the effects on transcription levels of different point mutations in the promoter region of the b@globin gene. Each line represents the level of transcription produced in a separate experiment by a single-nucleotide mutation (relative to wild type) at a particular location. Dots represent nucleotides for which no mutation was tested. Note that mutations within specific elements of the promoter have the greatest effects on the level of transcription.

transcription start site. Mutations on either side of this ele- critical; it will function the same whether it is upstream, ment have no effect on transcription, whereas mutations downstream, or within a gene. within the CAAT sequence dramatically lower the rate of 3. Whereas promoters are orientation specific, an enhancer transcription. Thus, for genes with a CAAT box, it appears can be inverted, relative to the gene it regulates, without to be required for robust transcription. Figure 16.4 sum- a significant effect on its action. marizes the transcriptional effects of mutations in the CAAT box and other promoter elements. The GC box is Another type of cis-acting regulatory element, the another element often found in promoter regions and has silencer, acts as a negative regulator of transcription. the consensus sequence GGGCGG. It is located, in one or Silencers, like enhancers, are cis-acting short DNA sequence more copies, at about position -110 and is bound by tran- elements that may be located far upstream, downstream, scription factors. or within the genes they regulate. Another similarity to enhancers is that they also often act in tissue- or temporal- specific ways to control gene expression. Enhancers and Silencers

In addition to promoters, transcription of eukaryotic genes ESSENTIAL POINT is also influenced by DNA sequences called enhancers. Eukaryotic transcription is regulated by cis-acting regulatory While promoters are always found immediately upstream sequences such as promoters, enhancers, and silencers. ■ of a gene, enhancers can be located on either side of a gene, nearby or at some distance from the gene, or even within the gene. Some studies show that enhancers can be located as far 16.4 Eukaryotic Transcription away as a million base pairs from the genes they regulate. Like promoters, they are cis regulatory elements because Initiation Is Regulated by Transcription they only serve to regulate genes on the same chromosome. Factors That Bind to Cis-Acting Sites However, there are several differences between promoters and enhancers. Promoters, enhancers, and silencers influence transcription initiation by acting as binding sites for transcription regu- 1. While promoter sequences are essential for minimal latory proteins, broadly termed transcription factors. In or basal-level transcription, enhancers, as their name addition to the general transcription factors (GTFs) suggests, increase the rate of transcription. In addition, required for the basic process of transcription initiation (see enhancers often confer time- and tissue-specific gene Chapter 12), some transcription factors serve to increase the expression. levels of transcription initiation and are known as activators, 2. Whereas promoters must be immediately upstream of while others reduce transcription levels and are known as the genes they regulate, the position of an enhancer is not repressors.

M16_KLUG8414_10_SE_C16.indd 307 16/11/18 5:15 pm 308 16 rEgulation of Gene Expression in Eukaryotes

Transcription factors can modulate gene expression at low levels. Expression levels are also modulated in as appropriate for different cell types, in response to envi- response to extracellular growth signals via the activator ronmental cues, or at the correct time in development. To proteins 1, 2, and 4 (AP1, AP2, and AP4), which are pres- do this, transcription factors may be expressed in only cer- ent at various levels in different cell types. AP2 binds an tain types of cells, thereby achieving tissue-specific regu- enhancer called the AP2 response element (ARE). AP1 and lation of their target genes. Some transcription factors are AP4 bind overlapping sites within the basal element (BLE), expressed in cells only at certain times during development which provides some degree of selectivity in how these fac- or in response to certain external or internal signals. In some tors stimulate transcription of MT2A when bound to the cases, a transcription factor may be present in a cell but will BLE in different cell types. only become active when modified by phosphorylation or High levels of MT2A transcription are stimulated by by binding to another molecule such as a hormone. These heavy-metal toxicity or stress. The heavy-metal response modifications to transcription factors may also be regulated employs an activator called metal-inducible transcription in tissue- or temporal-specific ways. Finally, the inputs of factor 1 (MTF-1). MTF-1 is normally found in the cytoplasm multiple transcription factors binding to different enhanc- but translocates to the nucleus in the presence of heavy met- ers and promoter elements are integrated to fine-tune the als. Direct metal binding induces conformational changes levels and timing of transcription initiation. that lead to MTF-1’s nuclear translocation. In the nucleus, MTF-1 binds several metal response elements (MREs) that enhance MT2A transcription. The Human Metallothionein 2A Gene: Multiple A different activator and enhancer mediate stress- Cis-Acting Elements and Transcription Factors induced transcription of MT2A: The glucocorticoid recep- The human metallothionein 2A (MT2A) gene provides an tor is an activator that binds to the glucocorticoid response example of how the transcription of one gene can be regulated element (GRE). Under stressful conditions, vertebrates by the interplay of multiple promoter and enhancer elements secrete a steroid hormone called glucocorticoid. Upon glu- and the transcription factors that bind them. The product of cocorticoid binding, the glucocorticoid receptor, which is the MT2A gene is a protein that binds to heavy metals such normally located in the cytoplasm, undergoes a conforma- as zinc and cadmium, thereby protecting cells from the toxic tional change that allows it to enter the nucleus, bind to the effects of high levels of these metals. The protein is also impli- GRE, and enhance MT2A gene transcription. In addition to cated in protecting cells from the effects of oxidative stress. activation, transcription of the MT2A gene can be repressed The MT2A gene is expressed at a low or basal level in all cells by the actions of the repressor protein PZ120, which binds but is transcribed at high levels when cells are exposed to over the transcription start site. heavy metals or stress hormones such as glucocorticoids. The presence of multiple regulatory elements and tran- The cis-acting regulatory elements controlling tran- scription factors that bind to them allows the MT2A gene to be scription of the MT2A gene include promoter, enhancer, transcriptionally activated or repressed in response to subtle and silencer elements (Figure 16.5). Core-promoter ele- changes in both extracellular and intracellular conditions. ments, such as the TATA box and Inr, are bound by several general transcription factors and RNAP II and are required for transcription initiation. The proximal-promoter ele- ESSENTIAL POINT ment, GC, is bound by the SP1 factor, which is present at all Transcription factors influence transcription by binding to promoters, enhancers, and silencers. ■ times in most eukaryotic cells and stimulates transcription

-250 -200 -150 -100 -50 Transcription start site

GRE ARE ARE ARE MRE MRE BLE MRE GC MRE TATA Inr

AP2 AP2 AP2 AP1 SP1 TFIID Glucocorticoid MTF-1 MTF-1 AP4 AP4 MTF-1 MTF-1 receptor Repressor PZ120

FIGURE 16.5 The human metallothionein 2A gene promoter and enhancer regions, containing multiple cis-acting regulatory sites. The transcription factors controlling both basal and induced levels of MT2A transcription are indicated below the gene, with arrows pointing to their bind- ing sites.

M16_KLUG8414_10_SE_C16.indd 308 16/11/18 5:15 pm 16.5 ACTIVATORS AND REPRESSORS INTERACT WITH GENERAL TRANSCRIPTION FACTORS 309

50 40 30 20 10 10 16.5 Activators and Repressors - - - - - + Interact with General Transcription Transcription start site Factors and Affect Chromatin Structure

TATA box We have thus far discussed how chromatin remodeling and chromatin modifications are necessary to make cis-acting regulatory sequences accessible to binding by transcription TAFs IIA factors. In this section we will see how these events come IID TBP together to regulate the initiation of transcription by facili- tating or inhibiting the binding of RNAP II to promoters, and TAFs 1 also by regulating RNAP II activity. TBP

Formation of the RNA Polymerase II IIB Transcription Initiation Complex A critical step in the initiation of transcription is the forma- IIB TAFs tion of a pre-initiation complex (PIC). The PIC consists of 2 IIA TBP RNAP II and several general transcription factors (GTFs), which assemble onto the promoter in a specific order and Mediator provide a platform for RNAP II to recognize transcription start sites and to initiate transcription. RNAP II Mediator The GTFs that assist RNAP II at a core promoter are IIF called TFIIA (Transcription Factor for RNAP IIA), TFIIB, RNAP II IIB TAFs TFIID, TFIIE, TFIIF, TFIIH, and a large multi-subunit com- 3 IIA TBP IIF plex called Mediator. The GTFs and their interactions with the core promoter and RNAP II are outlined in Figure 16.6 IIE and described as follows. IIH Mediator The first step in the formation of a PIC (Step 1 in Figure 16.6) is the binding of TFIID to the TATA box. TFIID RNAP II TAFs is a multi-subunit complex that contains TBP (TATA IIB 4 IIA IIF IIE Binding Protein) and approximately 13 proteins called TBP IIH TAFs (TBP Associated Factors). In addition to binding the

TATA box, TFIID binds other core-promoter elements such FIGURE 16.6 The assembly of general transcription factors as Inr elements, DPEs, and MTEs. TFIIA interacts with (TFIIA, TFIIB, etc.; abbreviated as IIA, IIB, etc.) required for the TFIID and assists the binding of TFIID to the core promoter. initiation of transcription by RNAP II. Once TFIID has made contact with the core promoter, TFIIB binds to BREs near the TATA box (Step 2 in Figure 16.6). In the recruitment model, DNA loops serve to recruit acti- Finally, the other GTFs (Mediator, IIF, IIE, and IIH) help vators and GTFs to the promoter region, which increases the recruit RNAP II to the promoter (Steps 3 and 4). The fully rate of PIC assembly and/or stability and stimulates release formed PIC mediates the unwinding of promoter DNA at the of RNAP II from the PIC. Direct interactions between acti- start site and the transition of RNAP II from transcription vators and repressors with Mediator and TFIID may serve initiation to elongation. to close DNA loops between promoters and enhancers. In other cases, proteins called coactivators serve as a bridge between activators and promoter-bound GTFs. Large com- Mechanisms of Transcription Activation plexes of activators and coactivators that come together to and Repression direct transcriptional activation are called enhanceosomes Researchers have proposed several models by which acti- (Figure 16.7). In a similar way, repressors bound at silencer vators and repressors modulate transcription. Common to elements may decrease the rate of PIC assembly and the these models is the formation of DNA loops that bring dis- release of RNAP II. tant enhancers and silencers into close physical proximity In the chromatin alterations model, interactions within with the promoter regions of the genes that they regulate. DNA loops may facilitate changes in chromatin compaction

M16_KLUG8414_10_SE_C16.indd 309 16/11/18 5:15 pm 310 16 rEgulation of Gene Expression in Eukaryotes

Enhancer ESSENTIAL POINT Activators and repressors act by enhancing or repressing the forma- tion of a pre-initiation complex at the promoter, opening or closing chromatin, and/or relocating genes to specific nuclear sites. ■ Enhanceosome (activator and coactivator proteins) Mediator 16.6 Regulation of Alternative RNAP II IIB TAFs Splicing Determines Which RNA IIA IIF IIE TBP IIH Spliceforms of a Gene Are Translated

TATA box As you learned earlier in the text (see Chapter 12), eukary- otic pre-mRNA transcripts are processed by the addition FIGURE 16.7 formation of a DNA loop allows factors that bind to an enhancer or silencer at a distance from of a 5¿-cap, the synthesis of a 3¿ poly-A tail, and removal a promoter to interact with general transcription factors of noncoding introns through the process of splicing. in the pre-initiation complex and to regulate the level of However, the pre-mRNAs of many eukaryotic genes transcription. may be spliced in alternative ways to generate different spliceforms that include or omit different exons. This that promote or repress transcription. For example, some process, known as alternative splicing, enables a single activators physically interact with chromatin modifiers gene to encode more than one variant of its protein prod- such as histone acetyl transferases (HATs) to promote an uct. These variants, known as isoforms, differ in the “open” chromatin conformation (see Section 16.2). In this amino acids encoded by differentially included or excluded scenario, activators bound to enhancers may recruit HATs exons. Isoforms of the same gene may have different func- to promoter regions to make the chromatin more accessible tions. Even small changes to the amino acid sequence of for PIC formation. Similarly, some repressors recruit histone a protein may alter the active site of an enzyme, modify deacetylases (HDACs) to promoter regions, which makes the the DNA-binding specificity of a transcription factor, or DNA less accessible for transcription. change the localization of a protein within the cell. Thus, In the nuclear relocation model, DNA looping may relo- alternative splicing is important for the regulation of gene cate a target gene to a nuclear region that is favorable or expression. inhibitory to transcription—regions of the nucleus that con- An elegant example of how protein activity can be mod- tain high or low concentrations of RNAP II and transcription ulated by alternative splicing is evidenced by the Drosophila regulatory factors. Mhc gene, which encodes a motor protein responsible for Importantly, these three models are not mutually exclu- muscle contraction. Different isoforms of this protein are sive; some genes may be regulated by mechanisms that expressed in different types of muscle with slightly differ- include all three models at the same time. ent contractile properties. When an embryo-specific isoform is expressed in flight muscles, it slows the kinetic proper- ties of the flight muscles and the flies beat their wings at a NOW SOLVE THIS lower frequency! Thus, in this example we see evidence that alternative splicing regulates gene expression by specifying 16.2 The hormone estrogen converts the estrogen recep- isoforms with functions that are specific to the cells they are tor (ER) protein from an inactive molecule to an active expressed in. transcription factor. The activated ER binds to cis-acting sites that act as enhancers, located near the promoters of a number of genes. In some tissues, the presence of estro- Types of Alternative Splicing gen appears to activate transcription of ER-target genes, whereas in other tissues, it appears to repress transcription There are many different ways in which a pre-mRNA may be of those same genes. Offer an explanation as to how this alternatively spliced (Figure 16.8). One example involves may occur. cassette exons—such exons may be excluded from the HINT: This problem involves an understanding of how transcrip- mature mRNA by joining the 3′-end of the upstream exon tion enhancers and silencers work. The key to its solution is to to the 5′-end of the downstream exon. Skipping of cassette consider the many ways that trans-acting factors can interact at exons is the most prevalent type of alternative splicing in enhancers to bring about changes in transcription initiation. animals, accounting for nearly 40 percent of the alternative splicing events.

M16_KLUG8414_10_SE_C16.indd 310 16/11/18 5:15 pm 16.6 REGULATION OF ALTERNATIVE SPLICING 311

Type of alternative splicing pre-mRNA mRNA spliceforms

5' AAAAn 3' Cassette exons 5' 3' 5' AAAAn 3'

Alternative 5’ 5' AAAAn 3' 5' 3' (or 3’) splice site 5' AAAAn 3'

5' AAAAn 3' Intron retention 5' 3' 5' AAAAn 3'

Mutually 5' AAAAn 3' 5' 3' exclusive exons 5' AAAAn 3'

Alternative 5' AAAAn 3' 5' 3' promoters 5' AAAAn 3'

Poly A Alternative 5' AAAAn 3' 5' 3' polyadenylation 5' AAAA 3' Poly A n

FIGURE 16.8 Different types of alternative splicing events. Exons are indicated by boxes with introns depicted by solid thick lines between them. Alternative splicing is indicated by thin red lines either above or below the pre-mRNA. Transcription start sites (bent arrows) and polyadenylation signals (Poly A) are indicated in the alternative promoters and polyadenyl- ation examples.

In slightly over a quarter of the alternative splicing these so-called mutually exclusive exons allows for swap- events in animals, splicing occurs at an alternative splice ping of protein domains encoded by different exons. site within an exon that may be upstream or downstream Pre-mRNAs with different 5′- and 3′-ends may be pro- of the normally used splice site. While some of these splice duced from the same gene due to different transcription ini- events are likely “noise,” or errors in splice site selection by tiation and termination sites. Some genes have alternative the spliceosome (Chapter 12), some instances of alternative promoters, so they have more than one site where tran- splice site usage are important regulatory events. scription may be initiated. Transcription from alternative Intron retention is the most common type of alter- promoters produces pre-mRNAs with different 5′@exons, native splicing event in plants, fungi, and singed-celled which may be alternatively spliced to downstream exons. eukaryotes known as protozoa but is rare in mammals. Tissue-specific expression of isoforms may result from dif- In some cases, introns, which are normally noncoding ferent transcription factors recognizing different promoters sequences, are included in the mature mRNAs and are of a gene in different tissues. translated, producing novel isoforms. In other cases, intron Spliceforms with different 3′-ends are produced by retention serves to negatively regulate gene expression at alternative polyadenylation. The polyadenylation sig- the posttranscriptional level; such mRNAs are degraded or nal (Chapter 12) is a sequence that directs transcriptional are retained in the nucleus. termination and addition of a poly-A tail. Thus, when a poly- In rare cases, splicing is co-regulated for a cluster of two adenylation signal is transcribed, transcription is soon ter- or more adjacent exons such that inclusion of one exon leads minated and any downstream exon sequences are omitted. to the exclusion of the others in the same cluster. The use of However, when an exon containing a polyadenylation signal

M16_KLUG8414_10_SE_C16.indd 311 16/11/18 5:15 pm 312 16 rEgulation of Gene Expression in Eukaryotes

Calcitonin peptide the expression of the CT/CGRP gene is regulated such that two peptide hormones with different structures and func- Translation and processing tions are synthesized in different cell types.

5' AAAAn CT mRNA 1 2 3 4 (thyroid cells) Alternative Splicing and the Proteome Alternative splicing increases the number of proteins that Primary Transcript Poly A Poly A can be made from each gene. As a result, the number of proteins that an organism can make—its proteome—may greatly exceed the number of genes in the genome. Alter- native splicing is found in plants, fungi, and animals but is 1 2 3 4 5 6 especially common in vertebrates, including humans. Deep sequencing of RNA from human cells suggests that over

5' AAAAn CGRP mRNA 95 percent of human multi-exon genes undergo alternative 1 2 3 5 6 (neuronal cells) splicing. While not all of these splicing events affect protein- coding sequences, it is clear that alternative splicing contrib- Translation and processing utes greatly to human proteome diversity. How many different polypeptides can be produced CGRP through alternative splicing of the same pre-mRNA?

FIGURE 16.9 alternative splicing of the CT/CGRP gene One answer to that question comes from research on the transcript. The primary transcript, which is shown in the Dscam gene in Drosophila melanogaster. The mature middle of the diagram, can be spliced into two different Dscam mRNA contains 24 exons; however, the pre-mRNA mRNAs, both containing the first three exons but differing in their final exons. includes different alternative options for exons 4, 6, 9, and 17 (Figure 16.10). There are 12 alternatives for exon 4; 48 alternatives for exon 6; 33 alternatives for exon 9; and is skipped, downstream exons are included and a downstream 2 alternatives for exon 17. The number of possible com- polyadenylation signal will be used. While alternative poly- binations that could be formed in this way suggests that, adenylation may produce spliceforms with different coding theoretically, the Dscam gene can produce 38,016 different sequences, it also specifies different 3′ untranslated regions proteins. Although this is an impressive number of isoforms, (UTRs) that are important for other posttranscriptional regu- does the Drosophila nervous system require all these alterna- latory events discussed later in this chapter. tives? Recent research suggests that it does. Figure 16.9 presents an example of alternative splic- During nervous system development, neurons must ing of the pre-mRNA transcribed from the mammalian accurately connect with each other. Even in Drosophila, calcitonin/calcitonin­ gene-related peptide (CT/CGRP) with only about 250,000 neurons, this is a formidable task. gene. In thyroid cells, the CT/CGRP pre-mRNA is spliced Neurons have cellular processes called axons that form to produce a mature mRNA containing the first four exons connections with other nerve cells. The Drosophila Dscam only. In these cells, the polyadenylation signal in exon 4 trig- gene encodes a protein that guides axon growth, ensuring gers transcription termination and addition of a poly-A tail. Thus, exons 5 and 6 are omit- Exon 4 Exon 6 Exon 9 Exon 17 ted. This mRNA is translated and processed 12 alternatives 48 alternatives 33 alternatives 2 alternatives into the calcitonin (CT) peptide, a 32-amino- Genomic DNA acid peptide hormone that regulates calcium and pre-mRNA levels in the blood. In neurons, the CT/CGRP primary transcript is alternatively spliced to mRNA skip exon 4. Since the polyadenylation signal in exon 4 is quickly spliced out during tran- scription of the pre-mRNA, transcription Protein continues, exons 5 and 6 are included, and the polyadenylation signal of exon 6 is rec- FIGURE 16.10 alternative splicing of the Drosophila Dscam gene mRNA. ognized. The CGRP mRNA is translated and Organization of the Dscam gene and the pre-mRNA. The Dscam gene processed into a 37-amino-acid peptide hor- encodes a protein that guides axon growth during development. Each mRNA will contain one of the 12 possible exons for exon 4 (red), one of the mone (CGRP) that stimulates the dilation of 48 possible exons for exon 6 (blue), one of the 33 possible exons for exon blood vessels. Through alternative splicing, 9 (green), and one of the 2 possible exons for exon 17 (yellow).

M16_KLUG8414_10_SE_C16.indd 312 16/11/18 5:15 pm 16.7 Gene Expression Is Regulated by mRNA Stability and Degradation 313

that neurons are correctly wired together. Each neuron Unaffected individuals have 5 to 35 copies of the CTG repeat, expresses a different subset of Dscam protein isoforms, and whereas patients with DM1 have 150 to 2000 copies. The in vitro studies show that each Dscam isoform binds to the severity of the symptoms is directly related to the number same isoform, but not to others. Therefore, it appears that of repeats. DM2 is caused by an expansion of a CCTG repeat the diversity of Dscam isoforms provides a molecular iden- sequence within the first intron of the CNBP gene (also known tity tag for each neuron, helping guide axons to the correct as ZNF9). Unaffected individuals have 11 to 26 repeats, while target and preventing miswiring of the nervous system. patients with DM2 have over 11,000 copies of this repeat; the The Drosophila genome contains about 14,000 protein- severity of symptoms is not related to the number of repeats. coding genes, but the Dscam gene alone encodes 2.5 times that Interestingly, DM is not caused by defects in the proteins many proteins. Because alternative splicing is far more com- encoded by DMPK and CNBP. Rather, repeat-containing mon in vertebrates, the suite of proteins that can be produced RNAs accumulate in the nucleus, instead of being exported from the human genome may be astronomically high. A to the cytoplasm, and are bound by proteins that regulate large-scale mass spectrometry study of the human proteome alternative splicing. In this way, these RNAs sequester found that the 20,000 protein-coding genes in the human splicing regulators and prevent them from regulating many genome can produce at least 290,000 different proteins. See RNAs that encode proteins important for muscle and neuron Chapter 18 for additional information on proteomes. function. Strategies to degrade the repeat-containing RNAs, or to block the binding of the splicing regulators to the RNAs, are currently being researched for therapeutic purposes. Regulation of Alternative Splicing

For pre-mRNAs that are alternatively spliced, how are spe- ESSENTIAL POINT cific splicing patterns selected? How does the spliceosome The pre-mRNAs of many eukaryotic genes undergo alternative splic- select one splice site instead of another? We know that this ing to produce different spliceforms encoding different protein process is highly regulated, with some spliceforms only pres- isoforms, which may have different functions. Defects in alternative splicing are associated with several human diseases. ■ ent in some cell types or under certain conditions. Many RNA-binding proteins (RBPs), which are a class of proteins that bind to specific RNA sequences or RNA second- ary structures, are involved in the regulation of alternative 16.7 Gene Expression Is Regulated splicing. Since RBPs often exhibit tissue-specific expression, by mRNA Stability and Degradation they are important regulators of tissue-specific alternative splicing. RBPs may act by binding and hiding splice sites to The steady-state level of an mRNA—meaning the total promote the use of alternative sites, by binding near alterna- amount at any one point in time—is a function of the rate tive splice sites to recruit the spliceosome to such sites, and/or at which the gene is transcribed and the rate at which the by directly interacting with the splicing machinery. mRNA is degraded. The steady-state level determines the amount of mRNA that is available for translation. mRNA Alternative Splicing and Human Diseases stability can vary widely between different mRNAs, last- ing a few minutes to several days, and can be regulated in Since alternative splicing is an important mechanism for the response to the needs of the cell. Thus, the molecular mech- regulation of gene expression, it is not surprising that defects anisms that control mRNA stability and degradation play in alternative splicing are associated with human diseases. important roles in the regulation of gene expression. Genetic disorders caused by mutations that disrupt RNA splicing are known as spliceopathies. Myotonic dystrophy, abbreviated from the Greek term Mechanisms of mRNA Decay dystrophia myotonica as DM, is an autosomal dominant dis- RNA is susceptible to degradation, or decay, by exoribonu- order that afflicts 1 in 8000 individuals. Patients with DM cleases—enzymes that degrade RNA via the removal of ter- exhibit myotonia (inability to relax muscles), muscle wast- minal nucleotides. However, two features of mRNAs provide ing, insulin resistance, cataracts, intellectual disability, and protection against exoribonucleases: a 7-methylguanosine cardiac muscle problems. Studies have shown that several of (m7G) cap at the 5′-end and a poly-A tail at the 3′-end. Main- these symptoms are caused by widespread alternative splic- tenance or removal of the cap and poly-A tail are thus criti- ing defects in muscle cells and neurons. cal steps in determining the stability or decay of an mRNA. There are two forms of DM (DM1 and DM2), which Most eukaryotic mRNAs are degraded by are caused by mutations in different genes, but with simi- deadenylation-dependent­ decay (Figure 16.11). This lar outcomes that lead to splicing defects. DM1 is caused by process is initiated by deadenylases, which are enzymes expansion of a CTG repeat in the 3′-UTR of the DMPK gene. that shorten the poly-A tail. A newly synthesized mRNA has

M16_KLUG8414_10_SE_C16.indd 313 16/11/18 5:15 pm 314 16 rEgulation of Gene Expression in Eukaryotes

Deadenylase 5'- cap

AUG UAA AAAAAAAA200

Deadenylation

AUG UAA A Decapping 10 enzyme Decapping 3' 5' Exoribonucleolytic decay by exosome

+ p AUG UAA A10 AUG Exosome m7Gpp 5' 3' Exoribonucleolytic decay by XRN1

XRN1 UAA A10

FIGURE 16.11 Deadenylation-dependent mRNA decay.

a poly-A tail that is about 200 nucleotides long. However, if eliminate these potentially harmful mRNAs. For example, deadenylation shortens it to less than 30 nucleotides, the mRNAs that lack poly-A tails or are improperly spliced may mRNA will be degraded. In some cases, an exoribonuclease- be retained in the nucleus to allow more time for processing containing exosome complex destroys the mRNA in a 3′ or may be degraded by exoribonucleases. to 5′ manner. In other cases, the shortened poly-A tail leads mRNAs with a premature stop codon trigger an mRNA to the recruitment of decapping enzymes, which remove surveillance response. Premature stop codons may result the 5′-cap and allow a specific exoribonuclease, XRN1, to from a nonsense mutation in the gene or be due to an RNA destroy the mRNA in a 5′ to 3′ direction. polymerase error during transcription. Translation of an More rarely, mRNAs may meet their demise through mRNA with a premature stop codon would lead to a trun- deadenylation-independent decay. This pathway begins cated and nonfunctional protein, which is a waste of cellular at the 5′-cap rather than the 3′ poly-A tail. Similar to deade- energy and resources and could possibly have toxic effects. nylation-dependent decay, decapping enzymes are recruited However, nonsense-mediated decay (NMD), the most to remove the cap, and then the XRN1 exoribonuclease digests thoroughly studied mRNA surveillance pathway, efficiently the mRNA in the 5′ to 3′ direction. In addition, in this pathway, eliminates mRNAs with premature stop codons. mRNAs may also be cleaved internally by endoribonucleases. How does NMD work? Research suggests that recog- Following endoribonucleolytic attack, newly formed 5′- and nition of premature stop codons often occurs when trans- 3′-ends are unprotected and subject to exoribonuclease lation terminates too far from the poly-A tail, upstream digestion. of an exon–exon junction, or upstream of other specific How are specific mRNAs targeted for decay by deade- sequences. Once identified, mRNAs with premature stop nylation-dependent or -independent decay? In many cases, codons are quickly degraded. In yeast and mammalian RBPs have been identified that regulate stability and decay cells, decay is most often initiated by decapping enzymes of specific mRNAs. However, we are still far from under- or deadenylases, followed by rapid exoribonuclease diges- standing the complex interactions of mRNAs and RBPs that tion. In other species, such as Drosophila, NMD involves determine mRNA fate. endoribonuclease attack near the premature stop codon and subsequent exoribonuclease digestion.

mRNA Surveillance and Nonsense-Mediated Decay ESSENTIAL POINT Modulation of mRNA stability and decay can eliminate aberrant Aberrant mRNAs can lead to nonfunctional proteins if mRNAs and regulate gene expression. ■ translated. Eukaryotic cells have evolved several ways to

M16_KLUG8414_10_SE_C16.indd 314 16/11/18 5:15 pm 16.8 Noncoding RNAs Play Diverse Roles in Posttranscriptional Regulation 315

miRNA pathway 16.8 Noncoding RNAs Play Pri-miRNA Pre-miRNA

Diverse Roles in Posttranscriptional RNAP II Regulation miRNA gene Drosha In addition to mRNAs that encode proteins, there are Nucleus several types of noncoding RNAs (ncRNAs) that serve AAAA a variety of functions in the eukaryotic cell. You should already be familiar with rRNAs and tRNAs, which play Cytoplasm important roles in translation (Chapter 13), and snRNAs, siRNA pathway which mediate RNA splicing (Chapter 12). Here we will dsRNA Pre-miRNA consider ncRNAs that serve as posttranscriptional regula- Dicer Dicer tors of gene expression.

RISC RISC RNA Interference and AGO AGO siRNA RNA interference (RNAi) is a mechanism by which ncRNA miRNA molecules guide the posttranscriptional silencing of mRNAs in a sequence-specific manner. The most important concept of RNAi is that since ncRNAs can associate with mRNAs RISC through complementary base pairing, ncRNAs are able to RISC AGO target specific mRNAs and, via associated proteins, destroy AGO the mRNAs or block their translation. AAAA AAAA mRNA mRNA The ncRNAs involved in RNAi, broadly termed small Partially noncoding RNAs (sncRNAs), are double-stranded RNAs complementary that are 20 to 31 nucleotides long with 2-nucleotide overhangs Exactly RISC at their 3′-ends. There are two main subtypes of sncRNAs: complementary small interfering RNAs (siRNAs) and microRNAs­ AGO (miRNAs). Although they arise from different sources, their AAAA AAAA mechanisms of action are similar. Translational siRNAs are derived from longer double-stranded RNA inhibition (dsRNA) molecules. These long dsRNAs may appear within mRNA degradation cells as a result of virus infection or the expression of trans- FIGURE 16.12 interference pathways. Left: Double- posons, also called “jumping genes” (see Chapter 14), both stranded RNA is processed into siRNAs by Dicer. siRNAs of which may synthesize dsRNAs as part of their life cycles. then associate with RISC containing an Argonaute (AGO) RNAi may have evolved as a mechanism to recognize these family protein. RISC unwinds the siRNAs into single-stranded siRNAs and cleaves mRNAs complementary to the siRNA. dsRNAs and inactivate them, protecting the cell from exter- Right: miRNA genes are transcribed as primary-miRNAs nal (viral) or internal (transposon) assaults. siRNAs can also (pri-miRNAs), which are trimmed at the 5′- and 3′-ends by be derived from lab-synthesized dsRNAs and introduced the nuclear enzyme Drosha to form pre-miRNAs, which are into cells for research or therapeutic purposes. exported to the cytoplasm and processed by Dicer. These miRNAs then associate with RISC and mRNAs. If the miRNA Whatever the source, when long dsRNAs are present and mRNA are perfectly complementary, the mRNA is in a eukaryotic cell, they are cleaved into approximately destroyed; if there is a partial match, translation is inhibited. 22-nucleotide-long siRNAs by an enzyme called Dicer (Figure 16.12, left). These siRNAs then associate with the RNA-induced silencing complex (RISC). RISC contains an fragments lacking a cap or a poly-A tail are then quickly Argonaute family protein that binds RNA and has endori- degraded in the cell by exoribonucleases (see Section 16.7). bonuclease or “slicer” activity. RISC cleaves and evicts one RNAi can also be mediated by miRNAs, which are pro- of the two strands of the double-stranded siRNA and retains duced from the transcription and processing of miRNA genes the other strand as a single-stranded siRNA “guide” to in eukaryote genomes. miRNA genes include self-comple- recruit RISC to a complementary mRNA. RISC then cleaves mentary sequences and are transcribed into primary miR- the mRNA in the middle of the region of siRNA–mRNA com- NAs (pri-miRNAs), which, like mRNAs, have a cap and plementarity (Figure 16.12, left, bottom). Cleaved mRNA a poly-A tail and may contain introns that are spliced out.

M16_KLUG8414_10_SE_C16.indd 315 16/11/18 5:15 pm 316 16 rEgulation of Gene Expression in Eukaryotes

However, because of their self-complementary sequences, protein. Patisiran is the first ever RNAi-based drug to receive pri-miRNAs form hairpin structures. A nuclear enzyme FDA approval. called Drosha removes the noncomplementary 5′- and 3′-ends to produce pre-miRNAs (Figure 16.12, right). These Long Noncoding RNAs and Posttranscriptional hairpins are then exported to the cytoplasm where they Regulation are cleaved by Dicer to produce mature double-stranded In addition to sncRNAs discussed previously, eukary- ­miRNAs and further processed to single-stranded miRNAs otic genomes also encode many long noncoding RNAs by RISC. Like siRNAs, miRNAs associate with RISC to target (lncRNAs). One obvious distinction is that lncRNAs are complementary sequences on mRNAs. Such complemen- longer than sncRNAs and are often arbitrarily designated to tary sequences serve as binding sites or miRNA response be greater than 200 nucleotides in length. lncRNAs are pro- ­elements (MREs). MREs are commonly found in 3′-UTRs duced in a similar fashion to mRNAs; they are modified with of mRNAs but can also be found in the 5′-UTRs or the coding a cap and a poly-A tail, and they can be spliced. In contrast to region. If the miRNA–mRNA match is perfect (common in mRNAs, they have no start and stop codons, indicating that plants), the target mRNA is cleaved by RISC and degraded. they do not encode protein. The conservative estimate is that But if the miRNA–mRNA match is partial (common in ani- the human genome encodes 17,000 lncRNAs. mals), it blocks translation (Figure 16.12, right, bottom). lncRNAs have been linked to diverse regulatory func- miRNAs are found in plants and animals, and are tions. Some lncRNAs bind to chromatin-regulating com- encoded by some viruses. There are at least 1500 miRNA plexes to influence chromatin modifications and alter genes in the human genome, and possibly as many as 5000. patterns of gene expression (see Special Topics Chapter miRNAs have been shown to regulate diverse target mRNAs ST1—Epigenetics). Others regulate transcription by directly associated with a broad range of processes such as cell-cycle associating with transcription factors. However, in this sec- control, development, and nervous system function. tion we will focus on the posttranscriptional roles of lncRNAs. When lncRNAs are complementary to mRNAs or pre- NOW SOLVE THIS mRNAs, the two can hybridize by base pairing. In some cases, 16.3 Some scientists use the analogy that the RNA- this leads to regulation of alternative splicing. For example, induced silencing complex (RISC) is a “programmable an lncRNA that binds to splice sites for an exon can lead to its search engine” that uses miRNAs as programs. What does exclusion from the mature transcript. In other cases, lncRNA– RISC search for, and how does an miRNA “program” the mRNA hybridization produces a dsRNA that triggers an RNAi search? What does RISC do when it finds what it is search- response. It is processed by Dicer into siRNAs that then target ing for? complementary mRNAs for destruction by RISC (see Figure HINT: The important concept here is that complementary base 16.12). Other studies show that lncRNAs can bind to mRNAs pairing enables an miRNA to guide RISC to its target for post- in ways that regulate their stability, decay, and translation. transcriptional regulation. Some lncRNAs function as competing endogenous RNAs (ceRNAs). Conceptually, ceRNAs are “sponges” that “soak up” miRNAs due to the presence of complementary RNA Interference in Medicine miRNA-binding sites in their sequence—the miRNA response elements (MREs) introduced earlier in this chapter. Thus, The discovery of RNAi opened the door to RNA molecules ceRNAs compete with mRNAs for miRNA binding. Whereas being developed as potential pharmaceutical agents. In miRNAs downregulate their mRNA targets, ceRNAs are able theory, any disease caused by overexpression of a specific to “derepress” the mRNA targets by sequestering miRNAs gene, or even normal expression of an abnormal gene prod- away from them. In other words, ceRNAs are decoys. The effi- uct, could be treated by RNAi. In early 2018, there were 30 cacy of a ceRNA depends on variables such as how many MREs ongoing clinical trials to treat viral infections like hepati- it contains, how many copies of the ceRNA are expressed in tis and Ebola, as well as cancers, eye diseases, hemophilia, the cell, and the affinity of its MREs for the miRNA. hypercholesterolemia, and even alcoholism. Some of these trials were in phase III, meaning that they were being admin- istered to large groups to confirm effectiveness and monitor ESSENTIAL POINT side effects. Noncoding RNAs, such as microRNAs (miRNAs) and small interfer- ing RNAs (siRNAs), can mediate sequence-specific degradation or In August of 2018, Alnylam Pharmaceuticals announced translational inhibition of target mRNAs in a process called RNA FDA approval of Patisiran, an siRNA and nanoparticle deliv- interference (RNAi). Long noncoding RNAs (lncRNAs) have a variety ery system that treats transthyretin amyloidosis. This disor- of functions in the cell such as influencing alternative splicing and der is characterized by nervous system and cardiac problems serving as decoys for miRNAs to allow translation of mRNAs that would otherwise be targeted by such miRNAs. ■ due to a buildup of a mutant form of the transthyretin

M16_KLUG8414_10_SE_C16.indd 316 16/11/18 5:15 pm 16.9 mRNA Localization and Translation Initiation Are Highly Regulated 317

16.9 mRNA Localization and Translation Initiation Are Highly Nucleus Regulated Actin gene

We have already encountered several posttranscriptional mechanisms that impact translation. Alternative splic- RNA ZBP1 ing determines which spliceforms may be translated, and mRNAs may be degraded or targeted by an miRNA to stop Export Cytoplasm translation. However, translation may be regulated more 40S directly as well. ZBP1 MP Some mRNAs are localized to discrete regions of the cell where they are translated locally. This generates asymmet- Transport ric protein distributions within the cell that enable different parts of the cell to have different functions. Similar to other MP posttranscriptional mechanisms, the regulation of mRNA 40S localization and localized translation are governed by cis-reg- 60S ZBP1 ulatory sequences on the mRNA, and RNA-binding proteins. Src One of the best-described RBP–mRNA interactions 40S ZBP1 governs the localization and translational control of actin P Microtubule mRNAs in crawling cells. Following injury, fibroblasts Localized 60S translation migrate to the site of the wound and assist in wound healing. AAA Fibroblasts control their direction of movement by control- ling where within the cell they polymerize new cytoskeletal actin microfilaments. The “leading edge” of the cell where Actin this actin polymerization occurs is called the lamellipodium. filaments Actin mRNA is localized to lamellipodia, and that local- ization is dependent on a 54-nucleotide element in the actin FIGURE 16.13 localization and translational regulation of mRNA 3′-UTR termed a zip code. The zip code is a cis-regu- actin mRNA. The RNA-binding protein ZBP1 associates with actin mRNA in the nucleus and escorts it to the cytoplasm. latory element that serves as a binding site for an RBP called ZBP1 blocks translation and binds cytoskeleton motor pro- zip code binding protein 1 (ZBP1). ZBP1 initially binds teins (MP), which transport ZBP1 and actin mRNA to the the zip code of actin mRNAs in the nucleus. Once exported cell periphery. At the cell periphery, ZBP1 is phosphorylated by Src and dissociates from actin mRNA, allowing it to be to the cytoplasm, ZBP1 blocks translation initiation by pre- translated by a ribosome (40S/60S). Actin translation and venting the association of the large subunit (60S) of the ribo- polymerization at the leading edge direct cell movement. some. In addition, ZBP1 associates with cytoskeleton motor proteins to facilitate the transport of the mRNA to the lamel- lipodium. Once the mRNAs arrive at the final destination, a NOW SOLVE THIS kinase called Src phosphorylates ZBP1, which disrupts RNA 16.4 Consider the example that actin mRNA localization binding and allows translation initiation (Figure 16.13). is important for fibroblast migration. What would you Since Src activity is limited to the cell periphery, this predict to be the consequence of deleting the zip code mechanism allows the transport of actin mRNAs in a trans- sequence element of the actin mRNA? lationally repressed state to the cell periphery, thus control- HINT: The key to answering this question is recalling that the zip ling where actin will be translated. code is a cis-acting element that is bound by an RNA-binding pro- Actin mRNA localization is important in other cells as tein involved in its localization and translational control. well, such as in neurons where localized translation of actin helps guide axon growth. In addition, many other mRNAs are also localized and translated in specific locations within the cell. mRNA localization and translational control are ESSENTIAL POINT particularly important for nervous system function. In fact, mRNAs may not be translated immediately; some mRNAs are stored defects in mRNA localization in neurons have been impli- for later use and/or are localized to specific regions of the cell and cated in human disorders such as fragile-X syndrome, spinal then locally translated to create asymmetric protein distribution within the eukaryotic cell. ■ muscular atrophy, and spinocerebellar ataxia.

M16_KLUG8414_10_SE_C16.indd 317 16/11/18 5:15 pm 318 16 rEgulation of Gene Expression in Eukaryotes

on the type of target. For example, enzymes may be turned 16.10 Posttranslational Modifications on or off by phosphorylation where conformational changes Regulate Protein Activity alter substrate binding. Transcription factors may be turned on or off by phosphorylation based on how the con- Even after translation is complete, the activity of the gene formational changes impact its affinity for the target DNA products can still be regulated through a suite of post- sequence. In some cases, there is more than one phosphory- translational modifications. You’ve already encountered lation site on a protein; phosphorylation in one site may acti- one of these mechanisms earlier in the text (Chapter 13), vate, while phosphorylation at the other site may inactivate when we discussed the addition of iron-containing heme the protein. groups to the oxygen-carrying protein hemoglobin. In addi- tion, proteins­ may be posttranslationally modified by the covalent attachment­ of various molecules. Such additions Ubiquitin-Mediated Protein Degradation can change a protein’s stability, subcellular localization, or One important way to regulate protein activity after transla- affinity for other molecules. Since covalent modification is tion is through the targeting of specific proteins for degrada- an enzyme-catalyzed event, the regulation of such enzymes tion. The principal mechanism by which the eukaryotic cell is a critical step for controlling gene expression at the post- targets a protein for degradation is by covalently modifying translational level. it with ubiquitin, a small protein with 76 amino acids that is found in all eukaryotic cells. The fact that it is ubiquitous gives ubiquitin its name. Regulation of Proteins by Phosphorylation Ubiquitin is covalently attached to a target protein via a We do not know the full extent of posttranslational modi- lysine side chain through a process called ubiquitination. fications within the proteome for any given species or cell Subsequently, lysine side chains in the attached ubiquitin type. However, phosphorylation is the most common type molecule can be modified by the addition of other ubiquitin of posttranslational modification, accounting for approxi- molecules. This process can be repeated to form long poly- mately 65 percent of all posttranslational modifications. ubiquitin chains, which serve as “tags” that mark the protein Phosphorylation is mediated by a class of enzymes called for destruction. Poly-ubiquitinated proteins are recognized kinases. Kinases catalyze the addition of a phosphate group by the proteasome, a multi-subunit protein complex to serine, tyrosine, or threonine amino acid side chains. Such with protease (protein cleaving) activity. The proteasome additions are reversible. Phosphatases are enzymes that unwinds target proteins, removes their ubiquitin tags, and remove phosphates. It is calculated that the human genome breaks the protein into small peptides about 7 to 8 amino contains 518 kinase-encoding genes and 147 phosphatase- acids long (Figure 16.14). encoding genes. This suite of enzymes can be used in count- Since ubiquitinated proteins are quickly destroyed, the less ways to regulate protein activity. determination of which proteins get ubiquitinated is a major Phosphorylation usually induces conformational regulatory step. A class of enzymes, ubiquitin ligases, rec- changes. These changes can have different effects depending ognize and bind specific target proteins and catalyze the

Release Ub Ub Ub Ub Ub Ub Ub Ub Ub Ub Recognition

Ub Substrate Unfolding Ubiquitin Ub ligase Opening Substrate Substrate Cleavage Protein to be Protein to be degraded degraded Discharge Proteasome

FIGURE 16.14 ubiquitin-mediated protein degradation. Ubiquitin ligase enzymes recognize substrate proteins and catalyze the addition of ubiquitin (Ub) residues to create a long chain. Ubiquitinated proteins are then recognized by the proteasome, which removes ubiquitin tags, unfolds the protein, and proteolytically cleaves it into small polypeptides.

M16_KLUG8414_10_SE_C16.indd 318 16/11/18 5:15 pm 16.10 Posttranslational Modifications Regulate Protein Activity 319

processive addition of ubiquitin residues (Figure 16.14). In the human genome. This suggests that ubiquitin-mediated turn, ubiquitin ligase activity can be regulated in many ways protein degradation may be a broadly used mechanism to to serve the needs of the cell. regulate biological function. It is estimated that there are over 600 ubiquitin ligase– encoding genes in the human genome and that some human ubiquitin ligases interact with over 40 different substrate ESSENTIAL POINT proteins. Overall, scientists estimate that human ubiquitin Following translation, protein activity can be modulated by post- translational modifications, such as phosphorylation or ubiquitin- ligases target over 9000 different proteins, which accounts mediated degradation. ■ for approximately 40 percent of the protein-coding genes in

EXPLORING GENOMICS

Mastering Genetics Visit the Tissue-Specific Gene Expression Study Area: Exploring Genomics

n this chapter, we discussed how patterns. You will perform your about the gene. The “UniGene” link gene expression can be regulated searches on these genes. will show you a UniGene report. For in complex ways. One aspect of some genes, upon entering UniGene NM_021588.1 Iregulation we considered is the way pro- you may need to click a link above the NM_007391 moter, enhancer, and silencer sequences gene name or the gene name itself in AY260853.1 can govern transcriptional initiation of order to retrieve a UniGene report. Be NM_004917 genes to allow for tissue-specific gene sure to explore the “EST Profile” link expression. All cells and tissues of an under the “Gene Expression” category 3. For each gene, carry out a nucleotide organism possess the same genome in each UniGene report. EST profiles BLAST search using the accession (with some genomic variation as you will will show a table of gene-expression numbers for your sequence query. learn in Chapter 18), and many genes patterns in different tissues. (Refer to the Exploring Genomics fea- are expressed in all cell and tissue types. ture in Chapter 9 if you need to refresh 5. Also explore the “GEO Profiles” link However, muscle cells, blood cells, and your memory on BLAST searches.) under the “Gene Expression” cat- all other tissue types express genes that Because the accession numbers are egory of the UniGene reports, when are largely tissue specific (i.e., they have for nucleotide sequences, be sure to available. These links will take you limited or no expression in other tissue use the “Nucleotide BLAST” program to a number of gene-expression types). In this exercise, we return to the when running your searches. Once you studies related to each gene of inter- National Center for Biotechnology Infor- enter “Nucleotide BLAST,” under the est. Explore these resources for each mation (NCBI) site and use the search “Choose Search Set” category, make gene, and then answer the following tool BLAST (Basic Local Alignment sure the database is set to “Others (nr, questions: Search Tool), which you were introduced etc.)” so that you are not searching an a. What is the identity of each to in an earlier Exploring Genomics exer- organism-specific database. sequence, based on sequence cise (see Chapter 9). We will use BLAST alignment? How do you know this? to learn more about tissue-specific gene- 4. Once your BLAST search results expression patterns. appear, look at the top alignment b. What species was each gene cloned for each gene. Clicking on the link for from? Exercise – Tissue-Specific Gene the top alignment will take you to the Expression c. Which tissue(s) are known to page showing the sequence alignment express each gene? for this gene. To the far right of the 1. Access BLAST from the NCBI Web page, if you scroll down, you will see a d. Does this gene show regulated site at https://blast.ncbi.nlm.nih.gov/ section called “Related Information.” expression during different times Blast.cgi. The “Gene” link provides a report on of development? 2. The following are GenBank acces- details related to this gene. e. Which gene shows the most sion numbers for four different genes Some alignments will display a restricted pattern of expression that show tissue-specific expression link for “Map Viewer,” which will take by being expressed in the fewest you to genome mapping information tissues?

M16_KLUG8414_10_SE_C16.indd 319 16/11/18 5:15 pm 320 16 rEgulation of Gene Expression in Eukaryotes

CASE STUDY A mysterious muscular dystrophy

man in his early 30s suddenly developed weakness in his 1. What is alternative splicing, where does it occur, and how hands and neck, followed weeks later by burning muscle could disrupting it affect the expression of the affected Apain—all symptoms of late-onset muscular dystrophy. His gene(s)? internist ordered genetic tests to determine whether he had one 2. What role might the expanded tri- and tetranucleotide repeats of the most common adult-onset muscular dystrophies—myotonic play in the altered splicing? dystrophy type 1 (DM1) or myotonic dystrophy type 2 (DM2). The 3. DM1 is characterized by a phenomenon known as genetic tests detect mutations in the DMPK and CNBP genes, the only anticipation (see Chapter 4) where the age of onset tends to genes known to be associated with DM1 and DM2. While await- decrease and the severity of the symptoms tend to increase ing the results of the gene tests, the internist explained that the from one generation to the next due to expansion of the trinu- disease-causing mutations in these genes do not result in changes cleotide repeats in the DMPK gene. What are the implications to the coding sequence. Rather, myotonic dystrophies result of a diagnosis of DM1 in this patient with respect to his 4-year- from increased, or expanded, numbers of tri- and tetranucleotide old son, and 2-year-old daughter? repeats in the 3′ untranslated region of the DMPK or CNBP genes. The doctor went on to explain that the presence of RNAs with For related reading, see Pavićević, D. S., et al. (2013). Molecular expanded numbers of repeats leads to aberrant alternative splicing genetics and genetic testing in myotonic dystrophy type 1. Biomed. of other mRNAs, causing widespread disruption of cellular path- Res. Int. 2013. Article ID 391821. ways. This discussion raises a number of interesting questions.

INSIGHTS AND SOLUTIONS

1. As a research scientist, you have decided to study transcrip- you test your two transcription systems using each of your tional regulation of a gene whose DNA has been cloned and templates, you obtain the following results: sequenced. To begin your study, you obtain the DNA clone of the gene, which includes the coding sequence and at least 1 kb DNA Added Purified System Nuclear Extract of upstream DNA. You then create a number of subclones of Undeleted + + + + + this DNA, containing various deletions in the gene’s upstream 127 deletion region. These deletion templates are shown in the following - + + + + + 81 deletion figure. - + + + + + -50 deletion + + -11 deletion o o + Low-efficiency transcription Undeleted template RNA + + + + High-efficiency transcription 1 + transcript TATA o No transcription

(a) Why is there no transcription from the -11 deletion tem- Deleted templates plate in both the crude extract and the purified system? -127 -81 -50 -11 +1 (b) How do the results for the nuclear extract and the puri- -127 fied system differ, for the undeleted template? How would you interpret this result? -81 (c) For each of the various deletion templates, compare the 50 - results obtained from both the nuclear extract and the puri- -11 fied systems. Region deleted Region remaining (d) What do these data tell you about the transcription regu- lation of this gene?

Solution: To test these DNA subclones for their ability to direct tran- (a) The lack of transcription from the -11 template suggests scription of the gene, you prepare two different types of in that some essential DNA sequences are missing from this dele- vitro transcription systems. The first is a purified system con- tion template. As the -50 template does show some transcrip- taining RNAP II and the general transcription factors TFIIA, tion in both the crude extract and purified system, it is likely TFIIB, TFIID, TFIIE, TFIIF, and TFIIH. The second system con- that the essential missing sequences, at least for basal levels sists of a crude nuclear extract, which is made by extracting of transcription, lie between -50 and -11. As the TATA box is most of the proteins from the nuclei of cultured cells. When (continued)

M16_KLUG8414_10_SE_C16.indd 320 16/11/18 5:15 pm PROBLEMS AND DISCUSSION QUESTIONS 321

located in this region, its absence in the -11 template may be 2. Scientists estimate that more than 15 percent of disease-caus- the reason for the lack of transcription. ing mutations involve errors in alternative splicing. However, there is an interesting case in which a mutation that deletes (b) The undeleted template containing large amounts of an exon results in increased protein production. Mutations upstream DNA is sufficient to promote high levels of tran- that delete exon 45 of the 79-exon dystrophin gene are the scription in a nuclear extract, but only low levels in a purified most common cause of Duchenne muscular dystrophy (DMD), system. These data suggest that something is missing in the a disease associated with progressive muscle degeneration. purified system, compared with the nuclear extract, and this However, some individuals with deletions of both exons 45 component is important for high levels of transcription from and 46 have Becker muscular dystrophy (BMD), a milder this promoter. As crude nuclear extracts are not defined in form of muscular dystrophy. Provide a possible explanation content, it would not be clear from these data what factors in for why BMD patients, with a deletion of both exon 45 and 46, the extract are the essential ones. produce more dystrophin than DMD patients do. (c) Both the -127 and -81 templates act the same way as the Solution: Having a deletion of one exon has several possible undeleted template in both the nuclear extract and the puri- effects on a gene product. One possibility is that the mRNA fied system—high levels of transcription in nuclear extracts transcribed from the exon-deleted dystrophin gene is unstable, but low levels in a purified system. In contrast, the -50 tem- leading to a lack of dystrophin protein production. Even if plate shows only low levels of transcription in both systems. the mRNA is stable, the resulting mutated dystrophin protein These results indicate that all of the sequences necessary could be targeted for rapid degradation, leading to the ab- for high levels of transcription in a crude system are located sence of stable active protein. Another possibility is that the between -81 and -50. deletion of one exon creates a frameshift leading to a prema- (d) First, these data tell you that general transcription factors ture stop codon. As the dystrophin gene has 79 exons spanning alone are not sufficient to specify high efficiencies of tran- over 2.6 million base pairs, a frameshift in exon 46 could scription from this promoter. The DNA sequence elements create a stop codon near the middle of the gene, which would through which the general transcription factors work are trigger nonsense-mediated mRNA decay. Any mRNA escaping located within 50 bp of the transcription start site. Second, degradation would encode a shorter than normal dystrophin the data tell you that the promoter for this gene is likely a protein, which would likely be nonfunctional. member of the “focused” class of promoters, with one defined It is possible that a deletion encompassing both exon 45 transcription start site and an essential TATA box. Third, high and 46 could restore the reading frame of the dystrophin levels of transcription require sequences between -81 and protein in exon 47. The protein product of this gene would -50 relative to the transcription start site. These sequences be missing amino acid sequences encoded by the two missing (enhancers) interact with some component(s) (transcrip- exons; however, the protein itself could still have some activ- tional activators) of crude nuclear extracts. ity, partially preserving the wild-type phenotype.

Mastering Genetics Visit for Problems and Discussion Questions instructor-assigned tutorials and problems.

1. HOW DO WE KNOW? In this chapter, we focused on the regula- 2. CONCEPT QUESTION Review the Chapter Concepts list on p. tion of gene expression in eukaryotes. At the same time, we found 302. The third concept describes how transcription initiation many opportunities to consider the methods and reasoning by requires the assembly of transcription regulatory proteins on which much of this information was acquired. From the explana- DNA sites known as promoters, enhancers, and silencers. Write tions given in the chapter: a short essay describing which types of trans-acting proteins bind (a) How do we know that transcription and translation are spa- to which type of cis-regulatory element, and how these interac- tially and temporally separated in eukaryotic cells? tions influence transcription initiation. (b) How do we know that DNA methylation is associated with 3. What features of eukaryotes provide additional opportunities for transcriptionally silent genes? the regulation of gene expression compared to bacteria? (c) How do we know that core-promoter elements are impor- 4. Describe the organization of the interphase nucleus. Include in tant for transcription? your presentation a description of chromosome territories and (d) How do we know that the orientation of promoters relative interchromatin compartments. to the transcription start site is important while enhancers 5. Provide a brief description of two different types of histone modi- are orientation independent? fications and how they impact transcription. (e) How do we know that alternative splicing enables one gene 6. Present an overview of the manner in which chromatin can be to encode different isoforms with different functions? remodeled. Describe the manner in which these remodeling pro- (f) How do we know that small noncoding RNA molecules can cesses influence transcription. regulate gene expression?

M16_KLUG8414_10_SE_C16.indd 321 16/11/18 5:15 pm 322 16 rEgulation of Gene Expression in Eukaryotes

7. Distinguish between the cis-acting regulatory elements referred 24. How is it possible that a given mRNA in a cell is found throughout to as promoters and enhancers. the cytoplasm but the protein that it encodes is only found in a 8. Describe the manner in which activators and repressors influ- few specific regions? ence the rate of transcription initiation. How might chromatin 25. How may the covalent modification of a protein with a phosphate structure be involved in such regulation? group alter its function? 9. Many promoter regions contain CAAT boxes containing consen- 26. The proteasome is a multi-subunit machine that unfolds and sus sequences CAAT or CCAAT approximately 70 to 80 bases degrades proteins. How is its activity regulated such that it only upstream from the transcription start site. How might one deter- degrades certain proteins? mine the influence of CAAT boxes on the transcription rate of a 27. When challenged with a low oxygen environment, known given gene? as hypoxia, the body produces a hormone called erythropoi- 10. Research indicates that promoters may fall into one of two etin (EPO), which then stimulates red blood cell production to classes: focused or dispersed. How do these classes differ, and carry more oxygen. Transcription of the gene encoding EPO is which genes tend to be associated with each? dependent upon the hypoxia-inducible factor (HIF), which is a 11. Explain the features of the Initiator (Inr) elements, BREs, DPEs, transcriptional activator. However, HIF alone is not sufficient and MTEs of focused promoters. to activate EPO. For example, Wang et al. (2010. PLOS ONE 5: 12. List three types of alternative splicing patterns and how they lead e10002) showed that HIF recruits another protein called p300 to the production of different protein isoforms. to an enhancer for the EPO gene. Furthermore, deletion of p300 13. Consider the CT/CGRP example of alternative splicing shown in significantly impaired transcription of the EPO gene in response Figure 16.9. Which different types of alternative splicing pat- to hypoxia. Given that p300 is a type of histone acetyl transfer- terns are represented? ase, how might p300 influence transcription of the EPO gene? 14. Explain how the use of alternative promoters and alternative 28. The TBX20 transcription factor is important for the development polyadenylation signals produces mRNAs with different 5′- and of heart tissue. Deletion of the Tbx20 gene in mice results in poor 3′-ends. heart development and the death of mice well before birth. To 15. The regulation of mRNA decay relies heavily upon deadenylases better understand how TBX20 regulates heart development at and decapping enzymes. Explain how these classes of enzymes a genetic level, Sakabe et al. (2012. Hum. Mol. Genet. 21:2194– are critical to initiating mRNA decay. 2204) performed a transcriptome analysis in which they com- 16. Nonsense-mediated decay is an mRNA surveillance pathway that pared the levels of all mRNAs between heart cells from wild-type eliminates mRNAs with premature stop codons. How does the mice and mice with Tbx20 deleted. cell distinguish between normal mRNAs and those with a pre- (a) How might such a transcriptome analysis provide informa- mature stop? tion about how TBX20 regulates heart development? 17. In 1998, future Nobel laureates Andrew Fire and Craig Mello, (b) This study concluded that TBX20 acts as an activator of and colleagues, published an article in Nature entitled, “Potent some genes but a repressor of other genes in cardiac tis- and specific genetic interference by double-stranded RNA in sue. How might a single transcription factor have opposite Caenorhabditis elegans.” Explain how RNAi is both “potent and effects on the transcription of different genes? specific.” 29. Many viruses that infect eukaryotic cells express genes that alter 18. Present an overview of RNA interference (RNAi). How does the the regulation of host gene expression to promote viral replica- silencing process begin, and what major components participate? tion. For example, herpes simplex virus-1 (HSV-1) expresses a 19. RNAi may be directed by small interfering RNAs (siRNAs) or protein called ICP0, which is necessary for successful viral infec- microRNAs (miRNAs); how are these similar, and how are they tion and replication within the host. Lutz et al. (2017. Viruses different? 9: 210) showed that ICP0 can act as a ubiquitin ligase and tar- 20. miRNAs target endogenous mRNAs in a sequence-specific man- get the redundant transcriptional repressors ZEB1 and ZEB2, ner. Explain, conceptually, how one might identify potential which leads to upregulation of the miR-183 cluster (a set of three mRNA targets for a given miRNA if you only know the sequence ­miRNAs transcribed from the same locus). of the miRNA and the sequence of all mRNAs in a cell or tissue of (a) What likely happens to ZEB1 and ZEB2 upon HSV-1 interest. infection? 21. In principle, RNAi may be used to fight viral infection. How might (b) How may ICP0 expression in a host cell lead to upregulation this work? of the miR-183 cluster? 22. Competing endogenous RNAs act as molecular “sponges.” What (c) Speculate on how miR-183 cluster upregulation may ben- does this mean, and what do they compete with? efit the virus. 23. How and why are eukaryotic mRNAs transported and localized to discrete regions of the cell?

M16_KLUG8414_10_SE_C16.indd 322 16/11/18 5:15 pm 17 Recombinant DNA Technology

CHAPTER CONCEPTS

■■ Recombinant DNA technology creates combinations of DNA sequences from different sources. ■■ A common application of recombinant DNA technology is to clone a DNA seg- ment of interest. ■■ Specific DNA segments are inserted into vectors to create recombinant DNA molecules that are transferred into A researcher examines an agarose gel containing separated eukaryotic or bacterial host cells, where DNA fragments stained with the DNA-binding dye ethidium the recombinant DNA replicates as the bromide and visualized under ultraviolet light. host cells divide. ■■ DNA libraries are collections of cloned DNA and were historically used to iso- late specific genes. esearchers of the mid- to late 1970s developed various techniques to ■■ DNA segments can be quickly ampli- create, replicate, and analyze recombinant DNA molecules—DNA fied and cloned millions of times using created by joining together pieces of DNA from different sources. the polymerase chain reaction (PCR). R These techniques, called recombinant DNA technology and often known ■■ Nucleic acids can be analyzed using a as “gene splicing” in the early days, marked a major advance in research in range of molecular techniques. molecular biology and genetics, allowing scientists to isolate and study spe- ■■ Sequencing reveals the nucleotide cific DNA sequences. For their contributions to the development of this tech- composition of DNA, and major improve- nology, , Hamilton Smith, and Werner Arber were awarded ments in sequencing technologies have the 1978 Nobel Prize in Physiology or Medicine. rapidly advanced many areas of modern The power of recombinant DNA technology is astonishing, enabling genetics research, particularly genomics. geneticists to identify and isolate a single gene or DNA segment of interest ■■ Gene knockout methods and transgenic from a genome, and to produce large quantities of identical copies of this animals have become invaluable for specific molecule. These identical copies, or clones, can then be manipulated studying gene function in vivo. for numerous purposes, including conducting research on the structure and ■■ Genome editing methods demonstrate organization of the DNA, studying gene expression, studying protein prod- value in basic research and exciting ucts to understand their structure and function, and producing important potential for clinical applications of commercial products from the protein encoded by a gene. genetic technology. The fundamental techniques involved in recombinant DNA technology ■■ Recombinant DNA technology has revo- subsequently led to the field of genomics, enabling scientists to sequence lutionized our ability to investigate the and analyze entire genomes. Note that some of the topics discussed in this genomes of diverse species and has led chapter are explored in greater depth later in the text (see Special Top- to the modern revolution in genomics. ics Chapter 2—Genetic Testing, 3—Gene Therapy, 5—DNA Forensics, and

323

M17_KLUG8414_10_SE_C17.indd 323 16/11/18 5:15 pm 324 17 Recombinant DNA Technology

6—Genetically Modified Foods). In this chapter, we survey available for use by researchers. A restriction enzyme rec- basic methods of recombinant DNA technology used to iso- ognizes and binds to DNA at a specific nucleotide sequence late, replicate, and analyze DNA. called a recognition sequence or restriction site (Figure 17. 1). The enzyme then cuts both strands of the DNA within that sequence by cleaving the phosphodiester backbone. Scientists commonly refer to this as “digestion” 1 7. 1 Recombinant DNA Technology of DNA. The usefulness of restriction enzymes is their abil- Began with Two Key Tools: Restriction ity to accurately and reproducibly cut DNA into fragments. Restriction enzymes represent sophisticated molecular scis- Enzymes and Cloning Vectors sors for cutting DNA into fragments of desired sizes. Restric- tion sites are distributed randomly in the genome and the Although natural genetic processes such as crossing over size of the DNA fragments resulting from digestion with a produce recombined DNA molecules, the term recombinant given enzyme can be estimated based on the probable fre- DNA is reserved for molecules produced by artificially quency of its recognition sequence. The actual fragment joining DNA obtained from different sources. We begin sizes produced will vary, however, because of variability in our discussion of recombinant DNA technology by consid- the number and locations of recognition sequences in rela- ering two important tools used to construct and amplify tion to one another. recombinant DNA molecules: DNA-cutting enzymes Recognition sequences exhibit a form of symmetry called restriction enzymes and cloning vectors. The described as a palindrome: The nucleotide sequence use of restriction enzymes and cloning vectors was largely reads the same on both strands of the DNA when read in responsible for advancing the field of molecular biology the 5′ to 3′ direction. Each restriction enzyme recognizes because a wide range of laboratory techniques are based its particular recognition sequence and cuts the DNA in a on recombinant DNA technology. characteristic cleavage pattern (see Figure 17.1). The most common recognition sequences are four or six nucleotides Restriction Enzymes Cut DNA at Specific long, but some contain eight or more nucleotides. Enzymes Recognition Sequences such as HindIII make offset cuts in the DNA strands, thus Bacteria produce restriction enzymes as a defense mecha- producing fragments with single-stranded overhanging nism against infection by bacteriophage. They restrict or ends called cohesive ends (or “sticky” ends), while others prevent viral infection by degrading the DNA of invading such as AluI cut both strands at the same nucleotide pair, viruses. More than 4300 restriction enzymes have been producing DNA fragments with double-stranded ends called identified, and over 600 are commercially produced and blunt-end fragments.

Enzyme Recognition Sequence DNA Fragments Produced Source Microbe

A-A-G-C-T-T A A-G-C-T-T HindIII Haemophilus influenzae Rd T-T-C-G-A-A T-T-C-G-A A Cohesive ends

G-G-A-T-C-C G G-A-T-C-C BamHI Bacillus amyloliquefaciens H C-C-T-A-G-G C-C-T-A-G G Cohesive ends

G-A-T-C G-A-T-C Sau3AI Staphylococcus aureus 3A C-T-A-G C-T-A-G Cohesive ends

A-G-C-T A-G C-T AluI Arthrobacter luteus T-C-G-A T-C G-A Blunt ends

FIGURE 1 7. 1 common restriction enzymes, with their recognition sequence, DNA cutting patterns, and source microbes. Arrows indicate the location in the DNA cut by each enzyme.

M17_KLUG8414_10_SE_C17.indd 324 16/11/18 5:15 pm 17.1 RECOMBINANT DNA TECHNOLOGY BEGAN WITH TWO KEY TOOLS 325

One of the first restriction enzymes to be identified was DNA Vectors Accept and Replicate DNA isolated from Escherichia coli strain R and was designated Molecules to Be Cloned EcoRI. DNA fragments produced by EcoRI digestion have Scientists recognized that DNA fragments resulting from cohesive ends because they can base-pair with complemen- restriction-enzyme digestion could be copied or cloned if tary single-stranded ends on other DNA fragments cut using they also had a technique for replicating the fragments. EcoRI. When mixed together, single-stranded ends of DNA Thus, a second key tool that allowed DNA cloning was the fragments from different sources cut with the same restric- development of cloning vectors, DNA molecules that accept tion enzyme can anneal, or stick together, by hydrogen DNA fragments and replicate these fragments when vectors bonding of complementary base pairs in single-stranded are introduced into host cells. ends (Figure 17. 2). Addition of the enzyme DNA ligase— Many different vectors are available for cloning. Vectors recall the role of DNA ligase in DNA replication as discussed differ in terms of the host cells they are able to enter and in earlier in the text (see Chapter 10)—to DNA fragments will the size of DNA fragment inserts they can carry, but most seal the phosphodiester backbone of DNA to covalently join DNA vectors share several key properties. the fragments together to form recombinant DNA molecules. ■■ A vector contains several restriction sites that allow Scientists often use restriction enzymes that create insertion of the DNA fragments to be cloned. cohesive ends since the overhanging ends make it easier to combine fragments. Blunt-end ligation is more technically ■■ Vectors must be capable of replicating in host cells inde- challenging because it is not facilitated by hydrogen bond- pendent of the host cell chromosome(s). ing, but a scientist can ligate fragments digested at different ■■ To make it possible to distinguish between host cells that sequences by different blunt-end generating enzymes. have taken up vectors and host cells that have not, the vector should carry a selectable marker gene (often an ESSENTIAL POINT antibiotic resistance gene) or a reporter gene (a gene that encodes a protein which produces a visible effect, Recombinant DNA technology was made possible by the discovery of proteins called restriction enzymes, which cut DNA at specific such as color or fluorescent light). sequences, producing fragments that can be joined with other DNA ■■ Most vectors incorporate specific sequences that allow fragments to form recombinant DNA molecules. ■ for sequencing inserted DNA.

G-A-A-T-T-C G-A-A-T-T-C

C-T-T-A-A-G C-T-T-A-A-G

Cleavage with EcoRI Cleavage with EcoRI 5' G A-A-T-T-C 3' Fragments with 3' C-T-T-A-A complementary sticky ends G 5'

Nick Annealing allows recombinant G A-A-T-T-C DNA molecules to form by complementary base pairing; C-T-T-A-A G the two strands are not covalently bonded as indicated by the nicks in the DNA backbone Nick

DNA ligase

5' G-A-A-T-T-C 3' DNA ligase seals the nicks in the DNA backbone, 3' C-T-T-A-A-G 5' covalently bonding the two strands

FIGURE 1 7. 2 DNA from different sources is cleaved with EcoRI and mixed to allow annealing. The enzyme DNA ligase forms phosphodiester bonds between these fragments to create a recombinant DNA molecule.

M17_KLUG8414_10_SE_C17.indd 325 16/11/18 5:15 pm 326 17 Recombinant DNA Technology

Bacterial Plasmid Vectors are introduced into bacteria by the process Genetically modified bacterial plasmids, derived from of transformation (see Chapter 8). Two main techniques naturally occurring plasmids, were the first vectors devel- are widely used for bacterial transformation. One approach oped, and they are still widely used for cloning. Recall from involves treating cells with calcium ions and using a brief earlier discussions (Chapter 8) that plasmids are naturally heat shock to introduce the plasmid DNA into cells. The occurring extrachromosomal, double-stranded, circular other technique, called electroporation, uses a brief, but DNA molecules that replicate independently from the chro- high-intensity, pulse of electricity to move plasmid DNA mosomes within bacterial cells [Figure 17. 3(a)]. These plas- into bacterial cells. mids can be extensively modified by genetic engineering to Only one or a few plasmids generally enter a bacterial serve as cloning vectors. Many commercially prepared plas- host cell by transformation. But because plasmids have an mids are readily available with a range of useful features origin of replication (ori) site that allows for plasmid repli- [Figure 17. 3(b)]. One example is a region called the multiple cation, it is possible to produce several hundred copies of a cloning site, a short sequence that has been genetically engi- plasmid in a single host cell. This greatly enhances the num- neered to contain a number of restriction sites for commonly ber of DNA clones that can be produced. used restriction enzymes. Multiple cloning sites allow sci- Cloning DNA with a plasmid generally begins by cutting entists to clone a range of different fragments generated by both the plasmid DNA and the DNA to be cloned with the many commonly used restriction enzymes. same restriction enzyme (Figure 17. 4). Typically, the plas- mid is cut once within the multiple cloning site, converting the circular molecule into a linear vector. DNA restriction fragments from the DNA to be cloned are added to the lin- (a) earized vector in the presence of DNA ligase. Sticky ends of DNA fragments anneal, joining the DNA to be cloned and the plasmid. DNA ligase is then used to create phosphodiester bonds to seal nicks in the DNA backbone, thus producing recombinant DNA, which is then introduced into bacterial host cells by transformation. Once inside the cell, plasmids replicate quickly to produce multiple copies. However, when cloning DNA using plasmids, not all plasmids will incorporate DNA to be cloned. For example, a plasmid cut with a restriction enzyme generating sticky ends can close back on itself (self-ligate) if cut ends of the plasmid rejoin. Obviously, such nonrecombinant plasmids are not desired. Also, during transformation, not all host cells will take up plasmids. Therefore, it is important that (b) HindIII bacterial cells containing recombinant DNA can be readily SphI identified in a cloning experiment. One way this is accom- PstI plished is through the use of selectable marker genes, such SalI DNA-sequencing AccI HincII as those that provide resistance to antibiotics (ampR for primer site XbaI ampicillin resistance, for example). Another strategy is the BamHI use of reporter genes such as lacZ. Figure 17. 5 provides an Multiple SmaI cloning site XmaI example of how the latter can be used to identify bacteria KpnI containing recombinant plasmids. This process is often BanII Ampicillin lacZ SstI referred to as “blue-white” screening for a reason that will resistance gene EcoRI soon become obvious. gene In blue-white screening, a plasmid is used that con- (ampR) tains the lacZ gene into which a multiple cloning site has DNA-sequencing primer site been incorporated. The lacZ gene encodes the enzyme b Origin of @galactosidase, which, as you learned earlier in the text (see replication (ori) Chapter 15), is used to cleave the disaccharide lactose into its component monosaccharides glucose and galactose. Blue- white screening takes advantage of this enzymatic activity. FIGURE 1 7. 3 (a) Color-enhanced electron micrograph of plasmids isolated from E. coli. (b) Diagram of a typical DNA Using this approach, one can easily identify transformed cloning plasmid. cells containing recombinant or nonrecombinant plasmids.

M17_KLUG8414_10_SE_C17.indd 326 16/11/18 5:15 pm 17.1 RECOMBINANT DNA TECHNOLOGY BEGAN WITH TWO KEY TOOLS 327

DNA to be cloned is cut with the same restriction enzyme

Plasmid vector is cut with a restriction enzyme

The two DNAs are ligated to form a recombinant molecule

Introduction into bacterial host cells by transformation

Host cell chromosome

Cells carrying recombinant plasmids can be selected by plating on agar containing antibiotics and color indicators such as X-gal (refer to Figure 17.5)

FIGURE 1 7. 4 cloning with a plasmid vector.

If a DNA fragment is inserted anywhere in the multiple culture broth and grown in large quantities, after which it is cloning site, the lacZ gene is disrupted and will not produce relatively easy to isolate and purify recombinant plasmids functional copies of b@galactosidase. The agar plates used from these cells. in the assay contain an antibiotic—ampicillin in this case. Plasmids are still the workhorses for many applications Nontransformed bacteria do not have the ampR gene and are of recombinant DNA technology, but they have a major limi- killed by the ampicillin. tation: Because they are small, they can only accept inserted These agar plates also contain a substance called pieces of DNA up to about 25 kilobases (kb) in size, and most X-gal (technically 5@bromo@4@chloro@3@indolyl@b@D@ galacto- plasmids can often only accept substantially smaller pieces. pyranoside), which is similar to lactose in structure. X-gal Therefore, as recombinant DNA technology developed and it is, therefore, a substrate for b@galactosidase, and when it is became desirable to clone large pieces of DNA, other vectors cleaved, it turns blue. Bacterial cells that carry nonrecom- were developed primarily for their ability to accept larger binant plasmids (those that have self-ligated and thus do pieces of DNA and because they could be used with other not contain inserted DNA) have a functional lacZ gene and types of host cells beside bacteria. produce b@galactosidase. As a result, these cells turn blue because the functional enzyme cleaves X-gal in the medium. By contrast, recombinant bacteria (with plasmids contain- Other Types of Cloning Vectors ing a DNA fragment inserted into the lacZ gene) will form Phage vectors were among the earliest vectors used in addi- white colonies when they grow on X-gal medium because tion to plasmids. These included genetically modified functional b@galactosidase cannot be made (Figure 17.5, strains of bacteriophage l¬a double-stranded DNA virus. bottom). Bacteria in a colony are clones of each other— The genome of l phage has been sequenced, and it has been genetically identical cells with copies of recombinant plas- modified to incorporate many of the important features of mids. White colonies can be transferred to flasks of bacterial cloning vectors described earlier in this chapter, including

M17_KLUG8414_10_SE_C17.indd 327 16/11/18 5:15 pm 328 17 Recombinant DNA Technology

a multiple cloning site. Phage vectors were popular for quite some time and are still in use today because they can carry inserts up to 45 kb—more than twice as long as DNA inserts in most plasmid vectors. DNA fragments are ligated into the Ampicillin lacZ resistance phage vector to produce recombinant l vectors that are sub- Multiple gene (ampR) sequently packaged into phage protein heads in vitro, and cloning site then the phage are used to infect bacterial host cells growing on petri plates. Inside the bacteria, the vectors replicate and form many copies of infective phage, each of which carries a DNA insert. As they reproduce, they lyse their bacterial host cells, forming the clear spots known as plaques on the bacte- 1. Multiple cloning site of plasmid rial lawn (described in Chapter 8), from which phage can be is cut with restriction enzyme. isolated and the cloned DNA can be recovered. The mapping and analysis of large eukaryotic genomes, including the human genome, require cloning vectors that can carry very large DNA fragments such as segments of an 3' lacZ entire chromosome. Bacterial artificial chromosomes 5' gene (BACs) and yeast artificial chromosomes (YACs) are 5' 3' examples of vectors that can be used for these purposes. lacZ BACs are essentially very large but low copy number (typi- gene 2. DNA to be cloned cally one or two copies per bacterial cell) plasmids that can cut with same accept DNA inserts in the 100- to 300-kb range. A YAC, like restriction enzyme. natural eukaryotic chromosomes, has telomeres at each 3' end, origins of replication, and a centromere. These compo- 5' nents are joined to selectable marker genes and to a cluster of restriction-enzyme recognition sequences for insertion of foreign DNA. Yeast chromosomes range in size from 230 kb to over 1900 kb, making it possible to clone DNA inserts from 3. DNA ligase joins together plasmid DNA and DNA 100 to 1000 kb in YACs. The ability to clone large pieces of to be cloned to create DNA in these vectors made them an important tool in the recombinant plasmid. Human Genome Project (see Chapter 18). lacZ Unlike the vectors described so far, expression vectors gene are designed to ensure mRNA expression of a cloned gene with the purpose of producing many copies of the gene’s DNA insertion lacZ encoded protein in a host cell. This is an important distinc- disrupts gene tion since most plasmids, phage vectors, and YACs only carry lacZ gene DNA and do not signal the cell to transcribe it into mRNA. Expression vectors are available for both bacterial and 4. Transform bacteria with eukaryotic host cells and contain the appropriate sequences plasmids. Grow cells on agar to initiate both transcription and translation of the cloned containing ampicillin and X-gal. gene. For many research applications that involve studies

FIGURE 1 7. 5 in blue-white screening, DNA inserted into Bacteria with the multiple cloning site of a plasmid disrupts the lacZ nonrecombinant gene. Bacteria containing recombinant DNA are unable to plasmids are blue metabolize X-gal, resulting in white colonies and allowing Bacteria with direct identification of colonies that carry DNA inserts to recombinant be cloned. (Bottom) Photo of a Petri dish showing the plasmids are white growth of bacterial cells after uptake of plasmids. Cells in blue colonies contain vectors without DNA inserts (nonrecombinant plasmids), whereas cells in white colonies contain vectors carrying DNA inserts (recombinant plasmids). Nontransformed cells did not grow into colonies due to the presence of ampicillin in the plating medium.

M17_KLUG8414_10_SE_C17.indd 328 16/11/18 5:15 pm 17.2 DNA Libraries Are Collections of Cloned Sequences 329

of protein structure and function, producing a recombinant libraries built from eukaryotic cells will contain coding and protein in bacteria (or other host cells) and purifying the noncoding segments of DNA such as introns. protein is a routine approach, although it is not always easy Since some vectors (such as plasmids) can carry only a to properly express a protein that maintains its biological few thousand base pairs of inserted DNA, BACs and YACs were function. The biotechnology industry also relies heavily on commonly used to accommodate the large sizes of DNA nec- expression vectors to produce commercially valuable pro- essary to span the approximately 3 billion bp of human DNA tein products from cloned genes. (as in the Human Genome Project). As you will learn later Introducing genes into plants is a common application in the text (see Chapter 18), whole-genome sequencing that can be done in many ways. One widely used approach approaches (see Figure 18.1) and new sequencing methodolo- involves a species of soil bacterium and a type of plasmid gies are replacing traditional genomic DNA libraries because called a Ti plasmid. We will discuss aspects of genetic engi- they effectively allow one to sequence an entire genomic DNA neering of food plants later in the text (see Special Topics sample without the need for inserting DNA fragments into vec- Chapter 6—Genetically Modified Foods). tors and cloning them in host cells. But the concept of a DNA library is still important for a number of modern applications. offer certain ESSENTIAL POINT Complementary DNA (cDNA) libraries advantages over genomic libraries and continue to be a use- Vectors replicate autonomously in host cells and facilitate the cloning and manipulation of newly created recombinant DNA ful methodology for specific approaches to gene cloning and molecules. ■ other applications. This is primarily because a cDNA library contains DNA copies—called cDNA—that are made from mRNA molecules isolated from cultured cells or a tissue sample. A cDNA library therefore represents only the genes 1 7. 2 DNA Libraries Are Collections being expressed in cells at the time the library was made— of Cloned Sequences unlike a genomic library, which contains all of the DNA, cod- ing and noncoding, in a genome. This is a key point: cDNA Cloning DNA into smaller vectors, particularly plasmids, libraries provide a snapshot, or catalog, of just the genes that produces only relatively small DNA segments—representing were transcriptionally active in a tissue at a particular time. just a single gene or even a portion of a gene. Even when sev- As a result, cDNA libraries have been particularly use- eral hundred genes are introduced into larger vectors such ful for identifying and studying genes expressed in certain as BACs or YACs, one still needs a method for identifying cells or tissues under certain conditions: for example, dur- the DNA pieces that were cloned. Consider this: In our clon- ing development, cell death, cancer, and other biologi- ing discussions so far, we have described how DNA can be cal processes. One can also use these libraries to compare inserted into vectors and cloned—a relatively straightfor- expressed genes from normal tissues and diseased tissues. ward process—but we have not discussed how a researcher For instance, this approach has been widely used to identify knows what particular DNA sequence they have cloned. genes involved in cancer formation, such as those genes that Simply cutting DNA and inserting it into vectors does not contribute to progression from a normal cell to a cancer cell tell you what gene or sequences are being copied. and genes involved in cancer cell metastasis (spreading). During the first several decades of DNA cloning, scien- The initial steps required to prepare a cDNA library are tists created DNA libraries, which represent collections of shown in Figure 17. 6. Key to the technique is the process cloned DNA. Depending on how a library is constructed, it of reverse transcription. Because most eukaryotic mRNAs may contain genes and noncoding regions of DNA. Gener- have a poly-A tail at the 3′@end, a short oligo(dT) molecule is ally speaking, there are two main types of libraries, genomic annealed to this tail to serve as a primer for initiating DNA DNA libraries and complementary DNA (cDNA) libraries. synthesis by the enzyme reverse transcriptase. Reverse transcriptase uses the mRNA as a template to synthesize a complementary DNA strand (cDNA) and forms a double- Genomic Libraries and cDNA Libraries stranded mRNA/cDNA duplex. The mRNA is then partially Ideally, a genomic library consists of many overlapping digested with the enzyme RNAse H to produce gaps in the fragments of the genome, with at least one copy of every DNA RNA strand. The 3′@ends of the remaining mRNA serve as sequence in an organism’s chromosomes, which in summary primers for DNA polymerase I, which synthesizes a second span the entire genome. In making a genomic library, chro- DNA strand. The result is a double-stranded cDNA molecule mosomal DNA is extracted from cells or tissues and cut ran- that can be cloned into suitable vectors, usually plasmids. domly with restriction enzymes, and the resulting fragments Because one typically wouldn’t know what restriction are inserted into vectors. Vectors in the genomic library may enzymes could be used to cut cDNA produced by the method contain more than one gene or only a portion of a gene. Also, just described, one usually needs to attach linker sequences

M17_KLUG8414_10_SE_C17.indd 329 16/11/18 5:15 pm 330 17 Recombinant DNA Technology

Poly-A tail For several decades an approach called library screen- mRNA ing was routinely used to sort through a library and isolate AAAAA 5' 3' specific genes of interest. Many of the first genes to be cloned and sequenced were identified this way. Library screening 1. Add oligo(dT) primer TTTTT oligo(dT) usually involves use of a probe, any DNA or RNA sequence that is complementary to some part of the target gene or 5' AAAAA3' sequence to be identified in a library. The probe will bind 5' (hybridize) to any complementary DNA sequences present 2. Add reverse transcriptase TTTTT to synthesize a strand of in one or more clones. cDNA Probes are derived from a variety of sources—often AAAAA 5' 3' related genes isolated from another species can be used if 3' TTTTT5' enough of the DNA sequence is conserved. For example, genes from rats, mice, or even Drosophila that have con- served sequence similarity to human genes can be used as 5' AAAAA3' probes to identify human genes during library screening. A probe must be labeled or tagged in some way so that it 3' TTTTT5' Double-stranded can be identified. Initially probes were labeled with radioac- 3. Partially digest RNA RNA/DNA duplex tive isotopes, but modern applications use probes labeled with with RNase H nonradioactive compounds that undergo chemical or color 5' 3' reactions to indicate the location of a specific clone in a library. Although less central to research today, libraries 3' TTTTT5' 4. Add DNA polymerase I still have their place in certain applications in the genet- to synthesize second ics lab. However, as you will learn later in the text (see strand of DNA Chapter 18), the basic methods of recombinant DNA tech- AAAAA 5' 3' nology, including DNA libraries, were the foundation for 3' TTTTT5' the development of more powerful whole-genome tech- niques that led to the genomics era of modern genetics and 5. Add DNA ligase to seal molecular biology. Genomic techniques, in which entire nicks genomes are being sequenced without creating libraries, 5' AAAAA 3' have largely replaced libraries at least for cloning and iso- 3' TTTTT5' lating one or a few genes at a time. We will also consider Double-stranded cDNA later (Chapter 18) how DNA sequence analysis using bio- informatics allows one to identify protein-coding and non- FIGURE 1 7. 6 Producing cDNA from mRNA. coding sequences in cloned DNA.

to the ends of the cDNA in order to insert it into a plasmid. NOW SOLVE THIS Linkers are short double-stranded oligonucleotides con- 17.1 A plasmid that is both ampicillin and tetracycline taining a restriction-enzyme recognition sequence (e.g., resistant is cleaved with PstI, which cleaves within the for EcoRI). After attachment to the cDNAs, the linkers are ampicillin resistance gene. The cut plasmid is ligated with cut with EcoRI and ligated to vectors treated with the same PstI-digested Drosophila DNA to prepare a genomic library, enzyme. and the mixture is used to transform E. coli K12.

BamHI Specific Genes Can Be Recovered from a PvuII Library by Screening PstI Ampicillin Genomic and cDNA libraries often consist of several hun- resistance dred thousand different DNA clones, much like a large book gene SalI library may have many books but only a few of interest to Tetracycline resistance your studies in genetics. So how can libraries be used to gene locate a specific gene of interest? To find a specific gene, we need to identify and isolate only the clone or clones contain- ing that gene. We must also determine whether a given clone contains all or only part of the gene we are trying to study.

M17_KLUG8414_10_SE_C17.indd 330 16/11/18 5:15 pm 17.3 The Polymerase Chain Reaction is A Powerful Technique for Copying DNA 331

(typically about 20 nucleotides long) single-stranded DNA (a) Which antibiotic should be added to the medium sequences, one complementary to the 5′@end of one strand to select cells that have incorporated a plasmid? of target DNA to be amplified and another primer comple- (b) If recombinant cells were plated on medium con- @ taining ampicillin or tetracycline and medium with mentary to the opposing strand of target DNA at its 3′ end. both antibiotics, on which plates would you expect When added to a sample of double-stranded DNA that to see growth of bacteria containing plasmids with has been denatured into single strands, the primers bind Drosophila DNA inserts? to complementary nucleotides at each end within the (c) How can you explain the presence of colonies that sequence to be cloned. DNA polymerase can then extend are resistant to both antibiotics? the 3′@end of each primer to synthesize second strands of HINT: This problem involves an understanding of antibiotic the target DNA. One complete reaction process, called a selectable marker genes in plasmids and antibiotic DNA selection cycle, doubles the number of DNA molecules in the reaction for identifying bacteria transformed with recombinant plasmid [Figure 17. 7(a)]. Repetition of the process produces large DNA. The key to its solution is to recognize that inserting foreign numbers of copied target DNA very quickly [Figure 17. 7(b)]. DNA into the plasmid vector disrupts one of the antibiotic resis- If desired, the PCR products can be cloned into plasmid vec- tance genes in the plasmid. tors for further use. The amount of amplified DNA produced is theoreti- ESSENTIAL POINT cally limited only by the number of times these cycles are DNA libraries are collections of cloned DNA that can be screened or repeated, although several factors prevent PCR reactions sequenced to isolate specific sequences of interest. ■ from amplifying very long stretches of DNA. Most routine PCR applications involve a series of three reaction steps in a cycle. These three steps are as follows:

1 7. 3 The Polymerase Chain 1. Denaturation: The double-stranded DNA to be cloned Reaction is A Powerful Technique is denatured into single strands by heating to 92 to 95°C for Copying DNA for about 1 minute. 2. Hybridization/Annealing: The temperature of the Cloning DNA using vectors and host cells is labor inten- reaction is lowered to between 45°C and 65°C, which sive and time consuming. In 1986, a technique called the allows primer binding, also called hybridization or polymerase chain reaction (PCR) was developed. PCR annealing, to the denatured, single-stranded DNA. The revolutionized recombinant DNA methodology and further primers serve as starting points for DNA polymerase to accelerated the pace of biological research. The significance synthesize new DNA strands complementary to the tar- of this method was underscored by the awarding of the 1993 get DNA. When selecting a hybridization temperature Nobel Prize in Chemistry to Kary Mullis, who developed for an experiment, factors such as primer length, base the technique. composition of primers (GC-rich primers are more ther- PCR is a rapid method of DNA cloning that extends the mally stable than AT-rich primers), and whether or not power of recombinant DNA and in many cases eliminates the all bases in a primer are complementary to bases in the need to use host cells for cloning. Genomic DNA or mitochon- target sequence are among primary considerations. drial DNA can be used and the DNA to be cloned can come 3. Extension: The reaction temperature is adjusted to from many sources, including mummified remains, fossils, between 65°C and 75°C, and DNA polymerase uses the or forensic samples such as dried blood, semen, or hair. primers as a starting point to synthesize new DNA strands PCR is a method of choice for many applications, whether by adding nucleotides to the ends of the primers in a in molecular biology, human genetics, evolution, develop- 5′ to 3′ direction. ment, conservation, or forensics. By copying a specific DNA sequence through a series of Each cycle results in amplification – a doubling of the in vitro reactions, PCR can amplify target DNA sequences number of DNA molecules present at the start of the cycle. that are initially present in very small quantities in a popula- PCR is, therefore, a “chain reaction” because the products tion of other DNA molecules. When performing PCR, double- of previous cycles serve as templates for each subsequent stranded target DNA to be amplified is placed in a tube with cycle. Each cycle takes 2 to 5 minutes and can be repeated DNA polymerase, Mg2+ (an important cofactor for DNA poly- immediately, so that in less than 3 hours, 25 to 30 cycles merase), and the four deoxyribonucleoside triphosphates. In result in over a million-fold increase in the amount of DNA. addition, some information about the nucleotide sequence Instruments called thermocyclers, or simply PCR machines, of the target DNA is required. This sequence information that can be programmed to carry out a predetermined num- is used to synthesize two oligonucleotide primers: short ber of cycles, automate the process. The large amounts of a

M17_KLUG8414_10_SE_C17.indd 331 16/11/18 5:15 pm 332 17 Recombinant DNA Technology

(a) PCR: one cycle of amplification (doubles the number of DNA molecules) Target sequence 3' 5' DNA to be 5' 3' ampli ed

1. Denature DNA (92–95°C)

2. Anneal primers Primer 5' (45–65°C) 3' Primer

3. Extend primers (65–75°C)

8 product 4 product Third cycle: molecules (b) PCR: three cycles of amplification Second cycle: molecules 2 product First cycle: molecules

Primer

Target sequence

Primer

FIGURE 1 7. 7 Steps in the polymerase chain reaction extends beyond the target sequence. (b) Repeated cycles (PCR). (a) In this schematic representation, a relatively short of PCR can quickly amplify the target DNA sequence more sequence of DNA is shown being amplified. Notice that the than a millionfold. Products in part (b) that consist of only first cycle produces amplified molecules with a strand that the target sequence are outlined and highlighted.

specific DNA sequence produced can be used for many pur- Although PCR is a valuable research tool with many poses, including cloning into plasmid vectors, DNA sequenc- advantages over previously used techniques, it does ing, clinical diagnosis, and genetic screening. have limitations. One is that some information about the PCR requires a special type of DNA polymerase. Mul- nucleotide sequence of the target DNA must be known in tiple PCR cycles involve repeated heating and cooling of order to synthesize primers. Another is its sensitivity: samples, which eventually lead to heat denaturation and even minor contamination of the sample with DNA from loss of activity of most proteins. PCR reactions rely on other sources can cause problems. For example, cells thermostable forms of DNA polymerase that are capable shed from a researcher’s skin can contaminate samples of withstanding multiple heating and cooling cycles with- gathered from a crime scene, making it difficult to obtain out significant loss of activity. PCR became a major tool accurate results. Also, PCR typically cannot be used to when DNA polymerase was isolated from Thermus aquati- amplify particularly long segments of DNA. Normally, cus, a bacterium living in habitats like the hot springs of DNA polymerase in a PCR reaction extends primers only Yellowstone National Park, where it was first discovered. for relatively short distances. Because of this, scien- Called Taq polymerase, this enzyme is capable of tolerating tists tend to amplify pieces of DNA that are only several extreme temperature changes and was the first thermosta- thousand nucleotides in length, which is fine for most ble polymerase used for PCR. routine applications.

M17_KLUG8414_10_SE_C17.indd 332 16/11/18 5:15 pm 17.4 Molecular Techniques for Analyzing DNA and RNA 333

PCR Applications NOW SOLVE THIS

PCR has become one of the most versatile and widely used 17.2 You have just created the world’s first genomic techniques in modern genetics and molecular biology, with library from the African okapi, a relative of the giraffe. many different applications. DNA cloning using a PCR- No genes from this genome have been previously isolated based approach has several advantages over library cloning or described. You wish to isolate the gene encoding the approaches. PCR is rapid and can be carried out in a few hours, oxygen-transporting protein b-globin from the okapi rather than the days required for making and screening DNA library. This gene has been isolated from humans, and its libraries. PCR is also very sensitive and amplifies specific DNA nucleotide sequence and amino acid sequence are avail- sequences from vanishingly small DNA samples. PCR sensi- able in databases. Using the information available about b tivity is invaluable in several kinds of applications, including the human -globin gene, what strategy can you use to isolate this gene from the okapi library? genetic testing, forensics, and molecular paleontology. With carefully designed primers, DNA samples that have been HINT: This problem asks you to design PCR primers to amplify partially degraded, contaminated with other materials, or the b-globin gene from a species whose genome you just cloned embedded in a matrix (such as the fossilized tree resin known The key to its solution is to remember that you have at your dis- posal sequence data for the human b-globin gene and to consider as amber) can be amplified. This allows the study of samples that PCR experiments require the use of primers that bind to from fossils or from single cells, such as those recovered at complementary bases in the DNA to be amplified. crime scenes, where a single hair or even a saliva-moistened postage stamp can serve as the sole source of the DNA. Later For more practice, see Problem 13. in the text (see Special Topics Chapter 5—DNA Forensics), we will discuss how PCR is used in human identification, includ- ing remains identification, and in forensic applications. Another important application of PCR is as a diagnostic tool. As you will learn in Special Topics Chapter 2—Genetic 1 7. 4 Molecular Techniques for Testing, gene-specific primers provide a way of using PCR for screening mutations involved in genetic disorders. PCR Analyzing DNA and RNA is also a key method for detecting bacteria and viruses (such A wide range of molecular techniques is available to almost as hepatitis or HIV) in humans, and pathogenic bacteria such anyone who does research involving DNA and RNA, par- as E. coli and Staphylococcus aureus in contaminated food. ticularly those who study the structure, expression, and Reverse transcription PCR (RT-PCR) is a powerful regulation of genes. There are far too many techniques avail- methodology for studying gene expression, that is, mRNA able than we can address in this chapter. In the following production by cells or tissues. In RT-PCR, RNA is isolated sections, we consider some of the techniques that are most from cells or tissues to be studied, and reverse transcriptase commonly used to analyze DNA and RNA. is used to generate cDNA molecules, as described earlier when we discussed preparation of cDNA libraries. This reac- tion is followed by PCR to amplify cDNA with a set of primers Agarose Gel Electrophoresis specific for the gene of interest. Amplified cDNA fragments One of the most routine techniques for analyzing DNA is are then separated and visualized on an agarose gel. Because agarose gel electrophoresis, a method that separates the amount of amplified cDNA in RT-PCR is based on the DNA fragments by size, with the smallest pieces moving far- relative number of mRNA molecules in the starting reac- thest through the gel (see Chapter 9; refer to Figure 9.14). tion, RT-PCR can be used to evaluate relative levels of gene The fragments form a series of bands that can be visual- expression in different samples. ized by treating the gel with DNA-binding stains such as Finally, in discussing PCR approaches, one of the most ethidium bromide and illuminating it with ultraviolet light valuable modern techniques is quantitative real-time PCR (Figure 17. 8). This is usually the method of choice when (qPCR) or simply real-time PCR. A key facet of this method smaller pieces of DNA need to be analyzed or isolated. is the ability to quantify the PCR product as it is made during Before DNA sequencing and bioinformatics became rou- an experiment (as the reactions occur in “real time”) without tine, newly cloned DNA would be digested with enzymes and having to run a gel. This, in turn, allows the calculation of separated by gel electrophoresis. The digestion pattern of the amount of template DNA originally present in a sample. fragments generated could then be interpreted to determine the location of restriction sites for different enzymes, to cre- ESSENTIAL POINT ate a restriction map. Restriction maps are now often created PCR allows DNA to be amplified, or copied, without cloning and is a by simply using software to identify restriction-enzyme rapid and sensitive method with wide-ranging applications. ■ cutting sites in sequenced DNA. The Exploring Genomics

M17_KLUG8414_10_SE_C17.indd 333 16/11/18 5:15 pm 334 17 Recombinant DNA Technology

a DNA binding membrane, and hybridization of these frag- ments to a labeled DNA or RNA probe. The membrane is washed to remove excess probe and overlaid with a piece of X-ray film for autoradiography or detected with a digi- tal camera with chemiluminescence probes. Hybridization identifies specific DNA sequences present in the fragments, because only (Figure 17. 9) fragments hybridizing to the probe are visualized. Southern blotting led to the subsequent development of other widely used blotting approaches. RNA blotting was called Northern blot analysis or simply Northern blotting. Prior to the development of RT-PCR and real-time PCR, Northern blotting was commonly used to study gene expression (RNA production) by cells and tissues because it could both characterize and quantify the transcriptional FIGURE 1 7. 8 An agarose gel containing separated DNA activity of genes. fragments stained with a DNA-binding dye (ethidium bromide) and visualized under ultraviolet light. Smaller A related blotting technique for analyzing proteins was fragments migrate faster and farther than do larger also developed. It is known as Western blotting. Thus part fragments, resulting in the distribution shown. Molecular of the historical significance of Southern blotting is that it techniques involving agarose gel electrophoresis are routinely used in a wide range of applications. led to the development of other blotting methods that are key tools for studying nucleic acids and proteins. Finally, as noted earlier in the text (see Chapter 9), exercise in this chapter involves a Web site, Webcutter, fluorescence in situ hybridization (FISH) is a powerful which is commonly used for generating restriction maps. tool that involves hybridizing a probe directly to a chromo- some or RNA without blotting (see Figure 9.13 and Chapter 16 opening photograph). FISH can be carried out with isolated Nucleic Acid Blotting and Hybridization chromosomes on a slide or directly in situ in tissue sections or Techniques entire organisms, such as embryos. One type of application Several of the techniques described in this chapter and else- is in the field of developmental genetics, where FISH is used where in the book rely on hybridization between comple- to identify which cell types in an embryo express different mentary nucleic acid (DNA or RNA) molecules. One of the genes during specific stages of development (Figure 17. 10). most widely used hybridization methods is called Southern Variations of the FISH technique are also being used blot analysis or simply Southern blotting (after Edwin to produce spectral karyotypes in which individual chro- Southern, who devised it). Southern blotting is another mosomes can be detected using probes labeled with dyes pioneering method that served essential roles in the early decades of DNA cloning such as identifying which clones in a library contained a given DNA sequence, identifying (a) (b) specific genes in genomic DNA digested with a restriction enzyme, and identifying the number of copies of a particular sequence or gene that are present in a genome. Modern DNA sequencing approaches have now replaced most of these application examples. Gel electrophoresis can be used to characterize the number and molecular weights of fragments produced by restriction digestion of small genomes, when the number of fragments generated is relatively low. However, digestion of large genomes—such as the human genome, with more than 3 billion nucleotides—would produce so many different frag- ments that they would run together on a gel to produce a continuous smear. Southern blotting enables the identifica- FIGURE 1 7. 9 (a) Agarose gel stained with ethidium bromide tion of a particular DNA fragment of interest. to show DNA fragments. (b) Chemiluminescent image of a Southern blot prepared from the gel in part (a). Only those Southern blotting involves separation of DNA fragments bands containing DNA sequences complementary to the by gel electrophoresis, transfer or “blotting” of fragments to probe show hybridization.

M17_KLUG8414_10_SE_C17.indd 334 16/11/18 5:15 pm 17.5 DNA Sequencing Is the Ultimate Way to Characterize DNA at the Molecular Level 335

DNA polymerase, and the four deoxyribonucleotide triphos- phates (dATP, dCTP, dGTP, and dTTP). The key to the Sanger technique is the addition of a small amount of modified deoxyribonucleotides, called dideoxynucleotides (abbreviated ddNTPs) (Figure 17. 11, inset box). Notice that a dideoxynucleotide has a 3′ hydrogen instead of a 3′ hydroxyl group. Dideoxynucleotides are called chain-termination nucleotides because they lack the 3′ oxygen required to form a phosphodiester bond with FIGURE 1 7. 10 In situ hybridization of a zebrafish embryo another nucleotide. If ddNTPs are included in a DNA syn- 48 hours after fertilization. The probe used shows expression of atp2a1 mRNA, which encodes a muscle-specific calcium thesis reaction, the polymerase will occasionally (randomly) pump, and is visualized as a dark blue stain. Notice that this insert a ddNTP instead of a dNTP into a growing DNA strand. staining is restricted to muscle cells surrounding the devel- Once this occurs, synthesis terminates because DNA poly- oping spinal cord of the embryo. merase cannot add new nucleotides to a ddNTP due to its lack of a 3′ oxygen. The Sanger reaction takes advantage of this key modification. that will fluoresce at different wavelengths (see Chapter 6 This outcome is illustrated in Step 2 of Figure 17.11. opening photograph). Spectral karyotyping has proven to be Notice that the shortest fragment generated is a sequence extremely valuable for detecting deletions, translocations, that has added ddCTP to the 3′@end of the primer and the duplications, and other anomalies in chromosome structure, chain has terminated. Over time as the reaction proceeds, such as chromosomal rearrangements, and for detecting eventually a ddNTP will be inserted at every location in the chromosomal abnormalities in cancer cells. sequence. The result is a population of newly synthesized DNA molecules each of which is terminated by a ddNTP and that differ in length by one nucleotide. The size difference ESSENTIAL POINT allows for separation of reaction products by gel electropho- DNA and RNA can be analyzed through a variety of methods that resis, which can then be used to determine the sequence. involve hybridization techniques. ■ When the Sanger technique was first developed, four separate reaction tubes, each with a different single ddNTP (e.g., ddATP, ddCTP, ddGTP, and ddTTP), were used. These reactions typically used either a radioactively labeled 1 7. 5 DNA Sequencing Is the primer or a radioactively labeled ddNTP to permit analy- sis of the sequence following polyacrylamide gel electro- Ultimate Way to Characterize DNA phoresis and autoradiography. Historically, this approach at the Molecular Level involved large polyacrylamide gels in which each reaction was loaded on a separate lane of the gel and the ladder-like In a sense, cloned DNA, from a single gene to an entire banding patterns revealed by autoradiography were read to genome, is completely characterized at the molecular level determine the sequence. This original approach could typi- only when its nucleotide sequence is known. The ability to cally read about 800 bases for each of 100 DNA molecules sequence DNA has greatly enhanced our understanding of simultaneously. Read length—the amount of sequence that genome organization and increased our knowledge of gene can be generated in a single individual reaction—and the structure, function, and mechanisms of regulation. total amount of DNA sequence generated in a sequence run Historically, the most commonly used method of (effectively, the read length times the number of reactions DNA sequencing was developed by Fred Sanger and col- an instrument can process during a given period of time) leagues during the 1970s and is known as dideoxy chain- together have become a hot area for innovation in sequenc- termination sequencing or simply Sanger sequencing. ing technology. Because Sanger sequencing was an important founda- Modifications of the Sanger technique in the mid- tional method for newer, more modern approaches to DNA 1980s led to technologies that allowed sequencing reac- sequencing, we will briefly discuss the technique here. tions to occur in a single tube. As shown in Figure 17.11, A double-stranded DNA molecule whose sequence is to be the four ddNTPs were each labeled with a different-colored determined is converted to single strands, one of which is fluorescent dye. These reactions were carried out in PCR- used as a template for synthesizing a series of complemen- like fashion using cycling reactions that permit greater tary strands. The template DNA is mixed with a primer that read and run capabilities. The reaction products were sepa- is complementary to either the target DNA or the vector, rated through a single, ultrathin-diameter polyacrylamide

M17_KLUG8414_10_SE_C17.indd 335 16/11/18 5:15 pm 336 17 Recombinant DNA Technology

1. Reaction components (DNA template, primer, DNA polmerase, dNTPs, labeled ddNTPs) (5’) (5’) P P P CH2 O Base P P P CH2 O Base DNA Template strand 3' 5' (4’) C C(1’) (4’) C C(1’) T C A G G A T C G G A T C T G T A C H H H H A G T C C T A G C H C C H H C C H 5' 3' (3’) (2’) (3’) (2’) Primer + ddNTPs OH H H H ddATP ddGTP ddCTP ddTTP Deoxynucleotide Dideoxynucleotide (dNTP) (ddNTP)

2. Primer extension and chain termination A G T C C T A G C C 5' 3' A G T C C T A G C C T 3. DNA fragments separated by 4. Laser and detector detect fluorescence 5' 3' capillary gel electrophoresis of each ddNTP and provide input to a computer for sequence analysis A G T C C T A G C C T A Direction of 5' 3' Capillary movement gel A G T C C T A G C C T A G of strands 5' 5' 3' C A G T C C T A G C C T A G A Detector T A 5' 3' G A G T C C T A G C C T A G A C A Laser C 5' 3' A A G T C C T A G C C T A G A C A T G 5' 3' 3' A G T C C T A G C C T A G A C A T Chromatograph 5' 3' A G T C C T A G C C T A G A C A T G 5' 3'

FIGURE 1 7. 11 computer-automated DNA sequencing (randomly) inserts a ddNTP instead of a dNTP, terminating using the chain-termination (modified Sanger) method. the synthesis of the chain because the ddNTP does not The inset box at upper right illustrates dideoxynucleotide have the OH group needed to attach the next nucleotide. (ddNTP) structure. (1) A primer is annealed to a sequence Over the course of the reaction, all possible termination adjacent to the DNA being sequenced (usually near the sites will have a ddNTP inserted, and thus all possible multiple cloning site of a ). A reaction lengths of chains are produced. The products of the reaction mixture is added to the primer–template combination. are added to a single lane on a capillary gel (3), and the This includes DNA polymerase, the four dNTPs, and small bands are read by a detector and imaging system (4) from molar amounts of ddNTPs labeled with fluorescent dyes. the newly synthesized strand. In this case, the sequence (2) All four ddNTPs are added to the same reaction tube. obtained begins with 5’-CTAGACATG-3’ as seen in the During primer extension, the polymerase occasionally chromatograph in step 4.

tube gel called a capillary gel (capillary gel electrophoresis). high-throughput DNA sequencing. These systems were a As DNA fragments move through the gel, they are scanned big improvement over manual Sanger systems because they with a laser. The laser stimulates fluorescent dyes on each could generate relatively large amounts of DNA sequence DNA fragment, which then emit different wavelengths of in relatively short periods of time. Computer-automated light for each ddNTP. Emitted light is captured by a detec- sequences could achieve read lengths of approximately tor that amplifies the signal and feeds this information 1000 bp with about 99.999 percent accuracy for about into a computer to convert the light patterns into a DNA $0.50 per kb. Automated DNA sequencers of this time period sequence that is technically called an electropherogram often contained multiple capillary gels (as many as 96) that or chromatograph. The data are represented as a series were several feet long and could process several thousand of colored peaks, each corresponding to one nucleotide in bases of sequences, so many of these instruments made it the sequence. possible to generate over 2 million bp of sequences in a day! For about two decades following the early 1990s, DNA Such systems became essential for the rapidly accelerat- sequencing was largely performed through computer- ing progress of the Human Genome Project. But by around automated Sanger-reaction-based technology (shown in 2005 and in the time since, sequencing technologies were to Figure 17.11) and referred to as computer-automated improve dramatically.

M17_KLUG8414_10_SE_C17.indd 336 16/11/18 5:15 pm 17.5 DNA Sequencing Is the Ultimate Way to Characterize DNA at the Molecular Level 337

Sequencing Technologies Have Progressed and variations of this instrument, is one of the most common Rapidly NGS platforms currently used in cutting-edge sequencing Sanger sequencing approaches, including those involv- laboratories today. This system uses a sequencing-by- ing computer-automated instruments, are rarely used synthesis (SBS) approach. DNA fragments serve as tem- today except for occasional sequencing of a relatively short plates, and nucleotides labeled with different dyes are piece of DNA or in labs that cannot afford more expen- incorporated. Next, unincorporated nucleotides are washed sive sequencing instruments. When it comes to sequenc- away and incorporated nucleotides are imaged. This cycle ing entire genomes, Sanger sequencing technologies and is then repeated. In the SBS method developed by Illu- early-generation computer-automated approaches are mina, the DNA fragments are attached to a solid support outdated. Compared to newer technologies, the costs of and then, using reactions similar to the Sanger method, the those approaches were relatively high, and sequencing fluorescently tagged terminator nucleotides are added and output, even with computer automation, was simply not detected. The fluorescent tags and terminator portions of the high enough to support the growing demand for genomic nucleotides are removed to allow another cycle of extension. data. This demand is being driven in large part by personal- SBS methods can now generate about 600 Gb of data in ized genomics (see Chapter 18) and the desire to reveal the 10 days—enough to sequence four complete human genomes, genetic basis of human diseases, which involves routine with each base sequenced an average of 30 times for accu- sequencing of complete individual human genomes. racy! The instrumentation needed to run these platforms is The development of genomics has spurred a demand for expensive. For example, the Illumina HiSeq instrument costs sequencers that are capable of generating millions of bases of about $650,000. But given the massive amounts of sequence DNA sequences in a relatively short time. Next-generation data NGS methods can generate, the average cost per base sequencing (NGS) technologies (the second generation is much lower than Sanger sequencing. Incidentally, NGS after Sanger methods) were the next big advance in DNA sequencing technologies also created major data manage- sequencing. NGS technologies dispensed with first-generation ment challenges for saving and storing large data files. methods (the Sanger technique and capillary electrophoresis) Shortly after NGS methods were commercialized, com- in favor of sophisticated, parallel formats (simultaneous reac- panies were announcing progress on third-generation tion formats) that synthesized DNA from tens of thousands sequencing (TGS). TGS methods are based on strategies of identical strands simultaneously and then used state-of- that sequence a single molecule of single-stranded DNA, and the-art fluorescence imaging techniques to detect newly at least four different approaches are being explored. synthesized strands and average sequence data across many Pacific Biosciences’ PacBio technology is one of the molecules being sequenced. NGS technologies provided an leaders in TGS and involves an approach known as single- unprecedented capacity for generating massive amounts of molecule sequencing in real time (SMRT). The PacBio instru- DNA sequence data rapidly and at dramatically reduced costs ment works by attaching single-stranded molecules of the per base. NGS has become so routine that now when scientists DNA to be sequenced to a single molecule of DNA poly- talk about “sequencing” we are referring to NGS. merase anchored to a substrate and then visualizing, in real Ultimately, several companies emerged as winners in the time, the polymerase as it synthesizes a strand of DNA (see race to commercialize NGS technology. The Illumina HiSeq, Figure 17. 12). The DNA polymerase is confined within a

A T G T G C T C C

DNA A G polymerase G-A-G-C G-A-G-C-A G-A-G-C-A

C-T-C-G-T-C C-T-C-G-T-C C-T-C-G-T-C

1. DNA polymerase located in 2. DNA polymerase adds 3. Fluorescent tag is cleaved off a nanopore anchored to a fluorescently tagged each base as it is added to the solid substrate binds a nucleotides to synthesize DNA strand. single-stranded DNA DNA. molecule to be sequenced.

FIGURE 1 7. 12 Third-generation sequencing (TGS). A simplified version of one approach to TGS is shown here. In this example, a DNA polymerase molecule anchored within a nanopore binds to a single strand of DNA. As the polymerase incorporates fluorescently labeled nucleotides into a new DNA strand (shown in pink), each base emits a characteristic color that can be detected.

M17_KLUG8414_10_SE_C17.indd 337 16/11/18 5:15 pm 338 17 Recombinant DNA Technology

nanopore—a hole of about 10 nm in diameter located within gene knockout technology and the creation of transgenic ani- a thin layer of metal on a glass substrate—a setup needed mals as examples of gene targeting—a collection of methods to detect the addition of individual nucleotides as they are that have revolutionized research in genetics. added to the growing strand. The terminal phosphates of the nucleotides are tagged with a fluorescent dye, with each Gene Targeting and Knockout Animal Models base assigned a characteristic color. Upon incorporation of a The concept behind gene targeting is to manipulate a spe- nucleotide, the tag is cleaved along with the phosphate—and cific allele, locus, or base sequence, through approaches that flashes. Each colored flash is detected and recorded. involve homologous recombination, to learn about the func- The PacBio was one of the first “long-read” instru- tions of a gene of interest. ments to reach the market. It generates read lengths of In the 1980s, scientists devised a gene-targeting tech- over 10,000 bp. Most TGS technologies still have somewhat nique for creating gene knockout (often abbreviated as high error rates for sequencing accuracy—about 15 percent KO) organisms, specifically mice. The pioneers of knockout errors per sequence generated. The list price for one PacBio technology, Dr. of the University of Utah sequencer is about $350,000, which is making this technol- and colleagues Oliver Smithies of the University of North ogy more affordable at least for biotechnology companies, Carolina, Chapel Hill, and Sir Martin Evans of Cardiff pharmaceutical companies, and well-funded academic University, United Kingdom, received the 2007 Nobel Prize research laboratories. in Physiology or Medicine for developing this technique. A few years ago, Oxford Nanopore Technologies devel- A knockout is an example of a loss-of-function oped a portable, single-molecule sequencer called the mutation. One can learn about gene function by creating a MinION that is the size of a USB memory stick! Although knockout that disrupts or eliminates copies of a specific gene or accuracy of this sequencer limits its applications, there is genes of interest and then asking, “What happens?” If physi- no reason to think that the technology for highly accurate cal, behavioral, biochemical, or other metabolic changes are pocket-sized sequencers will not advance in the near future. observed in the KO animal, this would suggest that the gene Rapid advances in sequencing technology were driven of interest has some functional role or roles in the observed by the demands of genome scientists (particularly those phenotypes. The KO techniques developed in mice led to working on the Human Genome Project) to rapidly gen- similar technologies for making KOs in other animal species, erate more sequence with greater accuracy and at lower including zebrafish, rats, pigs, fruit flies, as well as in plants. cost. Because of these technological advances, innovative Research in genetics, molecular biology, and biomedical approaches to genome sequencing are driving a range of fields have been revolutionized by the study of KO mice and new research and clinical applications. RNA sequencing other KO organisms. Our understanding of gene function has also emerged as a new technique that makes it possible has been advanced, and transgenic animals have been cre- to measure gene expression on a genome-wide scale. We will ated, often resulting in animal models for many human dis- discuss this later in the book (in Chapter 18). eases. KO animals are not limited to removal of a single gene: double-knockout animals (DKOs) and even triple-knockout ESSENTIAL POINT animals (TKOs) are also possible. This approach is typically DNA sequencing technologies are changing rapidly. Next-generation used when scientists want to study the functional effects and third-generation methods produce fairly large amounts of accu- of disrupting two or three genes thought to be involved in rate sequence data in a short time at lower cost than traditional approaches. ■ a related mechanism or pathway. Applications of KO tech- nology have also provided the foundation for gene-targeting approaches in gene therapy that we discuss later in the text (see Special Topics Chapter 3—Gene Therapy). 1 7. 6 Creating Knockout and Generally, generating a KO mouse or a transgenic mouse is a very labor intensive project that can take several Transgenic Organisms for Studying years of experiments and crosses and a significant budget to Gene Function complete. However, once a KO mouse is made, assuming it is fertile, a colony of mice can be maintained. Increased global Thus far we have focused on approaches to working with availability of this resource is possible because scientists recombinant DNA in vitro. Recombinant DNA technology often share KO mice and biotechnology companies produce has also made it possible to directly manipulate genes in vivo and sell them commercially. in ways that allow scientists to learn more about gene func- A KO animal can be made in several ways, but the tion in living organisms. These approaches also enable scien- same basic methods apply when making most KO ani- tists to create genetically engineered plants and animals for mals (Figure 17. 13). Because newer technologies such as research and for commercial applications. Here we discuss CRISPR-Cas (see Section 17.7) are becoming the methods

M17_KLUG8414_10_SE_C17.indd 338 16/11/18 5:15 pm 17.6 Creating Knockout and Transgenic Organisms for Studying Gene Function 339

1. Designing the targeting vector of choice for making KO and transgenic animals, here we provide a very brief overview of traditional methods. To neoR Vector begin, the DNA sequence for the KO target gene and infor- mation about noncoding flanking genomic sequences must Target gene be known. A targeting vector is then constructed, which con- Genome tains a copy of the gene of interest that has been mutated by insertion of a large segment of foreign DNA, typically a selectable marker gene. The inserted DNA disrupts the read-

2. Transform ES cells with targeting vector and ing frame of the target gene so that a nonfunctional protein select cells for recombination will be made if the mutated gene is expressed. The targeting vector is introduced into cells and undergoes homologous recombination with the genomic copy of the gene of interest (the target gene), disrupting or replacing it, thereby render- ing it nonfunctional. The selectable marker gene helps scien- ES cells from tists determine whether or not the targeting vector has been agouti mouse properly introduced into the genome. There are several ways to introduce the targeting vec- 3. Microinject ES cells into blastocyst from black-colored mouse tor into cells. One popular approach involves using elec- troporation to deliver the vector into cultured embryonic stem (ES) cells. The ES cells are harvested from the inner cell mass of a mouse embryo at the blastocyst stage. Alter- natively, the targeting vector is directly injected into the blastocyst with the hopes that it will enter ES cells in the inner cell mass. Sometimes it is possible to make KOs by iso- Inner cell mass lating newly fertilized eggs from a female mouse (or female of another desired animal species) and microinjecting the targeting vector DNA directly into the diploid nucleus of 4. Transfer into pseudopregnant surrogate mother; birth of chimeras the egg or into one of the haploid pronuclei prior to fusion [Figure 17. 14(a)]. Only a small percentage of ES cells take up the target- ing vector. In those that do, the actions of the endogenous enzyme recombinase catalyze homologous recombination between the targeting vector and the sequence for the gene of interest. Replacement of the original gene usually occurs on only one of the two chromosomes. Scientists can select recombinant ES cells by treating them with a reagent that kill cells lacking the targeting vec- Chimeras tor. The selected cells are injected into mouse embryos at the blastocyst stage where they will be incorporated into the inner cell mass of the blastocyst. Several blastocysts are 5. Chimeric mouse bred to black mouse to create mice heterozygous (+/-) for gene knockout then placed into the uterus of a surrogate mother mouse, sometimes called a pseudopregnant mouse—a female that has * * been mated with an infertile male to stimulate production of pregnancy hormones. These, in turn, trigger physiological (+/-) (+/+) changes that make the uterus receptive to implantation of (+/-) (+/+) the blastocysts. The surrogate will give birth to mice that are chimeras: 6. Breed heterozygous mice to produce mice homozygous (-/-) for gene knockout Some cells in their body arise from injected KO stem cells, and others arise from endogenous inner cell mass cells of the recipient blastocyst. As long as germ cells develop from

FIGURE 1 7. 13 A basic strategy for producing a knockout mouse.

M17_KLUG8414_10_SE_C17.indd 339 16/11/18 5:15 pm 340 17 Recombinant DNA Technology

(a) (b)

FIGURE 1 7. 14 (a) Microinjecting DNA into a fertilized egg to (- -) for both copies of the leptin (Lep) gene. The mouse create a knockout (or a transgenic) mouse. A fertilized egg is on the right is wild type (+ +) for the Lep gene. Normal held by a suction or holding pipette (seen below copies> of the Lep gene produce a peptide hormone called the egg), and a microinjection needle delivers cloned DNA leptin. The Lep knockout mouse weighs> almost five times as into the nucleus. (b) On the left is a knockout or null mouse much as its wild-type sibling.

recombinant ES cells, the mutated gene sequence will be to allow an animal to progress through development and be inherited in all of the offspring generated by these mice. born before activating the disruption. Another advantage

Typically, however, most F1 generation KO mice produced of conditional KOs is that target genes can also be turned off this way are heterozygous (+ -) for the gene of interest in a particular tissue or organ instead of the entire animal. and not homozygous for the KO. Sibling matings of F ani- A worldwide collaboration to disable all (20,000) > 1 mals can then be used to generate homozygous KO animals, protein-coding genes in the mouse genome was recently referred to as null mice and given a - - designation because completed. The purpose of the initiative was to create a KO they lack wild-type copies of the targeted gene of interest. As mouse resource that would help advance human disease > mentioned at the beginning of this section, KO animal mod- research. Rather than using a gene-by-gene approach to els serve invaluable roles for learning about gene function, make KOs, researchers used a high-throughput technique and they continue to be essential for biomedical research on that involves knocking out thousands of genes in embryonic disease genes [see Figure 17. 14(b)]. stem cells and then creating offspring in which specific genes Despite all of the work that goes into trying to produce of interest can then be turned off. Projects such as this pro- a KO organism, sometimes viable offspring are never born. vide a “library” of KO animals for the research community The KO is said to result in embryonic lethality. Knocking out so that individual scientists can use these animals rather a gene that is important during embryonic development than having to go through the expense and technical chal- will kill the mouse, often before researchers have a chance lenges of making their own knockouts. to study it. When this occurs, researchers typically examine embryos from the surrogate mouse to try to determine at what stage of embryonic development embryos are dying. Making a Transgenic Animal: The Basics This examination often reveals specific organ defects, which Transgenic animals, also sometimes called knock-in can be informative about the function of the KO gene. animals, express, or overexpress, a particular gene of If null mice for a particular gene of interest cannot interest (the transgene)—in other words, turning genes on be derived by traditional KO approaches, an alternative instead of off, the opposite of KOs. As with KOs, many of the approach called conditional knockout can often provide a prevailing techniques used to make transgenic animals were way to study such a gene. Conditional knockouts allow one developed in mice; Figure 17. 15 illustrates two examples. to control the particular time in an animal’s development The method of creating a transgenic animal is conceptu- that a target gene is disrupted. For example, if a target gene ally simple, and many of the steps are similar to those involved displays embryonic lethality, one can use a conditional KO in making a KO animal. But instead of trying to disrupt a

M17_KLUG8414_10_SE_C17.indd 340 16/11/18 5:15 pm 17.7 Genome Editing with CRISPR-Cas 341

(a) (b)

FIGURE 1 7. 15 examples of transgenic mice. (a) Transgenic in subsequent generations of mice generated from these mice incorporating the green fluorescent protein gene transgenics. (b) The mouse on the left is transgenic for a rat (gfp), a popular reporter gene, from jellyfish enable growth hormone gene, cloned downstream from a mouse scientists to tag particular genes with green fluorescent metallothionein promoter. When the transgenic mouse protein. Thanks to the expression of gfp, which makes the was fed zinc, the metallothionein promoter induced the transgenic mice glow green under ultraviolet light, scientists transcription of the growth hormone gene, stimulating the can track activity of the tagged genes, including activity growth of the transgenic mouse.

target gene, a vector with a functional transgene is created humanized mice—transgenic mice that express human to undergo homologous recombination and enter into the genes—have been created to study responses to different host genome. In some applications, tissue-specific promoter drugs for treating diseases, to understand the roles of genes sequences can be used so that transgene expression is limited in evolution, and to study embryonic development. Trans- to certain tissues. For example, in the biotechnology industry genic animals and plants are also created to produce com- mammary-specific promoters are used so that recombinant mercially valuable biotechnology products.In Special Topics products accumulate in milk for subsequent purification. It Chapter 6—Genetically Modified Foods, you will learn about is often much easier to make a transgenic animal than a KO examples of transgenic food crops. animal because vector incorporation into the host genome As mentioned earlier in this section, newly developed does not need to occur at a particular locus (just hopefully in technologies such as CRISPR-Cas are replacing these more a noncoding region) as is necessary when making a KO. traditional methods of gene manipulation. Therefore, we will As with KOs, the vector with the transgene can be put conclude this chapter with a section on genome editing to pre- into ES cells or injected directly into embryos or eggs. Like- cisely modify a particular gene or sequence in the genome. wise, a marker or reporter gene is included on the vector to help researchers identify successful transformation. In a ESSENTIAL POINT relatively small percentage of embryos or eggs, the trans- Gene-targeting methods to create knockout animals and transgenic genic DNA becomes randomly inserted into the genome by animals are widely used, valuable approaches for studying gene func- recombination due to the action of naturally occurring DNA tion in vivo. ■ recombinases. The rest of the process is similar to making a KO: Embryos are implanted in surrogate mothers, and crosses from resulting progeny are used to derive mice that are homozygous for the transgene. 1 7. 7 Genome Editing with There are many variations in the types of transgenic CRISPR-Cas organisms that can be created and in their roles in basic and applied research. In some experiments, a transgene is The process of genome editing involves removing, add- overexpressed in order to study its effects on the phenotype ing, or changing specific DNA sequences in the genome of of the organism. Other experimental variations include living cells. The ability to edit genomes both specifically creation of transgenic animals that express mutant genes and efficiently has broad implications for research, bio- or genes from a different species. For example, so-called technology, and medicine. Essential to this process is a

M17_KLUG8414_10_SE_C17.indd 341 16/11/18 5:15 pm 342 17 Recombinant DNA Technology

“programmable” nuclease that can be directed to cut the Cas9 is directed to cut sequences that are complementary genome in a sequence-specific manner. Later in the text to a 20-nucleotide region of the sgRNA. Thus, any site in the (see Special Topics Chapter 3—Gene Therapy), we discuss genome (near a PAM sequence) can be targeted by Cas9 with how genome-editing with transcription activator-like effector an sgRNA that is synthesized to be complementary to that nucleases (TALENs) and zinc finger nucleases (ZFNs) can be site. The first demonstration of in vivo genome editing with used for gene therapy. However, the fastest and most effi- CRISPR-Cas9 was performed on cultured mammalian cells cient approach to genome editing is the CRISPR (clustered in 2012 by introducing plasmid expression vectors carrying regularly interspaced palindromic repeats)-Cas system. genes encoding Cas9 and an sgRNA with a specific target- CRISPR-Cas was discovered by scientists trying to under- ing sequence. Since then, CRISPR-Cas has been used to edit stand how bacteria fight viral infection and was later adopted the genomes of many different organisms using a wide range as a genome-editing tool. See Chapter 15 for an explanation of of Cas9 and sgRNA delivery methods, such as viruses, plas- how bacteria use CRISPR-Cas as an immune response system mids, and direct injection. against foreign genetic material such as viral DNA. Using CRISPR-Cas9 to cut a genome at a precise loca- tion is only the first step of genome editing. Creating specific changes to the genome with intended consequences, such The CRISPR-Cas9 Molecular Mechanism as disrupting a gene’s function or correcting a mutation, The most widely used CRISPR-Cas system for genome edit- requires a second step that takes advantage of the eukary- ing uses a nuclease called Cas9 from the bacterium Strep- otic cell’s double-strand DNA break repair mechanisms (see tococcus pyogenes. Cas9 contains two nuclease domains and Chapter 14). Double-stranded breaks in the genome may be thus can create a double strand break in DNA. However, repaired by nonhomologous end-joining (NHEJ) or by Cas9 will only cut DNA sequences if two specific parameters homology-directed repair (HDR). NHEJ simply involves are satisfied [Figure 17. 16(a)]. the ligation of broken DNA fragments. This process is error First, Cas9 will only cut DNA near a specific sequence prone and often results in small insertions or deletions called a protospacer adjacent motif (PAM), which is (indels) at the repair site. HDR is less error prone because it 5′@NGG@3′, where N is any base. Cas9 recognizes the PAM uses an undamaged homologous chromosome or sister chro- sequence and cuts three base pairs upstream of it. Second, matid as a template to correctly repair a broken chromosome. Cas9 requires a single guide RNA (sgRNA)—a short RNA When the goal of CRISPR-Cas9 genome editing is to molecule that is required for Cas9 activity and specificity. disrupt a gene and create a nonfunctional allele, simply

(a)

sgRNA

Target gene in genome Cas9 PAM

Double-stranded break

(b) (c)

Donor template NHEJ HDR

Indel creates frameshift Specific edit

FIGURE 1 7. 16 cRISPR-Cas9 genome editing. (a) An sgRNA guides Cas9 to cleave a target site adjacent to a PAM sequence. The double-stranded DNA break can be repaired by (b) NHEJ, which introduces insertions or deletions (indels), or by (c) HDR, which can make specific edits using an introduced donor template.

M17_KLUG8414_10_SE_C17.indd 342 16/11/18 5:15 pm 17.7 Genome Editing with CRISPR-Cas 343

introducing Cas9 and an sgRNA into the cell is often suffi- organisms for various purposes. CRISPR-Cas biotechnologi- cient. While the HDR mechanism may repair Cas9-induced cal applications range from simple but useful innovations, double-stranded breaks correctly in some cells, in other cells such as the creation of tomatoes that ripen more quickly, the error-prone NHEJ pathway may introduce indels that to massive and controversial endeavors, such as “bringing result in a shift of the coding sequence reading frame, and back” the woolly mammoth by editing mammoth genes into thus lead to gene disruption [Figure 17. 16(b)]. the elephant genome. The woolly mammoth is not back yet, However, if the goal of CRISPR-Cas9 genome editing is but there are already many examples of CRISPR-Cas edited to make a more precise edit, HDR can be “tricked” into using or modified organisms that serve biotechnological purposes. an artificial donor template (instead of the homologous A major challenge in raising livestock is managing dis- chromosome) to make complex substitutions, deletions, or ease. For example, each year the pig farming industry in the additions. The donor template is an experimentally intro- United States loses over $600 million to a single disease— duced DNA molecule carrying a sequence with desired edits porcine respiratory and reproductive syndrome (PRRS). flanked by “homology arms” with sequences that match The porcine respiratory and reproductive syndrome virus regions adjacent to the genomic target. Through the HDR (PRRSV) causes PRRS by infecting immune cells in the pig’s mechanism, the target sequence in the genome is replaced lungs, which leads to respiratory complications and repro- by the sequence on the donor template [Figure 17. 16(c)]. ductive failure. Studies have determined that PRRSV gains entry into cells via the CD163 receptor. Therefore, research- CRISPR-Cas Infidelity ers used CRISPR-Cas9 to remove the CD163 gene from the pig genome. Pigs homozygous for a CD163 deletion showed no CRISPR-Cas is clearly a powerful tool with immense poten- clinical signs of the disease following exposure to the virus. tial, but it does have limitations. In some cases, Cas9 not only CRISPR-Cas technology is currently being used to modify cuts at the intended target but also at off-target sites in the food crops to introduce traits such as enhanced nutritional genome. Off-target edits may be due to an sgRNA having value, increased shelf life, and pest or drought resistance. For more than one perfect match in the genome or the sgRNA example, the biotech company DuPont Pioneer used CRISPR- directing Cas9 to a sequence with one or few mismatches. Cas to modify the ARGOS8 gene in corn to create a drought- To address this problem, several labs have tried modifying resistant strain. Past studies had shown that ARGOS8 Cas9 to improve its specificity. Others have designed Web- expression improves drought resistance and that expres- based algorithms to improve sgRNA design. Others still have sion of this gene is low in corn. Therefore, scientists removed turned back to bacteria and archaea looking for CRISPR sys- the native promoter of ARGOS8 and introduced a promoter tems with alternative enzymes to Cas9 that have improved that directs stronger expression. Under drought conditions, fidelity or other desirable traits. Some studies have shown ARGOS8-modified corn produced five bushels more per acre that the Cpf1 nuclease, from the CRISPR system of bacte- than unmodified corn. CRISPR-Cas has also been used to cre- ria in the Francisella and Prevotella genera, exhibits lower ate mushrooms that resist browning after being sliced. (For off-target editing than Cas9. Improving the specificity of additional information on the use of CRISPR-Cas in creat- CRISPR-Cas edits to the human genome will be important ing gene-edited food crops, see Special Topics Chapter 6— for the safety of medical applications of this technology. Genetically Modified Foods.) We are likely to see more CRISPR-Cas-derived foods in the grocery store soon. CRISPR-Cas Technology Has Diverse Perhaps one of the most anticipated applications of Applications CRISPR-Cas technology is for treating, or even curing, CRISPR-Cas is an indispensable tool for basic genetics human genetic diseases–in other words, gene therapy. research. A fundamental objective that geneticists often pur- (For a broader discussion of gene therapy, see Special Top- sue is to determine the function of an uncharacterized gene. ics Chapter 3—Gene Therapy.) Several clinical trials using A simple way to do this is to delete the gene and observe the CRISPR-Cas to treat cancers are currently under way, while phenotypic consequences. Although this is conceptually sim- numerous other CRISPR-Cas clinical trials are being planned ple, the ability to efficiently and quickly delete a gene from for a wide variety of genetic disorders such as muscle, blood, the genome has only recently become possible with CRISPR- and liver diseases, as well as heritable blindness, HIV infec- Cas technology. Beyond research, CRISPR-Cas is also begin- tion, and various cancers. ning to have an impact on biotechnology and medicine. Biotechnology is the use of living organisms to create a product or a process that helps improve the quality of life for ESSENTIAL POINT humans or other organisms. CRISPR-Cas has greatly facili- The fastest and most efficient method of genome editing is the CRISPR-Cas system, which uses an sgRNA to direct sequence-specific tated biotechnological innovation because it enables the cutting of the genome by the Cas9 nuclease. ■ rapid and cost-effective production of genetically modified

M17_KLUG8414_10_SE_C17.indd 343 16/11/18 5:15 pm 344 17 RECOMBINANT DNA TECHNOLOGY

EXPLORING GENOMICS

Manipulating Recombinant DNA: Mastering Genetics Visit the Study Area: Exploring Genomics Restriction Mapping

s you learned in this chapter, Genomics exercise for this chapter. probe (molecular biologists often refer restriction enzymes are sophis- Copy the sequence of cloned human to this as subcloning). To do this, you ticated “scissors” that geneti- DNA found there, and paste it into the will need to determine which restriction cistsA and molecular biologists routinely text box in Webcutter. enzymes would best be suited for cutting use to cut DNA for recombinant DNA both the plasmid and the human DNA. 2. Scroll down to “Please indicate which experiments. A wide variety of online enzymes to include in the analysis.” 1. Referring back to the Study Area and tools assist scientists working with restric- Click the button indicating, “Use only the Exploring Genomics exercise for tion enzymes and manipulating recombi- the following enzymes.” Select the this chapter, copy the plasmid DNA nant DNA for different applications. Here restriction enzymes EcoRI, BamHI, and sequence from Exercise I into the text we explore Webcutter and Primer3, two PstI from the list provided, and then box in Webcutter and identify cutting sites that make recombinant DNA experi- click “Analyze sequence.” (Note: Use sites for the same enzymes you used in ments much easier. the command, control, or shift key to Exercise I. Then answer the following Exercise I – Creating a Restriction select multiple restriction enzymes.) questions: Map in Webcutter 3. After examining the results provided by a. What is the total size of the plas- Suppose you had cloned and sequenced Webcutter, create a table showing the mid DNA analyzed in Webcutter? a gene and you wanted to design a probe number of cutting sites for each enzyme b. Which enzyme(s) could be used in approximately 600 bp long that could be and the fragment sizes that would a recombinant DNA experiment used to analyze expression of this gene in be generated by digesting with each to ligate the plasmid to the largest different human tissues by Northern blot enzyme. Draw a restriction map indicat- DNA fragment from the human analysis. Internet sites such as Webcutter ing cutting sites for each enzyme with gene? Briefly explain your answer. make it relatively easy to design experi- distances between each site and the ments for manipulating recombinant DNA. total size of this piece of human DNA. c. What size recombinant DNA mol- In this exercise, you will use Webcutter to ecule will be created by ligating create a restriction map of human DNA Exercise II – Designing a these fragments? with the enzymes EcoRI, BamHI, and PstI. Recombinant DNA Experiment d. Draw a simple diagram showing the 1. Access Webcutter at http://www Now that you have created a restriction cloned DNA inserted into the plas- .firstmarket.com/cutter/cut2.html. map of your piece of human DNA, you mid, and indicate the restriction- Go to the Study Area for Essentials need to ligate the DNA into a plasmid enzyme cutting site(s) used to create of Genetics, and open the Exploring DNA vector that you can use to make your this recombinant plasmid.

CASE STUDY Ethical issues and genetic technology

n the 1970s, scientists realized that there might be unforeseen information and the institution of guidelines limiting the scope of dangers and ethical issues with the use of recombinant DNA gene therapy. These guidelines prohibit germ-line therapy, which technology. A self-imposed moratorium on related research was impacts future generations, and also prohibit gene therapy designed I to enhance physical or mental aptitudes. implemented to develop safety protocols. As the Human Genome Project, designed to sequence and analyze the DNA of the human The recent development of CRISPR-Cas as a new genetic tech- genome, came into existence in 1990, it was accompanied by the Eth- nology may allow for the removal of mutant alleles that cause dev- ical, Legal, and Social Implications (ELSI) program. ELSI was charged astating neurological disorders such as Huntington disease and with identifying and addressing issues arising from genomic research. prevent its transmission to future generations. Similar technology This program focused mainly on privacy issues, the ethical could be used to selectively eradicate the species of mosquito that use of genetic technology in medicine, and the design and con- transmits malaria, a painful and life-shortening disease that affects duct of genetic research, including gene therapy. The program led millions worldwide. With the development of these revolution- to the passage of federal legislation regulating the use of genetic ary methods, there are calls to redefine issues and to institute a

M17_KLUG8414_10_SE_C17.indd 344 16/11/18 5:15 pm 345 PROBLEMS AND DISCUSSION QUESTIONS 

new set of ethical guidelines for using these methods to eliminate 3. Should these new technologies be regulated internationally genetic disorders and to revolutionize agriculture. to prevent their use by bioterrorists? How could violations be 1. What undesirable or unforeseen consequences might occur detected, and how could such regulations be enforced? in ecosystems if a species is eradicated using these new For related reading, see Rodriguez, E. (2016). Ethical issues in technologies? using Crispr/Cas9 system. J. Clin. Res. Bioethics 7:266 (doi:10.4172/ 2. Do we have the ethical right to alter the genomes of future gen- 2155-9627.1000266). erations of humans even if intervention eliminates lethal alleles?

INSIGHTS AND SOLUTIONS

1. The recognition sequence for the restriction enzyme Sau3AI (b) The DNA can be cut from the vector with Sau3AI because is GATC (see Figure 17.1); in the recognition sequence for the recognition sequence for this enzyme (GATC) is main- the enzyme BamHI—GGATCC—the four internal bases are tained on each side of the insert. Cutting the cloned insert identical to the Sau3AI sequence. The single-stranded ends with BamHI is more problematic. In the ligated vector, the produced by the two enzymes are identical. Suppose you conserved sequences are GGATC (left) and GATCC (right). have a cloning vector that contains a BamHI recognition The correct base for recognition by BamHI will follow the sequence and you also have foreign DNA that was cut with conserved sequence (to produce GGATCC on the left) only Sau3AI. about 25 percent of the time, and the correct base will precede the conserved sequence (and produce GGATCC on the right) (a) Can this DNA be ligated into the BamHI site of the vector, about 25 percent of the time as well. Thus, BamHI will be able and if so, why? to cut the insert from the vector (0.25 * 0.25 = 0.0625), or (b) Can the DNA segment cloned into this sequence be cut only about 6 percent, of the time. from the vector with Sau3AI? With BamHI? What potential problems do you see with the use of BamHI? GGATC G A TCC Insert Solution: C C TAG C TAGG (a) DNA cut with Sau3AI can be ligated into the vector’s BamHI cutting site because the single-stranded ends gener- ated by the two enzymes are identical. Vector

Mastering Genetics Visit for Problems and Discussion Questions instructor-assigned tutorials and problems.

1. HOW DO WE KNOW? In this chapter we focused on how spe- overview of how recombinant DNA techniques help geneticists cific DNA sequences can be copied, identified, characterized, study genes. and sequenced. At the same time, we found many opportuni- 3. What roles do restriction enzymes, vectors, and host cells play in ties to consider the methods and reasoning underlying these recombinant DNA studies? What role does DNA ligase perform techniques. From the explanations given in the chapter, what in a DNA cloning experiment? How does the action of DNA ligase answers would you propose to the following fundamental differ from the function of restriction enzymes? questions? 4. The human insulin gene contains a number of sequences that (a) In a recombinant DNA cloning experiment, how can we are removed in the processing of the mRNA transcript. Bacterial determine whether DNA fragments of interest have been cells cannot excise these sequences from mRNA transcripts, yet incorporated into plasmids and, once host cells are trans- this gene can be cloned into a bacterial cell and produce insulin. formed, which cells contain recombinant DNA? Explain how this is possible. (b) What steps make PCR a chain reaction that can produce 5. Although many cloning applications involve introducing recom- millions of copies of a specific DNA molecule in a matter of binant DNA into bacterial host cells, many other cell types are hours without using host cells? also used as hosts for recombinant DNA. Why? (c) How has DNA-sequencing technology evolved in response 6. Using DNA sequencing on a cloned DNA segment, you recover to the emerging needs of genome scientists? the nucleotide sequence shown below. Does this segment contain (d) How can gene knockouts, transgenic animals, and gene- a palindromic recognition sequence for a restriction enzyme? If editing techniques be used to explore gene function? so, what is the double-stranded sequence of the palindrome, and 2. CONCEPT QUESTION Review the Chapter Concepts list on what enzyme would cut at this sequence? (Consult Figure 17.1 p. 323. All of these refer to recombinant DNA methods and appli- for a list of restriction sites.) cations. Write a short essay or sketch a diagram that provides an CAGTATGGATCCCAT

M17_KLUG8414_10_SE_C17.indd 345 16/11/18 5:15 pm 346 17 RECOMBINANT DNA TECHNOLOGY

7. Restriction sites are palindromic; that is, they read the same in 18. One complication of making a transgenic animal is that the the 5′ to 3′ direction on each strand of DNA. What is the advan- transgene might integrate at random into the coding region, or tage of having restriction sites organized this way? the regulatory region, of an endogenous gene. What might be the 8. List the advantages and disadvantages of using plasmids as clon- consequences of such random integrations? How might this com- ing vectors. What advantages do BACs and YACs provide over plicate genetic analysis of the transgene? plasmids as cloning vectors? 19. When disrupting a mouse gene by knockout, why is it desirable to 9. What are the advantages of using a restriction enzyme whose rec- breed mice until offspring homozygous (- -) for the knockout ognition site is relatively rare? When would you use such enzymes? target gene are obtained? > 10. In the context of recombinant DNA technology, of what use is a 20. What techniques can scientists use to determine if a particular probe? transgene has been integrated into the genome of an organism? 11. If you performed a PCR experiment starting with only one copy of dou- 21. Gene targeting and genome editing are both techniques for remov- ble-stranded DNA, approximately how many DNA molecules would ing or modifying a particular gene, each of which can produce the be present in the reaction tube after 15 cycles of amplification? same ultimate goal. Describe some of the differences between the 12. What advantages do cDNA libraries provide over genomic DNA experimental methods used for these two techniques. libraries? Describe cloning applications where the use of a 22. The CRISPR-Cas system has great potential but also raises many genomic library is necessary to provide information that a cDNA ethical issues about its potential applications because theoreti- library cannot. cally it can be used to edit any gene in the genome. What do 13. In a typical PCR reaction, describe what is happening in stages you think are some of the concerns about the use of CRISPR- occurring at temperature ranges (a) 92–95°C, (b) 45–65°C, and Cas on humans? Should CRISPR-Cas applications be limited for (c) 65–75°C. use on only certain human genes but not others? Explain your 14. We usually think of enzymes as being most active at around answers. 37°C, yet in PCR the DNA polymerase is subjected to multiple 23. What is a single guide RNA, and what role does it play in CRISPR- exposures of relatively high temperatures and seems to function Cas genome editing in eukaryotic cells? appropriately at 65–75°C. What is special about the DNA poly- 24. What is the difference between nonhomologous end-joining merase typically used in PCR? (NHEJ) and homology-directed repair (HDR) in the context of 15. Traditional Sanger sequencing has largely been replaced in genome editing? recent years by next-generation and third-generation sequencing 25. What safety considerations must be taken before CRISPR-Cas is approaches. Describe advantages of these sequencing methods used to edit human embryos to cure disease? over first-generation Sanger sequencing. 26. Provide one example of a CRISPR-Cas application for 16. How is fluorescence in situ hybridization (FISH) used to produce biotechnology. a spectral karyotype? 27. Why is genome editing by CRISPR-Cas advantageous over tra- 17. What is the difference between a knockout animal and a trans- ditional methods for creating knockout or transgenic animals? genic animal? Explain your answers.

M17_KLUG8414_10_SE_C17.indd 346 16/11/18 5:15 pm AGGCCCAACAAGCACAGCCGGGGAAGGAAAA T GCG T TGTGGACCTCTG T GCCGA T T CCTG

AGGCCCAAGAAGC– CA T CC TGGGA AGGAAA A TGCA TT GGGGAACCCTG TGCGGAT TCTT G

TGGCTT TGGCCCT ATC TGTCCTGTG T TGAAGCTG TGCCAA TCCGAAAA G TCCAGGATGAC

TGGCTT TGGCCCT A TC T T T TC T ATG T CCAAGCTG TGCCCA T CCAAA AA G TCCAAGA TGAC 18 ACCA AA ACCCTC A TCA AGACGATT G TCGCCAGGATC AA T GACAT T T CACACACGCAGTCT ACCA AA ACCCTC A TCA AGAC A ATT G TC ACCAGGA T C AA TGACAT T T CACACACGCAGTCA

Genomics, GTCTCC TCC A A ACAGAGGGT CGCTGGTCTGGAC T TC AT T CC TGGGCTCCA ACCAGTCCTG Bioinformatics, GTCTCC TCC A A ACAGA A AGT CACCGGT T TGGAC T TC AT T CC TGGGCTCC A CCCCATCCTG AGT T TGTCCAGGATGGACCAGACGT TGGCCA T C T ACCA AC AGATCCT C AAC AGTC TGCAT

and Proteomics ACCT T A TCCA AGATGGACCAGAC A CT GGCAGT C T ACCA AC AGATCC T C ACCAGT A TGCCT

TCC AGAAAT G TGGTCCAA AT A TC T A A T GACCTGGAGAACCTCCGGGACCT T CTCCACCTG

TCC AGAAACG TGATCCAA AT A TC CA ACGACCTGGAGAACCTCCGGG A T C T T CT T CACGTG CHAPTER CONCEPTS CTGGCCTCCTCC AAGAGCTGCCCC T T GCCCCGGGCCAGGGGCCTGGAGACCT T T GAGAGC ■■ Genomics applies recombinant DNA, CTGGCCT TC TC T AAGAGC TGCC AC T T GCCCTGGGCC AGTGGCCTGGAGACCT T GGACAGC DNA sequencing methods, and bioin- formatics to sequence, assemble, and CTGGGCGGCGTCCTGGAAGCCTCACTC T AC TCCACAGAGGTGGTGGCTCTGAACA G ACTG

analyze genomes. CTGGGGGGTGTCCTGGAAGCTTC AGGCTAC TCCACAGAGGTGGTGGCCCTGAGCA GGCTG ■■ Disciplines in genomics encompass sev- eral areas of study, including structural and Alignment comparing the DNA sequence for the leptin gene functional genomics, comparative genom- from dogs (blue) and from humans (red). Vertical lines and ics, and metagenomics, and have led to shaded boxes indicate identical bases. LEP encodes a hormone that functions to suppress appetite. This type of analysis is a an “omics” revolution in modern biology. common application of bioinformatics and a good demonstra- ■■ Bioinformatics merges information tech- tion of comparative genomics. nology with biology and mathematics to store, share, compare, and analyze nucleic acid and protein sequence data. ■■ The Human Genome Project has greatly advanced our understanding of n 1977, as recombinant DNA–based techniques, including DNA sequenc- the organization, size, and function of ing, were developing, Fred Sanger and colleagues launched the field the human genome. Iof genomic analysis or simply genomics, the study of genomes, by ■■ Fifteen years after completion of the sequencing the 5400-nucleotide DNA genome of the virus fX174. Over Human Genome Project, a new era of the past 40 years, genomic analysis has advanced so quickly that modern genomics studies is providing deeper biological research is experiencing a genomics revolution. Genomics is insights into the human genome. one of the most rapidly advancing and exciting areas of modern genetics— ■■ Comparative genomics analysis has providing scientists, and even the general public, with unprecedented revealed similarities and differences in information about genomes of different organisms, including humans, on genome size and organization. a daily basis. ■■ Metagenomics is the study of genomes In this chapter, we will examine basic technologies used in genomic anal- from environmental samples and is valu- ysis, including bioinformatics – the use of computer hardware and software able for identifying microbial genomes. and mathematics applications to organize, share, and analyze data related ■■ Transcriptome analysis provides insight to gene structure, gene sequence and expression, and protein structure and into patterns of gene expression and function. We discuss examples of genome data derived from different species gene-regulatory activity of a genome. including the human genome, and consider selected disciplines of genomics. ■■ Proteomics focuses on the protein We discuss transcriptome analysis, the study of genes expressed in a cell or content of cells and on the structures, tissue (the “transcriptome”), and proteomics, the study of proteins present in functions, and interactions of proteins. a cell or tissue. We conclude by presenting the concept of synthetic genomes ■■ Synthetic genomes have been and applications in synthetic biology. assembled, elevating interest in poten- tial applications of engineering cells through synthetic biology.

347

M18_KLUG8414_10_SE_C18.indd 347 16/11/18 5:15 pm 348 18 Genomics, Bioinformatics, and Proteomics

sentences on different pieces of paper. Eventually, in the- 18.1 Whole-Genome Sequencing ory, many of the strips containing matching sentences would Is Widely Used for Sequencing and overlap in ways that you could use to reconstruct the pages and assemble the order of the entire text. Assembling Entire Genomes Figure 18.1 shows a basic overview of WGS. First, multiple copies of an entire chromosome are cut into short, A primary limitation of most recombinant DNA approaches overlapping fragments, either by mechanical or enzymatic is that they typically can identify only relatively small num- methods. For simplicity, here we present only an example bers of genes at a time. By contrast, genomics approaches using restriction enzymes. In separate digests, different allow the identification of many genes because entire restriction enzymes can be used so that chromosomes are cut genomes can be sequenced. The most widely used strategy at different sites. Alternatively, partial digests of DNA using for sequencing and assembling an entire genome involves the same restriction enzyme can be performed. To achieve variations of a method called whole-genome sequencing a partial digest, the reaction mixture is incubated for only a (WGS), also known as shotgun cloning or shotgun sequenc- short period of time so that not every target site in a particu- ing. In simple terms, this technique is analogous to you and lar DNA molecule is cut by the restriction enzyme. Restric- a friend taking your respective copies of this genetics text- tion digests of whole chromosomes generate thousands to book and randomly ripping the pages into strips about 5 to millions of overlapping DNA fragments. For example, a 6-bp 7 inches long. Each chapter represents a chromosome, and cutter such as EcoRI creates about 700,000 fragments when all the letters in the entire book are the “genome.” Then you used to digest the human genome! and your friend would go through the painstaking task of One of the earliest bioinformatics applications to be comparing the pieces of paper to find places that match, developed for genomics was the use of algorithm-based overlapping sentences—areas where there are similar

EcoRl BamHl BamHl

1 2 3 4 1. Genomic DNA cut with different restriction enzymes to create a series of overlapping fragments 1 1 2 4 2 3 4 3

2. Overlapping sequenced fragments aligned using computer programs to assemble an entire chromosome 1 3. Alignment of fragments based 1 2 on identical DNA sequences creates Contigs an assembly of contiguous fragments 2 3 4 or “contigs” 3 4

FIGURE 18.1 An overview of whole-genome sequencing to identify overlapping fragments based on sequence iden- and assembly. One strategy (shown here) involves using tity. Notice that EcoRI digestion produces two fragments restriction enzymes to digest genomic DNA into contigs, (contigs 1 and 2–4), whereas digestion with BamHI produces which are then sequenced and aligned using bioinformatics three fragments (contigs 1–2, 3, and 4).

M18_KLUG8414_10_SE_C18.indd 348 16/11/18 5:15 pm 18.2 DNA Sequence Analysis Relies on Bioinformatics Applications and Genome Databases 349

Sequence alignment between contigs 1 and 2 Contig 1 5'– ATTTTTTT TGTATTTTTAATAGAGACGAGGTGTCACCATGTTGGACAGGCTGGTCTCGAACTCCTGACCTCAGGTGATCTGCCC –3' Contig 2 5'– GGTCTCGAACTCCTGACCTCAGGTGATCTGCCCACCTCAGCCTCCCAAAGTGCTGGA

Sequence alignment between contigs 2 and 3

TTACAAGCATGAGCCACCACTCCCAGGC–3' Contig 3 5'– GAGCCACCACTCCCAGGCTTTATTTTCTATTTTTTAATTACAGCCATCCTAGTGAATGTGAAGTAGTATCTCACTGAGGTTTTGATTT –3'

Assembled sequence of a partial segment of chromosome 2 based on alignment of three contigs

5'– ATTTTTTT TGTATTTTTAATAGAGACGAGGTGTCACCATGTTGGACAGGCTGGTCTCGAACTCCTGACCTCAGGTGATCTGCCCACCTCAGCCTCCCAAAGTGCTGGA TTACAAGCATGAGCCACCACTCCCAGGCTTTATTTTCTATTTTTTAATTACAGCCATCCTAGTGAATGTGAAGTAGTATCTCACTGAGGTTTTGATTT –3'

FIGURE 18.2 DNA-sequence alignment of contigs on that are several thousand bases in length. Alignment of human chromosome 2. Single-stranded DNA for three dif- the three contigs allows a portion of chromosome 2 to be ferent contigs from human chromosome 2 is shown in blue, assembled. Alignment of all contigs for a particular chromo- red, or green. The actual sequence from chromosome 2 is some would result in assembly of a completely sequenced shown, but in reality, contig alignment involves fragments chromosome.

software programs for creating a DNA-sequence alignment, could be sequenced in a single reaction was doubling about in which similar sequences of bases are lined up for compari- every 24 months. At the same time, this increase in effi- son. Alignment identifies overlapping sequences, allowing ciency brought about a dramatic decrease in cost, from scientists to reconstruct their order in a chromosome. about $1.00 to less than $0.001 per base pair. As we will Because these overlapping fragments are adjoining seg- discuss in Section 18.3, without question the development ments that collectively form one continuous DNA molecule of high-throughput sequencing was essential for the Human within a chromosome, they are called contiguous ­fragments, Genome Project. And as you know from earlier in the text or contigs. Figure 18.2 shows an example of contig alignment (see Chapter 17), next- and third-generation sequencers and assembly for a portion of human chromosome 2. For sim- now enable genome scientists to produce a sequence more plicity, this figure shows relatively short sequences for each than 50,000 times faster than sequencers in 2000 with contig, which in actuality would be much longer. The figure greater output, improved accuracy, and reduced cost. is also simplified in that, in actual alignments, assembled sequences do not always overlap only at their ends. ESSENTIAL POINT The WGS method was developed by J. and Whole-genome sequencing enables scientists to assemble sequence colleagues at The Institute for Genome Research (TIGR, now maps of entire genomes. ■ named The J. Craig Venter Institute). In 1995, TIGR scientists used this approach to sequence the 1.83-million-bp genome of the bacterium Haemophilus influenzae. This was the first completed genome sequence from a free-living (i.e., nonviral) 18.2 DNA Sequence Analysis Relies organism, and it demonstrated “proof of concept” that shot- gun sequencing could be used to sequence an entire genome. on Bioinformatics Applications and Even after the genome for H. influenzae was sequenced, many Genome Databases scientists were skeptical that a shotgun approach would work on the larger genomes of eukaryotes. Now WGS approaches Genomics necessitated the rapid development of are the predominant method for sequencing genomes. bioinformatics, the use of computer hardware and soft- The major technological breakthrough that made genom- ware and mathematics applications to organize, share, ics possible was the development of computer-automated and analyze data related to gene structure, gene sequence sequencers. In the past 15 years, high-throughput sequenc- and expression, and protein structure and function. ing has increased the productivity of DNA-sequencing ­However, even before WGS projects had been initiated, technology over 500-fold. The total number of bases that a large amount of sequence information from a range of

M18_KLUG8414_10_SE_C18.indd 349 16/11/18 5:15 pm 350 18 Genomics, Bioinformatics, and Proteomics

different organisms was accumulating as a result of gene earlier in the text through several Exploring Genomics exer- cloning by recombinant DNA techniques. cises. In the Exploring Genomics feature for this chapter, you Scientists around the world needed databases that could will use NCBI and GenBank to compare and align contigs to be used to store, share, and obtain the maximum amount of assemble a chromosome segment. information from protein and DNA sequences. Thus, bioin- formatics software was already being used to compare and Annotation to Identify Gene Sequences analyze DNA sequences and to create private and public databases. But once genomics emerged as a new approach for One of the fundamental challenges of genomics is that, analyzing DNA, bioinformatics became even more important although genome projects generate tremendous amounts than before. Today, it is a dynamic area of biological research, of DNA-sequence information, these data are of little use providing new career opportunities for anyone interested in until they have been analyzed and interpreted. Thus, after merging an understanding of biological data with informa- a genome has been sequenced and compiled, scientists are tion technology, mathematics, and statistical analysis. faced with the task of identifying gene-regulatory sequences Among the most common applications of bioinformat- and other sequences of interest in the genome so that gene ics are to: maps can be developed. This process, called annotation, relies heavily on bioinformatics, and a wealth of different ■■ Compare DNA sequences, as in contig alignment, or com- software tools are available to carry it out. pare sequences from individuals of different species One initial approach to annotating a sequence is to ■■ Identify genes in a genomic DNA sequence compare the newly sequenced genomic DNA to the known sequences already stored in various databases. The NCBI ■■ Find gene-regulatory regions, such as promoters and provides access to enhancers BLAST (Basic Local Alignment Search Tool), a very popular software application for searching ■■ Identify structural sequences, such as telomeric through banks of DNA and protein sequence data. Using sequences, in chromosomes BLAST, we can compare a segment of genomic DNA to ■■ Predict the amino acid sequence of a putative polypep- sequences throughout major databases such as GenBank to tide encoded by a cloned gene sequence identify portions that align with or are the same as existing ■■ Analyze protein structure, and predict protein functions sequences. For WGS projects, simple BLAST alignments are on the basis of identified domains and motifs insufficient and more complex algorithms are required to ■■ Deduce evolutionary relationships between genes and align the billions of reads of DNA sequence generated. organisms on the basis of sequence information Figure 18.3 shows a representative example of a sequence alignment based on a BLAST search. Here a 280-bp High-throughput DNA-sequencing techniques were contig from rat chromosome 12 (the query sequence) was developed nearly simultaneously with the expansion of the used to search a mouse database to determine whether a Internet. As genome data accumulated, many DNA-sequence sequence in the rat contig matched a known gene in mice. databases became freely available online. They are essential Notice that the rat contig is aligned with base pairs 174,612 to resources for archiving and sharing sequence information 174,891 of mouse chromosome 8 (the subject sequence). The with other researchers and with the public. One of the larg- accession number for the mouse sequence, NT_039455.6, is est genomic databases, called GenBank, is maintained by indicated at the top of the figure. BLAST searches calculate the National Center for Biotechnology Information (NCBI) an identity value—determined by the sum of identical matches in Washington, D.C. GenBank shares and acquires data from between aligned sequences divided by the total number of other databases in Japan and Europe. Containing more than bases aligned. Gaps, indicating missing bases in one of the 220 billion bases of sequence data from over 100,000 species, two sequences, are usually ignored in calculating similarity it is the largest publicly available database of DNA sequences – scores. The aligned rat and mouse sequences were 93 percent and it doubles in size roughly every 18 months! The Human identical and showed no gaps in the alignment. Genome Nomenclature Committee, supported by the NIH, Notice that the BLAST report also provides an “Expect” establishes rules for assigning names and symbols to newly value, or E-value, based on the number of matching sequences cloned human genes. As sequences are identified and genes in the database that would be expected by chance. E-values are named, each sequence deposited into GenBank is pro- take into account the length of the query sequence. Shorter vided with an accession number that scientists can use to sequences have a much greater likelihood of being present access and retrieve that sequence for analysis. in the database by chance as compared to longer sequences. The NCBI is an invaluable source of public access data- The lower the E-value (the closer it is to 0), the higher the bases and bioinformatics tools for analyzing genome data. significance of the match. DNA sequences that have E-values You have already been introduced to NCBI and GenBank much less than 1.0 are considered to be significantly similar.

M18_KLUG8414_10_SE_C18.indd 350 16/11/18 5:15 pm 18.2 DNA Sequence Analysis Relies on Bioinformatics Applications and Genome Databases 351

ref | NT_039455.6 | Mm8_39495_36 Mus musculus chromosome 8 genomic contig, strain C57BL/6J Features in this part of subject sequence: insulin receptor Score = 418 bits (226), Expect = 2e-114 Identities = 262/280 (93%), Gaps = 0/280 (0%)

Query 1 CAGGCCATCCCGAAAGCGAAGATCCCT TGAAGAGGTGGGCAATGTGACAGCCACTACACC 60

Sbjct 174891 CAGGCCATCCCGAAAGCGAAGATCCCT TGAAGAGGTGGGGAATGTGACAGCCACCACACT 174832

Query 61 CACACTTCCAGAT TTTCCCAACATCTCCTCCACCATCGCGCCCACAAGCCACGAAGAGCA 120

Sbjct 174831 CACACTTCCAGAT TTCCCCAACGTCTCCTCTACCA TTGTGCCCACAAGTCAGGAGGAGCA 174772

Query 121 CAGACCAT TTGAGAAAGTAGTAAACAAGGAGTCACTTGTCATCTCTGGCCTGAGACACTT 180

Sbjct 174771 CAGGCCATT TGAGAAAGTGGTGAACAAGGAGTCACT TGTCATCTCTGGCCTGAGACACT T 174712

Query 181 CACTGGGTACCGCATTGAGCTGCAGGCATGCAATCAGGACTCCCCAGAAGAGAGGTGCAG 240

Sbjct 174711 CACTGGGTACCGCATTGAGCTGCAGGCATGCAATCAAGAT TCCCCAGATGAGAGGTGCAG 174652

Query 241 CGTGGCTGCCTACGTCAGTGCCCGGACCATGCCTGAAGGT 280

Sbjct 174651 TGTGGCTGCCTACGTCAGTGCCCGGACCATGCCTGA AGGT 174612

FIGURE 18.3 BLAST results showing a 280-base sequence exact matches. The rat contig sequence was used as a query of a chromosome 12 contig from rats (Rattus norvegicus, sequence to search a mouse database in GenBank. Notice the “query”) aligned with a portion of chromosome 8 from that the two sequences show 93 percent identity, strong evi- mice (Mus musculus, the “subject”) that contains a partial dence that this rat contig sequence contains a gene for the sequence for the insulin receptor gene. Vertical lines indicate insulin receptor.

Because this mouse sequence on chromosome 8 is particular sequence of DNA is not always straightforward, known to contain an insulin receptor gene (encoding a pro- particularly when one is studying genes that do not code tein that binds the hormone insulin), it is highly likely that for proteins. In fact, a reasonable question whenever one the rat contig sequence also contains an insulin receptor sequences a genome is, “Where are the genes?” In other gene. We will return to the topic of similarity in Sections 18.4 words, how does one know what sequences of a genome and 18.7, where we consider how similarity between gene are genes and which sequences are not genes or parts of sequences can be used to infer function and to identify evo- a gene? lutionarily related genes through comparative genomics. As we discussed earlier in the text (see Chapters 12 and 16), several hallmark features of genes exist, whether Hallmark Characteristics of a Gene Sequence the genome under study is from a eukaryote or a bacterium Can Be Recognized during Annotation (Figure 18.4). These characteristics can be identified in gene sequences by using bioinformatics software that incor- Gene-prediction programs are used to annotate sequences. porates search elements for them. Yet even with bioinformatics, identifying a gene in a

5'-UTR (exon) Exon Intron Exon Intron Exon 3'-UTR (exon)

Transcription Promoter (e.g., Translation 5' splice site 3' splice site Translation Polyadenylation regulatory TATA box, initiation site (GT) (AG) termination site site element (e.g., CAAT box, enhancers, GC box) silencers)

FIGURE 18.4 characteristics of a protein-coding gene genome sequence to determine whether it contains a that can be used during annotation to identify a gene in gene, it is necessary to distinguish between introns and an unknown sequence of genomic DNA. Most eukary- exons; gene-regulatory sequences, such as promoters and otic genes are organized into coding segments (exons) enhancers; untranslated regions (UTRs); and gene termi- and noncoding segments (introns). When annotating a nation sequences.

M18_KLUG8414_10_SE_C18.indd 351 16/11/18 5:15 pm 352 18 Genomics, Bioinformatics, and Proteomics

For instance, gene-regulatory regions found upstream One computational approach to assigning functions to of genes are marked by identifiable sequences such as pro- genes is to use sequence similarity searches, as described pre- moters, enhancers, and silencers. Recall from earlier in the viously. Programs such as BLAST are used to search through text (see Chapter 16) that TATA box, GC box, and CAAT databases to find alignments between the newly sequenced box sequences are often present in the promoter region of genome and genes that have already been identified, either eukaryotic genes. Protein-coding genes contain one or more in the same or in different species. (Figure 18.3). Inferring open reading frames (ORFs), nucleotide sequences that, gene function from similarity searches is based on a rela- after transcription and mRNA splicing, are translated into tively simple idea. If a genome sequence shows statistically the amino acid sequence of a protein. ORFs typically begin significant similarity to the sequence of a gene whose func- with an initiation sequence—usually ATG, which transcribes tion is known, then it is likely that the genome sequence into the AUG start codon of an mRNA molecule—and end encodes a protein with a similar or related function. with a termination sequence—TAA, TAG, or TGA—which Another major benefit of similarity searches is that they corresponds to the stop codons of UAA, UAG, and UGA are often able to identify homologous genes, genes that are in mRNA. evolutionarily related. Homologous genes in the same spe- Within eukaryotic ORFs, recall also that splice sites cies are called paralogs. For example, in the globin gene between exons and introns can be identified, because they family, the a@ and b@globin subunits in humans are paralogs contain a predictable sequence (most introns begin with CT resulting from a gene-duplication event. Paralogs often have and end with AG). Annotation can sometimes be a little bit similar or identical functions. easier for bacterial genes than for eukaryotic genes because If homologous genes in different species are thought there are no introns in bacterial genes. to have descended from a gene in a common ancestor, the Finally, downstream elements, such as termination genes are known as orthologs. After the human genome was sequences and, in eukaryotes, a polyadenylation sequence sequenced, many ORFs in it were identified as protein-coding that signals the addition of a poly-A tail to the 3′end of an genes based on their alignment with related genes of known mRNA transcript, are also important for annotation. function in other species. An example is the leptin gene, which was first discovered in mice and later identified in humans.

NOW SOLVE THIS Figure 18.5 compares portions of the human (LEP) and the mouse (Lep) leptin genes, which are over 85 percent identi- 18.1 In a sequence encompassing 99.4 percent of the cal in sequence. This close match between the two sequences euchromatic regions of human chromosome 1, Gregory confirmed the leptin-coding function of the human DNA and, et al. [(2006) Nature 441:315–321] identified 3141 genes. therefore, its identity as the human LEP gene. (a) How does one identify a gene within a raw A gene sequence can be used to predict a polypeptide sequence of bases in DNA? sequence, which can then be analyzed for specific structural (b) What features of a genome are used to verify likely gene assignments? domains and motifs. Identification of protein domains, (c) Given that chromosome 1 contains approximately such as ion channels, membrane-spanning regions, DNA- 8 percent of the human genome, and assuming binding regions, among others, can in turn be used to predict that there are approximately 20,000 genes, would protein function. you consider chromosome 1 to be “gene rich”? Human LEP gene HINT: This problem involves a basic understanding of bioinfor- GTCACCAGGATCAATGACAT T TCACACACG- - - TCAGTCTCCTCCAAACAGAAAGTCACC

matics and gene annotation approaches to determine how potential GTCACCAGGATCAATGACAT T TCACACACGCAGTCGGTATCCGCCAAGCAGAGGGTCACT gene sequences can be identified in a stretch of sequenced DNA. Mouse Lep gene GGTTTGGACTTCA TTCCTGGGCTCCACCCCATCC TGACCT TATCCAAGATGGACCAGACA

GGCTTGGACTTCA TTCCTGGGCTTCACCCCAT TC TGAGTT TGTCCAAGATGGACCAGACT

Predicting Gene and Protein Functions CTGGCAGTCTACCAACAGATCCTCACCAGTA TGCCTTCCAGAAACGTGATCCA AATATCC

by Sequence Analysis CTGGCAGTCTA TCAACAGGTCCTCACCAGCCTGCCTTCCCA AAATGTGCTGCAGATAGCC Functional genomics interprets DNA sequences and FIGURE 18.5 comparison of the human LEP and mouse establishes gene functions based on the projected RNAs or Lep genes. Partial sequences for these orthologs are shown possible proteins they encode and, as well, identifies other with the human LEP gene on top (in blue) and the mouse components of the genome, such as gene-regulatory ele- Lep gene sequence below it (in red). Notice from the num- ber of identical nucleotides, indicated by shaded boxes and ments. Functional genomics also involves experimental vertical lines, that the nucleotide sequence for these two approaches to confirm or refute computational predictions. genes is very similar.

M18_KLUG8414_10_SE_C18.indd 352 16/11/18 5:15 pm 18.3 THE HUMAN GENOME PROJECT REVEALED MANY IMPORTANT ASPECTS 353

■■ ESSENTIAL POINT Disseminate genome information among both scientists Bioinformatics can be used for sequence annotation to identify and the general public. protein-coding and noncoding sequences of a gene, such as regu- latory elements, and to predict gene function based on sequence Recognizing the impact that genetic information would analysis. ■ have on society, the HGP also set up the ELSI Program (standing for Ethical, Legal, and Social Implications) to consider these types of issues arising from the HGP and to ensure that personal genetic information would be safe- guarded and not used in discriminatory ways. 18.3 The Human Genome Project As the HGP grew into an international effort, scientists Revealed Many Important Aspects in 18 countries were involved in the project. Much of the work was carried out by the International Human Genome of Genome Organization in Humans Sequencing Consortium, involving nearly 3000 scientists working at 20 centers in six countries (China, France, Ger- Now that you have a general idea of the basic strategies used many, Great Britain, Japan, and the United States). for analyzing a genome, let’s look at the largest genomics In 1999, a privately funded human genome project led project completed to date. The Human Genome Project by J. Craig Venter at Celera Genomics (aptly named from a (HGP) was a coordinated international effort to determine word meaning “swiftness”) was announced. Celera’s goal the sequence of the human genome and to identify all the was to use WGS and computer-automated high-throughput genes it contains. DNA sequencers to sequence the human genome more rap- idly than the clone-by-clone approach used by the HGP. Origins of the Project Recall that Venter and his colleagues had proven the poten- The publicly funded Human Genome Project began in 1990 tial of WGS in 1995 when they completed the genome for under the direction of , the co-discoverer of the H. influenzae. Celera’s announcement set off an intense double helix structure of DNA. Eventually the public project competition between the two teams, each of which aspired was headed by Dr. , who had previously led a to be first with the human genome sequence. This contest research team involved in identifying the CFTR gene as the eventually led to the HGP finishing ahead of schedule and cause of cystic fibrosis. In the United States, the Collins-led under budget after scientists from the public project began HGP was coordinated by the Department of Energy and the to use high-throughput sequencers and WGS strategies National Center for Human Genome Research (now called as well. the National Human Genome Research Institute), a divi- sion of the National Institutes of Health. It established a 15-year plan with a proposed budget of $3 billion to iden- Major Features of the Human Genome tify all human genes, originally thought to number between In June 2000, the leaders of the public and private genome 80,000 and 100,000, to sequence and map them all, and to projects met at the White House with President Bill Clinton sequence the approximately 3 billion base pairs thought to and jointly announced the completion of a draft sequence of be comprised by the 24 chromosomes (22 autosomes, plus X the human genome. In February 2001, they each published and Y) in humans. Other primary goals of the HGP included an analysis covering about 96 percent of the euchromatic the following: region of the genome. The public project sequenced euchro- matic portions of the genome 12 times and set a quality-con- ■■ Establish functional categories for all human genes. trol standard of a 0.01 percent error rate for their sequence. ■■ Analyze genetic variations among humans, including the Although this error rate may seem very low, it still allows identification of single-nucleotide polymorphisms (SNPs). about 600,000 errors in the human genome sequence. Celera ■■ Map and sequence the genomes of several model organ- sequenced certain areas of the genome more than 35 times isms used in experimental genetics, including Esch- when compiling the genome. erichia coli, Saccharomyces cerevisiae, Caenorhabditis The remaining work of completing the sequence elegans, Drosophila melanogaster, and Mus musculus involved filling in gaps clustered around centromeres, telo- (mouse). meres, and repetitive sequences (regions rich in CG base ■■ Develop new sequencing technologies, such as high- pairs can be particularly tough to sequence and interpret), throughput computer-automated sequencers, to facili- correcting misaligned segments, and re-sequencing por- tate genome analysis. tions of the genome to ensure accuracy. In 2003 sequencing

M18_KLUG8414_10_SE_C18.indd 353 16/11/18 5:15 pm 354 18 Genomics, Bioinformatics, and Proteomics

and error fixing were deemed sufficient to pass the interna- TABLE 18.1 Major Features of the Human Genome tional project’s definition of completion—that the analysis ■■ The human genome contains 3.1 billion nucleotides, but contained fewer than 1 error per 10,000 nucleotides and protein-coding sequences make up only about 2 percent of that it covered 95 percent of the gene-containing portions the genome. of the genome. Yet even at the time of “completion” there ■■ The genome sequence is 99.9 percent similar in individuals were still some 350 gaps in the sequence that continued to of all nationalities. SNPs and copy number variations (CNVs) be worked on. account for genome diversity from person to person. And obviously the HGP did not sequence the genome of ■■ The genome is dynamic. At least 50 percent of the genome is every person on Earth. The assembled sequence consists of derived from transposable elements, such as LINE and Alu sequences, and other repetitive DNA sequences. haploid genomes pooled from different individuals so that ■■ The human genome contains approximately 20,000 protein- they provide a reference genome representative of major, coding genes, far fewer than the originally predicted number common elements widely shared among human popula- of 80,000–100,000 genes. tions. Examples of major features of the human genome ■■ The average size of a human gene is 25 kb, including gene- are summarized in Table 18.1. As you can see, many regulatory regions, introns, and exons. On average, mRNAs unexpected observations have provided us with major produced by human genes are 3000 nt long. new insights. The genome is not static! Genome variations, ■■ Many human genes produce more than one protein through including the abundance of repetitive sequences scattered alternative splicing, thus enabling human cells to produce throughout the genome, verify that the genome is dynamic a much larger number of proteins (perhaps as many as and reveal many evolutionary examples of sequences that 200,000) from only 20,000 genes. have changed in structure and location. In many ways, ■■ More than 50 percent of human genes show a high degree of sequence similarity to genes in other organisms; however, the HGP has revealed just how little we know about our more than 40 percent of the genes identified have no known genome. molecular function. Two of the biggest surprises discovered by the HGP ■■ Genes are not uniformly distributed on the 24 human were that less than 2 percent of the genome codes for pro- chromosomes. Gene-rich clusters are separated by gene- teins and that there are only around 20,000 protein-coding poor “deserts” that account for 20 percent of the genome. genes. Scientists had originally estimated the number of These deserts correlate with G bands seen in stained chro- genes to be about 100,000, based in part on a prediction that mosomes. Chromosome 19 has the highest gene density, human cells produce about 100,000 proteins. This overes- and chromosome 13 and the Y chromosome have the lowest gene densities. timate of gene number occurred, in part, because many ■■ Chromosome 1 contains the largest number of genes, and the genes code for multiple proteins through alternative splic- Y chromosome contains the smallest number. ing. Recall from earlier in the text (see Chapter 12), that ■■ Human genes are larger and contain more and larger alternative splicing patterns can generate multiple mRNA introns than genes of invertebrates, such as Drosophila. The molecules, and thus multiple proteins, from a single gene. largest known human gene encodes dystrophin, a muscle Initial estimates suggested that over 50 percent of human protein. This gene, associated in mutant form with muscular genes undergo alternative splicing to produce multiple dystrophy (Chapter 14), is 2.5 Mb in length (Chapter 12), transcripts and multiple proteins. Recent studies, how- larger than many bacterial chromosomes. Most of this gene is composed of introns. ever, suggest that ~94 to 95 percent of human pre-mRNAs ■■ containing multiple exons can potentially result in multiple The number of introns in human genes ranges from 0 (in histone genes) to 234 (in the gene for titin,which encodes a different protein products. muscle protein). There is still no consensus among scientists worldwide about the exact number of human genes, partly because it is unclear whether or not many of the presumed genes pro- It is clear that human genes encode an incredible diver- duce functional proteins. Currently, annotation predicts sity of proteins. During the HGP, functional categories were that the human genome encodes approximately 21,000 assigned for human genes, primarily on the basis of: proteins. But (see Table 18.1), due to posttranslational modifications and other processes, the total number of 1. Functions determined previously (for example, from proteins present in human cells is thought to be anywhere recombinant DNA cloning of human genes and known from approximately 200,000 to 1 million! Genome scien- mutations involved in human diseases) tists continue to annotate the genome, and as mentioned 2. Comparisons to known genes and predicted protein earlier, functional genomics studies have important roles sequences from other species in determining whether or not computational predictions about the number of protein-coding and non–protein-­ 3. Predictions based on annotation and analysis of protein coding genes are accurate. functional domains and motifs

M18_KLUG8414_10_SE_C18.indd 354 16/11/18 5:15 pm 18.4 The “Omics” Revolution Has Created a New Era of Biological Research 355

Although functional categories and assignments con- Accessing the Human Genome tinue to be revised, at the time the HGP was completed, the Project on the Internet functions of over 40 percent of human genes were unknown. It is now possible to access databases and other sites on the Determining human gene functions, deciphering complexi- Internet that display maps for all human chromosomes. You ties of gene-expression regulation and gene interaction, and will visit a number of these databases in Exploring Genom- uncovering the relationships between human genes and ics exercises. Figure 18.6(a) displays a partial gene map phenotypes are among the many ongoing challenges for for chromosome 12 that was taken from an NCBI database genome scientists. called Genome Data Viewer. The first image shows an ideo- gram, or cytogenetic map, of chromosome 12. To the right Individual Variations in the Human Genome of the ideogram is a column showing the contigs (arranged vertically) that were aligned to sequence this chromosome. The HGP originally revealed that in all humans, regard- The UniGene column displays a histogram representation of less of racial and ethnic origins, the genomic sequence is gene density on chromosome 12. Notice that relatively few approximately 99.9 percent the same. As we discuss in genes are located near the centromere. Finally, gene sym- other chapters, most genetic differences among humans bols, loci, and gene names (by description) are provided for result from single-nucleotide polymorphisms (SNPs) 20 selected genes. When accessing these maps on the Inter- and Recall that SNPs are copy number variations (CNVs). net, one can magnify, or zoom in on, each region of the chro- single-base changes in the genome and variations of many mosome, revealing all genes mapped to a particular area. SNPs are associated with disease conditions, such as sickle- You can see that most of the genes listed in Figure 18.6(a) cell anemia and cystic fibrosis. Later in the text (see Special have been assigned descriptions based on the functions of Topics Chapter 2—Genetic Testing), we will examine how their products: Some are transmembrane proteins; some are SNPs can be detected and used for diagnosis and treatment enzymes such as kinases; some are receptors, including several of disease. involved in olfaction; and so on. After the draft sequence of the human genome was The HGP’s most valuable contribution will perhaps be completed, it initially appeared that most genetic varia- the identification of disease genes and the development of tions between individuals (the 0.1 percent differences) new treatment strategies as a result. Thus, extensive maps were due to SNPs. While SNPs are important contributing have been developed for genes implicated in human disease factors to genome variation, structural differences that we conditions. The disease gene map of chromosome 21 shown discussed earlier in the text (see Chapter 11) such as dele- in Figure 18.6(b) indicates genes involved in amyotrophic tions, duplications, inversions, and CNVs, which can span lateral sclerosis (ALS), Alzheimer disease, cataracts, deaf- millions of base pairs of DNA, play much more important ness, and several different cancers. In a later chapter (see roles in genome variation than previously thought. Recall Special Topics Chapter 2—Genetic Testing), we discuss that CNVs are duplications or deletions of relatively large implications of the HGP for the identification of genes sections of DNA on the order of several hundred or several involved in human genetic diseases and for disease diagno- thousand base pairs. Many of the CNVs that vary the most sis, detection, and gene therapy applications. among genomes appear to be at least 1 kilobase. Although most human DNA is present in two copies per cell, one from each parent, CNVs are segments of DNA ESSENTIAL POINT that are duplicated or deleted, resulting in variations in the The HGP has revealed many surprises about human genetics, includ- number of copies of a DNA segment inherited by individuals. ing gene number and the high degree of DNA sequence similarity In some cases CNVs are major deletions at the exon level or both between individuals and between humans and other species, involving entire genes; other deletions affect gene function and showed that many genes encode multiple RNAs and proteins. ■ by frameshifts in the reading code. CNVs that are duplicated can result in overexpression of a particular gene, yet many deleted and duplicated CNVs do not present clearly identifi- able phenotypes. 18.4 The “Omics” Revolution Has Current estimates of the number of CNVs in an individ- ual genome range from about 12 to perhaps 4 to 5 dozen per Created a New Era of Biological person. Some studies estimate that there may be as many Research as 1500 CNVs greater than 1 kb among the human genome. Other studies claim there are more than 1.5 million dele- The Human Genome Project and the development of genom- tions of less than 100 bp that contribute to genome variation ics techniques have been largely responsible for launching a between individuals. new era of biological research—the era of “omics.” It seems

M18_KLUG8414_10_SE_C18.indd 355 16/11/18 5:15 pm 356 18 Genomics, Bioinformatics, and Proteomics

(a) Ideogram Contig UniGene clusters Gene Symbol Locus Description Hs.279594 12p13.33 Hs.544577 NT_009759. Hs.479728 FGF23 12p13.32 • Fibroblast growth factor 23 12p13.32 Hs.524219 12p13.31 Hs.458355 10M Hs.567497 12p13.2 Hs.419240 VWF 12p13.31 • Von Willebrand factor 12p13.1 Hs.212838 12p12.3 TNFRSF1A 12p13.31 • TNF receptor superfamily member 1A 20M 12p12.2 NT_009714. Hs.446149 12p12.1 CD4 12p13.31 • CD4 molecule 12p11.23 12p11.22 30M NT_187222.1 12p11.21 NT_187223.1 GNB3 12p13.31 • G protein subunit beta 3 12p11.1 NT_187224.1 Hs.524390 12q11 NT_187225.1 Hs.642755 CDKN1B 12p13.1 • Cyclin dependent kinase inhibitor 1B NT_187226.1 Hs.35052 12q12 40M Hs.369761 Hs.433845 KRAS 12p12.1 • KRAS proto-oncogene, GTPase 12q13.11 Hs.533782 12q13.12 Hs.406013 50M Hs.292063 LRRK2 12q12 • Leucine rich repeat kinase 2 Hs.546261 12q13.13 Hs.632717 • 1,25-dihydroxyvitamin D3 receptor| Hs.406510 nuclear receptor subfamily 1 group I 12q13.2 Hs.505735 VDR 12q13.11 12q13.3 60M Hs.75069 member 1|vitamin D nuclear receptor 12q14.1 Hs.527861 variant 1|vitamin D3 receptor SP1 12q13.13 12q14.2 • Sp1 transcription factor 12q14.3 12q15 70M CDK2 12q13.2 • Cyclin dependent kinase 2 12q21.1 Hs.524599 12q21.2 IFNG 12q15 • OTTHUMP00000240111 80M 12q21.31 NT_029419. 12q21.32 MDM2 12q15 • MDM2 proto-oncogene 12q21.33 90M Hs.642609 IGF1 12q23.2 12q22 • Insulin like growth factor 1

12q23.1 PAH 12q23.2 12q23.2 Hs.290404 • Phenylalanine hydroxylase 12q23.3 100M Hs.192374 • Aldehyde dehydrogenase 2 family 12q24.11 ALDH2 12q24.12 (mitochondrial) 12q24.12 12q24.13 110M Hs.528668 PTPN11 22q24.13 • Protein tyrosine phosphatase, 12q24.21 non-receptor type 11 12q24.22 Hs.433863 12q24.23 Hs.448226 HNF1A 12q24.31 • HNF1 homeobox A Hs.546285 12q24.31 120M Hs.442798 Hs.520348 P2RX7 12q24.31 • Purinergic receptor P2X 7 12q24.32 12q24.33 130M Hs.10842 NT_024477. UBC 12q24.31 • Ubiquitin C

(b) Chromosome 21 48 million bases

Coxsackie and adenovirus receptor Myeloproliferative syndrome, transient Amyloidosis cerebroarterial, Dutch type Leukemia transient of Down syndrome Alzheimer disease, APP-related Schizophrenia, chronic Enterokinase deficiency Usher syndrome, autosomal recessive Multiple carboxylase deficiency T-cell lymphoma invasion and metastasis Amyotrophic lateral sclerosis Oligomycin sensitivity Mycobacterial infection, atypical Jervell and Lange-Nielsen syndrome Down syndrome (critical region) Long QT syndrome Autoimmune polyglandular disease, type 1 Down syndrome cell-adhesion molecule Bethlem myopathy Homocystinuria Epilepsy, progressive myoclonic Cataract, congenital, autosomal dominant Holoprosencephaly, alobar Deafness, autosomal recessive Knobloch syndrome Myxovirus (influenza) resistance Hemolytic anemia Leukemia, acute myeloid Breast cancer Platelet disorder, with myeloid malignancy

FIGURE 18.6 (a) A gene map for chromosome 12 from the NCBI database Genome Data Viewer. (b) Partial map of disease genes on human chromosome 21. Maps such as this depict genes thought to be involved in human genetic disease conditions.

M18_KLUG8414_10_SE_C18.indd 356 16/11/18 5:15 pm 18.4 The “Omics” Revolution Has Created a New Era of Biological Research 357

that every year, more areas of biological research are being After the HGP, What’s Next? described as having an omics connection. Some examples of Since completion of a reference sequence of the human “omics” are: genome, studies have continued at a very rapid pace and ■■ Proteomics: the analysis of all the proteins in a cell or new areas for human genome research have emerged. One tissue such area is the analysis of the epigenome, in which a Human Epigenome Project is creating hundreds of maps of epigen- ■■ Metabolomics: the analysis of proteins and enzymatic etic changes in different cell and tissue types and evaluating pathways involved in cell metabolism potential roles of epigenetics in complex diseases (see also ■■ the analysis of the carbohydrates of a cell Glycomics: Special Topics Chapter 1—Epigenetics). Another research or tissue area is the characterization of SNPs (the International Hap- ■■ Toxicogenomics: the analysis of the effects of toxic Map Project) and CNVs for their role in genome variation, chemicals on genes, including mutations created by tox- disease, and pharmacogenomics applications. Yet another ins and changes in gene expression caused by toxins theme includes human cancer genome projects. We will dis- ■■ Metagenomics: the analysis of genomes of organisms cuss aspects of one of these (Cancer Genome Atlas Project) collected from the environment later in the text (see Chapter 19). Here we consider several examples of genome research that are extensions of the HGP. ■■ Pharmacogenomics: the development of customized medicine based on a person’s genetic profile for a par- ticular condition Personal Genome Projects ■■ Transcriptomics: the analysis of all expressed genes in a cell or tissue As we discussed earlier in this chapter (and in Chapter 17), next- and third-generation sequencing technologies We will consider several of these genomics disciplines in are capable of generating sequence reads at higher speeds other parts of this chapter. with greater accuracy and have greatly reduced the cost of DNA sequencing. Expectations for continued cost reduc- ESSENTIAL POINT tions along with continued technological advances are Genomics has led to a number of related “omics” disciplines that are high (see Figure 18.7)and have led several companies to rapidly changing how modern biologists study DNA, RNA, proteins, propose WGS for individual people—a personal genomics and many aspects of cell function. ■ approach.

300 454 sequencing is considered the first “next-generation” technique. A $10,000 machine could sequence hundreds of Cost per million millionsbase pairs of base pairs in a single run. 250 of sequence (log scale) James Watson, a woman with acute myeloid leukemia, a Yoruba male from Nigeria, and the first Asian genome 200 Automated Sanger Sequencing: J. Craig Venter At the peak of this technique, a diploid genome single machine could produce hundreds of thousands of base $1000 Sequencing by Ligation: Third-Generation Sequencing: 150 pairs in a single run. This technique employed in Some methods can read sequence SOLiD instruments uses a from short, single DNA molecules. Sequencing by Synthesis: $100 different chemistry from Others, such as Pacific Biosciences, Other companies such as Illumina previous technologies and and Ion Torrent, can read from modified the next-generation, samples every base twice, longer molecules as they pass Billions of base pairs 100 sequencing-by-synthesis reducing the error rate. through a pore. techniques and can produce billions of base pairs in a single run.

Human Genome Project completed 50 First drafts of two $10 composite haploid Whole-Genome Sequence Data human genomes $1

0 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 Year

FIGURE 18.7 Human genome sequence explosion. Sequencing costs have steadily declined since 2000 due to innovations in sequencing technology. As a result, notice that the number of individual genomes sequenced has dramatically increased.

M18_KLUG8414_10_SE_C18.indd 357 16/11/18 5:15 pm 358 18 Genomics, Bioinformatics, and Proteomics

Several companies have now developed technology that It has now become apparent that genomic variation among can sequence a genome for less than $1000. Whether the individuals is much more prevalent than can be deter- $1000 mark represents the costs of reagents to sequence a mined by a reference sequence. In bacteria such as Strep- genome or actual costs when sequence preparation, labor, tococcus, for example, several dozen different genes can and analysis of the genome are taken into account can be exist between isolates of the same strain. As a consequence, debated. Having somebody such as a geneticist analyze your genome scientists in many fields have now replaced single personal genome data and consider how genome variations reference genomes with the concept of the pangenome may affect your health may be expensive. So even if the cost of to describe all distinct genes and variations in a species sequencing a genome is less than $1000, interpreting genome (Figure 18.8). data to make sense for medical treatment may cost thousands of dollars more. Regardless of how the actual cost of sequenc- Whole-Exome Sequencing ing a genome is calculated, the modern cost is substantially While we have thus far focused on sequencing the entire lower than the $3 billion cost proposed at the start of the HGP. genome, it is worthwhile to recall that only about 2 per- As of early 2018, an estimated 400,000 individual cent of the genome sequence consists of protein-coding human genomes have been sequenced. In Special Topics genes. Thus, in personal genomic analysis, the focus has Chapter 2—Genetic Testing, we will talk about the implica- shifted toward whole-exome sequencing (WES), that is, tion of personal genomics for genetic testing. sequencing only the 180,000 exons in a person’s genome. WES reveals mutations involved in disease by focusing only Somatic Genome Mosaicism and on protein-coding segments of the genome, thus there are the Emerging Pangenome more disease-related genetic variations in the exome than The HGP pooled samples from multiple individuals to cre- in other regions of the genome. WES can be done at a cost ate a haploid reference genome. In contrast, personal genome of much less than $1000 with greater than 100 * coverage projects (PGPs) sequence a diploid genome. Personal genome (the percentage of bases in a DNA sequence that have projects generate sequences of millions of short DNA frag- been sequenced multiple times). Of course, a limitation of ments from maternal and paternal chromosomes that are this approach is its failure to identify mutations in gene- mapped onto the reference genome. Such projects indicate regulatory regions that influence gene expression. As the cost that haploid reference genome comparisons often underes- of WGS continues to drop, scientists and clinicians trying to timate the extent of genome variation between individuals detect disease-causing mutations are debating whether it by five-fold or more. Therefore, genome variation between individuals may Reference genome be closer to 0.5 percent than to the 0.1 percent predicted by the HGP, and in a 3-billion-bp genome this is a significant difference in sequence variation. Integrating genome data from several complete personal genomes of individuals from Comparison of four individual genomes different ethnic groups will also be of great value in evolu- A tionary genetics to address fundamental questions about B human diversity, ancestry, and migration patterns. Personal genome projects are also revealing that there C can be significant genome mosaicism in human somatic cells. It is now apparent that cells in an individual person D do not all contain identical genomes. Instead, we have to think of an individual as being made up of a population of Pangenome depiction of genome variation cells, each with its own unique personal genome. Somatic A A mosaicism can result from errors in DNA replication, creat- A,B B A,B,C B ing aneuploidy, CNVs, SNPs, and other variations that accu- mulate as cells divide during development. In somatic cells C these variations are passed to daughter cells during mitosis, C,D C,D but typically not transmitted to offspring. D We are only beginning to understand the frequency and effects of genetic mosaicism on health and disease. FIGURE 18.8 A pangenome attempts to visualize all genomic segments and gene variations found in a species. Newer sequencing methods and expansion of per- Notice that there are variations in individual genomes not sonal genome projects are demonstrating significant represented in the reference genome, but these variations genomic variation not just in humans but in most species. are included in the pangenome.

M18_KLUG8414_10_SE_C18.indd 358 16/11/18 5:16 pm 18.4 The “Omics” Revolution Has Created a New Era of Biological Research 359

makes sense to simply sequence the entire genome or to just that there may be over 17,000 genes for lncRNAs. It may sequence exomes first to find mutations. turn out that the number of noncoding RNA sequences In 2015, after seven years of work, a group of scientists will outnumber protein-coding genes. called the 1000 Genomes Project Consortium reported on the ■■ The functional sequences also include gene-regulatory genomes of 2504 individuals from 26 populations representing regions: 70,000 promoter and nearly 400,000 enhancer Europe, East Asia, South Asia, Africa, and the Americas. WGS regions. and WES data revealed nearly 88 million bi- and multi-allelic ■■ There are 20,687 protein-coding genes in the human SNPs and many other structural variations such as CNVs. One genome. interpretation of this work is that it reveals clear variations in ■■ individuals and associates particular diseases with geographic A total of 11,224 sequences are characterized as pseudo- or ancestral background. Thus sequencing genomes of indi- genes, previously thought to be inactive in all individu- viduals from diverse populations can help us to better under- als. Some of these are inactive in most individuals but stand the spectrum of human genetic variation and to learn occasionally active in certain cell types of some individu- the causes of genetic diseases across diverse groups. We will als, which may eventually warrant their reclassification come back to the topic of WES later in the text when we discuss as active, transcribed genes. genetic testing (see Special Topics Chapter 2—Genetic Testing). ■■ SNPs associated with disease are enriched within non- coding functional elements of the genome, often residing Encyclopedia of DNA Elements near protein-coding genes. (ENCODE) Project The ENCODE findings have broadly defined the func- In 2003, a few months after the announcement that the tional roles of the genome to include encoding either proteins human genome had been sequenced, a group of about or noncoding RNAs and displaying biochemical properties three dozen research teams around the world began the such as the binding of regulatory proteins that influence tran- Encyclopedia of DNA Elements (ENCODE) Project. scription or chromatin structure. It is worth noting, however, Using both experimental and computational approaches, a that a relatively large body of geneticists and other scientists do main goal of ENCODE was to identify and analyze all func- not agree with ENCODE’s definition of functional sequences. tional elements of the genome, including those that regulate One reason cited is that ENCODE did not adequately address the expression of human genes (such as transcriptional start many of the repetitive sequences in the genome such as trans- sites, promoters, and enhancers). ENCODE projects have posons, LINEs, SINEs, and other sequences such as telomeres also been initiated for mouse, worm, and fly genomes. and centromeres. There has also been significant debate about Because only a relatively small percentage (less the value of ENCODE, given the cost of the project. Despite than 2 percent) of the human genome codes for proteins, this, research teams are using information from ENCODE to ENCODE focused heavily not on genes but on all the rest identify risk factors for certain diseases, with the hopes of of the sequences, commonly referred to as “junk” DNA. So developing appropriate cures and treatments. what are all these other bases in the genome doing? The term junk DNA has always been a misnomer. We know that such sequences are important for chromosome structure, the EVOLVING CONCEPT OF THE GENE regulation of gene expression, and other roles. Just because Based on the work of the ENCODE project, we now these sequences themselves do not code for protein does not know that DNA sequences that have previously been mean that they are unimportant. thought of as “junk DNA” because they do not encode ENCODE studied gene expression in 147 different cell proteins, are nonetheless often transcribed into what types because genome activity differs from cell to cell. After we call noncoding RNA (ncRNA). Since the function about a decade of research and a cost of $288 million, in of some of these RNAs is now being determined, we 2012 a group of 30 research papers were published reveal- must consider whether the concept of the gene should ing the major initial findings of the ENCODE Project. be expanded to include DNA sequences that encode Selected highlights of what ENCODE revealed in initial ncRNAs. At this writing, there is no consensus, but it is papers and more recent papers include the following: important for you to be aware of these current findings as you develop your final interpretation of a gene. ■ ■■ The majority, 80 percent, of the human genome is con- sidered functional. In part, this is because large segments of the genome are transcribed into RNA, however, most do not encode proteins. These various RNAs include Nutrigenomics Considers Genetics and Diet tRNA, rRNAs, and miRNAs, and long noncoding RNAs As evidence of the impact of genomics, a field of nutritional (lncRNAs)—defined as non–protein-coding transcripts science called nutritional genomics, or nutrigenomics, longer than 200 nucleotides. A conservative estimate is has emerged. Nutrigenomics focuses on understanding the

M18_KLUG8414_10_SE_C18.indd 359 16/11/18 5:16 pm 360 18 Genomics, Bioinformatics, and Proteomics

interactions between diet and genes. We have all had rou- involved in analyzing “ancient” DNA. These so-called stone- tine medical tests for blood pressure, blood sugar levels, age genomics studies are generating fascinating data from and heart rate. Based on these tests, your physician may miniscule amounts of ancient DNA obtained from bone and recommend that you change your diet and exercise more to other tissues such as hair that are tens of thousands to about lose weight or that you reduce your intake of sodium to help 700,000 years old and often involve samples from extinct spe- lower your blood pressure. cies. Analysis of DNA from a 2400-year-old Egyptian mummy, Now several companies claim to provide nutrigenom- bison, mosses, platypus, mammoths, Pleistocene-age cave ics tests that analyze your genome for genes thought to be bears and polar bears, coelacanths, and Neanderthals are some associated with different medical conditions linked to nutri- of the most prominent examples of stone-age genomics. In ent metabolism. The companies then provide a customized 2013, scientists reported the oldest intact genome sequence to nutrition report, recommending diet changes for improving be successfully analyzed to date. It came from a 700,000-year- your health and preventing illness, based on your genes! It old bone fragment from an ancient horse uncovered from the is important to note that these tests have not yet been vali- frozen ground in the Yukon Territory of Canada. This result is dated as accurate and they have not been approved by the interesting in part because evolutionary biologists have used U.S. Food and Drug Administration. It remains to be seen genomic data to estimate that ancient ancestors of modern whether this approach as currently practiced is of valid sci- horses branched off from other animal lineages around 4 mil- entific or nutritional value. lion years ago—about twice as long ago as prior estimates. A little over a decade ago, researchers published about No Genome Left Behind and the 13 million bp of a sequence from a 27,000-year-old woolly Genome 10K Plan mammoth found frozen and nearly intact in Siberia. This study revealed a 98.5 percent sequence identity between Without question, new sequencing technologies that have mammoths and African elephants. Subsequent WGS stud- been developed are an important part of the transforma- ies by other scientists using mitochondrial and nuclear DNA tional effect the HGP has had on modern biology. from Siberian mammoths suggested that the mammoth Recent headline-grabbing genomes that have been com- genome differs from the African elephant by as little as 0.6 pleted include: percent. These studies are also great demonstrations of how ■■ Apple and tomato: The apple, which has more than stable DNA can be under the right conditions, particularly 57,000 genes, and the tomato, which has 31,760 genes, when frozen. each have more genes than humans! In Section 18.5 we discuss recent work on the Nean- derthal genome. Obtaining the genome of a human ances- ■■ Potato: This plant, which shares 92 percent of its DNA tor this old was previously unimaginable. This work is with tomatoes, turns out to be a fruit. providing new insights into our understanding of human ■■ Chickpea: It is one of the earliest cultivated legumes and evolution. the second most widely grown legume after the soybean. ■■ Red-spotted newt: This newt has a genome of almost 10 ESSENTIAL POINT times the size of a human genome. Personal genome sequencing, including exome sequencing, and epig- Modern sequencing technologies are asking some to enome analysis will provide unparalleled insight into individual varia- tions in genomes and both approaches have tremendous potential consider the question, “What would you do if you could for the diagnosis of genetic diseases. ■ sequence everything?” In 2009, partners around the world, including genome scientists and zoo and museum curators, began work on sequencing 10,000 vertebrate genomes— the Genome 10K Project. Shortly after the HGP finished, 18.5 the National Human Genome Research Institute (NHGRI) Comparative Genomics assembled a list of mammals and other vertebrates as priori- Provides Novel Information about the ties for genome sequencing in part because of their potential Human Genome and the Genomes of benefit for learning about the human genome through com- parative genomics. Genome 10K Project will also provide Model Organisms insight into genome evolution and speciation. Comparative genomics compares the genomes of dif- ferent organisms to answer questions about genetics and Stone-Age Genomics other aspects of biology. It is a field with many research and In yet another example of how genomics has taken over practical applications, including gene discovery and the areas of DNA analysis, a number of labs around the world are development of model organisms to study human diseases.

M18_KLUG8414_10_SE_C18.indd 360 16/11/18 5:16 pm 18.5 COMPARATIVE GENOMICS PROVIDES NOVEL INFORMATION ABOUT THE HUMAN GENOME 361

It also incorporates the study of gene and genome evolu- with other species is very high, ranging from about 30 percent tion and the relationship between organisms and their of the genes in yeast to 80 percent in mice and 98 percent environment. in chimpanzees. The human genome even contains around Comparative genomics uses a wide range of techniques 100 genes that are also present in many bacteria. and resources, such as the construction and use of nucleotide Comparative studies have demonstrated not only signif- and protein databases, fluorescence in situ hybridization icant differences in genome organization among bacteria and (FISH), and the creation of gene knockout animals. eukaryotes but also many similarities between genomes of Comparative genomics can reveal genetic differences and nearly all species. Analysis of the growing number of genome similarities between organisms to provide insight into how sequences confirms that all living organisms are related and those differences contribute to differences in phenotype, life descended from a common ancestor. A key piece of evidence cycle, or other attributes and to ascertain the evolutionary for this is the observation that all organisms studied to date history of those genetic differences. use similar gene sets for basic cellular functions, such as As of 2018, over 23,000 whole genomes have been DNA replication, transcription, and translation. sequenced—including many model organisms and a number Comparative genomics has further shown that many of viruses. This is quite extraordinary progress in a relatively genes identified as being involved in human disease are short time span! Among these organisms are yeast (S. cerevi- also present in other species. For instance, approximately siae, the first eukaryotic genome to be completely sequenced), 60 percent of genes associated with nearly 300 human dis- bacteria such as E. coli, the nematode roundworm (Caenorhab- eases are also present in Drosophila. These include genes ditis elegans), the thale cress plant (Arabidopsis thaliana), mice involved in a variety cancers; cardiovascular disease; cys- (M. musculus), zebrafish (Danio rerio), and, of course, Dro- tic fibrosis; and other conditions. These genetic relation- sophila. In recent years, genomes of chimpanzees, dogs, chick- ships are the rationale for using model organisms to study ens, gorillas, sea urchins, honey bees, pigs, pufferfish, rice, inherited human disorders; the effects of the environment and wheat have all been sequenced. These complete genome on genes; and the interactions of genes in complex diseases. sequences have been invaluable for comparative genomics Next, we consider how comparative genomics studies studies of gene function in these organisms and in humans. of sea urchins, a model organism, and Neanderthals have As shown in Table 18.2, the number of genes humans share revealed interesting elements of the human genome.

TABLE 18.2 Comparison of Selected Genomes

Approximate Size of Genome Approximate [in million (megabase, Mb) or Diploid (2n) Percentage (%) of billion (gigabase, Gb) base Chromosome Genes Shared with Organism (Scientific Name) pairs] (Date Completed) Number Number of Genes Humans African clawed frog (Xenopus laevis) 3.1 Gb (2016) 36 45,000 70 Bacterium (Escherichia coli) 4.6 Mb (1997) 1 4,403 Not determined Chicken (Gallus gallus) 1 Gb (2004) 78 20,000923,000 60 Dog (Canis familiaris) 2.5 Gb (2003) 78 18,400 75 Chimpanzee (Pan troglodytes) 3 Gb (2005) 48 20,000924,000 98 Fruit fly (Drosophila melanogaster) 165 Mb (2000) 8 13,600 50 Human (Homo sapiens) 3.1 Gb (2004) 46 20,000 100 Mouse (Mus musculus) 2.5 Gb (2002) 40 30,000 80 Pig (Sus scrofa) 3 Gb (2012) 38 21,640 84 Rat (Rattus norvegicus) 2.75 Gb (2004) 42 22,000 80 Rhesus macaque (Macaca mulatta) 2.87 Gb (2007) 42 20,000 93 Rice (Oryza sativa) 389 Mb (2005) 24 41,000 Not determined Roundworm (Caenorhabditis elegans) 97 Mb (1998) 12 19,099 40 Sea urchin (Strongylocentrotus purpuratus) 814 Mb (2006) 42 23,500 60 Thale cress (plant) (Arabidopsis 140 Mb (2000) 10 27,500 Not thaliana) determined Zebrafish (Danio rerio) 1.4 Gb (2013) 50 41,800 70 Yeast (Saccharomyces cerevisiae) 12 Mb (1996) 32 5,700 30

Originally adapted from Palladino, M. A. (2006) Understanding the Human Genome Project, 2nd ed. Benjamin Cummings.

M18_KLUG8414_10_SE_C18.indd 361 16/11/18 5:16 pm 362 18 Genomics, Bioinformatics, and Proteomics

The Sea Urchin Genome over 400,000 years ago is thought to be the oldest Neander- In 2006, researchers from the Sea Urchin Genome Sequenc- thal DNA ever analyzed. ing Consortium completed the 814-million-bp genome of Because Neanderthals are members of the human fam- the sea urchin Strongylocentrotus purpuratus. Sea urchins ily, and closer relatives to humans than chimpanzees, the are shallow-water marine invertebrates often studied by Neanderthal genome provides an unprecedented opportu- developmental biologists. Fossil records indicate that sea nity to use comparative genomics to advance our under- urchins appeared during the Early Cambrian period, around standing of evolutionary relationships between modern 520 million years ago (mya). humans and our predecessors. In particular, scientists are Sea urchins have an estimated 23,500 genes, including interested in identifying areas in the genome where humans representatives of almost all major vertebrate gene families. have undergone rapid evolution since splitting (diverg- Sequence alignment and homology searches demonstrate ing) from Neanderthals. Much of this analysis involves that the sea urchin contains many genes with important applying a comparative genomics approach to all three functions in humans, yet is missing genes that are impor- genomes. tant in flies and worms. For example, certain cytochrome The human and Neanderthal genomes are 99 percent P450 genes, which play a role in the breakdown of toxic identical. Comparative genomics has identified 78 pro- compounds, are not found in sea urchins. The sea urchin tein-coding sequences in humans that seem to have arisen genome also has an abundance (25 to 30 percent) of since the divergence from Neanderthals and that may have pseudogenes—nonfunctional duplications of protein- helped modern humans adapt. Some of these sequences coding genes. Sea urchins have a smaller average intron size are involved in cognitive development and sperm motility. than humans, supporting a general trend revealed by com- Of the many genes shared by humans and Neanderthals is parative genomics that intron size is correlated with overall FOXP2, a gene that has been linked to speech and language genome size. ability. There are many genes that influence speech, so this Urchins have nearly 1000 genes for sensing light and finding does not mean that Neanderthals spoke as we do. But odor, indicative of great sensory abilities. In this respect, because Neanderthals had the same modern human FOXP2 their genome is more typical of vertebrates than inver- gene scientists have speculated that Neanderthals possessed tebrates. A number of orthologs of human genes involved linguistic abilities. in hearing and balance are present, as are many human- The realization that modern humans and Neanderthals disease-associated orthologs, including protein kinases, lived in overlapping ranges as recently as 30,000 years ago transcription factors, innate immunity, and low-density has led to speculation about interactions between modern lipoprotein receptors. Sea urchins and humans share humans and Neanderthals. Genome studies suggest that approximately 7000 orthologs. interbreeding took place between Neanderthals and modern humans an estimated 45,000 to 80,000 years ago in the eastern Mediterranean. In fact, the genome of non-African The Neanderthal Genome and H. sapiens contains approximately 1 to 4 percent of a Modern Humans sequence inherited from Neanderthals. These exciting Svante Pääbo, of the Max Planck Institute for Evolutionary studies, previously thought to be impossible, are having Anthropology in Germany, specializes in studying ancient ramifications in many areas of the study of human evolu- genomes. In 1997, he and colleagues sequenced portions of tion, and it will be interesting indeed to follow the progress Neanderthal (Homo neanderthalensis) mitochondrial DNA of this work. from an undated fossil. Nine years later, in late 2006, Pää- bo’s group along with a number of scientists in the United ESSENTIAL POINT States reported the first sequence of nuclear DNA isolated Studies in comparative genomics are revealing fascinating similarities from Neanderthal bone samples. Continuing this work, in and variations in genomes from different organisms. ■ 2010 he led an international team of scientists who reported completion of a rough draft of the Neanderthal genome that encompassed more than 4 billion bp of DNA. Bones 18.6 Metagenomics Applies from three females who lived in Vindija Cave in Croatia about 38,000 to 44,000 years ago were used to produce this Genomics Techniques to draft sequence. In 2016, Pääbo and colleagues published a Environmental Samples complete, high-quality (52-fold coverage) nuclear genomic sequence from a Neanderthal bone sample that is at least Metagenomics, also called environmental genomics, is the 50,000 years old. Interestingly, a sequence recently recov- discipline that uses WGS approaches to sequence genomes ered from fossilized remains of a Neanderthal that lived from entire communities of microbes in environmental

M18_KLUG8414_10_SE_C18.indd 362 16/11/18 5:16 pm 18.6 Metagenomics Applies Genomics Techniques to Environmental Samples 363

samples of water, air, and soil. Oceans, glaciers, deserts, microbes make us ill. The HMP had several major goals, and virtually every other environment on Earth are being including: sampled for metagenomics projects. Human genome pio- ■■ Determining if individuals share a core human neer J. Craig Venter left Celera to form the J. Craig Venter microbiome. Institute, and his group played a central role in developing ■■ metagenomics as an emerging area of genomics research. Understanding whether changes in the microbiome can One of the institute’s major initiatives was a global be correlated with changes in human health. expedition to sample marine and terrestrial microorgan- ■■ Developing new methods, including bioinformatics tools, isms from around the world and to sequence their genomes. to support analysis of the microbiome. This project yielded over 1.2 million novel DNA sequences ■■ Addressing ethical, legal, and social implications raised from 1800 microbial species, including 148 previously by human microbiome research. unknown bacterial species, and identified hundreds of pho- toreceptor genes. The HMP involved about 200 scientists at 80 institutions A key benefit of metagenomics is its potential for teach- who, in 2012, published a series of papers summarizing their ing us more about millions of yet uncharacterized species findings. 242 healthy individuals in the United States par- of bacteria. Many new viruses, particularly bacteriophages, ticipated in the studies and each person was sampled up to have been identified through studies of water and soil sam- three times over nearly two years. Fifteen body sites of males ples. Further, important new information about genetic and 18 sites of females were selected for analysis and WGS diversity in microbes is emerging that is needed to be able of genomes for microbes and viruses present at these sites to understand complex interactions between microbial com- was performed. In addition, sequences for 16S rRNA genes munities and their environment and to allow phylogenetic were used specifically to compare bacterial samples. From classification of newly identified microbes. Metagenomics this work, more than 3000 reference sequences for microbes also has great potential for identifying genes with novel isolated from the human body were developed. functions, some of which may have valuable applications in The HMP amassed more than 1000 times the sequenc- medicine and biotechnology. ing data generated by the HGP. We have formulated the fol- The general method used in metagenomics often lowing concepts about the human microbiome: involves isolating DNA directly from an environmental sam- ■■ Sequence data identified an estimated 81 to 99 percent ple without requiring cultures of the microbes or viruses. of the microbes and viruses distributed among body Such an approach is necessary because often it is difficult areas in human males and females. to replicate the complex array of growth conditions the ■■ As many as 1000 bacterial strains may be present in each microbes need to survive in culture. person. An example of how metagenomics can provide novel insight into the microbial world around us is reflected by ■■ The microbiome starts at birth. Babies acquire bacteria a recent study of the microbiota found in from their mothers’ microbiome. subways. This project revealed that most microbes pres- ■■ A surprise to HMP scientists, the microbiome can be sub- ent are non-disease-causing bacteria normally prevalent stantially different from person to person. Based on vari- on human skin and in the GI tract. Although, occasion- ation between individuals, an estimated 10,000 bacterial ally, pathogens such as Bacillus anthracis were identified. species may be part of the total human microbiome. But almost half of the DNA sequenced did not match any ■■ Although the microbiome of the human gut differs from ­organism in eukaryotic, bacterial, archaeal and viral person to person, it remains relatively stable over time genome databases! in individuals.

Based on these findings, there is no single reference The Human Microbiome Project human microbiome to which people can be compared. In 2007 the National Institutes of Health announced plans Microbial diversity varies greatly from individual to indi- for the Human Microbiome Project (HMP), a $170 million vidual, and a personalization of the microbiome occurs in effort to sequence the genomes of an estimated 600 to 1000 individuals. For instance, comparing sequences of the micro- bacteria, viruses, yeasts, and other microorganisms that live biomes from two healthy people of equivalent age reveals on and inside humans. At the start of the project, microor- microbiomes that can be quite different. There are, however, ganisms were thought to outnumber human cells by about similarities in certain parts of the body; signature bacteria 10 to 1, although this prediction is likely too high. with characteristic genes are specific to certain locations. Many microbes, such as E. coli in the digestive tract, Knowledge about the personalized nature of the microbi- have important roles in human health, and, of course, other ome is already proving valuable for improving human health

M18_KLUG8414_10_SE_C18.indd 363 16/11/18 5:16 pm 364 18 Genomics, Bioinformatics, and Proteomics

and medicine, which may include microbiome-specific ther- The Case Study at the end of this chapter briefly dis- apeutic drugs in the future. Criteria are being sought for a cusses a clinical application focused on the importance of healthy microbiome, which is expected to help determine gut microbes for intestinal health. the role of bacteria in maintaining normal health. This may provide insight into how antibiotics can disturb a person’s ESSENTIAL POINT microbiome and why certain individuals are susceptible to Metagenomics sequences genomes from organisms in environmental certain diseases, especially chronic conditions such as psoria- samples, often identifying new sequences that encode proteins with sis, irritable bowel syndrome, and potentially even obesity. novel functions. ■ Related to this project, a team of researchers at the Uni- versity of California, Los Angeles, analyzed DNA sequences from 101 college students, 49 of whom had acne and 52 of whom did not. Over 1000 strains of Propionibacterium acnes 18.7 Transcriptome Analysis Reveals were isolated. Using WGS and bioinformatics, researchers Profiles of Expressed Genes in Cells clustered these strains into ten strain types (related strains). Six of these types were more common among acne-prone and Tissues students, and one type appeared repeatedly in skin samples from students without acne. Sequence analysis of types asso- Once any genome has been sequenced and annotated, a for- ciated with acne indicated groups of genes that may contrib- midable challenge remains: that of understanding genome ute to the skin disease. Further analysis of these strain types function by analyzing the genes it contains and the ways may help dermatologists develop new drugs targeted at kill- expressed genes are regulated. Transcriptome analysis ing acne-causing strains of P. acnes. (also called transcriptomics or global analysis of gene A Venn diagram, like the image shown in Figure 18.9, expression) studies the expression of genes in a genome both is a common way to represent overlapping data in metage- qualitatively—by identifying which genes are expressed or nomics datasets. In this figure, overlapping circles indicate not expressed—and quantitatively—by measuring varying numbers of human gut microbiome genes from individuals levels of expression for different genes. In other words, tran- with liver cirrhosis, Type 2 diabetes, and irritable bowel scriptome analysis attempts to catalog and quantify the total syndrome. Notice that each disease has a unique profile RNA content of a cell, tissue, or organism. of microbial genes but that significant overlaps between Even though in theory all cells or tissue types of an microbial genes for each disease occur. Of the nearly organism possess the same genes, depending on each specific 580,000 microbial genes, 403 were shared and thus con- cell’s function certain genes will be highly expressed, others sidered common markers for all three diseases. expressed at low levels, and some not expressed at all. Tran- scriptome analysis reveals gene-expression profiles that, for the same genome, may vary from cell to cell or from tissue type to tissue type. Identifying where and when genes are expressed by a genome is essential for understanding how Liver cirrhosis Type 2 diabetes the genome functions. 13,373 Transcriptome analysis provides insights into (1) 341,025 111,608 normal patterns of gene expression that are important for understanding how a cell or tissue type differentiates during 403 development, (2) how gene expression dictates and controls the physiology of differentiated cells, and (3) mechanisms 44,373 1,801 of disease development that result from or cause gene- expression changes in cells. Later in the text (see Special Topics Chapter 2—Genetic Testing), we will consider why Irritable bowel disease transcriptome analysis is gradually becoming an important 65,095 diagnostic tool in certain areas of medicine.

DNA Microarray Analysis FIGURE 18.9 Venn diagram representation of gut micro- A number of different techniques can be used for transcrip- bial genes from patients with liver cirrhosis, Type 2 dia- tome analysis. PCR-based methods—such as reverse tran- betes, and irritable bowel syndrome. Notice that different diseases show large numbers of unique genes with smaller scription PCR (RT-PCR) and quantitative real-time PCR numbers of shared genes. (qPCR) (described in Chapter 17) are useful because of their

M18_KLUG8414_10_SE_C18.indd 364 16/11/18 5:16 pm 18.7 Transcriptome Analysis Reveals Profiles of Expressed Genes in Cells and Tissues 365

ability to detect genes expressed at low levels. For nearly two or expressed sequence tags (ESTs)—short fragments of DNA decades DNA microarray analysis has been widely used cloned from expressed genes.] The arrayer fixes the DNA because it enables researchers to analyze all of a sample’s onto the slide at specific locations (called spots, fields, or expressed genes simultaneously (see Figure 18.10). features) that will be scanned and recorded by a computer. Most DNA microarrays, also known as gene chips, are A single microarray can have over 20,000 different spots prepared by “spotting” single-stranded DNA molecules onto of DNA (and over 1 million for exon-specific microarrays), glass slides using a computer-controlled high-speed robotic each containing a unique sequence that serves as a probe for arm called an arrayer. Arrayers are fitted with a number of a different gene. tiny pins. Each pin is immersed in a small amount of solution To use a microarray for transcriptome analysis, sci- containing millions of copies of a different single-stranded entists typically begin by extracting mRNA from cells or DNA molecule. [For example, many microarrays are pre- tissues (Figure 18.10). The mRNA is usually then reverse pared with single-stranded complementary DNA (cDNAs) transcribed to synthesize cDNA tagged with fluorescently

Tissue sample

1. Isolate mRNA mRNA molecules

2. Make cDNA by reverse transcription, using fluorescently labeled nucleotides

Labeled cDNA molecules (single strands)

3. Hybridization: Apply the cDNA mixture to a DNA microarray Segment of a microarray A T G C DNA strand G C cDNA Fixed to each spot on on microarray A T a microarray are millions C G Microarray (chip) of copies of short G C single-stranded DNA T A molecules, a different gene to each spot

4. Rinse off excess cDNA, put the microarray in a scanner to measure fluorescence of each spot. cDNA hybridized to Fluorescence intensity indicates the amount of DNA on microarray mRNA expressed in the tissue sample

Readout Scanner No fluorescence: gene not expressed in tissue sample Moderate fluorescence: low gene expression

Bright fluorescence: highly expressed gene in tissue sample

FIGURE 18.10 DNA microarray analysis for analyzing gene-expression patterns in a tissue.

M18_KLUG8414_10_SE_C18.indd 365 16/11/18 5:16 pm 366 18 Genomics, Bioinformatics, and Proteomics

labeled nucleotides. Microarray studies often involve com- RNA-seq not only allows for quantitative analysis of all paring gene expression in different cell or tissue samples. In RNAs expressed in a particular tissue, but it also provides this case, cDNA prepared from one tissue is usually labeled actual sequence data. And RNA-seq can also be carried out with one color dye, red for example, and cDNA from another inside the cell (in situ). For this application, the cell itself can tissue is labeled with a different-colored dye, such as green. serve as a “gene chip.” Labeled cDNAs are then denatured and incubated overnight Generally, most RNA-seq methods incorporate with the microarray so that they will hybridize to spots on reverse-transcribing RNA in situ, sequencing cDNA, map- the microarray that contain complementary DNA sequences. ping sequences to a reference genome (to determine which Next, the microarray is washed to remove excess cDNA, and sequences were transcribed from specific genes), and quan- then scanned by a laser, which causes the cDNA hybridized tifying gene expression. Methods incorporating fluorescence to the microarray to fluoresce. The patterns of fluorescent in situ RNA-seq are enabling scientists also to visualize where spots reveal which genes are expressed in the tissue of inter- specific RNAs are being transcribed in intact cells and tissues. est, and the intensity of spot fluorescence indicates the rela- It is now also becoming possible to carry out DNA tive level of expression. The brighter the spot, the more the sequencing and RNA-seq on individual cells! Several groups particular mRNA is expressed in that tissue. are taking an integrated approach to sequence genomes and DNA microarrays have dramatically changed the way transcriptomes in the same cell. This strategy enables the gene-expression patterns are analyzed. As discussed earlier correlation of genetic variability and mRNA expression vari- (in Chapter 17), Northern blot analysis was one of the earli- ability simultaneously in single cells—thus analyzing both est methods used for analyzing gene expression. Then PCR the genome and the transcriptome. techniques proved to be more rapid and have increased sen- We will consider applications of RNA-seq later in the sitivity. The biggest advantage of DNA microarrays is that text (see Special Topics Chapter 2—Genetic Testing), but they enable thousands of genes to be studied simultaneously. clearly this approach has already demonstrated great value As a result, however, they can generate an overwhelming for disease diagnosis including understanding how genome amount of gene-expression data. Over 1 million human variations such as CNVs impact transcriptome expression. gene-expression datasets are available in publicly accessible Now that we have discussed genomes and transcriptomes, databases, and commercially available DNA microarrays for we turn our attention to the ultimate end products of most analyzing human gene expression have been widely used. genes, the proteins encoded by a genome. But one limitation of DNA microarrays is that they can often yield variable results. For example, one experiment ESSENTIAL POINT under certain conditions may not always yield similar pat- Transcriptome analysis examines expression patterns for thousands terns of gene expression when the experiment is repeated. of genes simultaneously. ■ Some of this variability can be due to real differences in gene expression, but others can be the result of variation in chip preparation, cDNA synthesis, probe hybridization, or wash- 18.8 Proteomics Identifies and ing conditions, all of which must be carefully controlled to Analyzes the Protein Composition limit such variability. Commercially available DNA micro- arrays can reduce the variability that can result when indi- of Cells vidual researchers make their own arrays. As genomes have been sequenced and studied, biologists have focused increasingly on understanding the complex RNA Sequencing Technology Allows for structures, functions, and interactions of the proteins that In Situ Analysis of Gene Expression genomes encode—the proteome, defined as the complete set As we have discussed, the significant value of gene expression of proteins encoded by a given genome. Proteomics, then, is microarrays has been the ability to quantify RNA expres- the identification, characterization, and quantitative analy- sion for large numbers of genes simultaneously. But another sis of proteomes. limitation of microarrays is that the investigator is limited Proteomics provides information about many things: to studying the expression of only those genes with probes ■■ A protein’s structure and function on the chip. However, in recent years, significant progress ■■ Posttranslational modifications has been made on direct RNA sequencing (RNA-seq), also called whole-transcriptome shotgun sequencing, a mod- ■■ Protein–protein, protein–nucleic acid, and protein– ern approach that will likely render DNA microarrays obso- metabolite interactions lete in the future. ■■ Cellular localization of proteins

M18_KLUG8414_10_SE_C18.indd 366 16/11/18 5:16 pm 18.8 Proteomics Identifies and Analyzes the Protein Composition of Cells 367

■■ Protein stability and aspects of translational and post- addition of chemical groups through methylation, acetyla- translational levels of gene-expression regulation tion, and phosphorylation; and other modifications. Over a ■■ Relationships (shared domains, evolutionary history) to hundred different mechanisms of posttranslational modifi- other proteins cation are known. In addition, many proteins work via elaborate protein– Proteomics projects have been used to characterize protein interactions or as part of a large macromolecular com- major families of proteins for some species. For example, plex. Furthermore, although every cell in the body contains an about two-thirds of the Drosophila proteome has been well equivalent set of genes, not all cells express the same genes and, cataloged using proteomics. hence, the same proteins. Proteomics also considers proteins Proteomics is also of clinical value because it allows com- that a cell might acquire from another cell, not just the proteins parison of proteins in normal and diseased tissues, which can encoded by the genome of the cell type being analyzed. lead to the identification of proteins as biomarkers for disease The early history of proteomics dates back to 1975 and conditions. Proteomic analysis of mitochondrial proteins dur- the development of two-dimensional gel electrophoresis ing aging, proteomic maps of atherosclerotic plaques from (2DGE), a technique for separating hundreds to thousands human coronary arteries, and protein profiles in saliva as a of proteins with high resolution. In this technique, proteins way to detect and diagnose diseases are examples of such work. isolated from cells or tissues of interest are first loaded onto a polyacrylamide tube gel and separated by isoelectric Reconciling the Number of Genes focusing,which causes proteins to migrate based on their and the Number of Proteins Expressed electrical charge in a pH gradient. In an electrical field, by a Cell or Tissue proteins will migrate through an established pH gradient While Beadle and Tatum’s one-gene:one-enzyme hypothesis until they reach the pH at which their net charge is zero was a worthy proposal in the 1940s (see Chapter 13), genom- (with equal numbers of positively-charged and negatively- ics has revealed that the link between gene and gene prod- charged side chains). Next, a second migration, perpen- uct is often much more complex. Genes can have multiple dicular to the first, is performed, in which the proteins are transcription start sites that produce several different types further separated by their molecular weight using sodium of RNA transcripts. Alternative splicing and editing of pre- dodecyl sulfate polyacrylamide gel electrophoresis (SDS-PAGE) mRNA molecules can generate dozens of different proteins (Figure 18.11). from a single gene. As a result, proteomes are substantially It is not uncommon for a 2D gel loaded with a complex larger than genomes. Sequencing of mRNAs from human tis- mixture of proteins to show several thousand spots, as in sues found that over 95 percent of protein-coding genes with Figure 18.11, which displays the complex mixture of proteins more than one exon are alternatively spliced. in human platelets (thrombocytes). Particularly abundant However, it is unclear how many different proteins are protein spots in this gel have been labeled with the names of translated from this pool of transcripts. To address this, the identified proteins. With thousands of different spots on the Human Proteome Map (HPM), published in 2014, aimed to gel, how are the identities of the proteins ascertained? catalog the human proteome in all its complexity. This proj- In some cases, 2D gel patterns from experimental sam- ect involved proteomic analysis of a wide range of human ples can be compared to gels run with reference standards tissues and cell types using methods we will discuss in the containing known proteins with well-characterized migra- next section. Based on the results of this study, we now tion patterns. Reference gels for different biological samples know that the 20,000 protein-coding genes in the human such as human plasma are available, and computer software genome can produce at least 290,000 different proteins. The programs can be used to align and compare the spots from HPM accounted for 85% of all annotated protein-coding different gels. In the early days of 2DGE, proteins were genes in humans that currently exist in human proteomics often identified by cutting spots out of a gel and sequencing databases. Refer to PDQ 20 to access the online database for the amino acids the spots contained. Only relatively short the HPM. sequences of amino acids can typically be generated this The specific protein content (or profile) of a cell is deter- way; rarely can an entire polypeptide be sequenced using mined in large part by its gene-expression patterns—its this technique. BLAST and similar programs can be used to transcriptome. However, a number of other factors affect search protein databases containing amino acid sequences the proteome profile of a cell. To begin with, many proteins of known proteins. However, because of alternative splic- undergo co-translational or posttranslational modifications, ing or posttranslational modifications, peptide sequences such as cleavage of a signal sequence that targets a protein may not always match easily with the final product, and for an organelle pathway, a propeptide, or initiator methio- the identity of the protein may have to be confirmed by nine residues; linkage to carbohydrates and lipids; or the another approach.

M18_KLUG8414_10_SE_C18.indd 367 16/11/18 5:16 pm 368 18 Genomics, Bioinformatics, and Proteomics

1st Dimension: Load protein sample onto an isoelectric focusing tube gel. Electrophoresis separates proteins according to their isoelectric point, where their net charge is zero compared to the pH of the gel

+ pH 4.0 pH 4.0

Proteins

pH 10.0 pH 10.0 - 2nd Dimension: Rotate tube gel 90º and place onto an SDS-polyacrylamide gel (SDS-PAGE). Electrophoresis Stained gel shows proteins as a series separates proteins of spots separated horizontally by according to molecular isoelectric point and vertically by weight molecular weight

200 -

a-Actinin Vinculin

100 Gelsolin Pyruvate kinase Transferrin Albumin 70 Calreticulin HSP 60 ER-60

50

Actin Actin b ALADH fragment LDH Glyceraldehyde 30 dehydrogenase

Tropomyosin

Molecular weight (kDa) Molecular weight + Proteasome Triose phosphate isomerase delta chain 20 SDS-PAGE SOD C

CytochromeC Hb-8 oxidase VA 10

Thioredoxin Ubiquitin

4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 pH

FIGURE 18.11 Two-dimensional gel electrophoresis (2DGE) photo, some protein spots have been identified by name uses the different biochemical properties of proteins to based on comparison to a reference gel or by determination separate proteins in a complex mixture. The 2D gel photo of a protein sequence using mass spectrometry. Notice that shows separations of human platelet proteins. Each spot many spots on the gel are unlabeled, indicating proteins of represents a different polypeptide separated by isoelectric unknown identity. point, pH (x-axis), and molecular weight (y-axis). In this

Mass Spectrometry for Protein Identification by correlating these spectra with an m/z database that As important as 2DGE has been for protein analysis, mass contains known protein sequences. Certain forms of MS spectrometry (MS) has been instrumental to the devel- can provide peptide sequences directly from spectra. opment of proteomics. Mass spectrometry techniques Some of the most valuable proteomics applications of this analyze ionized samples in gaseous form and measure the technology are to identify an unknown protein or pro- mass-to-charge (m/z) ratio of the different ions in a sample. teins in a complex mix of proteins, to sequence peptides, Proteins analyzed by mass spectrometry generate char- to identify posttranslational modifications of proteins, acteristic m/z spectra. Their identities can be determined and to characterize multiprotein complexes. Many other

M18_KLUG8414_10_SE_C18.indd 368 16/11/18 5:16 pm 18.8 Proteomics Identifies and Analyzes the Protein Composition of Cells 369

NOW SOLVE THIS 2D GEL N 18.2 Annotation of a proteome attempts to relate each protein to a function in time and space. Traditionally, C protein annotation depended on an amino acid sequence comparison between a query protein and a protein with known function. If the two proteins shared a considerable portion of their sequence, the query would be assumed to share the function of the annotated protein. Following is a representation of this method of protein annotation An unknown protein cut out from N involving a query sequence and three different human a spot on a 2D gel is first digested proteins. Note that the query sequence aligns to common into small peptide fragments C domains within the three other proteins. What argu- using a protease such as trypsin. ment might you present to suggest that the function of the query is not related to the function of the other three Subject peptide fragments to mass proteins? spectrometry to produce mass-to-charge (m/z) spectra. Query amino acid sequence Compare m/z spectra for 200 unknown protein to a proteomics database of m/z spectra for known peptides. A spectrum y7 y8 Region of amino acid sequence match to query match would identify the peptide sequence of 100 a2 the unknown protein. S Q A A E L L HINT: This problem asks you to think about sequence similari- ties between four proteins and predict functional relationships. y5 y6 The key to its solution is to remember that although protein abundance Relative b2 y4 y9 domains may have related functions, proteins can contain several y3 0 different interacting domains that determine protein function. 200 600 1000 m/z

FIGURE 18.12 mass spectrometry for identifying an biochemical methods can be used together with MS, and unknown protein isolated from a 2D gel. The peptide in this new MS techniques do not involve running gels. Here we example was revealed to have the amino acid sequence serine (S)-glutamine (Q)-alanine (A)-alanine (A)-glutamic simply ­provide an introduction to MS. acid (E)-leucine (L)-leucine (L), shown in single-letter amino One common MS approach is matrix-assisted laser acid code. desorption ionization (MALDI). This approach is ideally suited for identifying proteins and is widely used for pro- teomic analysis of tissue samples. In brief, MALDI employs samples, microbes, and many other substances. Many pro- an ultraviolet laser to heat, vaporize, and ionize peptide teins involved in cancer have been identified by the use of fragments. Released ions are then analyzed for mass; MALDI MALDI to compare protein profiles in normal tissue and displays the m/z ratio of each ionized peptide as a series of tumor samples. peaks representative of the molecular masses of peptides in The use of MALDI-generated m/z spectra databases to the mixture and their relative abundance (Figure 18.12). identify unknown samples is limited by database quality. An Because different proteins produce different sets of peptide unknown protein from a 2D gel can be identified by MALDI fragments, MALDI produces a peptide “fingerprint” that is only if proteomics databases have a MALDI spectrum for characteristic of the protein being analyzed. that protein. But as is occurring with genomics databases, For many MS approaches including MALDI, proteins proteomics databases with thousands of well-characterized are first extracted from cells or tissues of interest and usu- proteins from different organisms are rapidly developing. ally separated by 2DGE. Protein spots are cut out of the gel, The human proteome database described earlier in this sec- and proteins are purified out of each gel spot. Computer- tion was developed from MS data. automated high-throughput instruments are available that can pick all the spots out of a 2D gel, after which MALDI is used to identify the proteins in the different spots. Just ESSENTIAL POINT about any source that provides a sufficient number of Proteomics methods such as mass spectrometry are valuable for analyzing proteomes—the protein content of a cell. ■ cells can be used: blood, whole tissues and organs, tumor

M18_KLUG8414_10_SE_C18.indd 369 16/11/18 5:16 pm 370 18 Genomics, Bioinformatics, and Proteomics

NOW SOLVE THIS could the number be more or less than 256? To answer this question, Venter and colleagues applied an experimental 18.3 Because of its accessibility and biological signifi- approach. They used transposon-based methods to selec- cance, the proteome of human plasma has been intensively tively mutate each gene in M. genitalium using the following studied and used to provide biomarkers for such condi- tions as myocardial infarction (troponin) and congestive rationale. Mutations in genes that produce a lethal pheno- heart failure (B-type natriuretic peptide). Polanski and type indicate that the genes are essential, but nonessential Anderson compiled a list of 1261 proteins, some occur- mutated genes would not be lethal. They found that of the ring in plasma, that appear to be differentially expressed 525 genes in M. genitalium, about 375 genes were essential, in human cancers [Polanski, M., and Anderson, N. L. thus constituting the minimum gene set in this bacterium. (2006). Biomarker Insights 2:1–48]. Of these, only 9 have In 2010, the JCVI published the first report of a func- been recognized by the FDA as tumor-associated proteins. tional synthetic genome. In this approach they designed First, what advantage should there be in using plasma as a and chemically synthesized more than one thousand diagnostic screen for cancer? Second, what criteria should 1080-bp segments called cassettes covering the entire be used to validate that a cancerous state can be assessed 1.08-Mb Mycoplasma mycoides genome (Figure 18.13). through the plasma proteome? HINT: This problem asks you to consider criteria that are valu- able for using plasma proteomics as a diagnostic screen for cancer. The key to its solution is to consider proteomics data that you would want to evaluate to determine whether a particular protein Design of M. mycoides genome is involved in cancer.

18.9 Synthetic Genomes and the Emergence of Synthetic Biology Chemical synthesis of 1078 Many years ago, studying genomes led to a fundamental 1080-bp oligonucleotide question: “What is the minimum number of genes neces- cassettes spanning the entire 1.08-Mb M. mycoides genome sary to support life?” Today, scientists are even more inter- ested in this question because the answer is expected to set the groundwork to create artificial cells or designer organ- isms based on genes encoded by a synthetic or artificial genome constructed in the laboratory. To do so we need a better understanding of the minimum number of genes, often referred to as the “core genes,” required to support life. Cloning of cassettes To help advance synthetic genome work, we can use in E. coli the small genomes of obligate parasites. For example, the bacterium Mycoplasma genitalium, a human parasitic patho- gen, is among the simplest self-replicating bacteria known and has served as a model for understanding the minimal elements of a genome necessary for a self-replicating cell. M. genitalium has a genome of 580 kb, has 525 genes, and is Complete genome assembly in S. cerevisiae one of the smallest bacterial genomes known. In contrast, the 1.8-Mb genome of Haemophilus influen- zae (the first bacterial genome sequenced) has 1815 genes. By comparing the nucleotide sequences of the M. genitalium genes with the H. influenzae genes, scientists from the J. Craig Venter Institute (JCVI) identified 256 genes whose sequence Genome transplantation to M. capricolum was similar enough to consider that they arose from a com- mon ancestral gene; that is, they are orthologous. Thus,

comparative genomics estimated that at least 256 genes FIGURE 18.13 Building a synthetic version of the 1.08-Mb might represent the minimum gene set essential for life. But, Mycoplasma mycoides genome.

M18_KLUG8414_10_SE_C18.indd 370 16/11/18 5:16 pm 18.9 Synthetic Genomes and the Emergence of Synthetic Biology 371

A homologous recombination technique was used to assem- The Quest to Create a Synthetic ble the cassettes into 11 separate 100-kb assemblies that were Human Genome eventually combined to completely span the entire 1.08-Mb The JCVI synthetic bacterial genome projects inspired M. mycoides genome. researchers to address questions about the minimal genome The entire assembled genome, called JCVI-syn1.0, was and the identification of essential genes in more advanced then transplanted into a close relative, M. capricolum, as eukaryotes, including yeast and humans. Using a single- recipient cells. Genome transplantation is effectively the true celled yeast, Saccharomyces cerevisiae, scientists created test of the functionality of a synthetic genome and an essen- synthetic versions of six chromosomes, which constituted tial outcome is the complete phenotypic transformation of a little more than one-third of the yeast genome. This work the recipient cells by the synthetic genome. Transplanta- demonstrated that creating eukaryotic synthetic genomes tion resulted in cells with the JCVI-syn1.0 genotype and the is possible, but geneticists interested in the genes essential phenotype of a new strain of M. mycoides. Transformation for a functional self-replicating human cell did not have the of M. capricolum into JCVI-syn1.0 M. mycoides was veri- necessary tools to make significant progress experimentally. fied, in part, because these cells were shown to express the Thus, when the Human Genome Project was completed, cre- lacZ gene, which was only present in the synthetic genome. ating a complete synthetic human genome was considered Selection for tetracycline resistance and a determination technically impossible. that recipient cells also made proteins characteristic of In 2016 a group of scientists proposed an initiative, M. mycoides, and not M. capricolum,were also used to verify called the Human Genome Project-Write (HGP-Write),to syn- strain conversion. thesize an entire human genome. Many have questioned the The synthetic genome effectively rebooted the M. capri- objectives for synthesizing a human genome and the ethical colum recipient cells to change them from one form to another. implications of doing so. With an estimated 10-year timeline When this work was announced, J. Craig Venter claimed, and a cost of approximately $100 million, organizers hope “This is equivalent to changing a Macintosh computer into HGP-Write will lead to technological advances in DNA syn- a PC by inserting a new piece of software.” This was tedious thesis while lowering costs, in the same way that the Human work, spanning over 15 years and ninety-nine percent of the Genome Project resulted in advances in DNA sequencing experiments involved failed! Keep in mind also, that these and dramatic cost reductions. But by 2018, the project was experiments did not create life from an inanimate object scaled back. Rather than synthesizing a 3 billion bp human since they were based on converting one living strain into genome, the project will instead focus on recoding a genome another. But clearly these studies provided key “proof of con- to create cells that are resistant to infection by viruses. cept” that synthetic genomes could be produced, assembled, and successfully transplanted to create a microbial strain Synthetic Biology for Bioengineering encoded by a synthetic genome. Thus this research brought Applications scientists closer to producing novel synthetic genomes incor- porating genes for specific traits of interest. Yet this work still Venter’s work with M. mycoides JCVI-syn1.0, a decade-long did not define the minimal genome and answer the question project that cost about $40 million, was hailed as a defining of how simple the genome can be. moment in the emerging field of synthetic biology, a disci- In 2016, JCVI announced that 473 genes is the minimal pline that applies engineering design principles to biological bacterial genome. To make this determination, they created systems. What are other potential applications of synthetic a synthetic version of the M. mycoides genome that was about genomes and synthetic biology? One of JCVI’s goals is to cre- half the size of JCVI-syn1.0 discussed earlier. This synthetic ate microorganisms that can be used to synthesize biofuels. 531-kb genome, called JCVI-syn3.0, contained 473 genes, Other possibilities include creating synthetic microbes with encoding 438 proteins and 35 RNAs. About one-third (149 genomes engineered to (1) express gene products to degrade genes) of the genes in JCVI-syn3.0 that are essential for life pollutants (bioremediation); (2) synthesize new biopharma- have no known function. And several of these 149 genes are ceutical products; (3) synthesize chemicals and fuels from present in other organisms including humans. Investigating sunlight and carbon dioxide; and (4) produce “semisyn- their roles is of significant interest to the JCVI team. thetic” crops that contain synthetic chromosomes encod- In the future, gene-editing approaches such as CRISPR- ing genes for beneficial traits such as drought resistance or Cas are expected to make it easier to alter genomes to address improved photosynthetic efficiency. the minimal genome question. Several teams have already applied CRISPR-based functional analysis to bacteria such ESSENTIAL POINT as Bacillus subtilis. Using CRISPR to knock out bacterial Synthetic genomes and synthetic biology offer the potential for genes, geneticists can then screen for phenotypic changes geneticists to create engineered cells with novel characteristics that may have commercial value. ■ as a way to identify essential genes in bacteria.

M18_KLUG8414_10_SE_C18.indd 371 16/11/18 5:16 pm 372 18 GENOMICS, BIOINFORMATICS, AND PROTEOMICS

GENETICS, ETHICS, AND SOCIETY

Privacy and Anonymity in the Era of Genomic Big Data

ur lives are surrounded by genetic profiles, indicating relatedness. Your Turn Big Data. Enormous quanti- The search matched Ryan with two men ake time to consider the follow- ties of personal information with the same last name. Ryan combined ing questions concerning the ethi- Oare stored on private and public databas- the name information with the only infor- cal challenges of ensuring genetic es, revealing our purchasing preferences, mation that he had about his sperm- T ­privacy. search engine histories, social contacts, donor father—date of birth, birth place, and even GPS locations. and college degree. Using an Internet 1. What are some of the ethical argu- Perhaps the most personal of all Big search, he obtained the names of every- ments for and against maintaining Data entries are genome sequences. one born on that date in that place. One genetic privacy and anonymity? Tens of thousands of individuals are man with the same last name as his two For more information, see Hansson, M. now donating DNA for whole-genome Y chromosome matches also had the G., et al. (2016). The risk of re-identifica- sequencing—by both private gene- appropriate college degree. Ryan then tion versus the need to identify individu- sequencing companies and public re- contacted his sperm-donor father. als in rare disease research. Eur. J. Hum. search projects. Most people who donate More recently, several published re- Genet. 24:1553–1558. their DNA for sequence analysis do so ports reveal the ease with which anyone’s with little concern. After all, what con- identity can be traced using DNA- 2. Would you send a DNA sample to a sequences could possibly come from sequence profiles and Internet searches. private company for whole-genome access to gigabytes of As, Cs, Ts, and Gs? These searches can reveal people’s iden- sequencing? If not, why not? If so, Surprisingly, the answer is—quite a lot. tities and disease susceptibilities. what privacy assurances would you One of the first inklings of genetic To many people, the implications of need before ordering your genome privacy problems arose in 2005, when “genomic re-identification” are disturbing. sequence? a 15-year-old boy named Ryan Kramer Genomic information leaks could reveal This topic is discussed in Niemiec, E., and tracked down his anonymous sperm- personal medical information, physi- Howard, H. C. (2016). Ethical issues donor father using his own Y chromo- cal appearance, and racial origins. They in consumer genome sequencing: Use some sequence data and the Internet. could also be used to synthesize DNA to of consumers’ samples and data. Appl. Ryan submitted a DNA sample to a gene- plant at a crime scene or could be used in Transl. Genom. 8:23–30. alogy company that generates Y chro- unforeseen ways in the future as we gain mosome profiles and puts people into more information about what resides contact with others who share similar in our genome.

EXPLORING GENOMICS

Contigs, Shotgun Sequencing, and Mastering Genetics Visit the Study Area: Exploring Genomics Comparative Genomics

n this chapter, we discussed how mechanically sheared) to create a series of used to arrange the contigs in their correct WGS can be used to assemble chro- overlapping DNA fragments called contigu- order on the basis of short overlapping mosome maps. Recall that in this ous sequences, or “contigs.” The contigs are sequences of nucleotides. Itechnique chromosomal DNA is digested then subjected to DNA sequencing, after In this Exploring Genomics exercise with different restriction enzymes (or which bioinformatics-based programs are you will carry out a simulation of contig

M18_KLUG8414_10_SE_C18.indd 372 16/11/18 5:16 pm CASE STUDY 373

alignment to help you understand the sequence deposited in GenBank. For 3. On the basis of your alignment results, underlying logic of this approach to creat- this exercise we have used short frag- answer the following questions, refer- ing sequence maps of a chromosome. For ments; however, in reality, contigs are ring to the sequences by their letter this purpose, you will use the National usually several thousand base pairs codes (A through H): Center for Biotechnology Information long. To complete this exercise, copy a. What is the correct order of over- BLAST site and apply a DNA alignment one sequence into the “Enter Query lapping contigs? program. Sequence” box and one sequence into the “Enter Subject Sequence” box and b. What is the length, measured in Exercise I – Arranging Contigs to then run an alignment (by clicking on number of nucleotides, of each Create a Chromosome Map “Align”). Repeat these steps with other sequence overlap between contigs? combinations of two sequences to deter- 1. Access BLAST from the NCBI Web site c. What is the total size of the mine which sequences overlap, and then at https://blast.ncbi.nlm.nih.gov/Blast chromosome segment that you use your findings to create a sequence .cgi. Locate and select the “Global Align” assembled? map that places overlapping contigs in category under “Specialized searches.” their proper order. Here are a few tips d. Did you find any contigs that do This feature allows you to compare two to consider: not overlap with any of the others? DNA sequences at a time to check for Explain. sequence similarity alignments. ■■ Develop a strategy to be sure that you analyze alignments for all pairs 2. Go to the Study Area, and open the 4. Run a nucleotide-nucleotide BLAST of contigs. Exploring Genomics exercise for this search with any one of the overlapping chapter. Listed are eight contig sequences, ■■ Only consider alignment overlaps contigs to determine which chromo- called Sequences A through H, taken that show 100 percent sequence some these contigs were taken from, from an actual human chromosome identity. and report your answer.

CASE STUDY Your microbiome may be a risk factor for disease

number of genes involved in susceptibility to inflammatory 1. Suppose you had Crohn disease or ulcerative colitis and bowel disorders (IBDs), including Crohn disease and ulcer- wanted to undertake FMT. Before undergoing the treatment, ative colitis, have been identified. However, it is clear that what genetic analyses might you consider to inform yourself A about human genes, microbial genes, or the constitution of other nongenetic risk factors often trigger the onset of these dis- eases. As noted in Section 18.6, the Human Microbiome Project has your gut microbiota and their correlation to or roles in Crohn provided valuable insights about the roles of the gut microbiome and disease or ulcerative colitis? its impact on intestinal disorders, including IBD. It is known that the 2. The use of FMT, whether in a physician’s office or at home, gut microbiome of those with IBD is different from those in remis- raises a number of ethical issues. What might they be, and sion, and it is also different from individuals who do not have IBD. which of them would concern you the most? These observations suggest that transfer of gut microbiota from 3. Several Internet sources offer screened donor fecal samples healthy individuals via fecal microbiota transplantation (FMT) might for use in FMT. What risks would you assume in undertaking be a successful treatment for IBD. This idea is supported by the this therapy at home using these samples? If you are willing use of FMT for a potentially life-threatening form of colitis caused to use this therapy on yourself, would you use it on one of your by the bacterium Clostridium difficile. After successful clinical trials, children? the U.S. Federal Drug Administration (FDA) has classified FMT as an investigational new drug. However, until it is formally approved, For related reading, see Daloiso, V., et al. (2015). Ethical aspects FMT can only be used to treat C. difficile infections that are resistant of fecal microbiota transplantation (FMT). Eur. Rev. Med. Pharmacol. to antibiotic therapy. Sci. 19(17):3173–3180.

INSIGHTS AND SOLUTIONS

1. One of the main problems in annotation is deciding how and bottom scans accepted ORFs of 100 and 300 nucleo- long a putative ORF must be before it is accepted as a gene. tides as genes, respectively. How many putative genes are Shown on p. 519 are three different ORF scans of the same detected in each scan? The longest ORF covers 1254 bp; E. coli genome region—the region containing the lacY gene. the next longest, 234 bp; and the shortest, 54 bp. How can Regions shaded in brown indicate ORFs. The top scan was we decide the actual number of genes in this region? In set to accept ORFs of 50 nucleotides as genes. The middle this type of ORF scan, is it more likely that the number of (continued)

M18_KLUG8414_10_SE_C18.indd 373 16/11/18 5:16 pm 374 18 GENOMICS, BIOINFORMATICS, AND PROTEOMICS

Insights and Solutions—continued genes in the genome will be overestimated or underesti- Sequenced strand mated? Why? 50

Solution: Generally, one can examine conserved sequences in other organisms to indicate that an ORF is likely a coding region. One can also match a sequence to previously described Complementary strand sequences that are known to code for proteins. The problem is not easily solved—that is, deciding which ORF is actu- ally a gene. The shorter the ORFs scan, the more likely the overestimate of genes because ORFs longer than 200 are less likely to occur by chance. For these scans, notice that the 50-bp scans produce the highest number of possible genes, Sequenced strand whereas the 300-bp scan produces the lowest number (1) of 100 possible genes.

2. Sequencing of the heterochromatic regions (repeat-rich sequences concentrated in centromeres and telomeres) of the Drosophila genome indicates that within 20.7 Mb, there Complementary strand are 297 protein-coding genes (Misra et al. (2002). Annota- tion of the Drosophila melanogaster euchromatic genome: a systematic review Genome Biol. 3:research0083.1). Given that the euchromatic regions of the genome contain 13,379 protein-coding genes in 116.8 Mb, what general conclusion is apparent? Sequenced strand 300 Solution: Gene density in euchromatic regions of the Drosophila genome is about one gene per 8730 base pairs, while gene density in heterochromatic regions is one gene per 70,000 bases (20.7 Mb/297). Clearly, a given region of Complementary strand heterochromatin is much less likely to contain a gene than the same-sized region in euchromatin.

Mastering Genetics Visit for Problems and Discussion Questions instructor-assigned tutorials and problems.

1. HOW DO WE KNOW? In this chapter, we focused on the analy- 2. CONCEPT QUESTION Review the Chapter Concepts list on sis of genomes, transcriptomes, and proteomes and considered p. 347. All of these pertain to how genomics, bioinformatics, and important applications and findings from these endeavors. At the proteomics approaches have changed how scientists study genes same time, we found many opportunities to consider the methods and proteins. Write a short essay that explains how recombinant and reasoning by which much of this information was acquired. DNA techniques were used to identify and study genes compared From the explanations given in the chapter, what answers would to how modern genomic techniques have revolutionized the clon- you propose to the following fundamental questions? ing and analysis of genes. (a) How do we know which contigs are part of the same 3. What is functional genomics? How does it differ from compara- chromosome? tive genomics? (b) How do we know if a genomic DNA sequence contains a 4. Contrast WGS for gene identification to linkage map-based protein-coding gene? approaches that you learned about in Chapter 7. (c) What evidence supports the concept that humans share 5. What is bioinformatics, and why is this discipline essential for substantial sequence similarities and gene functional simi- studying genomes? Provide two examples of bioinformatics larities with model organisms? applications. (d) How can proteomics identify differences between the num- 6. Annotation involves identifying genes and gene-regulatory ber of protein-coding genes predicted for a genome and the sequences in a genome. List and describe characteristics of a number of proteins expressed by a genome? genome that are hallmarks for identifying genes in an unknown (e) How has the concept of a reference genome evolved to sequence. What characteristics would you look for in a bacterial encompass a broader understanding of genomic variation genome? A eukaryotic genome? in humans? 7. How do high-throughput techniques such as computer-­ (f) How have microarrays demonstrated that, although all automated, next-generation sequencing, and mass spectrometry cells of an organism have the same genome, some genes are facilitate research in genomics and proteomics? Explain. expressed in almost all cells, whereas other genes show cell- 8. BLAST searches and related applications are essential for ana- and tissue-specific expression? lyzing gene and protein sequences. Define BLAST, describe basic

M18_KLUG8414_10_SE_C18.indd 374 16/11/18 5:16 pm 375 PROBLEMS AND DISCUSSION QUESTIONS 

features of this bioinformatics tool, and give an example of infor- 18. What are DNA microarrays? How are they used? mation provided by a BLAST search. 19. Annotation of the human genome sequence reveals a discrepancy 9. Describe three major goals of the Human Genome Project. between the number of protein-coding genes and the number of 10. Describe the human genome in terms of genome size, the per- predicted proteins actually expressed by the genome. Proteomic centage of the genome that codes for proteins, how much is com- analysis indicates that human cells are capable of synthesiz- posed of repetitive sequences, and how many genes it contains. ing more than 100,000 different proteins and perhaps three Describe two other features of the human genome. times this number. What is the discrepancy, and how can it be 11. Recall that when the HGP was completed, more than 40 ­percent reconciled? of the genes identified had unknown functions. The PANTHER 20. In Section 18.8 we briefly discussed The Human Proteome Map database provides access to comprehensive and current func- (HPM). An interactive Web site for the HPM is available at http:// tional assignments for human genes (and genes from other www.humanproteomemap.org. Visit this site, and then answer ­ species). Go to http://www.pantherdb.org/data/. In the frame on the questions in parts (a) and (b) and complete part (c). the left side of the screen locate the “Quick links” and use the (a) How many proteins were identified in this project? “Whole genome function views” link to a view of a pie chart of (b) How many fetal tissues were analyzed? current functional classes for human genes. Mouse over the pie (c) Use the “Query” tab and select the “Gene family” dropdown chart to answer these questions. What percentage of human menu to do a search on the distribution of proteins encoded genes encode transcription factors? Cytoskeletal proteins? Trans- by a pathway of interest to you. Search in fetal tissues, membrane receptor regulatory/adaptor proteins? adult tissues, or both. 12. The Human Genome Project has demonstrated that in humans 21. Researchers in search of loci in the human genome that are of all races and nationalities approximately 99.9 percent of the likely to contribute to the constellation of factors leading to genome sequence is the same, yet different individuals can be hypertension have compared candidate loci in humans and rats identified by DNA fingerprinting techniques. What is one pri- [Stoll, M., et al. (2000). New Target Regions for Human Hyper- mary variation in the human genome that can be used to distin- tension via Comparative Genomics. Genome Res. 10:473–482]. guish different individuals? Briefly explain your answer. Through this research, they identified 26 chromosomal regions 13. Through the Human Genome Project (HGP), a relatively accurate that they consider likely to contain hypertension genes. How human genome sequence was published from combined samples can comparative genomics aid in the identification of genes from multiple individuals. It serves as a reference for a haploid responsible for such a complex human disease? The research- genome. How do results from personal genome projects (PGP) ers state that comparisons of rat and human candidate loci to differ from those of the HGP? those in the mouse may help validate their studies. Why might 14. Explain differences between whole-genome sequencing (WGS) this be so? and whole-exome sequencing (WES), and describe advantages 22. Whole-exome sequencing (WES) is helping physicians diag- and disadvantages of each approach for identifying disease- nose a genetic condition that has defied diagnosis by tradi- causing mutations in a genome. Which approach was used for tional means. The implication here is that exons in the nuclear the Human Genome Project? genome are sequenced in the hopes that, by comparison with 15. Describe the significance of the Genome 10K project. the genomes of nonaffected individuals, a diagnosis might be 16. It can be said that modern biology is experiencing an “omics” revealed. revolution. What does this mean? Explain your answer. (a) What are the strengths and weaknesses of this approach? 17. Metagenomics studies generate very large amounts of sequence (b) If you were ordering WES for a patient, would you also data. Provide examples of genetic insight that can be learned include an analysis of the patient’s mitochondrial genome? from metagenomics.

M18_KLUG8414_10_SE_C18.indd 375 16/11/18 5:16 pm 19 The Genetics of Cancer

CHAPTER CONCEPTS

■■ Cancer is characterized by genetic defects in fundamental aspects of cellular function, including DNA repair, chromatin Colored scanning electron micrograph of two prostate cancer modification, cell-cycle regulation, cells in the final stages of cell division (cytokinesis). The cells are apoptosis, and signal transduction. still joined by strands of cytoplasm. ■■ Most cancer-causing mutations occur in somatic cells; only about 5 to 10 percent of cancers have a hereditary ancer is the second most common cause of death in Western countries, component. exceeded only by heart disease. It strikes people of all ages, and one ■■ Cancer cells in primary and secondary Cout of three people will experience a cancer diagnosis sometime in his tumors are clonal, arising from one or her lifetime. Each year, more than 14 million cases of cancer are diagnosed stem cell that accumulated several worldwide and more than 8 million people will die from the disease. cancer-causing mutations. Over the last 40 years, scientists have discovered that cancer is a genetic ■■ The development of cancer is a disease at the somatic cell level, characterized by the presence of gene prod- ­multistep process requiring multiple ucts derived from mutated or abnormally expressed genes. The combined mutations and clonal expansions. effects of numerous abnormal gene products lead to the uncontrolled growth ■■ Cancer cells show high levels of and spread of cancer cells. Although some mutated cancer genes may be genomic instability, leading to the inherited, most are created within somatic cells that then divide and form ­accumulation of multiple mutations, tumors. Completion of the Human Genome Project and numerous large-scale some in cancer-related genes. rapid DNA-sequencing studies have opened the door to a wealth of new infor- ■■ Alterations in epigenetic features mation about the mutations that trigger a cell to become cancerous. This new such as DNA ­methylation and histone understanding of cancer genetics is also leading to new gene-specific treat- ­modifications play significant roles in ments, some of which are now in use or are entering clinical trials. Some the ­development of cancers. scientists predict that gene-targeted therapies will replace chemotherapies ■■ Mutations in proto-oncogenes and within the next 25 years. tumor-suppressor genes contribute to The goal of this chapter is to highlight our current understanding of the development of cancers. the nature and causes of cancer. As we will see, cancer is a genetic disease ■■ Genetic changes that lead to cancer that arises from the accumulation of mutations in genes controlling many can be triggered by numerous natural basic aspects of cellular function. We will examine the relationship between and human-made carcinogens. genes and cancer and consider how mutations, chromosomal changes,

376

M19_KLUG8414_10_SE_C19.indd 376 16/11/18 5:16 pm 19.1 Cancer Is a Genetic Disease at the Level of Somatic Cells 377

epigenetics, and environmental agents play roles in the diseases is that cancers rarely arise from a single mutation development of cancer. Please note that some of the top- in a single gene. They arise instead from the accumulation ics discussed in this chapter are explored in greater depth of many mutations in many genes. The mutations that lead elsewhere in the text (see Chapter 16 and Special Topics to cancer affect multiple cellular functions, including repair Chapter 1—Epigenetics. of DNA damage, cell division, apoptosis, cellular differentia- tion, migratory behavior, and cell–cell contact.

19.1 Cancer Is a Genetic Disease What Is Cancer? at the Level of Somatic Cells Clinically, cancer defines a large number of complex dis- eases that behave differently depending on the cell types Perhaps the most significant development in our under- from which they originate and the types of genetic altera- standing of the causes of cancer is the realization that can- tions that occur within each cancer type. Cancers vary in cer is a genetic disease. Genomic alterations that are found their ages of onset, growth rates, invasiveness, prognoses, in cancer cells range from single-nucleotide substitutions to and responsiveness to treatments. However, at the molecu- large-scale chromosome rearrangements, amplifications, lar level, all cancers exhibit common characteristics that and deletions (Figure 19.1). However, unlike other genetic unite them as a family. diseases, cancer is caused by mutations that arise predomi- All cancer cells share two fundamental properties: (1) nantly in somatic cells. Only about 5 to 10 percent of can- abnormal cell growth and division (proliferation), and (2) cers are associated with germ-line mutations that increase defects in the normal restraints that keep cells from spread- a person’s susceptibility to certain types of cancer. Another ing and colonizing other parts of the body (metastasis). In important difference between cancers and other genetic normal cells, these functions are tightly controlled by genes that are expressed appropriately in time and place. In can- cer cells, these genes are either mutated or are expressed (a) inappropriately. It is this combination of uncontrolled cell proliferation and metastatic spread that makes cancer cells dangerous. When a cell simply loses genetic control over cell growth, it 1 2 3 4 5 may grow into a multicellular mass, a benign tumor. Such a tumor can often be removed by surgery and may cause no serious harm. However, if cells in the tumor also have the 6 7 8 9 10 11 12 ability to break loose, enter the bloodstream, invade other tis- sues, and form secondary tumors (metastases), they become malignant. Malignant tumors are often difficult to treat 13 14 15 16 17 18 and may become life threatening. As we will see later in the chapter, there are multiple steps and genetic mutations that 19 20 21 22 x y convert a benign tumor into a dangerous malignant tumor. (b) ESSENTIAL POINT Cancer cells show two fundamental properties: abnormal cell proliferation and a propensity to spread and invade other parts of 1 2 3 4 5 the body (metastasis). ■

6 7 8 9 10 11 12 The Clonal Origin of Cancer Cells 13 14 15 16 17 18 Although malignant tumors may contain billions of cells, and may invade and grow in numerous parts of the body, all 19 20 21 22 x cancer cells in the primary and secondary tumors are clonal, meaning that they originated from a common ancestral cell FIGURE 19.1 (a) Spectral karyotype of a normal cell. that accumulated specific cancer-causing mutations. This (b) Karyotype of a cancer cell showing translocations, deletions, and aneuploidy—characteristic features of is an important concept in understanding the molecular ­cancer cells. causes of cancer and has implications for its diagnosis.

M19_KLUG8414_10_SE_C19.indd 377 16/11/18 5:16 pm 378 19 The Genetics of Cancer

Numerous data support the concept of cancer clonality. The Cancer Stem Cell Hypothesis For example, reciprocal chromosomal translocations are char- A concept that is related to the clonal origin of cancer acteristic of many cancers, including leukemias and lympho- cells is that of the cancer stem cell. Many scientists now mas (two cancers involving white blood cells). Cancer cells believe that most of the cells within tumors do not prolifer- from patients with Burkitt lymphoma show reciprocal trans- ate. Those that do proliferate and give rise to all the cells locations between chromosome 8 (with translocation break- within the tumor or within a tumor subclone are known as points at or near the c-myc gene) and chromosomes 2, 14, or 22 cancer stem cells. Stem cells are undifferentiated cells (with translocation breakpoints at or near one of the immuno- that have the capacity for self-renewal—a process in which globulin genes). Each patient with Burkitt lymphoma exhibits the stem cell divides unevenly, creating one daughter cell unique breakpoints in his or her c-myc and immunoglobulin that goes on to differentiate into a mature cell type and one gene DNA sequences; however, all lymphoma cells within that that remains a stem cell. The cancer stem cell hypothesis is patient contain identical translocation breakpoints. This dem- in contrast to the random or stochastic model that predicts onstrates that cells in a tumor arise from a single cell, and this that every cell within a tumor has the potential to form a cell passes on its genetic aberrations to its progeny. new tumor. Although all cancer cells within a tumor are clonal, con- Cancer stem cells have been identified in leukemias as taining the same core set of cancer-causing genes that arose in well as in solid tumors of the brain, breast, colon, ovary, pan- the ancestral tumor cell, not all cells in a tumor are genetically creas, and prostate. identical throughout their entire genomes. Next-generation It is possible that cancer stem cells may arise from nor- sequencing studies reveal that tumors are composed of sub- mal adult stem cells within a tissue, or they may be created populations, or subclones, each of which contains its own sets from more differentiated somatic cells that acquire proper- of distinctive mutations. We will discuss the origins and impli- ties similar to stem cells after accumulating numerous muta- cations of cancer subclones later in the chapter. tions and changes to chromatin structure.

Driver Mutations and Passenger Mutations Cancer as a Multistep Process, Requiring Scientists are now applying some of the recent advances Multiple Mutations and Clonal Expansions in DNA sequencing to identify all the somatic mutations within tumors. These studies compare the DNA sequences of Although we know that cancer is a genetic disease initiated genomes from cancer cells and normal cells derived from the by driver mutations that lead to uncontrolled cell prolifera- same patient. Data from these studies are revealing that tens tion and metastasis, a single mutation is not sufficient to of thousands of somatic mutations can be present in cancer transform a normal cell into a tumor-forming (tumorigenic) cells. Researchers believe that only a handful of mutations malignant cell. If it were sufficient, then cancer would be in each tumor—called driver mutations—give a growth far more prevalent than it is. In humans, mutations occur advantage to a tumor cell. The remainder of the mutations spontaneously at a rate of about 10-6 mutations per gene, may be acquired over time, perhaps as a result of the high per cell division, mainly due to the intrinsic error rates of levels of DNA damage that occurs in cancer cells, but these DNA replication. Because there are approximately 1016 cell mutations have no direct contribution to the cancer pheno- divisions in a human body during a lifetime, a person might type. These are known as passenger mutations. The total suffer up to 1010 mutations per gene somewhere in the body, number of driver mutations that occur in any particular can- during his or her lifetime. However, only about one person cer is small—between 2 and 8. in three will suffer from cancer. It is now possible to sequence the genomes of individual The phenomenon of age-related cancer is another indi- tumor cells. These studies confirm that there is a great deal cation that cancer develops from the accumulation of sev- of genetic variation between individual cells and subclones eral mutagenic events in a single cell. The incidence of most within tumors. Most of these variations are due to the accu- cancers rises exponentially with age. If a single mutation mulation of different types of passenger mutations, with were sufficient to convert a normal cell to a malignant one, the few key driver mutations remaining constant between then cancer incidence would appear to be independent of subclones. Although most of these passenger mutations may age. Another indication that cancer is a multistep process not initially confer a selective advantage on the cells that is the delay that occurs between exposure to carcinogens contain them, if environmental conditions change—such as (cancer-causing agents) and the appearance of the cancer. during chemotherapy or radiotherapy—a passenger muta- For example, there was an incubation period of 5 to 8 years tion that confers a new phenotype such as drug resistance between the time people were exposed to radiation from the will be selected for, leading to clonal expansion of that cell atomic explosions at Hiroshima and Nagasaki and the onset and the appearance of a new subclone within the tumor. of leukemias.

M19_KLUG8414_10_SE_C19.indd 378 16/11/18 5:16 pm 19.2 CANCER CELLS CONTAIN GENETIC DEFECTS AFFECTING GENOMIC STABILITY 379

Each step in tumorigenesis (the development of a cell growth and division, such as apoptosis, growth signal- malignant tumor) appears to be the result of two or more ing, and cell-cycle regulation—all of which we will discuss genetic alterations that release a cancer stem cell from the in more detail later in the chapter. The resulting carcinoma controls that normally operate on proliferation and malig- is able to further grow and invade the underlying tissues of nancy. Each step confers a selective advantage to the growth the colon. A few cells within the carcinoma may break free and survival of the cell and is propagated through successive of the tumor, migrate to other parts of the body, and form clonal expansions leading to a fully malignant tumor. metastases. The stepwise clonal evolution of tumors is illustrated by the development of colorectal cancer. Colorectal cancers ESSENTIAL POINT are known to proceed through several clinical stages that are The development of cancer is a multistep process, requiring characterized by the stepwise accumulation of genetic defects mutations in several cancer-related genes. ■ in several genes (Figure 19.2). The first step is the conver- sion of a normal epithelial cell into a small cluster of cells known as an adenoma or polyp. This step requires inactivat- ing mutations in the adenomatous polyposis coli (APC) gene, a 19.2 Cancer Cells Contain Genetic gene that encodes a protein involved in the normal differen- Defects Affecting Genomic Stability, tiation of intestinal cells. The APC gene is a tumor-suppressor gene, which will be discussed later in the chapter. The result- DNA Repair, and Chromatin ing adenoma grows slowly and is considered benign. Modifications The second step in the development of colorectal cancer is the acquisition of a second genetic alteration in one of the Cancer cells contain large numbers of mutations and chromo- cells within the small adenoma. This is usually a mutation in somal abnormalities. Many researchers believe that the fun- the KRAS gene, a gene whose product is normally involved damental defect in cancer cells is a derangement of the cells’ with regulating cell growth. The mutations in KRAS that con- normal ability to repair DNA damage. The resulting loss of tribute to colorectal cancer cause the KRAS protein to become genomic integrity leads to a general increase in the mutation constitutively active, resulting in unregulated cell division. rate for every gene in the genome, including cancer-causing The cell containing the APC and KRAS mutations grows by driver mutations. The high level of genomic instability seen clonal expansion to form a larger intermediate adenoma in cancer cells is known as the mutator phenotype. In addi- of approximately 1 cm in diameter. The cells of the origi- tion, recent research has revealed that cancer cells contain nal small adenoma (containing the APC mutation) are now aberrations in the types and locations of chromatin modifi- vastly outnumbered by cells containing the two mutations. cations, particularly DNA and histone methylation patterns. The third step, which transforms a large adenoma into a malignant tumor (carcinoma), requires several more waves of clonal expansions triggered by the acquisition of defects in Genomic Instability and Defective DNA Repair several genes, including TP53, PI3K, and TGF-b. The products Genomic instability in cancer cells is characterized by the of these genes control several important aspects of normal presence of somatic point mutations and chromosomal

PI3K Cell Cycle/Apoptosis Genes Pathways APC Kras TGF-b

Normal colonic epithelium Small adenoma Large adenoma Carcinoma Patient age (years) 30–50 40–60 50–70

FIGURE 19.2 steps in the development of colorectal cancers. Some of the genes that acquire driver mutations and cause the progressive development of colorectal cancer are shown above the photographs. These driver mutations accumulate over time and can take 40 years or more to result in the formation of a malignant tumor.

M19_KLUG8414_10_SE_C19.indd 379 16/11/18 5:16 pm 380 19 The Genetics of Cancer

effects such as translocations, aneuploidy, chromosome The relationship between XP and genes controlling nucleo- loss, DNA amplification, and deletions (Figure 19.1). tide excision repair is also described earlier in the text (see Cancer cells that are grown in cultures in the lab also show Chapter 14). a great deal of genomic instability—duplicating, losing, Another example is hereditary nonpolyposis colorec- and translocating chromosomes or parts of chromosomes. tal cancer (HNPCC), which is caused by mutations in genes Often cancer cells show specific chromosomal defects that controlling DNA repair. HNPCC is an autosomal dominant are used to diagnose the type and stage of the cancer. For syndrome, affecting about 1 in every 200 to 1000 people. example, leukemic white blood cells from patients with Patients affected by HNPCC have an increased risk of devel- chronic myelogenous leukemia (CML) contain a specific oping colon, ovary, uterine, and kidney cancers. Cells from translocation, in which the C-ABL gene on chromosome patients with HNPCC show higher than normal mutation 9 is translocated into the BCR gene on chromosome 22. rates and genomic instability. At least eight genes are asso- This translocation creates a structure known as the ciated with HNPCC, and four of these genes control aspects Philadelphia chromosome (Figure 19.3). The BCR-ABL of DNA mismatch repair. Inactivation of any of these four fusion gene codes for a chimeric BCR-ABL protein. The genes—MSH2, MSH6, MLH1, and MLH3—causes a rapid normal ABL protein is a protein kinase that acts within accumulation of genome-wide mutations and the subse- signal transduction pathways, transferring growth factor quent development of cancers. signals from the external environment to the nucleus. The observation that hereditary defects in genes con- The BCR-ABL protein is an abnormal signal transduction trolling nucleotide excision repair and DNA mismatch repair molecule in CML cells, which stimulates these cells to lead to high rates of cancer lends support to the idea that the proliferate even in the absence of external growth signals. mutator phenotype is a significant contributor to the devel- A number of inherited cancers are caused by defects opment of cancer. in genes that control DNA repair. For example, xeroderma pigmentosum (XP) is a rare hereditary disorder that is char- Chromatin Modifications and acterized by extreme sensitivity to ultraviolet (UV) light Cancer Epigenetics and other carcinogens. Patients with XP often develop skin cancer. Cells from patients with XP are defective in nucleo- The field of cancer epigenetics is providing new perspec- tide excision repair, with mutations appearing in any one tives on the genetics of cancer. Epigenetics is the study of of seven genes whose products are necessary to carry out chromosome-associated changes that affect gene expression DNA repair. XP cells are impaired in their ability to repair but do not alter the nucleotide sequence of DNA. Epigenetic DNA lesions such as thymine dimers induced by UV light. effects can be inherited from one cell to its progeny cells and may be present in either somatic or germ-line cells. DNA methylation and histone modifications such as acetylation Normal Translocation chromosome 9 t(9;22) and phosphorylation are examples of epigenetic modifica- tions. The genomic patterns and locations of these modifi- cations can affect gene expression. The effects of chromatin Normal chromosome 22 modifications and epigenetic factors on gene expression and cancer are discussed in more detail later in the text (see Spe- cial Topics Chapter 1—Epigenetics). Cancer cells contain altered DNA methylation pat- + + q11.2 (BCR) terns. Overall, there is much less DNA methylation in cancer (BCR) (ABL) cells than in normal cells. At the same time, the promoters of some genes are hypermethylated in cancer cells. These Philadelphia changes are thought to result in the release of transcrip- chromosome q34.1 (C-ABL) tion repression over the bulk of genes that would be silent in normal cells—including cancer-causing genes—while at the same time repressing transcription of genes that would FIGURE 19.3 a reciprocal translocation involving the long regulate normal cellular functions such as DNA repair and arms of chromosomes 9 and 22 results in the formation cell-cycle control. of a characteristic chromosome, the Philadelphia chromo- some, which is found in chronic myelogenous leukemia Histone modifications are also disrupted in cancer (CML) cells. The t(9;22) translocation results in the fusion of cells. Genes that encode histone acetylases, deacetylases, the C-ABL proto-oncogene on chromosome 9 with the BCR methyltransferases, and demethylases are often mutated gene on chromosome 22. The fusion protein is a powerful hybrid molecule that allows cells to escape control of the or aberrantly expressed in cancer cells. The large numbers cell cycle, contributing to the development of CML. of epigenetic abnormalities in tumors have prompted some

M19_KLUG8414_10_SE_C19.indd 380 16/11/18 5:16 pm 19.3 Cancer Cells Contain Genetic Defects Affecting Cell-Cycle Regulation 381

scientists to speculate that there may be more epigenetic cell proliferation involves a large number of gene products defects in cancer cells than there are gene mutations. In addi- including those that control steps in the cell cycle. tion, because epigenetic modifications are reversible, it may In this section, we will review steps in the cell cycle, be possible to treat cancers using epigenetic-based therapies. some of the genes that control the cell cycle, and how these genes, when mutated, contribute to the development of ESSENTIAL POINT cancer. Cancer cells show high rates of mutation, chromosomal abnormalities, genomic instability, and abnormal patterns of chromatin modifications. ■ The Cell Cycle and Signal Transduction The cellular events that occur in sequence from one cell divi- sion to the next comprise the cell cycle (see Chapter 2). NOW SOLVE THIS In early to mid-G1, the cell makes a decision either to 19.1 In chronic myelogenous leukemia (CML), leukemic enter the next cell cycle or to withdraw from the cell cycle blood cells can be distinguished from other cells of the into quiescence. Continuously dividing cells do not exit the body by the presence of a functional BCR-ABL hybrid cell cycle but proceed through the G1, S, G2, and M phases; protein. Explain how this characteristic provides an however, if the cell receives signals to stop growing, it enters opportunity to develop a therapeutic approach to a the G0 phase of the cell cycle. During G0, the cell remains treatment for CML. metabolically active but does not grow or divide. Most dif- HINT: This problem asks you to imagine a therapy that is based ferentiated cells in multicellular organisms can remain in on the unique genetic characteristics of CML leukemic cells. The this G0 phase indefinitely. Some, such as neurons, never key to its solution is to remember that the BCR-ABL fusion pro- reenter the cell cycle. In contrast, cancer cells are unable to tein is found only in CML white blood cells and that this unusual enter G0, and instead, they continuously cycle. Their rate of protein has a specific function thought to directly contribute to the development of CML. To help you answer this problem, proliferation is not necessarily any greater than that of nor- you may wish to learn more about the cancer drug Gleevec mal proliferating cells; however, they are not able to become (see https://www.cancer.gov/about-cancer/treatment/drugs/ quiescent at the appropriate time or place. imatinibmesylate). Normal cells in G0 can often be stimulated to reenter the cell cycle by the presence of external growth signals such as growth factors and hormones that bind to cell-surface receptors. These receptors then relay the signal from the 19.3 Cancer Cells Contain plasma membrane to the cytoplasm and the nucleus. The process of transmitting growth signals from the external Genetic Defects Affecting environment to the cell nucleus is known as signal trans- Cell-Cycle Regulation duction. Ultimately, signal transduction stimulates the expression of genes whose products then allow cells to enter One of the fundamental aberrations in all cancer cells is a the cell cycle and divide. Cancer cells often have defects in loss of control over cell proliferation. Cell proliferation is the signal transduction pathways. Sometimes, abnormal signal process of cell growth and division that is essential for all transduction molecules send continuous growth signals to development and tissue repair in multicellular organisms. the nucleus even in the absence of external growth factors. Although some cells, such as epidermal cells of the skin or An example of abnormal signal transduction due to muta- blood-forming cells in the bone marrow, continue to grow tions in the ras gene is described in Section 19.4. In addition, and divide throughout an organism’s lifetime, most cells in malignant cells may not respond to external signals from adult multicellular organisms remain in a nondividing, qui- surrounding cells—signals that would normally inhibit cell escent, and differentiated state. Differentiated cells are proliferation within a mature tissue. those that are specialized for specific functions, such as pho- toreceptor cells of the retina or muscle cells of the heart. In contrast, many differentiated cells, such as those in the liver Cell-Cycle Control and Checkpoints and kidney, are able to grow and divide when stimulated by In normal cells, progress through the cell cycle is tightly regu- extracellular signals and growth factors. In this way, mul- lated, and each step must be completed before the next step ticellular organisms are able to replace dead and damaged can begin. There are at least three distinct points in the cell tissue. However, the growth and differentiation of cells must cycle at which the cell monitors external signals and internal be strictly regulated; otherwise, the integrity of organs and equilibrium before proceeding to the next stage. These are tissues would be compromised by the presence of inappro- the G1/S, G2/M, and M checkpoints. At the G1/S checkpoint, priate types and quantities of cells. Normal regulation over the cell monitors its size and determines whether its DNA has

M19_KLUG8414_10_SE_C19.indd 381 16/11/18 5:16 pm