99 GENE STRUCTURE Previous Lectures Have Detailed the Chemistry

GENE STRUCTURE

Previous lectures have detailed the chemistry of the DNA molecule, the genetic material, as well as the mechanisms for replicating and maintaining the integrity of the DNA. We now want to understand the functional aspects of DNA as the genetic material. How does this DNA molecule function as a gene, what is a gene, and how can we consider mutation in the context of the gene?

The Genetic Code The DNA molecule is a linear array of nucleotides. In most cases, we think of the purpose of the genetic information as storing the information as a code for a linear array of amino acids that constitute a protein. It is also true, that some genes encode RNA molecules that function as RNAs and do not encode proteins. Examples include the ribosomal RNAs that form the ribosome structure as well as playing a catalytic role in protein synthesis, the transfer RNAs (tRNAs) that carry specific amino acids to the translation machinery, and small nuclear RNAs (snRNAs) that play a critical role in the processing of mRNA precursors by splicing. Most genes, however, function to encode proteins. Twenty amino acids are utilized in the synthesis of proteins. Therefore, since there are only four possible nucleotides, a single nucleotide cannot code for an amino acid. Pairs of two would also be insufficient. However, a group of three nucleotides would give the potential for 64 specific codons (4 x 4 x 4), thus sufficient to code for all amino acids. This is in fact the code - a triplet code such that every three nucleotides encodes one amino acid.

Properties of the genetic code: - Read in groups of three - Unambiguous (a triplet codon specifies a unique amino acid) - Degenerate (more than one codon specifies an amino acid) - Stop codons

99 The genetic code is universal (same code used in all organisms, both prokaryotic and eukaryotic) with one exception - there are a few differences in the code used in mitochondria.

? UGA is not a stop signal but codes for trypophan. The mit tRNAtrp recognizes both UGG and UGA, obeying traditional wobble rules.

? Internal methionine is encoded by both AUG and AUA; initiating methionines are specified by AUG, AUA, AUU, and AUC.

? AGA and AGG are not arginine codons but are stop codons. Thus, there are four stop codons (UAA, UAG, AGA, and AGG) in the mitochondrial code.

A mutation is any heritable change in the genetic material, resulting in an alteration of the DNA sequence. Mutations are usually considered in the context of a change that alters gene function and thus the phenotype of the organism. Understanding the molecular mechanisms responsible for mutations, either simple changes in DNA sequence or more drastic deletions, insertions, or rearrangements of DNA material, as well as the mechanisms responsible for recognizing and

100 correcting these alterations, is of central importance to the understanding of disease mechanisms.

The Nature of Mutations –Mutations, which can alter the coding properties of a DNA segment, are of several types:

A. Substitution mutations convert one type of base pair into another. G-C to A-T and A-T to G-C changes are referred to as transition mutations (replacement of a purine to pyrimidine base pair by a purine to pyrimidebase pair). G-C to C-G, G-C to T-A, A-T to T-A, and A-T to C-G are called transversions (replacement of a purine-pyrimidine base pair by a pyrimidine- purine base pair). Although transitions are more common than transversions, both kinds of mutations occur as a consequence of replication errors, both can result from chemical damage to DNA, and both have been implicated as causative factors in inherited genetic disease and cancer. Single nucleotide changes can change the codon to that of another amino acid, thus altering the protein. In addition, such changes can also create a stop codon.

B. Small insertions/deletions comprise a second relatively common class of mutation. Genetic changes of this sort involve insertion or loss of a small number of contiguous base pairs (one to several hundred). Repetitive runs of a mono, di-, or trinucleotide sequence are extremely prone to insertion/deletion mutation, an effect that has been attributed to slippage of template and primer strands during replication:

101

Repeat elements like (CA) n shown above or the (A) n element, which contains a run of adenine residues on one strand paired with a run of thymine bases on the other, are very common in human chromosomes. For example, about 50,000 (CA) n repeats are distributed throughout the human genome, with each repeat element typically containing 10 to 60 copies of the (CA) dinucleotide (eg., n = 10 to 60). Due to their propensity to slip during DNA biosynthesis, repetitive sequences are particularly prone to mutation. As described below, a high incidence of mutations in (CA) n microsatellite sequences is a valuable diagnostic for certain human malignancies.

Deletions or insertions will result in a frameshift if it is not a multiple of three base pairs.

DNA Sequence Protein Sequence Type of Mutation ATG AAA TTT TGT CGT AAA MET LYS PHE CYS ARG LYS Wild type ATG AAc TTT TGT CGT AAA MET asn PHE CYS ARG LYS Missense ATG AAC TTT TGa CGT AAA MET ASN PHE stop Nonsense ATG AAA T TGT CGT AAA MET LYS leu ser stop Deletion/Frameshift

Definition of a Gene Traditionally, a gene has been defined as either the unit of heredity or defined as that portion of a chromosome encoding a functional RNA or protein. Although these two views generally coincide in the case of prokaryotic genes, the situation is much more complex in the case of a eukaryotic gene. A prokaryotic gene is relatively simple in structure, including the coding sequence to specify the synthesis of a protein and a minimal amount of regulatory sequence to control the expression of the gene. In contrast, a eukaryotic gene can be vastly more complex and can occupy large regions of chromosomes. This is due to the fact that most eukaryotic genes, particularly those in mammalian cells, are discontinuous. That is, coding regions are often separated by non-coding sequence. More importantly, the regulatory sequences that are responsible for the expression of the gene can be complex and separated by large distances from the actual gene sequence. Since a gene must ultimately be defined in a phenotypic sense, then the expression of the gene is critical - phenotype is determined not only by the sequence of a particular protein but also by the ability of a given cell to express that protein. In consideration of all of these issues, the definition of a functional gene would be those DNA sequences necessary to achieve the normal expression of the gene product.

102 Obviously, the ability to precisely define a gene is critical for an understanding of the basis of gene function, including the nature of sequences important for the normal, regulated expression of the gene. In addition, a knowledge of the gene structure and sequence is critical for evaluating and understanding the molecular basis for gene mutation that underlies a disease state. In addition, we will see later that a knowledge of the characteristics of a gene, including those sequences that define open reading frames, splice site signals that define exon/intron junctions, and the sequences that constitute transcription regulatory signals, is critical in the search for an unknown gene

Isolation and Study of a Eukaryotic Gene:

How does one go about studying the structure and characteristics of a particular gene. Since there are approximately 3 x 109 base pairs in the human genome, and any given gene may be no more than 104 base pairs, analysis of the total population of human DNA is impossible. Clearly, a gene must be isolated apart from the total DNA and amplified to allow a detailed study. Early studies of gene structure and function in eukaryotic cells made use of animal viruses, particularly the so-called DNA tumor viruses including adenovirus and polyomaviruses, as a mechanism to isolate and study individual genes. Viruses simply represent a relatively small set of genes packaged in a protein coat. The ability to isolate and purify viruses thus provided a mechanism to isolate pure populations of specific genes. Since these viruses make use of cellular activities for the expression of the viral genes, the basic aspects of viral gene structure and function generally reflect that found for cellular genes. Thus, the initial studies of these viruses laid much of the groundwork for subsequent analysis of cellular genes. With the advent of molecular cloning through recombinant DNA procedures, it then became feasible to isolate individual eukaryotic genes. Cloning allows both the purification (isolation) of the gene, away from all others, as well as the amplification of the gene to provide sufficient material to carry out biochemical analyses. We have already discussed the procedures involved in the generation of a library of clones and the methods for detection of a particular clone by hybridization with a radioactive DNA probe of complementary sequence. But, where does the probe come from? Let's take the globin gene as an example. Standard biochemical procedures for protein purification can yield sufficient amounts of pure preparations of the globin protein to carry out analysis. Protein chemistry can then generate limited amino acid sequence from the purified protein. The amino acid sequence can then be used to predict the nucleotide sequence of the gene based on a knowledge of the genetic code. There will be some ambiguities since the code is degenerate (an amino acid can be encoded by multiple codons) but a mixture of DNA probes could be

103 synthesized, one of which would correspond to the actual globin gene sequence. This mixture of probes could then be employed to screen the library of clones to identify the one that carries the globin gene. An alternative approach would be to generate an antibody that is specific to the globin protein by injection of the purified protein into rabbits or mice. Once an antibody has been generated, it can then be used to screen an expression library for a clone that encoded a portion of the globin protein. In this case, the expression library, likely in the form of a bacteriophage lambda vector, would generate fusion proteins within the phage-infected cells in which a portion of the ß-galactosidase protein was fused to random sequences that were cloned. If one of these carried the globin sequence, and if this was a portion of the globin protein that was recognized by the antibody, then by screening plaques with the antibody followed by a secondary procedure that would detect antibody bound to the filters, essentially the same procedure used for Western blot analysis, one could identify those clones in the library that carried DNA sequences encoding globin protein.

Discontinuous Nature of a Eukaryotic Gene

104 The ability to isolate a gene such as that encoding the ß-globin protein finally allowed detailed molecular analyses to be undertaken of the structure of a eukaryotic gene. This led to the startling discovery that most eukaryotic genes are discontinuous. That is, sequences that are found in the messenger RNA are not contiguous in the DNA. This fact was initially discovered with the DNA virus adenovirus by hybridization of a viral messenger RNA to the viral DNA genome and finding that segments of the DNA were looped out due to the fact that the sequences in the RNA were not contiguous with the DNA. Subsequent studies with a variety of cellular genes revealed that this event was a common property of most genes. Moreover, the analysis of

cellular genes revealed that actual coding sequences were interrupted by intervening sequences. Shown above is an analysis of the structure of the ovalbumin gene, relative to the mRNA product, as observed by forming a hybrid between the mRNA and the genomic DNA and then examining the resulting structure in an electron microscope (top figure). The DNA has been denatured and the single stranded portion can be seen extending at the ends. The schematic shown in the middle

105 represents a tracing of the electron micrograph to indicate the structures. The double stranded regions hybridized to the mRNA (designated 1 - 7) represent genomic sequences that are preserved in the mRNA whereas the looped out regions (A through G) define genomic sequence not present in the mRNA sequence. The deduced structure of the gene, showing exons and introns, is depicted at the bottom. This analysis, and a previous one using adenovirus, defined the exon/intron structure of eukaryotic genes and the fact that the mRNA was processed from an initial precursor.

Structure of a Typical Eukaryotic Gene As a result of the analysis of a large number of eukaryotic genes, essential features of gene organization have become evident. Most striking, when compared to the organization and structure of prokaryotic genes, is the extreme complexity of the genes in eukaryotic cells, particularly higher eukaryotes. This often involves a large array of discontinuous segments that must be assembled into the final mRNA product as well as a complex array of regulatory sequences that govern the transcription of the gene.

Exons - sequences in the gene that are found in the functional mRNA. Includes coding sequence but may also include non-coding sequence. The beginning of the first exon defines the site of initiation of transcription since there is no processing of the 5' end of the primary transcript. The end of the final exon defines the site of cleavage of the primary transcript at what is known as the

106 polyadenylation site, creating the mature mRNA 3' terminus. Transcription does not terminate at this position but continues some distance downstream.

Introns - intervening sequences in the gene that are removed in the formation of the functional mRNA. Usually includes non-coding sequence but there are instances of alternative processing where sequences can be both introns and exons.

This arrangement can vary from relative simple (two exons separated by one intervening intron sequence) to extremely complex whereby a very large number of exons form the final mRNA. For instance, the dystrophin gene, that which is mutated in Duschenne muscular dystrophy, comprises at least 70 exons and more than one million base pairs of DNA. With respect to considerations of mutation frequency, one might suspect that the very large domains of certain genes contributes to an increased frequency of mutation, by creating a larger target size for mutation. Although much of the size is due to intervening sequence, the mutation of which would have no consequence, it is also possible that an actively transcribed domain would be more susceptible to mutagenic events.

Complexity of Gene Organization in Metazoan Organisms

As a result of the ability to analyze and study gene structure and organization in higher eukaryotic systems, including humans, it is now apparent that a much increased complexity of gene organization as well as gene structure can be seen when compared to the lower eukaryotic organisms such as yeast. One example is seen in the generation of large, multi-gene families such as the globin gene cluster shown below.

In addition, it is also evident that the exon-intron structure of various genes is highly complex. Certainly, in some cases it provides flexibility in gene expression. Alternative splicing can create distinct gene products from one loci. It is also possible that this is a mechanism for

107 evolution of function. In many cases, exons appear to encode protein "domains" - independently functioning units of a protein. Thus, through a process of exon shuffling, new proteins could be created from parts of others.

Unequal crossing over as a mechanism for gene duplication as well as exon shuffling The analysis of eukaryotic gene structure has revealed the evolution of gene families resulting from gene duplication and subsequent evolution of sequence. It is likely that unequal crossing over during meiosis, involving inappropriate pairing via repetitive DNA sequences, is responsible for generating this complexity. As an example, consider the globin locus as shown in the figure below. In this case, the two chromosome homologues mis-align as a result of pairing of a repeated sequence known as Alu that is found frequently along each chromosome. A

recombination event then leads to the formation of a chromosome with a duplication of the globin gene and a second chromosome in which the globin gene has been deleted. The gamete receiving the chromosome with the deleted region will not give rise to a viable progeny but the other chromosome will now carry two copies of the globin gene.

The same event can give rise to changes in gene structure. For instance, consider a region of chromosomal DNA that contains five functional sequence blocks (genes or exons) as shown

108 above. Normally, precise alignment of the two chromosomal pairs would insure that crossing over events would not change the overall organization. If, however, the two chromosomes were to mis-align as a result of pairing between repeated sequences that might be found interspersed in the chromosome (solid circles), then unequal crossing over would leave one product with a duplication of a gene (or exon) and the other with a deletion.

The fact that intron sequences, as well as intergenic sequences, are essentially non-functional means that there is great flexibility in the capacity to re-arrange genes and exons due to recombination. That is, there is no requirement for a precise break and rejoining event if the basic functional unit (entire gene or exon with necessary splice signals) is maintained.

109 Comparison of the sequences/structure of the albumin and alpha-fetoprotein genes illustrates the process of gene evolution

Cloning of the albumin and alpha-fetoprotein genes, and subsequent DNA sequence analysis, has revealed considerable sequence similarities between the two genes and also within each gene. In particular, the sequence and arrangement of exons 3-6 is similar to that of exons 7- 10 and exons 11-14 for both genes. As such, it has been proposed that a primordial gene may have given rise to the present day albumin and alpha-fetoprotein genes as the result of an initial triplication of this group of four exons followed then by a duplication to create the related genes.

110 Functional Immunoglobulin Genes and Related T Cell Receptor Genes Are Created in Somatic Cells by Genomic Rearrangements Virtually every cell in the human body or any organism possesses the exact same complement of genes. Thus, even though the gene encoding albumin is only expressed in the liver, the exact same gene is also found in the brain. Clearly, the control of gene expression is a critical event in determining cell phenotype and we will return to the basis for this regulation later. There are, nevertheless, at least three exceptions to the rule that every cell contains the same set of genes. First, the random events of somatic mutation generate changes in genes that will be unique to the cell in which the mutation occurred. This is an ongoing process that for the most part is of no consequence. In certain instances, however, it does alter the function of a gene - perhaps the best example is the somatic mutation within the variable region of the immunoglobulin genes that generates a large part of the diversity of the antibody repertoire. Second, mutations, deletions, or chromosome alterations that affect key cellular regulatory genes can initiate and contribute to the process of oncogenesis. Mutations can activate oncogenes to a constitutively active, unregulated state. Deletions can eliminate genes that normally function to negatively control cellular proliferation. Chromosome rearrangements can generate novel, chimeric genes that possess unique properities. Third, the genes encoding the immunoglobulin molecules as well as the T cell receptors must undergo somatic recombination events in order to create a functional gene. Indeed, these recombination events, together with enhanced somatic mutation of the genes, are responsible for creating the tremendous diversity of the immunological response. Moreover, as indicated above,

111 somatic mutations within the V and J segments of the immunoglobulin genes, as well as errors in the recombination events, also contribute to the generation of antibody diversity.

Experimental evidence for immunoglobulin gene rearrangement: A Southern blot assay of germ cell DNA or embryo DNA versus DNA from an antibody- producing B cell, in this case a B cell tumor (myeloma), reveals a distinct difference in the organization of the immunoglobulin heavy chain gene as reflected by differences in the Southern blot analyses as shown in the figure below.

For instance, if one was analyzing the structure of the gene in the region of the J segments, where recombination was occurring, a Southern could detect a change in the genomic organization as indicated by a new band. If this same assay was carried out using any other tissue source other than from a B cell, the same pattern of immunoglobulin gene arrangement as found in the germ line would be obtained. Moreover, if one assayed for virtually any other gene sequence one would find the same pattern in the germ line as found in a somatic cell. Thus, there is a B cell specific alteration in the structure of the immunoglobulin DNA.

112