<<

Chapter 7. Functional Contents

7. 7.1. Annotating 7.1.1. The BLAST Search Tool at NCBI 7.1.2. Functional Annotation Using Yeast Knockout 7.1.3. Functional Annotation Using Mouse Knockout Mutations 7.1.4. Knockdown Using RNAi 7.1.5. Gene Editing Using CRISPR-CAS 7.1.6. 7.1.7. Protein Localization Using GFP Tags 7.1.8. Protein-Protein Interactions by Yeast 2-hybrid 7.2. 7.2.1. What is Proteomics 7.2.2. Protein Modifications 7.2.3. Technology for Proteomic Analysis 7.2.4. Structural Analysis of 7.3. and 7.3.1. Single Transcript Abundance Estimation 7.3.2. -wide Transcript Abundance Estimation

CONCEPTS OF GENOMIC Page 7- 1

7.1. ANNOTATING PROTEIN FUNCTION (RETURN)

 CHAPTER 7. Previously in Chapter 6 Structural Genomics, we ex- (RETURN) amined how were identified in sequenced ge- nomes. Once the coding sequences are identified, the lo- cation of introns and exons, the promoter region and the The Central Dogma of simply stated 3’-UTR have been identified, the next step is to annotate is that DNA is coded into RNA and RNA is coded into pro- the structure and function of the protein products of pro- tein. Further, we know that the of a gene is tein coding ORFs. The general approach and some of the determined by the proteins made inside the . Thus, it tools available for this task are outlined here. is because of the function of the protein that the gene is 7.1.1. The BLAST Search Tool at NCBI (RETURN) expressed as a phenotype. In this chapter will examine The Basic Local Alignment Search Tool (BLAST) at NCBI how we go about understanding protein function so that provides the ability to search a sequence database with a the annotation of proteins found at NCBI is determined. given sequence called a query sequence. Those se- The tools that are required to do this will also be consid- quences most similar to the query are reported back with ered. Today we extend this functional analysis to include a shown, a computed score indicat- a sophisticated analysis of gene expression that involves ing similarity of the query to the sequences found in the both all transcripts made by a genome, the transcrip- database, and any corresponding annotation for the sim- tome, and all proteins made from a genome, the prote- ilar hit sequences to the query. Thus, we can learn how ome. the proteins function that are most closely related to the query sequence we are using. There are several types of BLAST that can be used. These include: I. BLAST – using a nucleotide query to search a nucleotide database for the most simi- lar sequences. CONCEPTS OF GENOMIC BIOLOGY Page 7- 2 II. Protein BLAST – using a protein query to search number of genes in various categories organized in two a protein database for the most similar se- different ways is shown in the figure. quences. III. Translated BLAST searches including a. BLASTX – using a translated nucleotide query in all 3 reading frames to search a protein database for protein databases for sequences most similar to the translated query. b. TBLASTN – using a protein sequence to search a nucleotide database translated in all 3 reading frames. Note that we do a laboratory covering the BLAST tool at NCBI, and examples of these searches are executed and examined. The utility of BLAST can be further ex- tended by linking BLAST Search Results to a gene ontol- ogy (GO). The Consortium attempts pro- Figure 7.1. Yeast GO analysis, indicating the number of genes in each category. Each of the roughly 6200 genes identified in the yeast genome vides a framework for relating functional information have been placed in one or more GO categories, and a graphic summary about genes to the function of the whole organism, e.g. of the analysis is presented above. Note that the categories can be al- tered so that critical points of interest can be investigated in more de- determining when, where, and how a gene functions (in- tail. cluding metabolic and developmental pathways). The ex- tension of BLAST described above can be executed at sev- eral different web pages including the BLAST2GO page. 7.1.2. Functional Annotation Using Yeast Knockout Such analyses have been conducted on virtually every ge- Mutations (RETURN) nome sequenced, to provide at least a minimal functional analysis of the genome sequence. An example of the GO The purpose of making a knockout is to re- analysis of the Yeast Genome is given in Figure 7.1. The place the of an endogenous gene in CONCEPTS OF GENOMIC BIOLOGY Page 7- 3 the yeast genome with a replacement sequence that makes the endogenous gene nonfunctional. Typically, the replacement sequence is a selectable marker gene that allows the easy detection of the insertion. The most commonly used selectable marker is a gene for resistance to the kanamycin. The KanR (kanamycin re- sistance) cassette including a promoter for the gene is in- serted between a sequence of approximate 50 base-pairs near the start site at the 5’-end of the gene to a sequence about 50 base-pairs long near the 3’-end of the gene. As a result, the middle portion of the gene is removed (de- leted) and the cassette is inserted. The insertion is accom- plished by a process referred to as homologous recombi- nation (Figure 7.2a) This recombinant construct is made in a shuttle vector (refer to Chapter 4, The Genomic Toolkit, sec- tion 4.2.3.) so that it can be transferred into yeast cells, Figure 7.2. a) Yeast Knockout mutations are constructed using a KanR and selection performed for cells that are stably resistant cassette by homologous recombination (see text for details). B) verifica- to kanamycin. This indicates that the knockout construct tion of the knockout by PCR using 4 primer sets A-B, C-D, A-KanB, and KanC-D. has been successfully recombined into the chromosomal DNA. Successful knockouts are further verified using PCR. A forward primer (A) outside the gene on the 5’-end and a reverse primer (B) inside the endogenous gene are used in one PCR reaction, while a forward primer (C) also inside the endogenous gene and a reverse primer (D) outside the 3’-end of the gene are used in a second reaction. If CONCEPTS OF GENOMIC BIOLOGY Page 7- 4 the knockout was unsuccessful, these two sets of primers VI. Move that construct into the yeast knockout and will each amplify endogenous gene sequences. However, determine whether the normal phenotype is re- if the knockout was successfully made these primers will stored or if knockout phenotype remains. not amplify sequences but using the same (A) and (D) pri- The limitation of this analysis is that your gene of in- mers with a reverse primer that lands inside the KanR cas- terest must have a yeast ortholog. However, yeast has a sette (KanB), and a forward primer also landing inside the genome that contains approximately 6,300 genes, and cassette (KanC) in Figure 7.2b will amplify sequences in most contain several-fold more genes. Yeast the knockout, while they will not amplify endogenous cells are essentially unicellular, while most Eukaryotes of gene sequences. interest are multicellular, and have many genes associ- When you want to determine the phenotype of a gene ated with the developmentally appropriate expression of in any , knockout mutations of the yeast genes. In order to use the knockout approach on complex ortholog are a valuable tool for doing this. However, the multicellular organisms, mouse knockouts have proven technique only works when an ortholog of the gene of in- more useful for many genes. terest can be found in yeast. This is particularly useful for genes with simple metabolic . 7.1.3. Functional Annotation Using Mouse Knockout The steps involved would be as follows: Mutations (RETURN) I. Identify the putative protein of interest using a A mouse knockout is made using homologous recom- BLAST search. bination just as in yeast. The mouse knockout cassette II. Determine whether there is a yeast ortholog of contains the ends of the target gene just as with yeast, your gene of interest. but it contains two selectable markers, one inside the III. Construct a knockout of that gene and deter- gene of interest, and the other outside the gene of inter- mine the phenotype of the knockout. est. These two markers make it possible to distinguish IV. Obtain a cDNA clone of the gene of interest from between a true homologous recombination knockout, the appropriate Eukaryotic organism. and integration of the foreign DNA vector into a random V. Construct a yeast expression shuttle vector that site in the genome which would not produce a mouse expresses the gene of interest. knockout (Figure 7.3). CONCEPTS OF GENOMIC BIOLOGY Page 7- 5 The deletion module is introduced into embryonic mouse stem cells (ES) from an agouti mouse, and the ES cells are incubated on “selective” media such that the cells with the internal deletion marker (neoR in the case shown in Figure 7.3) and lacking the marker outside the deletion site (tk in Figure 7.3) are allowed to grow. This enriches the growing cells that have the deletion module inserted into the target gene via homologous recombina- tion and minimizes cells that had the full DNA vector ran- domly integrated into the genome which is the other pos- sible outcome. Note that it is still possible that only the inside marker randomly integrates randomly, and the outside marker is lost without homologous recombina- tion taking place, but this is less likely event than homol- ogous recombination. Once cells expressing the inside marker are have been Figure 7.3. A mouse knockout is made using homologous recombination grown and selected, they are injected into a developing just as in yeast. The knockout cassette contains the ends of the target embryo of a black mouse (Figure 7.4). The animals that gene just as with yeast, but it contains two selectable markers, one in- side he gene of interest, and the other outside the gene of interest. Here results from these injected embryos will be chimeric, neoR is the maker inside the gene, and thymidine kinase (tk) is the maker outside. Embryonic mouse stem cells are incubated with deletion mod- ule with two possible outcomes. Either the neoR maker homologously recombines into the gene of interest (left), or the deletion module ran- domly integrates in which case the cells will express both the neoR and the tk genes.

meaning that they contain a mixture of agouti injected cells and the original black mouse embryonic cells. These chimeric mice are then mated, and agouti mice and black CONCEPTS OF GENOMIC BIOLOGY Page 7- 6 mice result from this cross. Note that agouti is dominant Once the knockout mice have been obtained. The to black, and agouti mice are also likely heterozygous next task is to determine the biochemical, physiological, morphological, regulatory, and/or behavioral functional phenotype of the knocked-out gene. Not only do such An I gene knockouts provide valuable information about the expected phenotype of the gene, but they also can yield valuable experimental material for investigation of the function of the gene. One difficulty in using knockout mutations to investi- gate gene function can be that the knockout may be le- thal. Complete absence of the gene/protein may lead a non-viable mouse that cannot survive and a line that can- not be propagated for further study. Under such circum- stances, a possible solution is to create a gene knock- down rather than a .

7.1.4. Gene Knockdown Using RNAi (RETURN) We have previously discussed the role of natural RNAi Figure 7.4. Once the agouti mouse ES cells containing the knockout con- struct are selected, they are injected into the embro of a black mouse in gene regulation by miRNAs and siRNAs, but it is also where they become part of a chimeric embryo that produces a chimeric possible to construct a gene that can be used for syn- mouse. If any of the agouti, knockout-bearing cells become germline thetic RNAi (Figure 7.5). The engineered synthetic gene cells, agouti mice will result from the cross of two chimeri agouti mice. These resulting agouti mice should be homozygous for the knockout (de- places the sequence to be knocked down in the letion) of the gene of interest, and a line of these mice can be propa- gated for future study. for the knockout mutation. This can be verified using PCR, and then a line of knockout mice can be created. CONCEPTS OF GENOMIC BIOLOGY Page 7- 7

Figure 7.5. Formation of a short hairpin RNA that contains a dou- ble stranded stem and a hairpin loop. This is formed from an engi- neered gene designed to produce the hairpin RNA upon transcrip- tion. construct such that the sequence is found with it’s re- verse complement in the same construct. Such , will produce a hairpin structure having a double-stranded stem with a loop referred to as a short hairpin. The short hairpin RNA is cut into 21-23 bp short dsRNAs, and one strand of the dsRNA is loaded into a RISC particle where it can act to silence the target mRNA (Figure 7.6). Figure 7.6. Gene silencing from a double stranded RNA construct de- rived from the stem of a stem loop structure. The construct is cut into In this case the expression of the target gene is dra- 21-23 nucleotide short dsRNAs one strand of which loads into a RISK particle, where it works to reduce or eliminate the target mRNA se- matically lowered. It may in fact be eliminated, but by quence and inhibit the of the mRNA. adjusting the target sequence within the target gene it is possible to ultimately achieve the desired level of expression knockdown such that the functional signify- cance of the gene can be evaluated. CONCEPTS OF GENOMIC BIOLOGY Page 7- 8 It is preferable to examine the localization of a protein in the organism from which the protein comes, but some- 7.1.5. Gene Editing Using Crispr-CAS (RETURN) times this is not really possible. Since most of the protein One of the newest technologies for editing genes is localization signals within protein sequences are univer- the CRISPR-CAS9 technique. Link out at the How CRISPR sal, it is possible to make the construct in yeast and ex- Works link given or watch this brief NOVA video on how amine localization in the yeast system rather than in the CRISPR works. Note that this technology is just now system of origin. emerging, and it is clear that it has many uses related to those technologies we have talked about above for gene 7.1.7. Protein-Protein Interactions by Yeast 2-hybrid functional analysis. It is very powerful and can be used to (RETURN) delete an entire gene or to edit only a single nucleotide Many proteins interact with one or more other pro- or a few to determine the role that specific teins, and the interacting partners of a protein are a crit- parts of proteins play. ical part of understanding the function of the protein. For The ethical consequences of using CRISPR for gene example, protein kinases or methylases that phos-phory- therapy on humans has been subject to debate, and an late or methylate other proteins must of necessity inter- interesting presentation is given in a TED Talk by Ellen act with the proteins they modify. Subunits of multisub- Jorgensen. unit proteins must interact, and the proteins of larger complexes such as spliceosomes and ribosomes are all 7.1.6. Protein Localization using GFP Tags (RETURN) examples of protein-protein interactions. An important type of annotation for proteins involves The goal of a yeast two hybrid experiment is to deter- determining where inside the cell the protein is normally mine if two or more proteins interact with each other. localized. This can be accomplished using a protein from Review the role of the Gal4p protein in regulating the the jellyfish that naturally fluoresces green called Green GAL-regulon genes that we discussed earlier. Recall that Fluorescent Protein (GFP). This protein can be added to Gal4p has a DNA binding domain (BD) and an activation the 3’- or 5’-end of the protein and this modified chimeric domain (AD) that is responsible for activating RNA poly- gene inserted into the organism of choice using one of merase to transcribe the gene associated with the UAS the techniques above. promoter. To conduct a yeast two-hybrid experiment a chimeric gene is first constructed in the yest genome that CONCEPTS OF GENOMIC BIOLOGY Page 7- 9 has a UAS enhancer sequence ahead of a such as the lacZ gene from the lac operon in E. coli. Then two plasmid constructs are made. The first has the BD fused with an interacting protein that we will call the bait. The second plasmid has the AD fused with a second pro- tein that we call the prey. Figure 7.7 describes the results obtained when the bait interacts with the prey. If they do not interact then no of the reporter oc- curs. By testing a series of proteins as preys, multiple in- teracting partners with the bait can be tested.

Figure 7.7. Overview of Yeast two-hybrid assay, checking for interac- tions between two proteins, called here Bait and Prey. (A) Gal4 tran- scription factor gene produces a two domain protein (BD and AD) which is essential for transcription of the reporter gene (LacZ); (B,C) Two fusion proteins are prepared: Gal4BD+Bait and Gal4AD+Prey. Neither of them is sufficient to initiate the transcription (of the reporter gene) alone; (D) When both fusion proteins are produced and Bait part of the first inter- acts with Prey part of the second, transcription of the reporter gene oc- curs.

CONCEPTS OF GENOMIC BIOLOGY Page 7- 10 study of the metabolites produced in cells can also be characterized as . Understanding the inte- 7.2. PROTEOMICS (RETURN) gration of these types if information is referred to as sys- The study of genes and DNA sequences could be de- tems biology. scribed as Genomics. A subdivision of genomics dealing 7.2.1. What is Proteomics (RETURN) with all the RNAs found in cells is referred to as Tran- Proteomics could be considered: scriptomics. Proteomics refers to all of the proteins ex- pressed in organisms, their structure, and when and a) A catalog of all proteins expressed throughout the life where they function in the organism and in cells. The cycle of the organism. b) A catalog of all proteins expressed in each cell or tis- sues of an organism. The Central Dogma The c) A catalog of all proteins expressed under all condi- DNA Genomics tions in an organism. d) A catalog of all proteins expressed in all tissues of an Transcription organism. RNA Transcriptomics e) Understanding the structural properties of proteins. Translation f) Analyze the function of all proteins in an organism • Systems To understand how proteins of an organism inter- Proteins Proteomics Biology act with each other. Cellular Catalysis • To understand how proteins of an organism are modified & regulated Metabolites Metabolomics While genomics has greatly facilitated proteomics projects, characterizing a is considerably more Phenotype Phenomics complex than sequencing a genome. At the most basic Figure 7.8. The subdivisions of genomic biology and their relationship to level, there are far more proteins than genes in a eukary- the central dogma. otic organism. For example, humans possess approxi- mately 25,000 genes, but are estimated to have between 200,000 and 2 million unique proteins. Many of these CONCEPTS OF GENOMIC BIOLOGY Page 7- 11 proteins are produced by alternative splicing. These types of posttranslational modifications, a subset of splice variants are likely to have non overlapping func- which is shown in Table 7.1. Many of these such as phos- tions. In addition, the exact proteins that are expressed phorylation, acetylation, and methylation have already at any given moment depend on a person’s age, health, been discussed. A detailed description of the chemistry and environmental stimuli. To complicate matters fur- involved in each of these posttranslational modifications ther, the diverse chemical properties of proteins make it is beyond the scope of this chapter. The list given here is difficult to develop a “one size fits all” approach to char- meant to show the functional diversity of posttransla- acterizing the proteome. Instead, a wide variety of tech- tional modification and its importance. nologies is necessary. Table 7.1. Chemical Modification of Proteins that 7.2.2. Protein Modifications (RETURN) Affect Functionality

There are numerous ways that lead to one gene pro- 1) Phosphorylation: activation and inactivation of enzymes ducing multiple sequences, and numerous 2) Acetylation: protein stability, used in histones mechanisms of posttranslational control of protein func- 3) Methylation: regulation of gene expression tion including covalent modification, or by other mecha- 4) Acylation: membrane tethering, targeting nisms. 5) Glycosylation: cell–cell recognition, signaling Review Chapter 3, section 3.7.2. on mRNA splicing. 6) Hydroxyproline: protein stability, ligand interactions Recall that in Figure 3.57. an example of transcript splic- 7) Ubiquitination: destruction signal 8) Others ing that generates calcitonin in thyroid tissue while the a. Sulfation: protein–protein and ligand interactions same transcript generates CGRP in neuronal cells. This is b. Disulfide-bond formation: protein stability an example of two unique proteins being created from a c. Deamidation: protein–protein and ligand interactions single gene transcript. This phenomenon is not uncom- d. Pyroglutamic acid: protein stability mon. Often related proteins are made for the same pri- e. GPI anchor: membrane tethering mary transcript under different circumstances. f. Nitration of tyrosine: inflammation

Protein function may be altered by posttranslational modifications as well. Posttranslational modifications are Another type of posttranslational protein modification is defined as any changes to the covalent bonds of a protein proteolytic cleavage, i.e., fragmenting the protein after it has been fully translated. There are numerous CONCEPTS OF GENOMIC BIOLOGY Page 7- 12 caused by a specific protease degrading the protein. Be- Acidic Basic side the Calcitonin and CRGP examples that we have al-

ready discussed (see Figure 3.57, Chapter 3, Section MW High 3.7.2), many peptide hormones such as Insulin are pro- duced by proteolytic cleavage of a primary product. The digestive proteases trypsin and chymo-trypsin are also examples of protein activation by cleavage of a pep- tide fragment from the protein. 7.2.3. Technology for Proteomic Analysis (RETURN) The advent of proteomics required the development of technologies for studying multiple proteins simul-tane- ously. A few of the most important of these are listed below.

2-D Gel MW Low 2-D is one of the oldest prote- omics technologies. In this approach, proteins are usually first separated by their charge in a tube of polyacrylamide with a pH gradient going from end to end. When a protein Figure 7.9. 2-D Gel electrophoresis separation of proteins in encounters a pH level where its charge is neutralized (iso- the X-axis direction takes place from acid to basic, while sep- electric point), it no longer moves along an applied elec- aration in the Y-axis direction takes place from low to high molecular weight. Usually one can identify over 2500 spots tric field. Once proteins have been separated on the basis (unique proteins) in a 2-D protein gel. of charge, the tube is transferred onto a second gel slab with constant pH. Applying an electric field across the second gel will separate the proteins on the basis of their molecular weight. The end result is that each protein will have a unique x–y position CONCEPTS OF GENOMIC BIOLOGY Page 7- 13 on the 2-D gel. Samples of proteins identified by their po- sition on the gel can be removed for further experimental with analysis. benzoic Differential in gel electrophoresis is a recent develop- acid ment that has allowed researchers to compare proteomic Cy3 profiles in two different samples more accurately. To un- derstand this technology, consider two populations of E. coli, one grown in the presence of benzoic acid and the other grown in its absence. The proteins in one sample are labeled with one fluorescent dye (Cy3 in this case, blue in Figure 7.10), and the proteins in the second sam- ple are labeled with another dye (Cy5, red in Figure 7.10.). The two dyes are matched for charge and mass so without that they will affect proteins migrating in a gel in the same benzoic way. Protein samples derived under the two conditions acid are then mixed together and loaded onto a single 2-D gel. Cy5 After the gel has been run, it is exposed to light of one wavelength in order to excite the Cy3 dye and light of an- other wavelength in order to excite the Cy5 dye. Exam- ples of the results are shown in Figure 7.10. Images cap- tured in this way can be further processed by software to estimate differences in protein expression between indi- vidual proteins expressed under the two conditions. On Figure 7.10. Differential 2-Gel electrophoresis. Two protein samples are stained with either Cy3 or Cy5 fluorescent dyes. The samples are the aggregate level, the images can be subtracted or mixed and separated as in Figure 7.8. Subsequently the gels can be overlaid to compare overall patterns of protein expres- analyzed by looking at each sample using an appropriate color of UV sion, and consequently learn the effect of the treatment light. on the proteome. CONCEPTS OF GENOMIC BIOLOGY Page 7- 14 Despite the long history of 2-D gel electrophoresis, ence of ions, and the data acquisition unit allows experi- there are several caveats associated with this technique. mental measurements to be analyzed by computer. The First, it does not work well with very large or very small picture in the slide shows a researcher using a mass spec- proteins, and low-abundance proteins are difficult to de- trometer. tect with this technique. Also, membrane-bound proteins cannot be characterized using 2-D gels. Unfortunately, the most promising drug targets belong to this class of proteins, and they may not be abundantly expressed. However it remain a viable approach to identifying differ- ential protein expression one of the main goals of prote- omic analysis. The 2-D gel technique requires a way to identify the protein in each spot on the gel. What has emerged is the use of mass spectrometry to identify the spots once they have been removed from the gel. Figure 7.10. A typical mass spectrum of a protein. The peaks repre- sent the mass position of the various molecular ions made and ana- Mass spectrometers are devices that measure the lyzed. mass-to-charge ratios of ions. These ions might be very simple or as complex as peptides. Four components make A mass-spectrometry experiment can begin with ex- up every mass spectrometer: an ion source, a mass ana- traction of a protein of interest from a 2-D gel. Typically, lyzer, an ion detector, and a data acquisition unit. Be- proteins are too large to be analyzed directly by mass cause mass spectrometers are only able to analyze ions, spectrometers, so they must first be broken down into a sample must be ionized first to create an ion source. The smaller, more manageable peptides. This is done by di- mass analyzer typically consists of some combination of gesting the protein with a protease, such as trypsin, that magnetic or electric fields that can be manipulated by the cleaves the protein between specific amino acids. The experimenter to determine the mass-to-charge ratio of mass spectrum generated is processed by a computer an ion of interest. The ion detector measures the pres- that attempts to identify the protein likely to be repre- CONCEPTS OF GENOMIC BIOLOGY Page 7- 15 sented by the spectrum. Theoretical peptide mass finger- and “peptide microarray” are sometimes used in place of prints (i.e., mass spectra) are calculated for all proteins in “protein chip.” a database. This is done by first identifying the trypsin cleavage sites in all proteins in the database and then cal- culating the mass of the peptides that would result from cleavage with trypsin. These calculated fragments are then compared with the fragments obtained from the mass-spectrometry experiment. A close match allows re- searchers to identify the protein represented by the ex- perimental mass spectrum. In this way a library of previ- ously determined protein mass signatures can be used to identify an unknown protein derived from a gel. Protein Chips Protein chips are able to simultaneously detect and quantitate thousands of different protein molecules. This involves fastening some method for detection of each specific protein to a matrix such as a nitrocellulose mem- Figure 7.11. Protein chip showing various spots on a pro- brane. The diverse chemistry of proteins requires varied tein array. This figure came from a yeast experiment where the level of expression of yeast proteins was measured by methods for detecting proteins and measuring their ac- the size and intensity of each spot on the gel. tivity. To date, protein chips have been designed to detect 7.2.4. Structural Analysis of proteins (RETURN) the presence of proteins by using antibodies; to detect One of the most valuable aspects of proteome and protein–protein, protein–, protein–small protein is examination of protein 3-dimentional structure molecule, and protein–lipid interactions; and to measure (tertiary structure). Structural analysis of proteins is typ- enzyme–substrate reactions. The image in Figure 7.11. ically done by X-ray crystallography. This is a costly and comes from a protein-chip experiment that uses antibod- time-consuming process that requires purification and ies to detect yeast proteins. Each dot in the array repre- crystallization of the protein, and lengthy data collection sents a different protein. Note that the terms “protein” CONCEPTS OF GENOMIC BIOLOGY Page 7- 16 and calculation to generate a 3-D structure related to the amino acid primary structure of the given protein. There are tools at NCBI and at other sources that are available to examine the structural properties of pro- teins, compare protein structures, and to search the structure database for proteins with similar structures. Because of the time and expense of determing a struc- ture, these bioinformatic techniques can be very useful in comparing proteins. Structural investigation of proteins allows pharmaceuti- cal designers to investigate protein surfaces and deter- mine binding sites for potential drugs once a protein is identified as a potential site for a therapeutic drug. They are also useful for investigation of protein-protein inter- actions by identifying specific amino acids on the protein surface that favor stronger or weaker interactions. In so doing the ability to investigate the properties of proteins is an important tool in the design of new drugs. We will examine this site more in the laboratory por- tion of the course. CONCEPTS OF GENOMIC BIOLOGY Page 7- 17 these regulatory steps establishes the level of a given mRNA inside a cell. 7.3. TRANSCRIPTOME AND GENE EXPRESSION Note that the level of any molecule inside a cell re- (RETURN) sults from the rate at which the molecule is made and the The transcriptome of an organism is defined as: rate at which the molecule is degraded. Thus, if you in- crease the rate at which a translatable mRNA molecule is a) All the mRNAs and other RNAs expressed throughout produced, but do not change the rate at which it is de- the life cycle of the organism. graded the mRNA will accumulate to a higher level inside b) All mRNAs and other RNAs expressed in each cell or a cell. We call the level of mRNA that results from bal- tissues of an organism. ancing synthesis and degradation, the steady-state level, c) All mRNAs and other RNAs expressed under all condi- i.e. level that exists when the rate of production/synthe- tions in an organism. sis is equal to the rate of degradation. d) All mRNAs and other RNAs expressed in all tissues of an organism. Two additional caveats are critical. First, the regula- tory steps that we discussed in Chapter 3, are able to Transcriptomics or transcriptome Analysis the is the modulate the steady state level of an mRNA. And second, investigation of all RNA sequences made by an organism the number of protein molecules made from an mRNA is defining when and where in an organism each one is directly proportional to the steady-state level of the made. As such, quantitatively measuring the level of mRNA. every mRNA the organism makes in every tissue, and in response to various environmental signals are major We also know that combinatorial gene regulation goals of transcriptomics. (Chapter 3, section 3.5.6., Figure 3.33) is a significant part of the reason groups of genes are controlled differently As previously discussed in Chapter 3, section 3.7. (see in different tissues and in response to different signals. Figure 3.54) the expression of genes is controlled at vari- This means that transcriptional regulation controls an or- ous levels, including transcriptional regulation, pro- ganism’s transcriptome, i.e. the ultimate level expression cessing regulation, several posttranscriptional steps, of all genes, under all circumstances, in all tissues of the mRNA degradation regulation, translational regulation, organism. and several posttranslational steps. The sum effect of CONCEPTS OF GENOMIC BIOLOGY Page 7- 18 Assessing this complex pattern of gene regulation, the sample is separated, the RNAs are directly eluted can be done by targeting specific individual mRNAs that from the gel, and are bound to a membrane filter, mak- have been implicated in specific processes is one way that ing a copy of the gel blotted onto the filter. All mRNAs are such changes in the expression of a gene can be deter- then bound to the filter, so a means of identifying each mined. This was in fact the initial way in which the tran- mRNA you wish to analyze is required. scriptome was analyzed, and it remains a valuable for val- Individual RNAs bound to the membrane are identi- idating the newer multi-transcript approaches that allow fied by hybridization between the bound cellular RNAs assessment of nearly all RNAs made by an organism sim- and a labeled (usually DNA) probe that is complementary ultaneously. to the mRNA to be identified. Hybridization of the probe 7.3.1. Single Transcript Abundance Estimation (RE- leads to the production of a spot on the correspond- TURN) ing to the location of the RNA being detected. Based on There are three approaches to estimating steady- the size and intensity of the spot, the amount of RNA de- state levels of single transcript abundance that have been tected by the probe can be identified, and based on the developed. These approaches can be either semi-quanti- position on the blot the sizes of the RNA can be deter- tative, e.g. Northern Blotting and RT-PCR, or quantitative, mined (if size standards were included. e.g. Real-time PCR. Note that although these approaches In addition to size, Northern blot analysis is used to were originally developed to analyze the expression of determine whether a specific mRNA is present in a cell mRNA one sequence at a time, they remain useful as type, and if so, at what levels (size and intensity of the tools DNA for validating the expression of individual spot). The steady-state level of the mRNA is estimated genes in more complex genome-wide approaches. from the size of the spot, and gene expression is meas- ured in this way. Northern Blot Analysis of RNA Northern blotting analyzes RNA in much the same way that Southern blotting does DNA (see Figure 7.12). RNA is extracted from the cell, and size-separated by gel electrophoresis. Multiple samples can be run on the same gel, but the limit is basically the width of the gel and the number of lanes available to load samples into. Once CONCEPTS OF GENOMIC BIOLOGY Page 7- 19

Figure 7.12. Northern blot analysis of RNA. Extracted RNA is placed in the wells of an gel, and the RNAs are separated on the basis of size. The size-separated are blotted onto a nitrocellulose membrane. The membrane is removed from the gel, and probed with a labeled probe that allows detection of a specific mRNA species. The labeled probe is then detected, and an image is produced showing the position of the labeled probe in the original gel. Note that the example shown here uses radioactively labeled probes, and detection by audoradiography. However, newer techniques employ chemiluminescent probes and detect light using a very sensitive camera.

To be able to accurately determine the relative task is usually done by using a second probe comple-men- amounts of RNA, a number of conditions must be met tary to an RNA that is found in equal amounts in the var- when performing a Northern blot. First it is essential to ious RNAs. This loading control can also be used to nor- verify the integrity of the RNA. If the RNA is partially de- malize the hybridization intensities by adjusting for dis- graded, the hybridization intensity will not accurately crepancies in RNA amounts in each lane. reflect the amount of RNA in the original sample. Second, RNA sampling is widely used to study changes in the there must be more labeled probe than there is RNA com- expression of individual genes in development, tissue plementary to it. If not, then the amount bound will be a specialization, or the response of cells to various physio- reflection of the amount put into the hybridization logical stimuli. Note that the technique requires a se- mix, and there will be competition among the comple- quence specific probe in order to be useful. If the se- mentary RNAs in different positions on the blot. Third, quence of the gene of interest is unknow, it may not be sufficient time needs to be allowed for the probe to find possible to generate a sequence specific probe. Thus, its complementary RNA. Finally, there should be an inde- Northern blotting requires sequence information con- pendent means of showing that approximately equal cerning the gene to be studied. amounts of RNA were loaded in each lane of the gel. This CONCEPTS OF GENOMIC BIOLOGY Page 7- 20 to make a DNA copy of each mRNA strand called a first- strand cDNA (Figure 7.14). First strand complementary

A A A T T T

Figure 7.13. Examples of Northern Blot showing typical results. [A] Sec- tion of a Northern Blot showing the expression of the AIM1 gene in root (R), leaf (L). stem (S), cotyledon (C), silique (Si), and flower (F) tissues. Size and intensity of the spot corresponds to the relative steady state level of mRNA in each cell type. [B] A comparison of the expression of Reverse transcription of an mRNA fol- AIM1 and AtMFP2 genes in tissues showing low expression (R & L) with Figure 7.14. tissues showing high expression (Si & F). [C] Comparison of AIM1 expes- lowed by PCR to amplify the DNA strand. An Oligo-dT sion in root with expression at varying times in either total darkness (eti- primer is used to prime the synthesis of a first strand olated) or times in the continuous light. Note the rRNA loading control cDNA. This is then amplified by PCR using a forward shown in (C), showing that equivalent amonts of RNA were added to (light Green) and a reverse (yellow) primer. each lane of the gel.

DNAs (cDNA) are generated from each individual mRNA RT-PCR using an oligo(dT) primer that anneals to the poly(A) tails RT-PCR (Reverse Transcriptase – Chain of mRNAs. The specific mRNA of interest is subsequently Reaction) is a technique that uses reverse transcriptase CONCEPTS OF GENOMIC BIOLOGY Page 7- 21 analyzed using sequence-specific PCR primers to amplify a part or all of the given mRNA. Real-time PCR Following PCR, the products are separated by agarose Real time PCR or quantitative PCR (qPCR) is a more gel electrophoresis and visualize using a double srand- quantitative way to measure changes in mRNA levels DNA-specific DNA dye (Figure 7.15). than either Northern Blots or RT-PCR. Reverse transcrip- One of the difficulties with RT-PCR is knowing when tase is utilized with an mRNA template to generate first- to stop the PCR. If the enough cycles of PCR are run to strand cDNA as in RT-PCR, but the PCR amplification step use up all of the dNTP in the reaction, then the amount is done in the presence of a dye such as SYBR green, a dye of product formed will be an underestimation of mRNA in that stains only dsDNA as it is being made. The thermocy- the sample, and if too few PCR cycles are run, a cler used for quantitative PCR is equipped with a laser de- tector that detects SYBR green in real-time as the dsDNA is being made. Analysis of the fluorescence output allows the deter- mination of the number of cycles of PCR required to reach a level of dsDNA that is about half way to the maximum that can be achieved, and this is used to estimate the number of mRNA molecules in the original sample.

Figure 7.15. RT-PCR Technology. PCR amplified first-strand cDNA is sep- arated on agarose gel electrophoresis. The PCR products are visualized using a ds-DNA stain, and the image generated is photographically rec- orded. measurable amount of product may not be formed. This means that estimation of the true amount of mRNA in the sample must be done using several different number of cycles of PCR, and even then it may not be possible to find a way to make a valid comparison. CONCEPTS OF GENOMIC BIOLOGY Page 7- 22

Figure 7.17. Quantitative PCR. Analysis of the fluorescent output re- veals how many cycles of PCR are required to get to a given level of flu- orescence. This allows comparison of samples for the number of copies of the mRNA detected by the PCR primers.

By subsequently running a reference RNA, one can as- sure that each sample is compared based on the same amount of cellular RNA, and by doing replicated samples from different treatments the level of mRNA expression can be compared. Since the amount of dsDNA doubles each PCR cycle, the number of cycle of PCR to reach a given amount of signal corresponds to the relative num- ber of mRNA molecules in each sample. Because the Figure7.16. Quantitative or Real-time PCR monitors is a amount of product made is estimated in real time, the form of RT-PCR, but a thermocycler is used that measures limitations of simple RT-PCR are largely overcome by the amount of dsDNA formed in real time, i.e. with each cycle of PCR. As a consequence it is not necessary to run Quantitative PCR. Additionally, one calculates a real com- electrophoresis and separate products of PCR. parison and is able to correct for sample loading differ- ences. This makes qPCR the standard for estimating gene expression differences, and it is the most commonly used CONCEPTS OF GENOMIC BIOLOGY Page 7- 23 tool for validating individual mRNAs in the genome-wide and no more than three probes can be used simultane- studies to be described below. ously. Thus, microarrays increase the throughput by sev- eral orders of magnitude. 7.3.2. Genome-wide Transcript Abundance Estima- Another difference between microarrays and North- ern blots is that microarrays have DNA sequences that tion (RETURN) represent the labeled “probes” in a Northern blot at- Because of the single gene techniques above, we tached to a solid support that can be glass, plastic, or a learned a great deal about eukaryotic gene expression nylon membrane, while the mRNAs, which are separated but the advent of high throughput DNA sequencing tech- by size and immobilized in a Northern Blot, are labeled niques requires the ability to survey the expression of the either directly or through a cDNA intermediary. Thus, on entire transcriptome simultaneously. Such techniques the microarray, the bound DNA probe must be in excess are referred to as Genome-wideTranscription Abundance just as is the fee DNA probe in a Northern Blot. To be con- estimation techniques (GWTA). We will discuss the mi- sistent with the terminology of Northern blots, for micro- croarray as a tool for GWTA estimation, and the transcrip- arrays the bound DNA is referred to as the “probe,” and tome sequencing techniques such as Massively Parallel the labeled RNA or cDNA is called the “target”. Signature Sequencing (MPSS) techniques that are now Most microarray experiments compare the RNA pop- the standard. We will also briefly mention Serial Analysis ulations found in two different samples. The samples can of Gene Expression (SAGE) techniques. be tumor tissue and normal tissue, cells that have re- Microarrays ceived a drug treatment and cells that have not, or cells at two different points in the cell cycle (see Figure 7.18.), Microarrays permit the simultaneous analysis of the etc. RNA expression of thousands of genes. For fully se- quenced , microarrays can be used to analyze the expression of every gene. Northern blots, on the other hand, are limited by the number of lanes on the gel and by the number of probes that can be used on the same blot. Northern blots normally have 20–40 lanes, CONCEPTS OF GENOMIC BIOLOGY Page 7- 24 Biological Treatment Labeling of the target RNA is usually performed by to generate samples generating a single-stranded cDNA, using the enzyme re- verse transcriptase. One method of labeling uses fluores- Preparation of RNA cently labeled nucleotides that are incorporated into the cDNA during the reverse-transcription reaction. This is generally the way the nucleotides labeled with the dyes Cy3 Cy5 Cy3 and Cy5 are incorporated into targets used in com- Preparation and la- beling of cDNAs petitive hybridization. Many different labels can be incor- porated depending on the type of microarray experiment that is being performed. For experiments in which two different RNA populations are analyzed on the same mi- Hybridization of la- beled cDNAs with croarray (competitive hybridization), two dyes are used fixed probe set that fluoresce at different wavelengths, most commonly Cy3 and Cy5. Labeling can also use biotin-conjugated RNA bases. Fluorescently labeled avidin is then bound to the biotin. Image analysis of hy- bridization pattern Additionally, several different solid supports for the arrayed probes can be used including membrane sup- ports such as nitrocellulose, glass (microscope slides), and computer chips (video on microarray technology). Figure 7.18. A yeast microarray experiment to compare Spotted microarrays are usually produced on glass micro- the effect of two treatments such as budding cells ver- scope slides while a process called photolithography is sus sporulating cells on the expression of all genes in the yeast genome. Samples were labeled with Cy3 and Cy5, used to produce computer chips. Photolithography that and a comparative analysis of the amount of each label leads to the synthesis of the probe directly on the com- in each sample. puter chip. You may want to link out to this video on mi- croarray construction. Once fluorescent labeled samples from each treat- ment, they can be mixed and hybridized to the sequence CONCEPTS OF GENOMIC BIOLOGY Page 7- 25 array. Each labeled RNA sequence will find an appropri- If the specific RNA is expressed in both treatments, ate location in the array where it can hybridize with the both Cy3- and Cy5-labeled cDNAs will hybridize at that probe located at that position. spot. At another spot perhaps on y Cy3-labeled cDNA hy- bridizes, while at still other locations only Cy5-labeled cDNA will hybridize. Thus, by measuring the amount of each dye detected at each position in the array, one can determine the level or expression of that specific mRNA. All of the conditions for quantitative mRNA hybridization

CDKNIA

Figure 7.19. Confocal laser scanning microscopy to de- tect positional fluorescence in a microarray. a laser beam is aimed at each spot on the microarray. The fluorescent light that is emitted upon excitation of the dye passes MYC through a pinhole that effectively eliminates all sur- rounding light. This condition permits a precise determi- Figure 7.20. A microarray image showing spots with predominantly nation of the level of fluorescence coming from the hy- Cy3-expressing sequences (), predominantly Cy5-expressiong se- bridized target at a single spot on the microarray. For quences (), and equally expressed sequences (). competitive hybridization, the microarray is scanned twice, using different wavelengths for each of the fluo- rescent dyes Cy3 and Cy5. discussed for Northern Blots will also apply to microar- rays, and it may not be possible to establish optimal con- Confocal laser scanning microscopy is used to deter- ditions for all of the probes in the gel. This can lead to mine the amount of fluorescently labeled target that has misestimation of some of the RNAs in the samples. hybridized to the DNA on the microarray (Figure 7.19). CONCEPTS OF GENOMIC BIOLOGY Page 7- 26 The details for setting up SAGE are time consuming and laborious to set up. Thus, SAGE is not widely used, SAGE – Serial Analysis of Gene Expression although it was the first sequence-based method for RNA There are several other means of obtaining genome- abundance estimation. wide expression profiles including SAGE (Serial Analysis of Gene Expression). SAGE differs from microarray tech- Transcriptome Sequencing Approaches to Gene niques because it requires a DNA sequencing step. The Expression Analysis basic concept behind SAGE is that the abundance of dif- The advent of high throughput DNA sequencing tech- ferent RNAs in a sample can be determined by sequenc- niques has created a number of cDNA-sequencing ap- ing each cDNA made from the RNA. In SAGE, instead of proaches to evaluating gene expression. These expand sequencing the entire cDNA, only a short DNA sequence on SAGE but are much less laborious and costly. One of tag is sequenced, but this is enough sequence to unam- the early methods called MPSS or Massively Parallel Sig- biguously identify the transcript uniquely. The number of Sequencing. occurrences of the tag is then determined for the sample, and these are compared between treatments. MPSS is used to analyse the level of gene expression in a sample by counting the number of individual mRNA A “tag” sequence is cut from each cDNA using re- molecules that each gene in the organism produces. By striction enzymes. The tags are then ligated together in analyzing the tag counts in RNA samples prepared from a specific way to form a concatemer. Concatamers are different treatments, the expression of genes can be then further ligated together, and sequenced. The bor- compared across the treatments. Basically, “tagged” ders of each tag are identified using software that recog- products from cDNA are amplified by PCR so that each nizes the four-base restriction site and the tag length. The mRNA molecule produces circa 100,000 PCR products sequences of the individual tags are then compared with with a unique tag. The tags are used to attach the PCR the known sequences of 3’ untranslated regions of the products to microbeads used for DNA sequencing using a genes for the organism under analysis. The relative abun- next generation sequencer. A sequence signature of ~16- dance of each tag is calculated and is taken as a measure 20 bp of high quality sequence is obtained. This is per- of the level of expression of the associated gene. formed in parallel. This is the signature sequencing part of the analysis. Since at least a million sequence signa- tures are obtained per experiment (from millions of CONCEPTS OF GENOMIC BIOLOGY Page 7- 27 beads) in parallel, the procedure is referred to as Mas- Today, another, more popular, approach involves ob- sively parallel signature sequencing. taining a cDNA library from the tissue and/or treatment given. Then using high-throughput cDNA sequencing, Computer analysis is then used to count and analyze random fragments (usually 50 or 100 bp in length) from each signature sequence (MPSS tag) in an MPSS dataset, the library are sequenced using an approach like whole and the level of expression of any single gene is calculated genome . These fragments are then by dividing the number of signatures from that gene by assembled to create full length transcripts for all genes the total number of signatures for all mRNAs present in represented in the dataset. The number of short reads the dataset. contributing to each full-length sequence is then com- puted, and provides a basis for gene expression compari- sons for all genes in the transcriptome. This analysis can be performed in tandem for as many treatments/tissues as you wish to compare, and it becomes a valid basis for expression analysis that requires a minimum of preexist- ing knowledge of the transcriptome. Thus, this technique is useful both for organisms with fully sequenced ge- nomes, but also allows the investigation of organisms that do not have available sequenced genomes.

Figure 7.21. A sample MPSS data set showing the counts (on right) for seven different transcripts. Note the 4 base sequence ligated to each tag used to identify the start of the sequence tag (to the right).

MPSS has routine sensitivity at a level of a few mole- cules of mRNA per cell, and the datasets are in a digital format that simplifies the management and analysis of the data.