<<

Commentary

Identifying expressed

Katherine J. Martin* and Arthur B. Pardee

Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, MA 02115

he study of expressed genes has had a oligo(dT) primers that anneal at the 5Ј lated more than 44,000 novel genes that Tgreat impact on biological research ends of poly(A) regions of mRNAs. were not already discovered by the EST (1). Expressed genes are the basic func- Hence, the cDNAs produced are devoid of project. Sequences from both of the EST tional units of genomic DNA. Because long poly(dA) regions. projects are collected in the these regions cannot be identified from database of ESTs (dbEST, http:͞͞www. genomic sequence information per se, the Inception of Expression Analysis ncbi.nlm.nih.gov͞dbEST), which is main- ’s products, messenger RNAs or pro- Information on expressed genes has clas- tained by the National Center for Biotech- teins, must be isolated from cells, directly sically originated from studies of individ- nology Information. GenBank and dbEST sequenced, and identified. As we steadily ual cDNAs, identified and cloned by vir- sequences then are organized into a build up sizable expression databases that tue of their particular importance to a nonredundant list of unique genes by currently include more than 92,000 of the specific topic of research. For the past 20 the UniGene project (http:͞͞ www.ncbi. roughly 100,000 total human genes, the years, DNA sequence information for nlm.nih.gov͞UniGene), which is consid- process of identifying the remaining un- these functionally characterized genes has ered the most regularly updated source for discovered genes is becoming progres- been entered into databases including high-quality, nonredundant information sively more difficult. Existing databases of GenBank, the European Molecular Biol- on expressed genes. CGAP also is creating expressed genes now include virtually all ogy Laboratory, and the DNA Data Base a public database, SAGEmap, to provide of the abundantly expressed human of Japan, which share their respective quantitative data (4). genes—the easier to reach ‘‘low hanging contents. Public human expression databases now fruit on the tree,’’ as well as many middle A random approach recently has been are believed to include a large percentage and rarely expressed genes. The genes that used as a part of a large-scale effort to of all genes. UniGene currently lists se- are still undiscovered are expressed at low collect DNA sequence information for all quence information for 92,571 different levels or are specifically expressed only in expressed human genes. This approach expressed genes (UniGene build #108, certain cell types, developmental stages, entails sequencing partial cDNA clones Feb. 19, 2000). It is noted that the algo- or growth conditions. Such genes hold the generated from mRNA and is termed rithms used by UniGene to cluster redun- COMMENTARY promise of including key regulatory fac- expressed sequence tag (EST) analysis dant sequences are experimental and tors responsible for differentiated pheno- (3). EST sequences generally represent hence this number may increase or types, developmental progression, or cell 200–800 bp of first-pass sequence infor- decrease with improvements and the ad- growth regulation. As we move forward to mation extending in from mRNA 3Ј ends. dition of new sequences. Further, some identify these genes, highly efficient meth- Two large public EST sequencing projects, sequences currently considered to be dif- ods of removing, i.e., subtracting, the bulk the EST project and the Cancer Genome ferent genes may in fact represent non- of identified, abundant genes from cDNA and Anatomy Project (CGAP, http:͞͞ overlapping regions of the same gene. libraries are required. www.ncbi.nlm.nih.gov͞ncigap), have been Hence, more complete sequence informa- In this issue of PNAS, Wang and col- initiated to rapidly identify, i.e. obtain at tion, e.g., from genomic data, also may leagues (2) discover a flaw in current least partial sequence information for all reduce the UniGene tally. The ultimate subtraction methods, which are now expressed genes. target is also uncertain. Estimates of the widely used to identify novel expressed The first EST project (3) was begun in total number of expressed human genes genes. They show that the long poly(A) 1991 and to date has accumulated se- range between 60,000 and 150,000 (5–7). regions present in most expressed mRNAs quences for a total of approximately Serial analysis of gene expression (SAGE) generate a serious problem in subtraction 48,000 different genes. Rates of novel results indicate that 46% of genes cur- reactions. Long poly(dT) regions of tester gene discovery by the EST project were rently have no matches in existing data- cDNA, which is generated from the RNA initially high, but have declined sharply in bases (2, 7), hence predicting a total of of interest, randomly hybridize with long recent years. Ninety percent of the 48,000 130,000 genes. Although precise values for poly(dA) regions of driver cDNA gener- ESTs were accumulated in the 4 years the total number of expressed human ated from the comparison cell type, re- before 1997 and only modest numbers genes or the number that have already sulting in template loss. This loss particu- were added in the past 3 years. The second been identified are uncertain, it is likely larly affects low abundance mRNAs. large-scale expressed sequence tag that current databases are close to Wang et al. predict that this flaw will limit project, the Cancer Genome and Anat- complete. the usefulness of current subtraction omy Project (CGAP) (http:͞͞www.ncbi. The dbEST obtains its information methods and result in a fall-off in gene nlm.nih.gov͞ncicgap), was begun in late from many different types of libraries and identification rates before the identifica- 1996. CGAP currently is maintaining high tissues. Libraries include PCR-amplified tion of all genes is completed. They report rates of gene discovery by applying the and unamplified libraries, normalized, a conceptually and technically straightfor- latest techniques in tissue procurement, subtracted, and unaltered libraries, as well ward solution that significantly enhances cDNA library preparation, and bioinfor- the efficiency with which novel expressed matics to sequence expressed genes in genes are identified. They generate sub- cancerous, precancerous, and normal cell See companion article on page 4162. traction libraries by using short, anchored lines and tissues. CGAP now has accumu- *To whom reprint requests should be addressed.

PNAS ͉ April 11, 2000 ͉ vol. 97 ͉ no. 8 ͉ 3789–3791 Downloaded by guest on October 1, 2021 as libraries generated with a new reverse Subtraction and Normalization Methods gene identification rates once current transcriptase that produces very long As sequencing continues, the remaining methodologies become limiting. They cDNA clones. Libraries are prepared from unidentified genes are becoming progres- identify a problematic area of current more than 1,000 different human tissues sively harder to find because they are of subtraction and normalization methods and cell lines, including normal and cancer progressively lower abundance and are that may limit the usefulness of these cells, different developmental stages and more cell-type restricted. It is important methods. growth states, as well as microdissected, that databases include even these most Conventionally, reverse is bulk, and pooled tissues. scarce and tissue-specific genes, as these performed with a poly(dT) primer that A parallel ongoing effort is to sequence have the potential to include many of the anchors randomly along the approxi- the complete . Draft se- most biologically interesting regulatory mately 200-bp poly(A) tails of mRNAs quence currently is reported for 47% (Hu- and creates long poly(dA͞dT) sequences ͞͞ factors. Subtraction and normalization are man , http: www.ncbi. key methods in the process of identifying in the cDNAs. During subtraction, rare nlm.nih.gov͞genome͞seq) and more than cDNAs are lost at a high rate because of ͞͞ such genes. 90% (Celera , http: www. Normalization methods selectively re- random hybridization of their long celera.com) of the genome. Upon com- duce the level of representation of abun- poly(dA) regions with driver poly(dT). pletion, these projects will provide the dant genes so that the resulting mRNAs Wang et al. (2) demonstrate a method that basic fundamental DNA sequence infor- overcomes this problem: the construction ϫ 9 preparations contain both abundant and mation for the entire 3 10 bp of the rare genes at similar levels. Before nor- of subtraction libraries by using short, human genome. Genomics has made key malization, a typical cell expresses 1,000– anchored oligo(dT) primers. These prim- contributions to biology and medicine. 2,000 different abundant and middle ers are composed of 11 dTs plus one or two other 3Ј bases that anchor their hy- However, its greater value may be as a abundant messages at levels of Ͼ500 and component of integrated resources of bridization at the 5Ј ends of poly(dA) 15–500 copies͞cell, respectively. These genomic data plus data on the 3% of its sequences. The short poly(dA͞dT) tails represent 50–65% of the cell’s total sequence that is translated. Integrated da- produced are less likely to cause the re- mRNA mass (17, 18). In contrast, approx- tabases, linked to functional information moval of rare cDNAs. Such short an- imately 15,000 different rare messages are on molecular processes and disease states, chored primers previously were used in expressed at a level of Ͻ5 copies per cell hold the potential for revolutionizing the DD technique (12) to target reverse and constitute the remaining percentage methods of basic research and disease transcriptase and PCR to the 3Ј ends of of a cell’s mRNA mass. intervention (8–10). expressed genes. Subtraction methods allow one to com- In addition to library subtraction and Wang et al. (2) demonstrate that these normalization, other methods of novel pare two mRNA preparations and remove anchored primers do indeed produce gene discovery currently are used. These from one (the tester) genes that are short 3Ј dA͞dT sequences. These se- include methods that incorporate a sub- present in the other (the driver). This quences reduced loss of a synthetic tester traction step and methods that do not. technique is useful in identifying tissue- template during subtraction reactions. Subtraction-based methods include, for specific genes. Current EST sequencing They also reduced loss of rare cellular example, representational difference combines subtraction and normalization genes. A selected group of rare colon- analysis (11). Methods that do not involve methods with advanced tissue preparation specific mRNAs was assayed after sub- subtraction include differential display methods, such as the isolation of RNA traction. They were increased several-fold (DD) (12) and related fingerprinting tech- from bulk, microdissected, and pooled in the poly(dA͞dT)(-) libraries made by niques, SAGE, differential library screen- tissues, to increase the diversity and re- using the anchored primers, as compared ing, and negative selection (13, 14). duce the redundancy of messages in a with conventional dA͞dT(ϩ) libraries; DD is a PCR-based method whereby given sample. four of the five rare colon specific genes cDNAs made from two mRNA samples To perform a subtraction, first-strand were retained at 1.4- to 7.8-fold higher are compared on side-by-side tracks of a cDNA is generated from tester RNA by levels. Furthermore, Moloney murine leu- sequencing gel. Each mRNA is repre- using poly(dT) primers. Double-stranded kemia virus reverse transcriptase was sented as a single band and differential cDNA then is generated from driver RNA more effective than avian myeloblastosis bands can isolated and cloned. In parallel again by using the poly(dT) primers. virus reverse transcriptase in reducing comparisons, DD was found to have less Tester cDNA is mixed with an excess rare template loss. bias toward abundant genes than EST of driver cDNA, and the cDNAs are A point of interest is that the dC- sequencing or subtractive hybridization denatured and allowed to reanneal. Dou- anchored primer behaved like poly(dT), (15). In recent years, false positive rates ble-stranded cDNA is eliminated by by randomly hybridizing along poly(A) for DD have been reduced from 50% to adsorption to a hydroxyapatite column. tails of mRNAs. Therefore, two base- close to 5% (16). The method currently is Normalization methods use the same ba- anchored primers were used in its place in widely applied to solve specific research sic principles, except that a single RNA which the penultimate base was dC plus an questions (nearly 1,400 citations in Med- preparation is used to generate both tester ultimate dC, dA, or dG. Interestingly, we line), although it contributes only a minor and driver cDNA and hybridization time is find with DD that dC is a perfectly satis- fraction of EST and CGAP entries. limited. factory 3Ј base perhaps because the re- SAGE is a streamlined method of se- peated 30 rounds of PCR progressively quencing genes from any type of expres- Improving Expressed Gene Identification shorten the 3Ј tails to a minimum. sion library, subtracted, normalized, or Wang and colleagues (2) describe the fall- unaltered (13). In SAGE, 10-bp stretches off in rates of new gene discovery by the New Directions of cDNA obtained from a precise location EST project. ‘‘The rate of novel gene Expression approaches already have had relative to the mRNA 3Ј end are concate- identification through the EST project a great impact on biological research merized so that a single lane of DNA declined dramatically from 10.6% of EST (1, 19). Expression databases are receiv- sequencing identifies many genes. SAGE sequences in 1996 (36,000 novel se- ing thousands of queries per day by is used as a tool for both quantitative quences) to only 2.7% of EST sequences researchers. The approach is providing analysis of gene expression levels and collected in 1998 (638 novel sequences).’’ researchers with new tools to address novel gene discovery. They predict a similar decline in CGAP complex molecular events. In a general

3790 ͉ www.pnas.org Martin and Pardee Downloaded by guest on October 1, 2021 sense, expression approaches are con- patients according to distinct subcatego- Arrays with complete collections of tags, as tributing to the current shift in research ries of disease. Such classification has the well as the equipment to hybridize and directions from the classical emphasis on potential to assist clinicians in making interpret the results, will likely follow. A individual genes and molecules to the prognosis and predicting the most effec- significant challenge currently is to develop investigation of patterns of gene expres- tive therapies. approaches to determine the meaning of sion and the elucidation of comprehen- The power of hybridization arrays can changes in individual gene expression levels, sive functional networks. New technolo- be seen by the questions now being an- especially in a biological and functional gies based on the expression approach swered by this method. For example, two context. In summary, Wang and colleagues (2) include cDNA microarrays and comput- groups have studied large-scale gene ex- have identified a flaw in current subtrac- er-based expression analyses. pression changes in leukemias and lym- tion methods that are widely used to iden- Microarrays in particular have had an phomas and found that groups of genes tify novel expressed genes. They predict overwhelming recent popularity. cDNA can be readily identified that sort blood that this will limit the usefulness of current tags for thousands of expressed genes are samples a priori into different disease subtraction methods and result in a fall-off arrayed on a glass or membrane support. groups (8, 9). Studies of breast tumor in gene identification rates. As a solution, When incubated with labeled first-strand tissue biopsies also showed that clinically they construct subtraction libraries by us- cDNA produced from cellular RNA, each relevant gene expression patterns could be ing short, anchored oligo(dT) primers that tag hybridizes to its cognate mRNA and identified (10, 20). These expression pat- do not copy the long poly(A) regions of allows relative quantitation of the expres- terns have the potential to very accurately mRNAs. They present data showing that sion levels. This technology allows wide- identify prognostic groups and predict the this technological improvement has the spread changes in expression patterns to most effective therapies. potential to improve rates of novel gene be probed in a single experiment. Currently, the most comprehensive hy- identification, which will enhance the as- Important clinical applications also are bridization arrays include 20–30% of the sembly of a database that includes all developing from expression technology. 130,000 genes. Hybridization array methods expressed human genes. Expression information linked to func- will become more powerful when tags for all We thank Drs. Kornelia Polyak, Scott McGin- tional information on molecular processes expressed genes are included. For the most nis, and Todd Golub for helpful input and and disease states can accurately classify part, these genes are already in databases. review of the manuscript.

1. Sager, R. (1997) Proc. Natl. Acad. Sci. USA 94, 8. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, Hood, L. (1999) Genet. Anal. 15, 209–215. 952–955. C., Gaasenbeek, M., Mesirov, J. P., Coller, H., 15. Wan, J. S., Sharp, S. J., Poirier, G. M., Wagaman, 2. Wang, S. M., Fears, S. C., Zhang, L., Chen, J.-J. & Loh, M. L., Downing, J. R., Caligiuri, M. A., et al. P. C., Chambers, J., Pyati, J., Hom, Y. L., Galindo, Rowley, J. D. (2000) Proc. Natl. Acad. Sci. USA 97, (1999) Science 286, 531–537. J. E., Huvar, A., Peterson, P. A., et al. (1996) Nat. 4162–4167. 9. Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Biotechnol. 14, 1685–1691. 3. Adams, M. D., Kelley, J. M., Gocayne, J. D., Lossos, I. S., Rosenwald, A., Boldrick, J. C., Sabet, 16. Martin, K. J., Kwan, C. P., O’Hare, M. J., Pardee, Dubnick, M., Polymeropoulos, M. H., Xiao, H., H., Tran, T., Yu, X., et al. (2000) Nature (London) A. B. & Sager, R. (1998) BioTechniques 24, 1018– Merril, C. R., Wu, A., Olde, B., Moreno, R. F., et 403, 503–511. 1026.

al. (1991) Science 252, 1651–1656. 10. Martin, K. J., Kritzman, B. M., Price, L. M., Koh, 17. Alberts, B., Bray, D., Lewis, J., Raff, M., Roberts, COMMENTARY 4. Lal, A., Lash, A. E., Altschul, S. F., Velculescu, V., B., Kwan, C. P., Zhang, Z., Mackay, A., O’Hare, K. & Watson, J. D. (1994) Molecular Biology of the Zhang, L., McLendon, R. E., Marra, M. A., M. J., Kaelin, C. M., Mutter, G. M., et al. (2000) Cell (Garland, New York). Prange, C., Morin, P. J., Polyak, K., et al. (1999) Cancer Res., in press. 18. Bonaldo, M. F., Lennon, G. & Soares, M. B. Cancer Res. 59, 5403–5407. 11. Lisitsyn, N., Lisitsyn, N. & Wigler, M. (1993) (1996) Genome Res. 6, 791–806. 5. Cohen, J. (1997) Science 275, 767–772. Science 259, 946–951. 19. Boguski, M. S. (1995) Trends Biochem Sci. 20, 6. Bishop, J. O., Morton, J. G., Rosbach, M. & Rich- 12. Liang, P. & Pardee, A. B. (1992) Science 257, 295–296. ardson, M. (1974) Nature (London) 250, 199–204. 967–971. 20. Perou, C. M., Jeffrey, S. S., van de Rijn, M., 7. Velculescu, V. E., Madden, S. L., Zhang, L., Lash, 13. Velculescu, V. E., Zhang, L., Vogelstein, B. & Rees, C. A., Eisen, M. B., Ross, D. T., Perga- A. E., Yu, J., Rago, C., Lal, A., Wang, C. J., Kinzler, K. W. (1995) Science 270, 484–487. menschikov, A., Williams, C. F., Zhu, S. X., Lee, Beaudry, G. A., Ciriello, K. M., et al. (1999) Nat. 14. Nelson, P. S., Hawkins, V., Schummer, M., Bum- J. C., et al. (1999) Proc. Natl. Acad. Sci. USA 96, Genet. 23, 387–388. garner, R., Ng, W. L., Ideker, T., Ferguson C. & 9212–9217.

Martin and Pardee PNAS ͉ April 11, 2000 ͉ vol. 97 ͉ no. 8 ͉ 3791 Downloaded by guest on October 1, 2021