Identifying Expressed Genes

Commentary Identifying expressed genes Katherine J. Martin* and Arthur B. Pardee Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, MA 02115 he study of expressed genes has had a oligo(dT) primers that anneal at the 5Ј lated more than 44,000 novel genes that Tgreat impact on biological research ends of poly(A) regions of mRNAs. were not already discovered by the EST (1). Expressed genes are the basic func- Hence, the cDNAs produced are devoid of project. Sequences from both of the EST tional units of genomic DNA. Because long poly(dA) regions. sequencing projects are collected in the these regions cannot be identified from database of ESTs (dbEST, http:͞͞www. genomic sequence information per se, the Inception of Expression Analysis ncbi.nlm.nih.gov͞dbEST), which is main- gene’s products, messenger RNAs or pro- Information on expressed genes has clas- tained by the National Center for Biotech- teins, must be isolated from cells, directly sically originated from studies of individ- nology Information. GenBank and dbEST sequenced, and identified. As we steadily ual cDNAs, identified and cloned by vir- sequences then are organized into a build up sizable expression databases that tue of their particular importance to a nonredundant list of unique genes by currently include more than 92,000 of the specific topic of research. For the past 20 the UniGene project (http:͞͞ www.ncbi. roughly 100,000 total human genes, the years, DNA sequence information for nlm.nih.gov͞UniGene), which is consid- process of identifying the remaining un- these functionally characterized genes has ered the most regularly updated source for discovered genes is becoming progres- been entered into databases including high-quality, nonredundant information sively more difficult. Existing databases of GenBank, the European Molecular Biol- on expressed genes. CGAP also is creating expressed genes now include virtually all ogy Laboratory, and the DNA Data Base a public database, SAGEmap, to provide of the abundantly expressed human of Japan, which share their respective quantitative gene expression data (4). genes—the easier to reach ‘‘low hanging contents. Public human expression databases now fruit on the tree,’’ as well as many middle A random approach recently has been are believed to include a large percentage and rarely expressed genes. The genes that used as a part of a large-scale effort to of all genes. UniGene currently lists se- are still undiscovered are expressed at low collect DNA sequence information for all quence information for 92,571 different levels or are specifically expressed only in expressed human genes. This approach expressed genes (UniGene build #108, certain cell types, developmental stages, entails sequencing partial cDNA clones Feb. 19, 2000). It is noted that the algo- or growth conditions. Such genes hold the generated from mRNA and is termed rithms used by UniGene to cluster redun- COMMENTARY promise of including key regulatory fac- expressed sequence tag (EST) analysis dant sequences are experimental and tors responsible for differentiated pheno- (3). EST sequences generally represent hence this number may increase or types, developmental progression, or cell 200–800 bp of first-pass sequence infor- decrease with improvements and the ad- growth regulation. As we move forward to mation extending in from mRNA 3Ј ends. dition of new sequences. Further, some identify these genes, highly efficient meth- Two large public EST sequencing projects, sequences currently considered to be dif- ods of removing, i.e., subtracting, the bulk the EST project and the Cancer Genome ferent genes may in fact represent non- of identified, abundant genes from cDNA and Anatomy Project (CGAP, http:͞͞ overlapping regions of the same gene. libraries are required. www.ncbi.nlm.nih.gov͞ncigap), have been Hence, more complete sequence informa- In this issue of PNAS, Wang and col- initiated to rapidly identify, i.e. obtain at tion, e.g., from genomic data, also may leagues (2) discover a flaw in current least partial sequence information for all reduce the UniGene tally. The ultimate subtraction methods, which are now expressed genes. target is also uncertain. Estimates of the widely used to identify novel expressed The first EST project (3) was begun in total number of expressed human genes genes. They show that the long poly(A) 1991 and to date has accumulated se- range between 60,000 and 150,000 (5–7). regions present in most expressed mRNAs quences for a total of approximately Serial analysis of gene expression (SAGE) generate a serious problem in subtraction 48,000 different genes. Rates of novel results indicate that 46% of genes cur- reactions. Long poly(dT) regions of tester gene discovery by the EST project were rently have no matches in existing data- cDNA, which is generated from the RNA initially high, but have declined sharply in bases (2, 7), hence predicting a total of of interest, randomly hybridize with long recent years. Ninety percent of the 48,000 130,000 genes. Although precise values for poly(dA) regions of driver cDNA gener- ESTs were accumulated in the 4 years the total number of expressed human ated from the comparison cell type, re- before 1997 and only modest numbers genes or the number that have already sulting in template loss. This loss particu- were added in the past 3 years. The second been identified are uncertain, it is likely larly affects low abundance mRNAs. large-scale expressed sequence tag that current databases are close to Wang et al. predict that this flaw will limit project, the Cancer Genome and Anat- complete. the usefulness of current subtraction omy Project (CGAP) (http:͞͞www.ncbi. The dbEST obtains its information methods and result in a fall-off in gene nlm.nih.gov͞ncicgap), was begun in late from many different types of libraries and identification rates before the identifica- 1996. CGAP currently is maintaining high tissues. Libraries include PCR-amplified tion of all genes is completed. They report rates of gene discovery by applying the and unamplified libraries, normalized, a conceptually and technically straightfor- latest techniques in tissue procurement, subtracted, and unaltered libraries, as well ward solution that significantly enhances cDNA library preparation, and bioinfor- the efficiency with which novel expressed matics to sequence expressed genes in genes are identified. They generate sub- cancerous, precancerous, and normal cell See companion article on page 4162. traction libraries by using short, anchored lines and tissues. CGAP now has accumu- *To whom reprint requests should be addressed. PNAS ͉ April 11, 2000 ͉ vol. 97 ͉ no. 8 ͉ 3789–3791 Downloaded by guest on October 1, 2021 as libraries generated with a new reverse Subtraction and Normalization Methods gene identification rates once current transcriptase that produces very long As sequencing continues, the remaining methodologies become limiting. They cDNA clones. Libraries are prepared from unidentified genes are becoming progres- identify a problematic area of current more than 1,000 different human tissues sively harder to find because they are of subtraction and normalization methods and cell lines, including normal and cancer progressively lower abundance and are that may limit the usefulness of these cells, different developmental stages and more cell-type restricted. It is important methods. growth states, as well as microdissected, that databases include even these most Conventionally, reverse transcription is bulk, and pooled tissues. scarce and tissue-specific genes, as these performed with a poly(dT) primer that A parallel ongoing effort is to sequence have the potential to include many of the anchors randomly along the approxi- the complete human genome. Draft se- most biologically interesting regulatory mately 200-bp poly(A) tails of mRNAs quence currently is reported for 47% (Hu- and creates long poly(dA͞dT) sequences ͞͞ factors. Subtraction and normalization are man Genome Project, http: www.ncbi. key methods in the process of identifying in the cDNAs. During subtraction, rare nlm.nih.gov͞genome͞seq) and more than cDNAs are lost at a high rate because of ͞͞ such genes. 90% (Celera Genomics, http: www. Normalization methods selectively re- random hybridization of their long celera.com) of the genome. Upon com- duce the level of representation of abun- poly(dA) regions with driver poly(dT). pletion, these projects will provide the dant genes so that the resulting mRNAs Wang et al. (2) demonstrate a method that basic fundamental DNA sequence infor- overcomes this problem: the construction ϫ 9 preparations contain both abundant and mation for the entire 3 10 bp of the rare genes at similar levels. Before nor- of subtraction libraries by using short, human genome. Genomics has made key malization, a typical cell expresses 1,000– anchored oligo(dT) primers. These prim- contributions to biology and medicine. 2,000 different abundant and middle ers are composed of 11 dTs plus one or two other 3Ј bases that anchor their hy- However, its greater value may be as a abundant messages at levels of Ͼ500 and component of integrated resources of bridization at the 5Ј ends of poly(dA) 15–500 copies͞cell, respectively. These genomic data plus data on the 3% of its sequences. The short poly(dA͞dT) tails represent 50–65% of the cell’s total sequence that is translated. Integrated da- produced are less likely to cause the re- mRNA mass (17, 18). In contrast, approx- tabases, linked to functional information moval of rare cDNAs. Such short an- imately 15,000 different rare messages are on molecular processes and disease states, chored primers previously were used in expressed at a level of Ͻ5 copies per cell hold the potential for revolutionizing the DD technique (12) to target reverse and constitute the remaining percentage methods of basic research and disease transcriptase and PCR to the 3Ј ends of of a cell’s mRNA mass.

Identifying Expressed Genes

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support