<<

Commentary

Navigating the human

Robert L. Strausberg*† and Gregory J. Riggins‡

Cancer Office, National Institute, Bethesda, MD 20892; and ‡Departments of Pathology and , Duke University Medical Center, Durham, NC 27710

he potential coding capacity of the hu- mouse ESTs. Each project has produced Research (FAPESP͞LICR)-Human Can- Tman is currently a topic of great cDNA libraries based on the use primers for cer (13). The power of interest. The number of predicted first-strand synthesis that are anchored at ORESTES is that a high proportion of the from the recent human-genome analysis the 3Ј transcript end [the poly(A) sequence]. sequence tags are in the coding regions of was at the lower end of previous estimates, Sometimes, the resulting cDNA molecules transcripts (see Fig. 1). In the ORESTES which had ranged between about 30,000 and represent the entire transcript. More often, approach, low-stringency PCR conditions 120,000 (1, 2). Whereas estimates of they are incomplete, either because of are used to produce cDNA libraries from number are likely to increase based on ad- mRNA degradation or incomplete enzy- which a relatively small number of individ- ditional experimental evidence and im- matic processing during conversion of ual clones are sequenced. Several thousand proved gene-finding algorithms, it is clear mRNA to cDNA. Thus, for a given tran- ORESTES libraries have been produced, that gene number is only one mechanism for script, there might be several different forms each with different primers, such that each creating the genetic diversity required to of cDNA in the library. To facilitate gene library is expected to contain unique cDNA encode the full complement of human pro- cataloging, the MGI and CGAP sequenced sequences. As described by Camargo et al., teins. The scientific literature richly de- the 3Ј cDNA end, starting from the poly(A) the experimental results confirm the theo- scribes the presence and functional signifi- sequence (see Fig. 1). With this approach, it retical expectations of ORESTES in that cance of alternatively processed forms of is more likely that sequences from tran- these sequences are spaced throughout the human transcripts that are derived from scripts derived from the same gene will be transcript, thereby providing a scaffold to different initiation sites, alter- recognized as such (although alternative complete full-length transcript sequences. native exon splicing, and multiple polyade- sites add complication.) In Moreover, the approach has a normaliza- Ј nylation sites (3–5). Determining the vari- the MGI project, from the 5 tion effect for a broader sampling of the ous transcript forms and investigating the clone end also was performed to provide many different transcripts, with less depen- purpose of these complex mixtures of in- more -encoding sequences (because dence on expression levels. This approach Ј COMMENTARY structions will be the next great endeavor many of the clones are incomplete, 5 clone- facilitates discovery of genes with low ex- Ј toward understanding human biology. end sequences often define not the true 5 pression levels. The volume of tags is signif- Large-scale analysis of the transcriptome end of a transcript, but, instead, are within icant as well; 700,000 ORESTES represent originates from the coding regions.). nearly 20% of human dbEST. (EST) concept popularized by Venter and Complementing the traditional EST ap- As shown in Fig. 1, complete gene se- coworkers (6). In the EST approach, clones proaches have been imaginative new strat- quence assembly from ESTs can be quite from cDNA libraries are subjected to single- egies. One of these approaches, Serial Anal- challenging, because each tag is relatively pass sequencing, such that a unique identi- ysis of (SAGE; ref. 11), short, and often the tags don’t share homol- fier is assigned to each cDNA. Initially, produces short sequence tags (usually 14 ogous sequences. To organize the disparate these tags were about 300 nucleotides in nucleotides in length) located adjacent to Ј EST sequences, the NCBI developed Uni- length, but sequence tags of more than 700 defined restriction sites near the 3 end of Gene (14). From the outset, UniGene was nucleotides are now common. Many scien- the cDNA. Therefore, one advantage of designed to bin all transcript sequences from tists quickly realized the value of using EST SAGE is that, unlike the EST approach, a gene into one cluster, thereby serving as a sequences to identify new genes and char- each transcript has a unique tag, thereby platform for transcript profiling efforts, in- acterize genes expressed in normal and dis- facilitating transcript quantification. In ad- cluding those based on microarrays (15, 16). eased tissues. Public and private efforts soon dition, these tags are concatemerized, such However, because UniGene, in principle, emerged to capitalize on this opportunity. that 30 or more gene tags can be read from groups all transcript forms of a gene into Key to the success of the public efforts was a single sequencing lane, substantially in- one cluster, it is too imprecise to serve as the the formation of the IMAGE consortium creasing the efficiency of gene cataloging. foundation for defining the scope of the (7), an academic-industrial partnership to However, because the tags are so short, the human transcriptome. distribute clones produced by the public informatics challenges become greater. The To proceed effectively with transcrip- efforts. The Merck Gene Index (MGI; ref. CGAP project, working together with the tome efforts, the cDNA sequencing and 8) and the Cancer Genome Anatomy National Center for Biotechnology Infor- clone needs will be more demanding than Project (CGAP; ref. 9) have produced many mation (NCBI), has generated a SAGE in the past. Currently, there is a significant of the human clones distributed by IMAGE. database, SAGEmap (12). This database shift in emphasis of the large-scale cDNA A guiding principle for these efforts was the now includes over 4,000,000 gene tags, com- efforts toward the sequencing of complete immediate distribution of project resources: plementing the more than 3,700,000 human human transcripts. In 1999, the National clones were distributed through the IM- ESTs in dbEST. AGE consortium (http:͞͞image.llnl.gov͞) In this issue, Camargo et al. describe a and sequences through a GenBank division, previously untried strategy, the ORF ESTs See companion article on page 12103. dbEST (10). (ORESTES) approach developed by the †To whom reprint requests should be addressed at: National CGAP and MGI have sought to develop Fundac¸ao de Amparo a`Pesquisa do Estado Cancer Institute, 31 Center Drive, Room 11A03, Bethesda, a comprehensive catalog of human and de Sao Paulo͞Ludwig Institute for Cancer MD 20892. E-mail: [email protected].

www.pnas.org͞cgi͞doi͞10.1073͞pnas.221463598 PNAS ͉ October 9, 2001 ͉ vol. 98 ͉ no. 21 ͉ 11837–11838 Downloaded by guest on September 29, 2021 serve as a platform for the functional annotation of each transcript species through links to primary data sets and the scientific literature. One of the great strengths of the human transcriptome effort is that it incorporates the talents of an international consortium and utilizes many different experimental strategies. In addition, although data and clone-release policies differ somewhat among the groups, there is an overall spirit Fig. 1. Idealized schematic view of ESTs generated by various public projects. A full-length cDNA is of open sharing. Transcriptome analysis represented by the black bar, with nontranslated regions indicated by the stippled sections. The red differs from the recently reported human- arrows depict 5Ј and 3Ј ESTs based on the MGI approach; blue arrows depict the CGAP 3Ј approach. Note that at both the 5Ј and 3Ј ends, alternate EST positions are possible based on transcript variations or genome sequencing efforts in that the incomplete cDNA synthesis. The purple arrow indicates the MGC EST strategy in which 5Ј ESTs are ‘‘biological space’’ of the transcriptome generated to search for full-ORF clones. ORESTES tags are shown by the green arrows, which are spaced still remains to be defined. Many of the more evenly throughout the cDNA sequence. The green bars denote regions where sequence gaps exist transcript tags annotated in dbEST and that might be subject to the transcript-finishing approach of Camargo et al. The black arrow indicates SAGEmap are from projects focused on where a SAGE tag might be located. This arrow is vertical to indicate that SAGE tags are located at a precise the molecular characterization of cancer. site within a transcript. The reasons for emphasis on cancer are clear, and it also is evident that compre- hensive study of the transcriptome will Institutes of Health announced the Mam- the human-genome sequence (and im- require more substantial study of normal malian Gene Collection Project (ref. 17; proved gene models), the transcript- ͞͞ human tissues and cells, as well as those http: mgc.nci.nih.gov), which is focused finishing approach will likely provide a from various disease states. toward the identification and complete convenient means for delineating the Imperative to an elucidation of the tran- sequencing of human and mouse full-ORF boundaries of each gene and providing scriptome will be the development of new cDNAs. To date, that project has pro- complete transcript sequences. technologies and scientific strategies. We duced over 5,000 human sequences (de- Realizing the full potential of cDNA will need to identify and analyze not only posited in GenBank). In addition, the sequencing and annotation efforts will different transcripts from a single gene, German Genome Project recently com- require the development of new schemes but we will also need to examine the pleted full-ORF human cDNA sequences and databases for capturing information Ϸ entirety of the transcript population of derived from 1,500 human genes (18). about each of the human transcripts. cells and tissues such that we can begin to Moreover, substantial progress in the se- Newer databases, such as the NCBI’s Ref- understand the networks of interactions quencing of mouse full-ORF cDNAs al- Seq (20), annotate alternatively spliced encoded by various transcript forms. For ready has been reported (19). forms of transcripts. For example, Ref- example, the development of DNA chips The transcript-finishing approach of Seq lists at least 14 distinct species of that facilitate identification and quantifi- Camargo et al. presented in this issue BRCA1 transcripts based on inclusion or cation of specific transcripts might be used represents a valuable addition to these exclusion of particular exons. Complete to address these questions. Undoubtedly, efforts. This strategy utilizes the OR- delineation of the human transcriptome innovation will be a hallmark of transcrip- ESTES scaffold EST sequences to build will require a public database that is firmly tome research for the next several years. primers for reverse transcription (RT)- rooted by the use of precise sequence- That this creative spirit is strong is richly PCR reactions to bridge gaps, thereby based annotation to describe all exons documented in the work of Camargo et al. confirming the membership of ESTs in a (including sequence variation within a common transcript and providing inter- particular exon) and alternate forms of The authors thank Dr. Lynette Grouse for vening sequence information. When com- processing at the 5Ј and 3Ј ends for each helpful discussions during the preparation of bined with our increasing knowledge of transcript species. Such a database will this manuscript.

1. Lander, E. S., Linton, L. M., Birren, B., Nusbaum, 8. Williamson, A. R. (1999) Drug Discov. Today 4, Wagner, L., et al. (2001) Nucleic Acids Research C., Zody, M. C., Baldwin, J., Devon, K., Dewar, K., 115–122. 29, 11–16. Doyle, M., FitzHugh, W., et al. (2001) Nature 9. Strausberg, R. L., Buetow, K. H., Emmert-Buck, 15. Luo, J., Duggan, D. J., Chen, Y. D., Sauvageot, J., (London) 409, 860–921. M. R. & Klausner, R. D. (2000) Trends Genet. 16, Ewing, C. M., Bittner, M. L., Trent, J. M. & Isaacs, 2. Venter, J. C., Adams, M. D., Myers, E. W., Li, 103–106. W. B. (2001) Cancer Res. 61, 4683–4688. P. W., Mural, R. J., Sutton, G. G., Smith, H. O., 10. Boguski, M. S., Lowe, T. M. J. & Tolstoshev, C. M. 16. Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Yandell, M., Evans, C. A., Holt, R. A., et al. (2001) (1993) Nat. Genet. 4, 332–333. Lossos, I. S., Rosenwald, A., Boldrick, J. C., Sabet, 291, 1304–1351. 11. Velculescu, V. E., Zhang, L., Vogelstein, B. & H., Tran, T., Yu, X., et al. (2000) Nature (London) 3. Gong, Q. H., Cho, J. W., Huang, T., Potter, C., Kinzler, K. W. (1995) Science 270, 484–487. 403, 503–511. Gholami, N., Basu, N. K., Kubota, S., Carvalho, S., 12. Lal, A., Lash, A. E., Altschul, S. F., Velculescu, V., 17. Strausberg, R. L., Feingold, E. A., Klausner, R. D. Pennington, M. W., Owens, I. S., et al. (2001) Zhang, L., McLendon, R. E., Marra, M. A., & Collins, F. S. (1999) Science 286, 455–457. Pharmacogenetics 11, 357–368. Prange, C., Morin, P. J., Polyak, K., et al. (1999) 18. Wiemann, S., Weil, B., Wellenreuther, R., Gas- 4. Yi, X., White, D. M., Aisner, D. L., Baur, J. A., Wright, Cancer Res. 59, 5403–5407. W. E. & Shay, J. W. (2000) Neoplasia 2, 433–440. senhuber, J., Glassl, S., Ansorge, W., Bocher, M., 5. Edwalds-Gilbert, G., Veraldi, K. L. & Milcarek, C. 13. Camargo, A. A., Samaia, H. P. B., Dias-Neto, E., Blocker, H., Bauersachs, S., Blum, H., et al. (2001) (1997) Nucleic Acids Res. 25, 2547–2561. Simao, D. F., Migotto, I. A., Briones, M. R. S., Genome Res. 11, 422–435. 6. Adams, M. D., Dubnick, M., Kerlavage, A. R., Costa, F. F., Nagai, M. A., Verjovski-Almeida, S., 19. Kawai, J., Shinagawa, A., Shibata, K., Yoshino, Moreno, R., Kelley, J. M., Utterback, T. R., Nagle, Zago, M. A., et al. (2001) Proc. Natl. Acad. Sci. M., Itoh, M., Ishii, Y., Arakawa, T., Hara, A., J. W., Fields, C. & Venter, J. C. (1992) Nature USA 98, 12103–12108. Fukunishi, Y., Konno, H., et al. (2001) Nature (London) 355, 632–634. 14. Wheeler, D. L., Church, D. M., Lash, A. E., (London) 409, 685–690. 7. Lennon, G., Auffray, C., Polymeropoulos, M. & Leipe, D. D., Madden, T. L., Pontius, J. U., 20. Pruitt, K. D. & Maglott, D. R. (2001) Nucleic Acids Soares, M. B. (1996) Genomics 33, 151–152. Schuler, G. D., Schriml, L. M., Tatusova, T. A., Res. 29, 137–140.

11838 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.221463598 Strausberg and Riggins Downloaded by guest on September 29, 2021