<<

The Pharmacogenomics Journal (2002) 2, 156–164  2002 Nature Publishing Group All rights reserved 1470-269X/02 $25.00 www.nature.com/tpj REVIEW

An international database and integrated analysis tools for the study of cancer expression

RL Strausberg1 ABSTRACT AA Camargo2 Researchers working collaboratively in Brazil and the United States have 3 assembled an International Database of Cancer . Several stra- GJ Riggins tegies have been employed to generate gene expression data including CF Schaefer1 expressed sequence tags (ESTs), serial analysis of gene expression (SAGE), SJ de Souza2 and open reading-frame expressed sequence tags (ORESTES). The database LH Grouse1 contains six million gene tags that reflect the gene expression profiles in a 3 wide variety of cancerous tissues and their normal counterparts. All A Lal sequences are deposited in the public databases, GenBank and SAGEmap. A 1 KH Buetow suite of informatics tools was designed to facilitate in silico analysis of the K Boon3 gene expression datasets and are available through the NCI Cancer Genome SF Greenhut1 Anatomy Project web site (http://cgap.nci.nih.gov). 2 The Pharmacogenomics Journal (2002) 2, 156–164. doi: 10.1038/ AJG Simpson sj.tpj.6500103 1National Cancer Institute, Bethesda, MD, USA; 2Ludwig Institute for Cancer Research, Keywords: cancer; ; ORESTES; SAGE; CGAP; gene expression Sa˜o Paulo, Brazil; 3Duke University Medical Center, Durham, NC, USA Cancer is fundamentally a disease of the genome. The genomic changes associa- Correspondence: ted with cancer appear in many shapes and forms, from point mutations and RL Strausberg, National Cancer Institute, 31 Center Drive, Room 10A07, MSC 2580, deletions of single nucleotides to the translocation of large chromosomal seg- Bethesda, MD 20892, USA ments. Moreover, alterations within the genome, in concert with changes in the Tel: 301 451–8027 environment, modulate patterns of gene expression, at both the transcript and Fax: 301 480–4368 levels. The challenge to cancer research is to relate the variations in gene E-mail: RLSȰnih.gov expression with specific types of cancer. Indeed, over the past few years, impress- ive efforts have been made to read these molecular signatures of cancer, toward the goal of gaining a more comprehensive understanding of the basic mech- anisms of cancers, which will lead to improved detection, diagnosis, inter- vention, and ultimately prevention.1–9 The platform vital to support these studies is a complete catalog of human and mouse , with a specific focus on those genes expressed in cancer cells and their normal counterparts. In 1997, the National Cancer Institute initiated the Cancer Genome Anatomy Project (CGAP)10–13 with the aim of facilitating the interface of genomics and cancer research. The first initiative developed within CGAP was the Tumor Gene Index. The Index was focused to develop a catalog of the genes expressed in a variety of normal and cancerous tissues. Subsequently, additional components evolved within CGAP to further annotate these genes, integrate the gene data- base with the sequence, and correlate these genes with chromo- somal aberrations observed in cancers. For example, the Genetic Annotation Initiative (GAI)14–15 has assembled a database of gene-based single nucleotide polymorphisms (SNPs). These SNPs were identified through analysis of the exten- sive CGAP sequence database, which represents cancerous tissues obtained from Received 18 January 2002 many different individuals. The Cancer Chromosomal Aberration Project Revised 21 February 2002 (cCAP),16,17 another CGAP program, has advanced the convergence of the human Accepted 27 February 2002 physical map and DNA sequence with the cytogenetic map through precise International database of cancer gene expression RL Strausberg et al. 157

mapping of bacterial artificial (BACs) clones METHODS APPLIED TO CATALOG GENE EXPRESSION along all human chromosomes using fluorescent in situ The CGAP and HGCP programs have both used approaches hybridization (FISH). Many of the BACs used in this initiat- to build gene catalogs based on cDNAs derived from human ive were substrates also utilized by the public Human Gen- and mouse tumors, as well as normal tissues. The CGAP pro- ome Project. The cCAP is a component of the International ject has utilized two gene-tagging strategies to build a gene BAC Mapping Consortium Project.16 CGAP also displays the expression catalog, the expressed sequence tag (EST) wealth of information about abnormalities approach,22 and serial analysis of gene expression (SAGE).23 that occur in cancers that is cataloged in the Mitelman Data- In the traditional EST approach, clones from cDNA libraries base of Chromosome Aberrations.18 An electronic and freely are subjected to single-pass from the 5Ј and/or 3Ј accessible version of this database is available for viewing on end of the cDNAs, producing sequences of several hundred the CGAP web site (http://cgap.nci.nih.gov/). nucleotides, such that a unique identifier is assigned to each In 1998 the Ludwig Institute for Cancer Research (LICR) cDNA. The cDNA libraries are constructed using primers for in Sa˜o Paulo, Brazil, sought to catalog gene expression in first-strand synthesis that are anchored at the 3Ј transcript cancer cells through the application of a novel gene tagging end (the poly(A) sequence). The resulting cDNA molecules approach, open reading frame ESTs (ORESTES). This led to may represent the entire transcript, but are more often a partnership between the LICR and the State of Sa˜o Paulo incomplete, either because of mRNA degradation or incom- Research Foundation (FAPESP) to launch the FAPESP/LICR plete enzymatic processivity during conversion of mRNA to Human Cancer (HGCP).19–21 cDNA. Thus, for a specific transcript, there may be multiple In 1998, when discussions between the leaders of the forms of cDNA within a library. To facilitate gene cat- CGAP and HGCP projects were initiated, the synergy aloging, CGAP has focused its sequencing effort on the 3Ј between the programs was immediately apparent and an cDNA end, starting from the poly(A) sequence. Using this informal collaboration quickly formed. The notion for the approach, it is more likely that sequences from transcripts collaboration was that the gene tagging technologies being derived from the same gene will be recognized as such employed in both projects were complementary, as were the (although alternative polyadenylation sites add compli- specific types of cancer that were being targeted. All mem- cation to cataloging gene transcripts). bers of each group were committed to contributing the gene The serial analysis of gene expression (SAGE) approach tag data to public databases maintained by the National produces very short sequence tags (usually 10 nucleotides in Center for Biotechnology Information (NCBI). This would length) located adjacent to defined restriction sites near the ensure that the worldwide community of academic and 3Ј end of the cDNA. In this approach, cDNAs are digested industrial researchers would have access to the data for with a frequently cutting restriction enzyme (often NlaIII) application to both basic and applied cancer research. to which adapters are added that encode the sequence of a Recently, a formal relationship between the Brazilian and type IIs restriction enzyme (BsmF1), such that the 10 nucleo- United States scientists resulted in the creation of an inter- tides adjacent to the NlaIII site are isolated. The individual national database of cancer gene expression. In this report, tags are concatenated to form a single DNA molecule, which the content of this cancer gene expression database, as well is then subjected to DNA sequence analysis. The power of as informatics tools for utilizing the database, are described. SAGE is two-fold. First, unlike EST sequences, which can The web sites for each of these projects, as well as the vary in both location within the transcript and sequence informatics tools, are described in Table 1. length, the SAGE tags are precisely anchored at a defined

Table 1 The URLs for the CGAP and HGCP projects and descriptions of the analysis tools available on the CGAP web site

Name Purpose Web site

CGAP Determines the genomic and gene expression profiles of http://cgap.nci.nih.gov/ normal, precancer and cancer cells. FAPESP/LICR HGCP Supports discovery of new human genes and is pursuing http://www.ludwig.org.br/ORESTES completion of all gene coding regions. Library Finder Searches for one or more tissue-specific libraries from the http://cgap.nci.nih.gov/Tissues/LibraryFinder CGAP, MGC, SAGE, or dbEST collections. Gene Library Finds all the genes in a specific cDNA library or group of http://cgap.nci.nih.gov/Tissues/LibrarySummarizer Summarizer (GLS) libraries. cDNA xProfiler Compares gene expression between two pools of libraries. http://cgap.nci.nih.gov/Tissues/xProfiler Differential Gene Distinguishes the statistical differences in gene expression http://cgap.nci.nih.gov/Tissues/GXS Expression Displayer between two pools of libraries. (DGED) Gene Finder Finds one gene or list of genes, based on selected search http://cgap.nci.nih.gov/Genes/GeneFinder criteria. Links to a Gene Info page, including Virtual Northern, and various NCBI and NCI databases are provided.

www.nature.com/tpj International database of cancer gene expression RL Strausberg et al. 158

restriction site within the transcript and are the same length. Therefore, in principle, all of the tags generated from a particular transcript species will be identical, thereby facili- tating gene transcript quantification. Second, 30 or more gene tags can be concatenated and read from a single sequencing lane, substantially increasing the cost effective- ness of gene cataloging. As a result, 50 000 or more SAGE tags are often produced from a single library. The challenge of SAGE analysis is that occasionally two genes will have the same SAGE tag (for example, SAGE does not distinguish between human genes DGCR6 and DGCR6L because the genes are identical in the 3Ј NlaIII region), or an individual gene may have more than one tag (due to alternative 3Ј transcript processing.) Unlike the EST method of gene identification, the SAGE approach does not generate a resource of cDNA clones. To make the data accessible to the Figure 1 Distribution of EST sequences within the NOTCH2 gene. scientific community, CGAP, working collaboratively with An example of the gene tags surveyed for the NOTCH2 gene. The the NCBI, has generated a public SAGE database, SAGE- 5Ј to 3Ј orientation of the mRNA is shown from left to right. Tags map24,25 (http://www.ncbi.nlm.nih.gov/SAGE/sagexpsetup. generated by CGAP, MGC and other projects are concentrated in cgi). This database now includes over 5 000 000 gene tags, the 5Ј and 3Ј non-translated regions of the gene, while the mainly derived from CGAP libraries. ORESTES gene tags are located throughout the entire coding The FAPESP/LICR-HGCP project adopted an EST-based sequence of the transcript. strategy to generate a gene catalog. However, the strategy used is quite distinct and novel from the EST methods ORESTES gene tags are classified according to the cluster to employed by CGAP. Unlike the CGAP project, in which which they align. To date, CGAP has deposited 1.2 million transcript sequencing is focused on the 3Ј end of the tran- EST sequences in GenBank and over 4 million SAGE tags script with the expectancy that the gene tags generated are into the SAGEmap databases, while the HGCP has contrib- within the non-translated regions of the transcript, a high uted over 740 000 ORESTES sequences to GenBank. These proportion of the sequence tags generated by the HGCP pro- datasets can be downloaded from the anonymous FTP sites ject are in the coding regions of transcripts. This strategy of maintained by the NCBI. The EST dataset is available at cataloging genes is termed open reading frame ESTs (http://www.ncbi.nlm.nih.gov/dbEST/dbEST access.html) (ORESTES).19 This method utilizes specific oligonucleotides and SAGE sequences can be found at (ftp://ncbi.nlm.nih.gov/ that are used as primers for cDNA synthesis in low strin- pub/sage/). gency PCR conditions, resulting in the production of cDNA libraries from which a relatively small number of individual BIOLOGICAL CONTENT OF THE DATABASE clones are produced and sequenced. Through the HCGP, Both the CGAP and HGCP projects have sought to build thousands of ORESTES libraries have been produced, each databases that reflect the diversity of cancer biology. In most with different primers, such that each library is expected to cases, cDNA libraries were constructed from bulk tumor or contain unique cDNA sequences. The theoretical expec- normal tissue samples. However, several libraries in the tation of the ORESTES approach, confirmed with experi- CGAP collection were made from specific cell populations mental results, is that these sequences preferentially target isolated by employing laser capture microdissection.28,29 central, generally coding regions of expressed genes.19 For Members of both research groups agreed that the projects example, Figure 1 shows the distribution of ORESTES tags would analyze gene expression profiles in a diversity of can- within the NOTCH2 gene in comparison with traditional cers, but that there would be some difference in emphasis ESTs generated by CGAP and other projects including the to make maximal use of resources. Thus, while the number NIH Mammalian Gene Collection (MGC).26 Moreover, the of EST and SAGE libraries that have been produced for any ORESTES approach has a ‘normalization’ effect, because it one type of cancer are relatively limited, the intent was to allows for a broader sampling of the many different tran- provide gene tag data for a wide variety of cancers. These scripts populations with less dependence on expression lev- datasets could be used to generate information on genes that els. This would enhance the discovery of genes that are might be of interest and which could then be studied in a expressed at low levels in tissues. larger number of tumors with approaches such as microar- All EST sequences generated by the CGAP and HGCP are ray analysis and in situ hybridization. As shown in Table 2, deposited in dbEST, a division of GenBank. The classi- the CGAP project provided the majority of ESTs expressed fication of all gene tags, generated by both projects, is based in normal and cancerous tissues from lung and prostate, on the analysis of all EST data within the UniGene data- while the HGCP ORESTES dataset focused toward head and base.27 The UniGene database groups all EST data with hom- neck, as well as the breast tissues. In addition, the numbers ologous sequences into ‘clusters’, such that each cluster of CGAP-generated SAGE tags for each of these tissue types potentially represents an individual gene. All CGAP and are also shown.

The Pharmacogenomics Journal International database of cancer gene expression RL Strausberg et al. 159

Table 2 Examples of the cancerous and normal tissues sur- each of these tools is that they support online analysis of veyed by the HGCP and CGAP projects. The numbers of EST the datasets in real time according to the instructions of the and SAGE tags generated by CGAP, in addition to the user. That is, the responses are not prepackaged, but allow ORESTES gene tags that were submitted to GenBank, are a researcher to perform in silico experiments using the data listed for each project. A complete listing of the tissues sur- stored within the various datasets. As noted below, great veyed by CGAP and HGCP can be accessed using the CGAP care should be used in employing these tools and careful Library Finder and Gene Library Summarizer tools scientific planning is required to ensure that the results are truly informative. Tissue/tumor HGCP ORESTES CGAP ESTs CGAP SAGE

Brain 79 238 93 623 940 628 Library Finder Breast 104 065 19 888 377 175 The CGAP site is formatted to provide entry to the datasets Colon 90 841 67 027 327 493 Head/neck 122 167 13 400 0 from different scientific vantage points such as genes, Kidney 5219 93 593 81 438 tissues, chromosomes, or pathways. By selecting the Tissue Lung 37 910 101 342 79 026 section and then the Library Finder tool, one can access Lymph node 0 65 264 0 information on the numbers and types of SAGE, EST and Ovary 13 314 34 418 326 417 ORESTES libraries. As with each of the tools described in this Pancreas 0 25 278 169 518 report, the Library Finder tool provides the user flexibility Prostate 21 949 70 011 193 221 in performing an in silico experiment (Figure 3). Fields that can be selected include organism (human or mouse), library group (CGAP, ORESTES, SAGE, other projects, or a combi- CGAP ANALYSIS TOOLS nation of projects), tissue type (eg breast, brain), library To support the effective analysis of the CGAP and HGCP preparation (eg bulk, microdissected, cell line), tissue his- molecular data by the cancer research community, a suite tology (eg normal, cancer), library protocol (eg normalized, of informatics tools has been designed and made accessible subtracted). Moreover, the results can be summarized from through the CGAP web site (http://cgap.nci.nih.gov) various perspectives (eg tissue, histology, protocol). For (Figure 2). The tools described include the Library Finder, example, one could select CGAP, prostate, and microdis- Gene Library Summarizer, cDNA xProfiler, Digital Gene sected, to obtain just a list of the 13 prostate cancer libraries Expression Displayer and the Virtual Northern. Each tool prepared through microdissection. Alternatively, an investi- facilitates the use of individual datasets, but more gator can select library groups SAGE, and CGAP, and select importantly, allows for integrated views of the EST, breast tissue, with no other delimiters. This selection would ORESTES and/or SAGE datasets. The guiding principle for reveal a list of 34 breast libraries from the CGAP EST and

Figure 2 Web site for the Cancer Genome Anatomy Project (http://cgap.nci.nih.gov/).

www.nature.com/tpj International database of cancer gene expression RL Strausberg et al. 160

Figure 3 Query form for the Library Finder tool. In this example, the fields selected will generate a list of all the cDNA libraries in the CGAP and ORESTES datasets that were prepared from breast tissues.

SAGE approaches. A search of CGAP, or ORESTES, or SAGE cDNA xProfiler reveals the entire set of libraries from each of those projects. The cDNA xProfiler allows an investigator to compare the genes that are expressed in two pools of libraries. For a gene Gene Library Summarizer (GLS) to be ‘present’ in a library pool, there must be at least one The GLS tool finds all of the genes expressed in a single or EST sequence found in the UniGene cluster for that gene. a group of cDNA libraries, then categorizes the genes as Similar to GLS, the user chooses the organism, library group, ‘Known’ or ‘Unknown’ and further as ‘Unique’ or ‘Non- and can list libraries by various groupings such as tissue unique’. Unknown genes are those that are just designated type. Within the tissue type classification, the user can select as ESTs. Genes with any given name, including those such multiple tissues to include or exclude in the analysis. Also as ‘hypothetical protein’ are ‘Known’ in this categorization. selectable is the minimum number of sequences each library A unique gene is one that is only found in UniGene within must have to be included in the analysis. For each of the the category selected by the GLS user. For example, if one two library pools, the tissue type, tissue preparation, library chooses to look at breast cancer libraries produced by CGAP protocol, tissue histology, and library name can be selected. and HCGP, and a gene is classified as unique, that indicates The choices provide the investigator with maximum flexi- that a gene is not found in any other cDNA libraries within bility in choosing the parameters for the desired experi- UniGene. However, it does not mean that the gene is mental analysis. For example, a simple search might be the uniquely expressed in that tissue type and further investi- comparison of the genes expressed in prostate cancer vs nor- gation would be required to study the actual uniqueness of mal prostate tissues. A more complex search might include expression. As with the Library Finder tool, the investigator analysis of the CGAP and ORESTES breast, prostate, and has much flexibility in selecting experimental parameters. ovarian cancer libraries in pool A, compared with all normal For example, if the user chooses human CGAP and ORESTES tissue libraries, but excluding brain tissues in pool B. After libraries, prepared from breast cancer samples, and asks for making these selections, the user is then taken to a screen the results to be summarized by tissue, the display shown showing the libraries chosen in each category, with the in Figure 4 would appear. Clicking on any of the categories opportunity for manual intervention to delete libraries from then reveals a list of the individual genes, by symbol, name, the analysis. After electronic submission of the query, a sequence ID, and CGAP Gene Information (discussed screen is displayed as shown in Figure 5, listing unique and further in the Virtual Northern Tool section). non-unique, known and unknown genes found in group A,

The Pharmacogenomics Journal International database of cancer gene expression RL Strausberg et al. 161

Figure 4 An example of a query search in the Gene Library Summarizer. The Gene Library Summarizer tool was used to query the CGAP and ORESTES datasets to identify cDNA libraries prepared from breast tissues. The numbers under each column are hyperlinked to a list of genes in each category.

group B, genes found in both groups A and B, genes found the mouse to scan the spots reveals the number of gene tags in either group A or B, and genes found in one group but for that gene in relation to the total number of gene tags in not the other. Clicking on any of these categories reveals the database for that tissue. the list of genes with supporting information. FEATURES TO CONSIDER WHEN DESIGNING IN SILICO Digital Gene Expression Displayer (DGED) ANALYSES OF THE DATABASES The DGED tool is similar to the cDNA xProfiler, because the The integration of different molecular datasets (SAGE, EST, investigator chooses two distinct pools of libraries to com- ORESTES), provides a powerful platform for surveying a pare. However, the DGED presents the gene expression com- wealth of cancer gene expression data in cancer tissues, and parison results based on statistical significance as calculated tools such as the DGED provide concurrent access to all of by the sequence odds ratio and chi-squared test (both are the datasets. However, it is very important to consider the described in detail on the CGAP web site). For example, an source of the data when performing any in silico analysis. investigator could choose to compare the gene expression For example, using the DGED tool it is possible to compare profiles in brain tumors vs normal brain tissue. In addition the gene expression profiles in breast cancer and normal bre- to statistical comparison of sequences, the display also ast tissue using the data from all three datasets. However, shows the number of libraries in which each gene has been this would be inadvisable because not all libraries are equal observed (Figure 6). This could be a useful analysis for an in relationship to the number of sequences surveyed per investigator who is interested in genes expressed in a large library (tens to hundreds sequences for ORESTES, a few proportion of tumors of a given type. thousand ESTs, and tens of thousands for SAGE libraries). In addition, while the traditional EST and SAGE approaches The Virtual Northern Tool seek to survey a wide variety of genes, each ORESTES library The analysis described above, which compares the gene includes a limited gene set based on the primer selected. In expression in glioblastomas vs normal brain based on the addition, some CGAP libraries were produced by application DGED tool, allows the investigator to discern genes that of normalization and subtraction methods, such that the may be up or down regulated in these tumors. But, it doesn’t frequency of gene tags in a library would not reflect the rela- reveal information about the expression of those genes in tive proportions of in vivo transcripts. Also, as noted below, a wide variety of tissues. To address that need, the Virtual ORESTES provides for normalization of gene tags. Therefore, Northern tool was developed. This tool can be accessed for a key principle in the design and analysis of in silico experi- each gene from the Gene Info link throughout the CGAP ments is to carefully consider both the biology and the gene web site. The Gene Info page provides a wide variety of links tagging technology used for each library before drawing any to information about the gene and also links to the Virtual scientific conclusions. In the breast cancer example men- Northern. Selecting this tool reveals a figure showing the tioned above, the best approach might be to evaluate gene relative levels of expression for the gene, based both on ESTs expression patterns separately for EST, ORESTES, and SAGE and SAGE (Figure 7). The gene expression levels are indi- approaches, and then compare the resulting lists of differen- cated visually by spots that have different intensities. Using tially expressed genes to identify common features.

www.nature.com/tpj International database of cancer gene expression RL Strausberg et al. 162

Figure 5 An example of a query search in the cDNA xProfiler. The cDNA xProfiler analysis tool is used to compare the genes that are differentially expressed in two pools of cDNA libraries. A query of the genes expressed in breast, prostate, and ovarian cancer libraries (pool A) when compared to those genes expressed in all normal tissues, excluding brain tissue (pool B), is shown. The numbers under each column are linked to a complete list of genes in that category.

Figure 6 An example of a query search in the Digital Gene Expression Displayer. This is an example of the results of an analysis performed using the Digital Gene Expression Displayer (DGED) tool. The genes, expressed in pools A and B, are shown. The DGED not only indicates which genes are found in each library pool, but also includes the statistical significance of the difference in gene expression between the library pools.

The Pharmacogenomics Journal International database of cancer gene expression RL Strausberg et al. 163

below suggest that gene expression databases are already a new powerful tool in the fight against cancer.

EXAMPLES OF SUCCESSFUL DATA MINING FROM THE GENE CATALOG The utility of data mining of the EST and SAGE datasets for cancer research is already evident in the scientific literature. In general, data mining efforts are focused toward identifi- cation of new genes with relevance to cancer, cataloging genes that are over-expressed in various cancers, and sel- ecting gene sets for further analysis, such as through the use of microarrays. For example, based on an analysis of genes that are hom- ologous among various organisms, an EST from a CGAP library was identified for the human telomerase catalytic subunit that leads to the identification of a full-length cDNA.30 Therefore, data mining for genes preferentially expressed in tumors has been particularly fruitful. An analy- sis of both the EST and SAGE datasets has resulted in the identification of genes over expressed in prostate, pancre- atic, breast, brain, colon, and ovarian cancers.31–37 For example, starting with CGAP SAGE data, Loging and col- leagues36 searched for genes expressed in glioblastoma mul- tiforme. In silico analysis revealed 13 candidate genes that were over-expressed in brain tumors. Further analysis by fluorescent-PCR analysis confirmed that seven of these genes had potential to serve as candidate tumor markers. Figure 7 A display of a Virtual Northern. The expression level of a However, no single gene was up regulated in all of the glio- selected gene is shown in a variety of tissues. Placing the mouse over any spot will reveal the actual number of sequences that blastomas examined, reflecting the heterogeneity of appeared in the total sequences in the libraries, as shown in the expression profiles in these tumors. In a similar example, box. CGAP SAGE libraries constructed from various ovarian tissues were mined to identify a set of genes up regulated in ovarian cancer, which included various secreted and surface The CGAP and HGCP data sets must be used with care, to .37 In this study, immunohistological analysis vali- ensure that the in silico experiments adhere to high scientific dated the SAGE findings with protein expression patterns standards. Because the numbers of samples that can be ana- for several genes including Claudin 3, Claudin 4, and ApoJ. lyzed through these datasets are relatively limited, further Particularly exciting is the new opportunity to mine gene experimentation with additional tissues and technologies is sets based on the physiology of the tumor and the microen- required to verify insights gained through in silico analyses. vironment, such as hypoxia-induced genes and genes asso- When using the in silico analysis tools, such as DGED and ciated with the tumor endothelium.38–39 For example, Lal et the Virtual Northern, it is important to remember that al38 produced CGAP SAGE libraries which were used to tumors derived from the same anatomical location are quite identify 32 putative hypoxia responsive genes in a human heterogeneous with respect to gene expression, and that the glioblastoma cell line. Subsequent real-time polymerase number of different tumors that can be examined through chain reaction (PCR) and in situ analysis of tumor samples, EST and SAGE approaches is relatively limited. Any results confirmed the up-regulation of 20 of these genes in hypoxic that are deemed statistically significant by the DGED and regions of a variety of tumors. Virtual Northern tools indicate genes that may be worthy Thus, there is already credible evidence to support the for additional analysis in a larger population of tumor belief that the CGAP and HGCP projects will provide sub- samples. Therefore, mining a pool of libraries derived from stantial new insights into cancer biology, and support the tumors will reflect gene expression differences in the overall development of new approaches to cancer detection, diag- population of tissues, but will not necessarily indicate the nosis, and treatment. The results presented here suggest that frequency with which a particular gene is over or under international, public gene expression databases promise expressed within any individual tumor within the pool. immense value in the future. By sharing datasets, the scien- Moreover, while a minority of the libraries in the database tific thoughts of researchers worldwide can be harnessed were produced from microdissected samples, most libraries toward addressing problems in cancer research. In addition, in the database are derived from bulk tumors, and therefore while researchers in one country might wish to devote their do not address the issue of intra-tumor cellular heterogen- efforts to study cancers most prominent in their geographi- eity. However, even with these caveats, the results noted cal area, by building common databases of gene expression

www.nature.com/tpj International database of cancer gene expression RL Strausberg et al. 164

for all cancers, we have the ability not only to look for differ- desorption/ionization time-of-flight mass spectrometry. Proc Natl Acad ences in cancers, but also for common targets that might Sci USA 2001; 98: 581–584. 15 Buetow KH, Edmonson MN, Cassidy AB. Reliable identification of large lead to interventions for multiple forms of cancer. There- numbers of candidate SNPs from public EST data. Nat Genet 1999; fore, while the current effort results from a collaboration of 21: 323–325. researchers in Brazil and the United States, the hope is that 16 Cheung VG et al. Integration of cytogenetic landmarks into the draft this will become a worldwide resource toward the eradi- sequence of the human genome. Nature 2001; 409: 953–958. 17 Kirsch IR et al. A systematic, high-resolution linkage of the cytogenetic cation of cancer. and physical maps of the human genome. Nat Genet 2000; 24: 339–340. ACKNOWLEDGEMENTS 18 Mitelman F, Johansson B, Mertens F (eds). Mitelman Database of Chro- The Human Cancer Genome Project was supported by the Ludwig Insti- mosome Aberrations in Cancer. http://cgap.nci.nih.gov/Chromosomes/ tute for Cancer Research (LICR) and Fundac¸a˜o de Amparo a` Pesquisa Mitelman (2002) 19 Dias-Neto E et al. Shotgun sequencing of the human do Estado de Sa˜o Paulo (FAPESP). The ORESTES sequences were gener- with ORF expressed sequence tags. Proc Natl Acad Sci USA 2000; 97: ated by a virtual network of 33 laboratories from the State of Sa˜o 3491–3496. Paulo, Brazil. 20 Camargo AA et al. The contribution of 700 000 ORF sequence tags to The NCI Cancer Genome Anatomy Project results from the effort of the definition of the human transcriptome. Proc Natl Acad Sci USA a multidisciplinary team of scientists from academic and industrial lab- 2001; 98: 12103–12108. oratories. A list of the CGAP team members can be found at 21 de Souza SJ et al. Identification of human chromosome 22 transcribed http://cgap.nci.nih.gov/Info/teams. sequences with ORF expressed sequence tags. Proc Natl Acad Sci USA 2000; 97: 12690–12693. 22 Adams MD et al. Sequence identification of 2375 human brain genes. DUALITY OF INTEREST Nature 1992; 355: 632–634. None declared. 23 Velculescu VE, Zhang L, Vogelstein B, Kinzler KW. Serial analysis of gene expression. Science 1995; 270: 484–487. REFERENCES 24 Lal A et al. A public database for gene expression in human cancers. 1 Garber ME et al. Diversity of gene expression in adenocarcinoma of Cancer Res 1999; 59: 5403–5407. the lung. Proc Natl Acad Sci USA 2001; 98: 13784–13789. 25 Lash AE et al. SAGEmap: a public gene expression resource. Genome 2 Perou CM et al. Molecular portraits of human breast tumours. Nature Res 2000; 10: 1051–1060. 2000; 406: 747–752. 26 Strausberg RL, Feingold EA, Klausner RD, Collins FS. The mammalian 3 Alizadeh AA et al. Distinct types of diffuse large B-cell lymphoma ident- gene collection. Science 1999; 286: 455–457. ified by gene expression profiling. Nature 2000; 403: 503–511. 27 Wheeler DL et al. Database resources of the National Center for 4 Bhattacharjee A et al. Classification of human lung carcinomas by Biotechnology Information. Nucl Acids Res 2001; 29:11–16. mRNA expression profiling reveals distinct adenocarcinoma subclasses. 28 Emmert-Buck MR et al. Laser capture microdissection. Science 1996; Proc Natl Acad Sci USA 2001; 98: 13790–13795. 274: 998–1001. 5 Golub TR et al. Molecular classification of cancer: class discovery and 29 Emmert-Buck MR et al. Molecular profiling of clinical tissue specimens: class prediction by gene expression monitoring. Science 1999; 286: feasibility and applications. Am J Pathol 2000; 156: 1109–1115. 531–537. 30 Nakamura TM et al. Telomerase catalytic subunit homologs from fis- 6 Shih LM et al. Top-down morphogenesis of colorectal tumors. Proc sion yeast and human. Science 1997; 277: 955–959. Natl Acad Sci USA 2001; 98: 2640–2645. 31 Argani P et al. Discovery of new markers of cancer through serial analy- 7 Polyak K, Riggins GJ. Gene discovery using the serial analysis of gene sis of gene expression: prostate stem cell antigen is overexpressed in expression technique: implications for cancer research. J Clin Oncol pancreatic adenocarcinoma. Cancer Res 2001; 61: 4320–4324. 2001; 19: 2948–2958. 32 Scheurle D et al. Cancer gene discovery using digital differential dis- 8 Riggins GJ. Using serial analysis of gene expression to identify tumor play. Cancer Res 2000; 60: 4037–4043. markers and antigens. Disease Markers 2001; 17:41–48. 33 Luo J et al. Human prostate cancer and benign prostatic hyperplasia: 9 Khan J et al. Classification and diagnostic prediction of cancers using molecular dissection by gene expression profiling. Cancer Res 2001; gene expression profiling and artificial neural networks. Nat Med 2001; 61: 4683–4688. 7: 673–679. 34 Ryu B, Jones J, Hollingsworth MA, Hruban RH, Kern SE. Invasion-spe- 10 Strausberg RL, Buetow KH, Emmert-Buck MR, Klausner RD. The cancer cific genes in malignancy: serial analysis of gene expression compari- genome anatomy project: building an annotated gene index. Trends sons of primary and passaged cancers. Cancer Res 2001; 61: 1833– Genet 2000; 16: 103–106. 1838. 11 Strausberg RL, Dahl CA, Klausner RD. New opportunities for 35 Porter DA et al. A SAGE (serial analysis of gene expression) view of uncovering the molecular basis of cancer. Nat Genet 1997; 15: 415– breast tumor progression. Cancer Res 2001; 61: 5697–5702. 416. 36 Loging WT et al. Identifying potential tumor markers and antigens by 12 Strausberg RL. The Cancer Genome Anatomy Project: new resources database mining and rapid expression screening. Genome Res 2000; for reading the molecular signatures of cancer. J Pathol 2001; 195: 10: 1393–1402. 31–40. 37 Hough CD, Cho KR, Zonderman AB, Schwartz DR, Morin PJ. Coordi- 13 Strausberg RL, Greenhut SF, Grouse LH, Schaefer CF, Buetow KH. In nately up-regulated genes in ovarian cancer. Cancer Res 2001; 61: silico analysis of cancer through the cancer genome anatomy project. 3869–3876. Trends Cell Biol 2001; 11: S66–S71. 38 Lal A et al. Transcriptional response to hypoxia in human tumors. J 14 Buetow KH et al. High-throughput development and characterization Natl Cancer Inst 2001; 93: 1337–1343. of a genomewide collection of gene-based single nucleotide 39 St Croix B et al. Genes expressed in human tumor endothelium. Science polymorphism markers by chip-based matrix-assisted laser 2000; 289: 1197–1202.

The Pharmacogenomics Journal