Using Information Retrieval for Large-Scale Gene Analysis

From: ISMB-00 Proceedings. Copyright © 2000, AAAI (www.aaai.org). All rights reserved. Genes, Themes and Microarrays Using Information Retrieval for Large-Scale Gene Analysis Hagit Shatkay Stephen Edwards W. John Wilbur Mark Boguski National Center for Biotechnology Information NLM, NIH Bethesda, Maryland 20984 {shatkay,edwards}@ncbi, nlm. nih.gov Abstract 1. Genes that are functionally related may demonstrate The immensevolume of data resulting from DNAmi- strong anti-correlation in their expression levels, (a croarray experiments, accompaniedby an increase in gene may be strongly suppressed to allow another to the numberof publications discussing gene-related dis- be expressed), thus clustered into separate groups, coveries, presents a majordata analysis challenge. Cur- blurring the relationship between them. rent methods for genome-wideanalysis of expression As shown later, simultaneously expressed genes do data typically rely on cluster analysis of gene expres- 2. sion patterns. Clustering indeed reveals potentially not always share a function. Moreover, genes that are meaningful relationships amonggenes, but can not expressed at different times mayserve complementing explain the underlying biological mechanisms. In an roles of one unifying function. attempt to address this problem, we have developed 3. Even when similar expression levels correspond to a new approach for utilizing the literature in order similar functions, the function and the relationships to establish functional relationships amonggenes on a genome-wide scale. Our method is based on re- between genes in the same cluster can not be deter- vealing coherent themeswithin the literature, using a mined from the cluster data alone. Testing, justify- similarity-based search in documentspace. Content- ing, and explaining the formed clusters requires a lot based relationships amongabstracts are then trans- of additional research effort. lated into functional connections amonggenes. We describe pre]imlnary experiments applying our algo- 4. Due to the interrelated nature of biological processes, rithm to a database of documents discussing yeast genes may have more than a single function. The genes. A comparisonof the producedresults with well- strict assignment of genes to clusters, resulting from established yeast gene functions demonstrates the ef- most clustering methods currently used, may prove fectiveness of our approach. overly stringent, potentially preventing the exposure of complex interrelationships between genes. Keywords:genomics, microarray, machine learning, information retrieval, documentdatabases The work described in this paper aims to complement the existing methods by providing a much-needed bi- Introduction ological context, based on a survey of the existing literature. The assumption underlying our approach is The development of DNAmicroarrays during the last that the function of manygenes is described in the lit- few years (Schena et al. 1995; DeRisi, Iyer, & Brown erature, and by relating documents talking about well 1997), allows researchers to simultaneously measure the understood genes to documents discussing other genes, expression levels of thousands of different genes. Ex- we can predict, detect and explain the functional re- periments involving such arrays produce overwhelm- lationships between the many genes that are involved ing amounts of data. In response, much recent work in large-scale experiments. Wedo not attempt here to has been concerned with automating the analysis of draw any functional or relational information from the microarray data. Currently pursued techniques (e.g. expression array itself. Instead, we use a large database Eisen et. al. (1998), Tamayo et al. (1999), Ben-Dot of documents as our information search space. Each et. al. (1999)) concentrate mostly on applying cluster- gene is represented by a document, roughly discussing ing methods directly to the expression data, in order the gene’s biological function. The literature database to find clusters of genes demonstrating similar expres- is then searched for documents similar to the gene’s sion patterns. The assumption motivating such search document. Thus, for each gene we produce a set of for co-expressed genes is that simultaneously expressed documents that are related to its functional role. We genes often share a commonfunction. However, there then look for similarities between the resulting sets of are several reasons that cluster analysis alone cannot documents. Since each set corresponds to a gene, we fully address this core issue: can mapthe similar documentsets back to their corre- ISMB 2000 317 sponding genes, and establish flmctional relationships known genome from Escherichia coli, Mycobacterium among these genes. tuberculosis, and Saccharomycescerevisiae already exist (Brown & Botstein 1999), and those representing To accomplish this goal, we use a new statistical information-retrieval method (Shatl~y, Wilbur 2000) Caenorhabditus elegans and Drosophilia melanogaster to conduct the similarity search based on the gene’s genomesequences should be available soon. In addition, commercially available DNAmicroarrays and oligonu- document. As an integral part of our algorithm, we cleotide arrays exist for most of the humangenes char- produce an "executive summary", consisting of a few acterized to date and can be expected for the whole characteristic content bearing terms in the set of documents assigned to each gene. Thus we simultaneously human genome once it is completely sequenced mad an- notated within the next three years. achieve three goals: This new technology allows gene expression experi- ¯ Finding functional relationships between genes. ments to be performed on a genome-wide scale. Ex- ¯ Obtaining the literature specifically relevant to the periments with S. cerevisiae have studied changes in function of these genes. gene expression patterns for over 95% of the protein ¯ coding genes simultaneously under a variety of con- Producing a short summaryjustifying why the genes Speliman et al. 1998; were considered relevant to each other, and what ditions (Cho et al. 1998; their function is. DeRisi, Iyer, & Brown 1997; Chu et al. 1998). This increase in percentage of genomemeasured, has an im- This functional information can then be correlated with mediate impact on the number of genes awaiting analy- the expression array cluster analysis to refine the result- sis. For example, the numberof genes collectively iden- ing hypotheses and, by extension, future experiments. tiffed as being induced during sporulation dramatically increased from a total of 50 to approximately 500 from a The rest of this paper is organized as follows: The single set of genomewide microarray experiments (Chu next section surveys related work on gene analysis, both et al. 1998). based directly on expression array data and on literature mining. We then describe our approach of using With this increased volume of data manual gene anal- the literature to find function and relationships between ysis becomes impractical, and there is an immedi- genes. Next we discuss our preliminary experiments ate need for more powerful methods of data analy- and results over the set of well-studied yeast genes dis- sis (Ermolaeva et al. 1998; Bassett, Eisen, & Bo- cussed by Spellman et. al. (1998). Our results demon- guski 1999). Most efforts to date have involved clus- strate that the automated usage of literature is an ex- tering genes based on their expression patterns and tremely powerful tool for determining relationships be- using these clusters to infer functional correlation. tween genes, for explaining expression-based clusters Methods involving hierarchical clustering, commonly obtained from array-based experiments, and for assist- applied in sequence and phylogenetic analysis, have ing in the design of further experiments. been used with the yeast data sets described previ- ously (Eiseu et al. 1998). As expected, in many cases this clustering revealed that genes with a commonfunc- Related Work tion were indeed coexpressed (Spellman et al. 1998; The first part of this section provides further back- Eisen et al. 1998). Self- organizing maps(Tamayo et al. ground on the analysis of data obtained from gene ex- 1999) and other clustering methods (Wen e¢ al. 1998; Ben-Dor & Yakhini 1999) have also been shownto effec- pression arrays and the challenges it poses; the second tively group genes by the observed expression patterns. part discusses current methods for using the literature for gene analysis. While clusters of simultaneously expressed genes can correlate with shared function, this is not always the Analyzing Gene Expression Arrays case. The complex and parallel nature of the system causes some genes to share similar expression profiles DNAmicroarrays represent the latest in a series of pow- despite the distinct biological processes in which they erful tools based on hybridizing a soluble DNA/RNA are involved. In fact, careful analysis of the CLB2clus- molecule to its complementary strand immobilized on a ter described by Spelimanet. al. (1998) reveals genes in- solid support (Southern 1975; Wahl, Meinkoth, & Kim- volved in several different cellular functions. For exam- met 1987; Schena et al. 1995). With DNAmicroarrays, ple, CHS2, BUD8,and IQG1 are all involved in main- cDNAcorresponding to known genes is spotted onto the solid support (usually a glass slide). The mRNA tenance of the cell wall while ACE2, ALK1,and HST3 from cells or tissues is then converted into fluorescently

Using Information Retrieval for Large-Scale Gene Analysis

Experimental and Computational Requirements for Post-Genomic Integrative Cellular Physiology John Wikswo

Existing and Emerging Information Technologies That Affect Genomic Data Sharing

Connecting the Dots

Impacting Healthcare a Conference Hosted By

Genomics in Medical Science: an Overview

Recommendations for Clinical CYP2C19 Genotyping Allele Selection: a Report of the Association For

Efficiency of Shared-Memory Multiprocessors for a Genetic Sequence Similarity Search Algorithm

World Magazine Bio IT

The Written-Description Requirement Protects the Human Genome from Overly-Broad Patents, 32 J

The NIH Catalyst from the Deputy Director for Intramural Research

08001 Prelim Prog for Print.Qxp

Workshop on Complete Cdna Sequencing