From: ISMB-00 Proceedings. Copyright © 2000, AAAI (www.aaai.org). All rights reserved. , Themes and Microarrays Using Information Retrieval for Large-Scale Analysis

Hagit Shatkay Stephen Edwards W. John Wilbur Mark Boguski National Center for Biotechnology Information NLM, NIH Bethesda, Maryland 20984 {shatkay,edwards}@ncbi, nlm. nih.gov

Abstract 1. Genes that are functionally related may demonstrate The immensevolume of data resulting from DNAmi- strong anti-correlation in their expression levels, (a croarray experiments, accompaniedby an increase in gene may be strongly suppressed to allow another to the numberof publications discussing gene-related dis- be expressed), thus clustered into separate groups, coveries, presents a majordata analysis challenge. Cur- blurring the relationship between them. rent methods for genome-wideanalysis of expression As shown later, simultaneously expressed genes do data typically rely on cluster analysis of gene expres- 2. sion patterns. Clustering indeed reveals potentially not always share a function. Moreover, genes that are meaningful relationships amonggenes, but can not expressed at different times mayserve complementing explain the underlying biological mechanisms. In an roles of one unifying function. attempt to address this problem, we have developed 3. Even when similar expression levels correspond to a new approach for utilizing the literature in order similar functions, the function and the relationships to establish functional relationships amonggenes on a genome-wide scale. Our method is based on re- between genes in the same cluster can not be deter- vealing coherent themeswithin the literature, using a mined from the cluster data alone. Testing, justify- similarity-based search in documentspace. Content- ing, and explaining the formed clusters requires a lot based relationships amongabstracts are then trans- of additional research effort. lated into functional connections amonggenes. We describe pre]imlnary experiments applying our algo- 4. Due to the interrelated nature of biological processes, rithm to a database of documents discussing yeast genes may have more than a single function. The genes. A comparisonof the producedresults with well- strict assignment of genes to clusters, resulting from established yeast gene functions demonstrates the ef- most clustering methods currently used, may prove fectiveness of our approach. overly stringent, potentially preventing the exposure of complex interrelationships between genes. Keywords:genomics, microarray, machine learning, infor- mation retrieval, documentdatabases The work described in this paper aims to complement the existing methods by providing a much-needed bi- Introduction ological context, based on a survey of the existing lit- erature. The assumption underlying our approach is The development of DNAmicroarrays during the last that the function of manygenes is described in the lit- few years (Schena et al. 1995; DeRisi, Iyer, & Brown erature, and by relating documents talking about well 1997), allows researchers to simultaneously measure the understood genes to documents discussing other genes, expression levels of thousands of different genes. Ex- we can predict, detect and explain the functional re- periments involving such arrays produce overwhelm- lationships between the many genes that are involved ing amounts of data. In response, much recent work in large-scale experiments. Wedo not attempt here to has been concerned with automating the analysis of draw any functional or relational information from the microarray data. Currently pursued techniques (e.g. expression array itself. Instead, we use a large database Eisen et. al. (1998), Tamayo et al. (1999), Ben-Dot of documents as our information search space. Each et. al. (1999)) concentrate mostly on applying cluster- gene is represented by a document, roughly discussing ing methods directly to the expression data, in order the gene’s biological function. The literature database to find clusters of genes demonstrating similar expres- is then searched for documents similar to the gene’s sion patterns. The assumption motivating such search document. Thus, for each gene we produce a set of for co-expressed genes is that simultaneously expressed documents that are related to its functional role. We genes often share a commonfunction. However, there then look for similarities between the resulting sets of are several reasons that cluster analysis alone cannot documents. Since each set corresponds to a gene, we fully address this core issue: can mapthe similar documentsets back to their corre-

ISMB 2000 317 sponding genes, and establish flmctional relationships known genome from Escherichia coli, Mycobacterium among these genes. tuberculosis, and Saccharomycescerevisiae already ex- ist (Brown & Botstein 1999), and those representing To accomplish this goal, we use a new statistical information-retrieval method (Shatl~y, Wilbur 2000) Caenorhabditus elegans and Drosophilia melanogaster to conduct the similarity search based on the gene’s genomesequences should be available soon. In addition, commercially available DNAmicroarrays and oligonu- document. As an integral part of our algorithm, we cleotide arrays exist for most of the humangenes char- produce an "executive summary", consisting of a few acterized to date and can be expected for the whole characteristic content bearing terms in the set of docu- ments assigned to each gene. Thus we simultaneously once it is completely sequenced mad an- notated within the next three years. achieve three goals: This new technology allows experi- ¯ Finding functional relationships between genes. ments to be performed on a genome-wide scale. Ex- ¯ Obtaining the literature specifically relevant to the periments with S. cerevisiae have studied changes in function of these genes. gene expression patterns for over 95% of the protein ¯ coding genes simultaneously under a variety of con- Producing a short summaryjustifying why the genes Speliman et al. 1998; were considered relevant to each other, and what ditions (Cho et al. 1998; their function is. DeRisi, Iyer, & Brown 1997; Chu et al. 1998). This increase in percentage of genomemeasured, has an im- This functional information can then be correlated with mediate impact on the number of genes awaiting analy- the expression array cluster analysis to refine the result- sis. For example, the numberof genes collectively iden- ing hypotheses and, by extension, future experiments. tiffed as being induced during sporulation dramatically increased from a total of 50 to approximately 500 from a The rest of this paper is organized as follows: The single set of genomewide microarray experiments (Chu next section surveys related work on gene analysis, both et al. 1998). based directly on expression array data and on litera- ture mining. We then describe our approach of using With this increased volume of data manual gene anal- the literature to find function and relationships between ysis becomes impractical, and there is an immedi- genes. Next we discuss our preliminary experiments ate need for more powerful methods of data analy- and results over the set of well-studied yeast genes dis- sis (Ermolaeva et al. 1998; Bassett, Eisen, & Bo- cussed by Spellman et. al. (1998). Our results demon- guski 1999). Most efforts to date have involved clus- strate that the automated usage of literature is an ex- tering genes based on their expression patterns and tremely powerful tool for determining relationships be- using these clusters to infer functional correlation. tween genes, for explaining expression-based clusters Methods involving hierarchical clustering, commonly obtained from array-based experiments, and for assist- applied in sequence and phylogenetic analysis, have ing in the design of further experiments. been used with the yeast data sets described previ- ously (Eiseu et al. 1998). As expected, in many cases this clustering revealed that genes with a commonfunc- Related Work tion were indeed coexpressed (Spellman et al. 1998; The first part of this section provides further back- Eisen et al. 1998). Self- organizing maps(Tamayo et al. ground on the analysis of data obtained from gene ex- 1999) and other clustering methods (Wen e¢ al. 1998; Ben-Dor & Yakhini 1999) have also been shownto effec- pression arrays and the challenges it poses; the second tively group genes by the observed expression patterns. part discusses current methods for using the literature for gene analysis. While clusters of simultaneously expressed genes can correlate with shared function, this is not always the Analyzing Gene Expression Arrays case. The complex and parallel nature of the system causes some genes to share similar expression profiles DNAmicroarrays represent the latest in a series of pow- despite the distinct biological processes in which they erful tools based on hybridizing a soluble DNA/RNA are involved. In fact, careful analysis of the CLB2clus- molecule to its complementary strand immobilized on a ter described by Spelimanet. al. (1998) reveals genes in- solid support (Southern 1975; Wahl, Meinkoth, & Kim- volved in several different cellular functions. For exam- met 1987; Schena et al. 1995). With DNAmicroarrays, ple, CHS2, BUD8,and IQG1 are all involved in main- cDNAcorresponding to known genes is spotted onto the solid support (usually a glass slide). The mRNA tenance of the cell wall while ACE2, ALK1,and HST3 from cells or tissues is then converted into fluorescently are involved in nuclear events. This example demon- strates the wealth of biological information that is not labeled cDNAand applied to the unlabeled cDNAma- represented by temporal gene clusters. trix (Schena 1999). Since each spot on the matrix cor- responds to a knowngene or EST, the expression level In addition, some membersof a commonsignaling path- of thousands of genes can be measured in a single ex- way may play antagonistic roles and actually show an periment. DNAmicroarrays consisting of the entire anti-correlation with regards to gene expression. As

318 SHATKAY a result, the clusters obtained from shared gene ex- abstracts is the PubMeddatabase. An initial step in pression profiles must still be analyzed with respect to the search for relevant literature in PubMedis the spec- knownbiological roles, before reliable conclusions about ification of a boolean query. The user provides either their biological functions can be drawn from the data. a single term (e.g. OLE1), or a boolean combination A more recent approach to array analysis uses terms (e.g. OLE1AND stero]). The result is the set of Bayesian networks to describe relationships between all documents found in the database which satisfy the genes (Friedman et al. 2000). Rather than simply constraints specified in the query. This form of query group genes according to their related expression pat- suffers from several well-knowndeficiencies: terns, this approach allows the identification of causal ¯ A prohibitively large number of documents are typ- relationships among genes. Indeed, based on the anal- ically retrieved. ysis of 800 genes shownto have regulated gene expres- ¯ A substantial part of the retrieved documents are sion during the yeast cell cycle (SpeUmanet at. 1998), irrelevant to the user’s information needs. only a few of these genes appeared to dominate the ¯ Manyrelevant documents may not be retrieved, de- order of expression (Friedman et al. 2000), and the re- spite their relevance. For instance, documentsthat sults could highlight the critical genes for establishing talk about OLE1using one of its aliases such as DNA the yeast cell cycle. While this analysis can suggest repair protein fatty-acid desaturase 1 or ACYL- causal relationships between genes, it does not provide COAdesaturase 1 will not be retrieved. the biological explanation for these relations. In some cases, only further experimentation can determine the A lot of recent work on mining the literature for genes and proteins aims at supporting the boolean paradigm, involved mechanism. However, it is highly likely that in manyof these cases, this information currently exists improving it to produce more accurate results (thus in the published literature. mostly addressing the first two problems). Such work concentrates on automated natural language processing The current method for explaining the discovered clus- for finding relevant phrases and useful facts in text. It ters and relationships, has been for individuals to search is intended to assist in finding documentsabout a given through the literature, gene by gene, or rely on their gene, or about the relationships between specific genes. own knowledge of the biological processes involved. Leek (1997) suggests a way of using hidden Markov While such a method can be effective on a small scale, models (HMM)s for extracting sentences discussing gene it produces a major bottleneck when performing exper- positions on chromosomes from text. Craven and Kum- iments on a genome-wide scale. lien (1999) introduce a method for transforming fiat It is for this reason that we propose the development text documents into databases of facts about relation- ships between genes/proteins, performing a task similar of an automated method for relating genes according to their biological function based on the current lit- to the one Leek addresses, without the need to obtain erature. Our method complements the approaches de- an HMMfor discovering these relationships. Rindflesch scribe above, by providing literature-based explanations et. o2. (2000) present a method based on parsing and us- to the clusters and the relationships that are discovered ing thesauri to automatically extract facts about genes through the expression arrays. The next section surveys and proteins from documents. Blaschke et. al. (1999) current research aimed at automating literature mining also use a similar method for extracting information in the area of gene analysis. about protein interaction from scientific text. Most of the above methods have only been applied to small and limited sample sets of documents/terms. They all stem Text Usage in Biological Analysis from the boolean query paradigm, and require the user to specify a very accurate query in order to provide With the advancement of genome sequencing tech- high-quality results. niques comes an overwhelming increase in the amount of literature discussing the discovered genes. As an il- Another recent system aiming at improving the quality lustrative example, the number of PubMed documents of the results returned from boolean search over genes is containing the word gene published between the years MedMinerby Tanabe et. al. (1999). It provides a good 1970- 1980 is a little over 35,000, while the num- interface to two databases, Geneeards and PubMed. In ber of such documents published between the years order to retrieve documentsthat are likely to be of in- 1990 - 2000 is 402,700 - over a ten fold increase. Thus, terest to the user, it relies on a human-generated list surveying the literature for information about genes of keywords, whose presence in a document discussing requires a great deal of time and effort. It can not genes typically indicates that the documentis of high be effectively and efficiently done using the currently quality and relevance. Still, MedMinerprovides abun- available search techniques, given the large number of dant information about a single gene or about the rela- genes involved in current expression array experiments. tionship between two specified genes. Such quantities of The problem is further aggravated by the non-uniform information generated per gene when hundreds of genes nomenclature used in the literature as illustrated below. are involved can not be effectively handled by a user. The most widely used on-line source for gene-related The above methods all rely on strong assumptions re-

ISMB 2000 319 garding the use of natural language, such as the terms result analysis following such experiments. typically used to indicate relationships and the waysen- tences are structured. With the shift towards the analy- Acting under this hypothesis, we shift our attention sis of mammaliansystems the problem of non-uniform from the gene-expression space to document space. Thus we start with a large database of documents con- nomenclature and language usage is likely to worsen. taining all the relevant literature discussing the domain Gene symbols are rarely used in the mammalian sys- tem literature. Instead, the discussion involves a large of interest (for instance - all the documents in PubMed variety of terms describing the genes. This additional that discuss yeast genes). Each gene is mapped to complication will makeit difficult for the user to form single document discussing it; each such document is treated as a representative of the gene. Wecall each accurate boolean queries. It is also likely to reduce document thus associated with a gene the kernel docu- the effectiveness of literature mining strategies that are ment for that gene. based on gene symbol identifiers (such as the one sug- gested by Leek) and on strong assumptions about the Using our algorithm for finding similar documents, we way genes names are used in sentences. Moreover, these obtain for each gene a body of related literature (20- systems can indeed be helpful when searching for infor- 50 documents sharing a commontheme) based on the mation about a few genes at a time, but do not address document representing the gene, along with an "exec- the need for finding links and functional relationships utive summary"containing the terms that characterize among thousands of genes. the relevant literature. It is important to note that the An alternative to the boolean query paradigm is the use abstracts retrieved by our algorithm are considered rel- evant not because they contain the same gene name as of similarity queries; the user provides a sample docu- ment that is relevant to the subject of interest, and gets the one associated with the kernel abstract, but rather back other documents discussing the same subject mat- because they discuss the same issues (which typically ter. Such a query mechanism does not depend on the corresponds to functionality) as those discussed in the user choice of query terms, but rather on the contents kernel document. and quality of the example document. The ability to There are several ways to use the set of documents re- retrieve quality documents that are indeed similar in trieved for each gene in order to derive relationships contents to the example document strongly depends on among genes: defining a similarity measure and a search procedure that ranks the relevant documents high and the irrele- ¯ One can simply mine this set for the names of other vant ones low. Wehave recently developed a probabilis- genes as done by any of the algorithms described in tic algorithm that, given an example document, finds a the previous section. The main limitation of doing so set of documentsthat are most relevant to it (a theme) is the dependencyon explicit rules for detecting gene and provides a set of terms summarizing the contents names, with the risk of overlooking important infor- of this set of documents (Shatl~y, Wilbur 2000). The mation while detecting unimportant relationships. use of similarity queries in general and this algorithm in ¯ A more effective way is to automatically compare the particular, forms the basis to our approach as described sets of documents retrieved for each gene, and de- in the next section. termine that genes share similar functionality if the The ultimate challenge in the use of literature for an- literature associated with each of them is similar. alyzing expression arrays is the ability to obtain an ¯ A third possible way is to use the terms character- overview of the whole landscape of genes and their re- izing the retrieved literature, as they occur in the lated literature. A good literature analysis tool should summary, and consider genes as related if their sum- provide information such as which genes are function- maries consist of the same (or almost the same) set ally related to each other, what their shared function- of terms. ality is and which documentsdiscuss this functionality. It should also provide summaries that allow easy and Wecurrently use the second of these methods to deter- quick browsing through the literature, and an easy ac- mine relationships amonggenes, as described later in cess to the most relevant documents. The next section this section. describes the new approach we have developed in order The first step in our approach requires mapping the to meet such challenges. set of genes (G1,..., GN)to a set of kernel documents (gl,..., KN) (see top of Figure 2). Kernel documents are currently obtained from the available curated litera- Discovering Gene Functions and ture about yeast genes (as explained in the experiments Relations through the Literature part of this paper). Our method strongly depends on the quality of the kernel documents. Abstracts dis- The hypothesis underlying our approach is that the cussing experimental methods rather than gene func- function of manyindividual genes is discussed in the tion tend to draw other documents describing the same literature and that a good analysis of the literature is experimental methods. The result is a document set a primary step both for experimental design and for not representative of the gene’s function. On the other

320 SHATKAY ¯ qT __ the probability that the term ti occurs in a Pr(Term) document d, given that d is an off-theme document: I 0.9 qT~fPr(t~ 6 did ~ T) 0.8 0.7 ¯ DBi -- the probability that the term ti occurs in 0.6 a document d, given that d is a document in the 0.5 database, regardless of its being an on-theme or an (14 0.3 off-theme document: DBi~fPr(ti 6 did 6 DB) 0.2 (11 The distribution DBi models the possible arbitrary us- acid age of terms in the language, without being strongly in- dicative of the main topic discussed. (e.g. the sentence "He entered the building" is not particularly relevant Figure 1: Typical term distribution for the Nutrition to the topic construction, despite the occurrence of the theme. term building in it). The a priori probability of any document d 6 DB, re- hand, kernels discussing gene biology typically lead to gardless of its contents, to be a theme documentis de- high quality information about the functionality of re- lated genes. We are currently considering ways to au- noted as Pd: Paa=~Pr(d6 T). tomate the kernel selection process, so that each kernel Throughout this paper, we assume this parameter to faithfully represents the biology of its associated gene. be known and fixed for all documents, and we do not attempt to estimate it here. (In the experiments de- The rest of this section provides the details of our ap- scribed later, I’d = 0.01 for all d 6 DB.) proach. Wefirst outline the similarity query algorithm used for finding related abstracts starting from a ker- The last component of our model is the Bernoulli event nel document. (A complete discussion of the models representing the choice made for ea~ term ti, in each and the algorithms can be found in (Shatkay, Wilbur documentd, whether it is to be generated according to 2000)). We then describe how similarities between the the database probability, DBi or according to the spe- obtained document collections are detected. cific on/off-theme distribution. Wedenote this proba- bility, for each term ti, as Ai. Similarity Queries over Documents The process by which each document d E DB is gen- Our algorithm is based on the idea that documents erated, given a specific theme, T, can be modeled as which share a commontheme can be modeled as though follows: First it is decided if the documentd is inside they were generated through sampling from a common the theme T or not. The probability for d 6 T is I’d. set of independent Bernoulli distributions representing Then for each term, ti, it is decided if ti is generated the theme. For example, a set of documents discussing according to the general database distribution, DBi, or genes responsible for nutrition during the cell-cycle, are according to its specific theme/off-theme distribution. likely to contain terms such as fructose or glucose and The probability of a term ti to be generated according quite unlikely to contain the term lipid, as illustrated to the general database distribution DBi is Ai. in Figure 1. Finally, the decision whether to include the term in the documentd is based on one of three possibilities: Each document in our document database, DB, is mod- eled as an M-dimensional binary vector, where M ¯ If ti is to be generated according to the general DB is the number of distinct terms 1 {tz, ..., tM} ill distribution, it is included in d with probability DBi. the database. Formally, a document d is a vector Otherwise: (dl, d2, ..., dM), where: ¯ If d is a theme document, ti is included in d with 6 probabilityT. pi di = 6did~ef 10 ifotherwise ti d , . (1) ¯ If d is an off-theme document,ti is included in d with Given a theme T, we view the presence/absence of probability qT. terms in document d in the database DB, as a result of Note that for each document d 6 DB, we know the M independent Bernoulli events, each of which stems terms it contains. The missing information is which from one of three families of Bernoulli distributions: documents are theme documents and which terms are ¯ pT __ the probability that the term ti occurs in generated from the general distribution, DBi, as op- posed to the theme-specific ones, pit and qT. a document d, given that d is a theme document: pT~fPr(ti 6 did 6 T) Given a single documentrepresenting the gene, our task is to find the characteristic set of Bernoulli distribu- ZTermsconsist of one or two words, excluding stop words. tions, (pT, qT and A)2, for all terms i, and use it to Theyaxe extracted from the raw text in a standard prepro- cessing stage. 2Note that estimating DBiis straightforward since all

ISMB 2000 321 find the documents that are highly likely to have been that are informative and descriptive of the specific sub- generated by sampling from these distributions. The ject matter. latter documents are the ones focused on the theme represented by these distributions. In addition, we pro- This output, as shownin the results section of this pa- duce a set of terms characterizing this theme. These per, in and of itself, provides valuable support for gene are the terms that have a high probability to occur in analysis. Still, we further extend it in the next phase, theme documents (high pT) and a much lower proba- to assist in finding relations amongthe genes. bility to occur in documents outside the theme (high ratio pT /qT). Finding Functional Relations among Genes To estimate the Bernoulli parameters under missing Obviously, establishing firm functional relationships be- information as described above, we use an Expecta- tween genes requires performing carefully designed ex- tion Maximization algorithm(EM) (Dempster, Laird, periments. However, the literature can be used to sug- & Rubin 1977); it aims to maximize the likelihood gest possible relations and to provide coherent justifica- of the database partition into theme/off-theme docu- tion for these suggestions. In the following we describe ments, given the Bernoulli parameters, based on the our approach for utilizing the literature in this manner. kernel document. The complete algorithm is described Our primary assumption, which is justified by our re- elsewhere (Shatkay, Wilbur 2000), and we provide only suits, is that commonrelevant literature is a strong in- its outline here. An EMalgorithm starts by initializ- dicator of commonfunctionality. That is, genes which ing the model parameters, (pT, qT, )~T), based on some have similar lists of top ranking documents associated prior knowledge; in our case the initial assignment is a with them, share some commonfunction that is de- rough approximation of the Bernoulli parameters based scribed in the commonliterature. on the kernel document and its comparison to the rest of the database. It then alternates between: Our task is thus reduced to finding similarities between the lists of documents retrieved in the previous phase ¯ the E-step of computing the expected values, for the of the algorithm, and to associating with each gene all likelihood of the documents to be in the same theme the other genes that have similar document lists. To as the kernel document, under the current parameter do this we use the PubMedidentifiers associated with estimates, and the documents, without examining the documents’ con- tents. Using the identifiers alone, we construct for each ¯ the M-step of finding new model parameters that kernel a vector characterizing it based on the documents maximizethe likelihood of the database partition into deemedrelevant to it by the first phase of the algorithm. theme/off-theme documents given the parameters. Using this vector representation, we can rank, for each This iterative process is guaranteed, under mild condi- kernel Ki, all the other kernels according to their prox- tions, to provide monotonically increasing convergence imity to Ki in the kernel-vector space. Since each kernel of the likelihood function, and we have proven that our corresponds to a gene, we can mapthe inter-related ker- nels back to their respective genes, and obtain a set of algorithm indeed converges to such a local maximum. genes that are closely related. The methodis illustrated Weexecute this algorithm for each of the kernel doc- at the bottom part of Figure 2 and is further described uments, (K1,..., KN), representing each of the genes, in the following paragraphs. (Ga,..., GN},as illustrated in the top part of Figure 2. The result from the run for each gene consists of: First, we construct the set of PubMedIdentifiers of rel- evant documents, S,, as follows: ¯ a list of the top 50 documents discussing the same Let N be the number of kernel documents used for rep- theme as the kernel document, ordered by their de- resenting genes3. We denote each kernel document by gree of relevance to the theme, and Ki where 1 < i < N. ¯ a list of terms (keywords) characterizing the theme, For each kernel, Ki, let Li be the set of PubMedidenti- ordered by their degree of relevance to the theme. fiers for the 50 top ranking documents associated with Note that the keywords provided in the list are not kernel gi, formally: L~f{IDia ... ID~o} , where ID~ is the PubMedidentifier of the jth document merely the terms most probable to occur in the set of J documents discussing the theme, but rather those that ranked as relevant for kernel Ki. are muchmore probable to occur in this set than in the Intuitively speaking, if two distinct genes, Gi and Gj, rest of the database (pT/qT is high). Simply using the represented by kernels Ki and Kj, have similar sets of most frequent terms, (as done, for example, by Tan- relevant PubMedidentifiers, Li and Lj, then the lit- abe et. al. (1999)), typically results in terms that erature relevant to these two genes has a lot in com- common throughout the database and therefore non- informative. In contrast our method provides keywords aThe number of genes we are analyzing may exceed N since the samekernel documentmight discuss and represent the required information is present in the database. morethan a single gene.

322 SHATKAY ~ Doctnnents for K Assign | aTe~’i~ for K Pubmed Similarity / : Query Processoq Document Documents for I~ NG ~ NTern for K __~_ ...... Documents [ I Database

KI~ for K FlndpLargestl{K1~ ..... K~P}--/Map[~{’~1~ ..... x ?’-[ Convert to Co~ine ~/) p --Dhnenslonal I - l~ck tol - Vectors I - I" enesI J " IA KN9 (G " ..... C-=."]. l)ocs for sK Figure 2: Finding Documents and Terms related to Genes (top), and Sets of Related Genes (bottom). mon. This in turn suggests that some roles and func- assess similarity between documents, when documents tions (which are typically described in the literature) are represented as vectors of terms (see, for instance, are shared by these two genes. Salton (1989)). Weuse it here in a non-traditional con- text, where our vector represents the kernels based on Note that whenlooking for similarities between lists of PubMedidentifiers, identifiers that occur only within a other documents rather than terms. Formally, the co- sine coefficient between two vectors, Vi, V~, whose re- single list Li, and do not occur in any other list, Lj, do spective lengths are IIVil], ]lVkl] is the cosine of the an- not contribute to the evaluation of Lj as similar to L~. Using this observation, we can reduce the number of gles between the vectors and is defined as: PubMedidentifiers used for comparing document lists. Formally, let ID denote a PubMedidentifier and liD I v’M~d..# denote the total numberof identifier lists, Li, in which cos(Yi, Ylk \defZ...aj.=l j= ~’I~V~i ~k ID occurs. Our calculations need only take into account those identifiers for which lID I> 1. Thus, Sr is defined to be the set of PubMedidentifiers Since the vectors representing the kernels are normal- of all documentsthat are in the relevance list of at least ized, their length is 1 and only the numerator needs to two kernels. Formally: be calculated.

N Wenote that the cosine coefficient is 0 whenever the vectors Vi and Vj are orthogonal (independent of each srde~f U ni - {ID I IIDI <_1} . (2) i=1 other), and 1 when Vi = Vj. Thus, the closer Vi and Vj are, the closer the coefficient is to 1. Hence, by calcu- Wedenote the number of PubMedidentifiers in St, [Srl: lating for each kernel vector, V~, the cosine coefficients by Mr, and denote each PubMedidentifier in Sr as/]9° with respect to all other kernel vectors, V#, we obtain where 1 < j < Mr. for each kernel a ranking of howrelated it is to each of We can now represent each kernel document Ki, as an the other kernels, Kj. M~) Mr-dimensional vector, viT¯def~ = ~v i1 ... v over Sr where By recalling that each kernel K~ corresponds in turn to v~J are defined as follows: a gene Gi we obtain a relationship between the respec- # tive genes. The reasoning for the assumed relationship 1 if ID E L~ (3) is given by the lists of terms associated with the themes ~--- 6i’~d~ef 0 otherwise. generated from the kernel documents, and thus the rea- Wethen divide each such kernel vector by its length, soning behind the suggested relationships can be easily (the length in this case is simply the square root of checked. the numberof non-zero entries), obtaining a normalized representation of the kernels as vectors of length 1. It is left to be shown that the documents retrieved as relevant to the genes, the summaries obtained and the To gauge the similarity between each pair of kernels, relationships implied by using our algorithms are indeed we calculate the cosine coefficient between their respec- useful. The experiments and the results reported in the tive vectors. The cosine coefficient is a well understood next section demonstrate that our methods are indeed measure often used in information retrieval to roughly capable of meeting these criteria.

ISMB 2000 323 Experiments and Results For each of the genes, the oldest reference cited in SGD was chosen to be the kernel document corresponding The main goal of the methods presented in this work is to the gene. Since some of the closely related genes to provide researchers with quality literature and con- share the same reference, we obtain 344 distinct kernel cise contents summaries regarding genes. A secondary documents on which we test our algorithm. goal is to present and reveal (possibly yet-unknown) relationships amonggenes. The database used in our experiments is a subset of PubMed, consisting of 33,700 documents discussing To check the performance of our algorithm~ we apply yeast genes. It was constructed by taking the 344 kernel them to yeast genes, and show how our methods indeed documents, and applying the current PubMedneighbor- find relevant documents and provide accurate summary ing algorithm (Wilbur & Coffee 1994) to each of the terms. Moreover, we also discover meaningful relation- kernel documents. Neighboring was applied again to ships among the genes. We have chosen the yeast DNA all the resulting documents and then applied a third microarray testbed since the validity of our methods can time to all the documents in the resulting set. The only be assessed by comparisonof the results with exist- resulting database contained 42,335 documents which ing summaries of biological information. The Saccha- 4 included 2,250 documents deemed relevant for our 408 romyces GenomeDatabase (Cherry et al. 1998; Ball et target genes by the SGDcurators (86% of the total al. 2000) and the Yeast Proteome Database (Costanzo curated documents as of August, 1999). Many of the et al. 2000), as well as the functional analysis given by 42,335 had a title only and no abstract, and we elim- Spellman et. al. (1998), are critical for rapid, objective inated them from the database, resulting in a set of evaluation of our results. 33,700 yeast-related documents. We eliminated from Werealize, of course, that the fact that the yeast genes these documents the Mesh term taggings typically as- are well studied biases the literature in PubMedto in- sociated with PubMedentries, as well as all the terms clude manyabstracts discussing these genes. However, that occur in over 10%of the documents in the database given that PubMedconsists of abstracts only, which typ- or in 2 or fewer documents. All these terms are typically ically contain little explicit information about the con- useless and may have detrimental effect when looking nections amonggenes, it is obvious that our algorithms for descriptive keywords. Eliminating such terms im- contribute a great deal, finding information that can proves both the quality of the results and the running not be easily and effectively obtained by any currently time of the program. available means. As a first phase in our experiments, we applied our simi- The rest of this section describes the experimental set- larity search program, described in the previous section, ting and reports the results obtained by applying our to the 344 kernels, searching over the database of 33,700 algorithms to the data. The quality of the results was abstracts. For each kernel, the program outputs a list verified through comparison to the functional groups of the top 50 related documents and a list of keywords of genes according to Spellman et. al. (1998). The por- describing the contents of this relevant set. tion of Spellman’s table relevant to the results discussed The next phase consists of looking for relationships here is shown in Table 1. The table categorizes the among genes. For each of the kernels, the previous yeast genes according to their functionality (rows) and phase produced a list of 50 relevant documents. The the phase in the cell-cycle in which they are expressed first step in the current phase is to construct the set of (columns). relevant documentsretrieved for all the kernels, elimi- nating duplicates. That is, if a single documentis rele- Experimental Setting want to more than one kernel, it is still included in the The experiments presented here consist of applying our set of relevant documents only once. We then elimi- algorithms to yeast genomedata, in an attempt to find nate all documentsthat are relevant for a single kernel relevant literature and gene relations for the yeast genes only, as explained in the previous section. Weare left analyzed by Spellman et. al. (1998). The names of all with a set of 3063 documents that are relevant to 2 or the genes used by Spellman5 were compared against more kernel documents, (this is the set S~, defined in the Saccharomyces Genome Database (SGD). Out Equation 2). about 800 genes found by Spellman et. al. to be cell- We then represent each kernel as a 3063-dimensional cycle regulated, only 408 genes had curated PubMed vector (as specified in Equation 3), and use the cosine references in the SGD,and our experiments concentrate coefficient to measure similarity between each kernel on these 408 genes. and all the other ones. Each kernel is then converted 4SGD,the Sacchaxomyces GenomeDatabase can be ac- back to the gene(s) for which it was curate& The genes cessed at h~tp://genorne-ww~.stanford.edu/Saccflarornyces that are grouped as similar according to our method and YPD, the Yeast Proteome Database, at are compared with the ones grouped by functionality http://www.proteome,corn/databases/index.htrnl. according to Spellman’s table (parts of which are shown http5Available : / / gen°me-www’stanf°rd’edu through the genome[ eellcy web cle site / . at Stanford, in Table 1).

324 SHATKAY Biological G1 S G2 M M/G1 Function Replication CDC45 ORC1 CDC47 CDC54 CDC6 CDC46 Initiation MCM2 MCM6 MCM3 Fatty Acids/ EPT1 LPP1 PSD1 AUR1 ERG3 LCB3 ERG2 ERG5 PMA1 ELO1 FAA1 FAA3 Lipids/ SUR1 SUR2 SUR4 PMA2 PMP1 FAA4 FAS1 Sterols/ Membranes Nutrition BAT2 PHO8 AGP1 BAT1 GAP1 DIP5 FET3 FTR1 AUA1 GLK1 HXT1 MEP3 PFK1 PHO3 HXT2 HXT4 HXT7 PHO5 PttOll PHO12 PHO84 RGT2 SUC2 SUT1 VAP1 VCXl ZRT1

Table 1: Yeast Genes: expression during cell-cycle and functionality. (Adapted from Spellman et. al. (1998))

To check the validity of the keyword list assigned to 1. A set of related documents. each kernel, we compare each keyword to its associated 2. A set of summarizing keywords. functionality using a mini-thesaurus obtained from a panel of four independent yeast experts. Each func- In addition, from the set of related documents we ob- tionality description listed in Spellman’s table (such as tain, for each kernel, through the vector representation and the cosine coefficient calculation, a set of related Secretion or Chromatin) is associated with the terms kernels. The latter kernels are mapped back to form a judged most closely related to it according to the ex- perts. Each expert received a list of the 22 function de- set of related genes. scriptions listed by Spellman et al, and a separate list To assess the value of the results obtained in the first of 330 alphabetically-sorted summaryterms resulting phase we examine the set of summarizing keywords. from our program. The experts assigned to each term (Obviously, objectively assessing the quality of the re- in the latter list, the functionality descriptors that they trieved documents themselves would also be desirable judged to be most related to it; non-specific terms were but there is no well-defined way to do it.) Wealso ex- left unassigned. An example of two entries in the re- amine the lists of related genes obtained in the second sulting thesaurus is shownin Table 2. phase. The quality of the results is checked through a comparison with the functionality assigned to genes by Function Associated Terms Spellman et. al., shown in Table 1. Since manyof the Chromatin chromatids, chromatin, chromosome, genes in the experiment are not assigned any function- sister chromatids, telomere, telomeric ality by Spellman (120 out of the 344 kernels used) Secretion acid phosphatase, coatomer, endoplasm~c we can only verify in this manner results for the ones endoplasmicreticulum, er, golgi apparatus golgi complex,golg: transport, golgi, v snare whose functionality was determined by Spellman et. al. An example of a typical successful search is shown in Table 2: Exampleof thesaurus entries associating gene Table 3. The left column of the table lists the PubMed function with related terms. identifiers for two kernel documents together with the For each gene, we compare its functionality according to genes they stand for and the functionality of these genes Spellman with the functionality assigned by the panel according to Spellman et. al. The second column lists, to each of its key terms, counting how manyof the key for each of the two kernels, the 10 top keywordsassoci- terms indeed correspond to the gene’s functionality ac- ated with the retrieved set of documents, as determined by our algorithm. The third column lists the top 10 cording to Spellman and how many do not. The results e are described throughout the rest of this section. genes associated with each of the two kernels, based on the cosine coefficient. The fourth columnlists the func- tion of eac~ gene according to Spellman e~. al, as a mean Results As stated before, for each gene represented by a ker- ~ELO1has only 9 genes associated with it, since there nel document we obtain through the similarity query were only 9 non-zero cosine coefficients associated with its mechanism applied to the whole database: kernel.

ISMB 2000 325 Kernel (PMID, Keywords Assoc. Function Gene,Function) GeIleS 8702485 fatty acid, OLE1 (Fatty Acid, Sterol. Met.)* ELO1 fatty, FAA4 Fatty Acid/Lipids/Sterols/Membranes Fatty Acid/ lipids, FAA3 Fatty Acid/Lipids/Sterols/Membranes Lipids/ acid, SUR2 Fatty Acid/Lipids/Sterols/Membranes Sterols/ grown, FAA1 Fatty Acid/Lipids/Sterols/Membranes Membranes medium, ERG2 Fatty Acid/Lipids/Sterols/Membranes carbon, PSD1 Fatty Acid/Lipids/Sterols/Membranes synthase, CYB5 (Fatty Acid, Sterol. Met.)* strains, PGM1 (Carbohydrates Met.)* deficient 7’651133 hexose, HXT1 Nutrition HXT7 glucose uptake, RGT2 Nutrition Nutrition glucose conc. HXT4 Nutrition f]:uctose, HXT2 Nutrition glycolytic, GLK1 Nutrition glucose, SEO1 (Small Molecules Transport)* sugars, PRB1 (Protein Degradation)* uptake, AGP1 Nutrition aerobic, ZRT1 Nutrition utilization MIG2 (Carbohydrates Me}:)*

Table 3: Example of a result obtained from two different kernel/gene using our algorithm, compared with function- ality according to Spellman or YPD (YPD functionality denoted by *). for checking the validity of our results. Since our ex- (kernel), is considered correct if it appears in our the- periment included more genes than listed in Spellman’s saurus entry labeled by the same function as the one table, some of the genes in the third column are not assigned to the gene by SpeUman.If its thesaurus en- assigned functionality by Spellman. For these genes, try is labeled by a different function, it is considered (denoted by an * in the table), we found the function- wrong. If it was assigned no function by our panel of ality in YPD. experts it is considered non-descriptive. An average of The table shows that except for two genes (PGM1and 3.27out of the 5 top ranking keywords, were associated PRB1) all of the genes found for these two kernels with the correct function, while only 1.12 out of the 5 were associated with the wrong function, and 0. 61 out have a strong functional relationship to the genes rep- of the 5 were non-descriptive. The difference between resented by the kernels, and the keywords provide a strong indication of this functionality. (Note that the the high rate of correct keyword assignment relative to keywords are associated as a set with the whole kernel the wrong and the non-descriptive assignment is highly entry and not separated as one keyword per associated statistically significant (p << 0.005, according to the gene.) We note that PGM1is involved in carbohy- two-samplet-test). drates metabolism which is still functionally related to For manyother kernels the groups of related genes con- fatty acids metabolism. PRB1is responsible for pro- tain many genes not assigned functionality by Spell- tein degradation, which is not related to nutrition. It is man, which makes the results harder to validate. An- included in this set, since the abstract chosen for its ker- other set of cases, in which our results deviate from nel document discusses regulation of the enzyme prblp Spellman’s functionality grouping of genes, are those by glucose, rather than the function of prblp. for which the kernel document was not primarily fo- cused on the flmction of the gene but contained a lot The results for about 100 out of the 220 kernels for which we had the Spellman assigned functionality, of detail discussing the experimental methods. In such closely resemble the ones demonstrated in Table 3 in cases, any document describing the same experimen- the strong agreement with Spellman’s duster assign- tal method was considered similar and drawn into the ment and in the accurate description as given by the set of relevant documents, resulting in a mixture of keywords learned by the similarity query algorithm. biologically-unrelated documents. The terms included As a quantitative measure, we calculated the average in the keywords list indicate potential problems with number of correct and incorrect keywords among the 5 this grouping and provide a warning that these results top-ranking keywords associated with each of these ker- should not be taken at face value. An example of such nels. A keyword occurring in a list for a specific gene a resultis given in Table 4. In this case, the kernel doc- ument focuses on the technique used for studying the

326 SHATKAY Kernel (PMID, Keywords Assoc. Function Gene,Ftmction) Genes 6323245 ars, CDC10 Site Selection/Morphogenesis MCM2,MCM3,MCM6 autonom, replicating, PHO3 Nutrition Replication Init. replicating sequence, EST1 DNA Syn autonomously, MIF2 Chromatin minichromosomes, PHO12 Nutrition replicating POL2 DNA Syn. centromeric DHS1 DNArepair leu2, SNQ2 , plasmids, SMC3 Chromat. Cohes. ura3, EXG2 Cell Wall Synt. Table 4: Exampleof a result obtained from an uninformative kernel using our algorithm, compared with functionality according to Spellman.

MCMgenes, rather than the explicit function of these The results presented in this paper demonstrate that genes. Consequently, some of the kernels considered given a functionally descriptive kernel document our similar to it represent the use of similar techniques for program can provide insight into gene functional group- studying different biological processes, rather than the ings, similar to that currently obtained through labori- biology of their associated genes. The result is a set of ous, manual literature surveys relying on a lot of human genes for which the commonality is that the documents expertise. Obviously our method can not ascribe hmc- curated for them all discuss manipulations within chro- tion to genes which have not yet been studied. However, mosomes rather than gene function. The keyword list it can indicate functional relationships among known (which highly ranks terms such as autonomous repli- genes which heretofore have gone unnoticed. cation and contains leu2 and ura3 that are commonly used selectable markers for plasmids), indicates that the The main limitation our technique currently faces is theme underlying this set of documents and genes is not that of obtaining functionally descriptive kernel doc- uments. We are considering several machine-learning relevant to functional genomics. techniques that can greatly assist in automating the Obviously, obtaining good biological information (as kernel selection process. The expectation is that such shown in Table 3) is muchpreferable to an indication kernel selection would consistently lead to good results. of poor quality, and for the most part this depends Our method complements current techniques used for on starting from good quality kernel documents. The cluster analysis of the expression array data. We excellent experience with the 100 high-quality kernel documents demonstrates that once a single informative strongly believe that by combining this approach with techniques such as the one suggested by Friedman document is given for a gene, manyother quality docu- ments about the related genes are automatically found, et. al. , as well as with expression array clustering ap- accompanied by a succinct summary of the functional proaches, we can achieve a great deal of automation and relationship between the genes. expedite the tedious task of analyzing the overwhelming amounts of data generated from experiments conducted over gene expression arrays. Conclusions and Ongoing Work

Automatically finding connections among documents Acknowledgments discussing genes has three clear advantages: Weare grateful to Jan Fassler, Ken Katz, Steven Sul- 1. It is an efficient wayfor establishing putative relation- livan and Tyra Wolfsberg for the time and effort they ships between genes as a preliminary step preceding have put into assigning functional tags to terms. direct experimental methods. 2. It provides the relevant hterature needed by the re- References searchers for performing the results analysis. Ball, C. A. et al. 2000. Integrating functional genomicin- 3. It generates a summaryexplaining the discovered re- formation into the Saccharomycesgenome database. Nu- lationships. This summarycan help researchers ex- cleic Acids Res. 28:77-80. plain and evaluate the relationships found through Bassett, D. E.; Eisen, M. B.; and Boguski, M. S. 1999. direct clustering of the expression levels. Geneexpression informatics - it’s all in your mine. Nature Thus, this method can be used both for generating hy- Genetics 21:51-5. potheses prior to the experiments, as well as for post- Ben-Dot, A., and Yabhlni, Z. 1999. Clustering gene expres- experimental interpretation of the results. sion patterns. In Proceedings of the Third Annual Inter-

ISMB 2000 327 national Conference on Computational Molecular Biology Schena, M.; Shalon, D.; Davis, R. W.; and Brown, P. 1995. (RECOMB99). Quantitative monitoring of gene expression patterns with Ben-Dor, A.; Shamir, R.; and Yakhlai, Z. 1999. Cluster- a complementary DNAmicroarray. Science 270:467-470. ing gene expression patterns. Journal of Computational Schena, M. 1999. DNA Microarrays: A Practical Ap- Biology 6(3/4):281-297. proach. Oxford University Press. Blaschke, C.~ Andrade, M. A.; Ouzounis, C.; and Valencia, Shatkay, H., and Wilbur, W. J. 2000. Finding Themes in A. 1999. Automatic extraction of biological information MedLine Documents. In Proceedings of the IEEB confer- from scientific text: Protein-protein interactions. In Pro- ence on Advances in Digital Libraries (ADL2000). ceedings of the AAAI Conference on Intelligent Systems in Southern, E. M. 1975. Detection of specific sequences Molecular Biology, 60-67. among DNAfragments separated by gel electrophoresis. Brown, P. O., and Botstein, D. 1999. Exploring the new Journal of Molecular Biology 98:503-17. world of the genome with DNAmicroarrays. Nature Ge- Spellman, P. T. et. al. 1998. Comprehensive identification netics 21:33-7. of cell cycle-regulated genes of the yeast saccharomyces Cherry, J. M.et. al. 1998. SGD: saccharomyces genome cerevisiae by microarray hybridization. Molecular Biology database. Nucleic Acids Res 26:73-9. of the Cell 9:3273-3297. Cho, 1%. J. et. al. 1998. A genome-wide transcriptional Tamayo, P. et. al. 1999. Interpreting patterns of gene analysis of the mitotic cell cycle. Mol Cell 2:65-73. expression with self-organizing maps: Methods and appli- Chu, S. et. al. 1998. The transcriptional program of sporu- cation to hematopoietic differentiation. Proceedings of the lation in budding yeast. Science 282:699-705. National Academy of Science 96:2907-2912. Costanzo, M. C. et. al 2000. The yeast proteome database Tanabe, L. et. al. 1999. Medminer: An internet text- (YPD) and caenorhabditis elegans proteome database mining tool for biomedical information, with application to (WormPd): comprehensive resources for the organization gene expression profiling. BioTechniques 27(6):1210-1217. and comparison of model organism protein information. Wahl, G. M.; Meinkoth, J. L.; and Kimmel, A. R. 1987. Nucleic Acids Res 28:73-6. Northern and southern blots. Methods Enzymol 152:572- Craven, M., and Kumlien, J. 1999. Constructing biolog- 81. ical knowledge bases by extracting information from text Wen, X. et. al. 1998. Large-scale temporal gene expression sources. In Proceedings of the AAAI Conference on Intel- mapping of central nervous system development. Proceed- ligent Systems in Molecular Biology, 77-86. ings of the National Academy of Science 95:334-9. Dempster, A. P.; Laird, N. M.; and Rubin, D. B. 1977. Wilbur, W. J., and Coffee, L. 1994. The effectiveness of Maximumlikelihood from incomplete data via the EMal- document neighboring in search enhancement. Information gorithm. Journal of the Royal Statistical Society 39(1):1- Processing and Management 30(2):253-266. 38. DeRisi, J.; Iyer, V.; and Brown, P. 1997. Exploring the metabolic and genetic control of gene expression on a ge- nomic scale. Science 278:680-686. Eisen, M. B.; Spellman, P. T.; Brown, P. O.; and Botstein, D. 1998. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Science 95:14863-14868. Ermolaeva, O. et. al. 1998. Data management and analysis for gene expression arrays. Nature Genetics 20:19-23. Ferea, T. L.; Botstein, D.; Brown, P. O.; and Rosenzweig, 1%. F. 1999. Systematic changes in gene expression patterns following adaptive evolution in yeast. Proceedings of the National Academy of Science 96:9721-6. Friedman, N.; Linial, M.; Nachman, I.; and Pe’er, D. 2000. Using Bayesian networks to analyze expression data. Life Sciences (to appear). Gillespie, D., and Spiegelman, S. 1965. A quantitative assay for DNA-RNAhybrids with DNAimmobilized on membrane. Journal of Molecular Biology 12:829-42. Leek, T. 1%. 1997. Information extraction using hidden Markov models. Master’s thesis, Department of Computer Science, University of California, San Diego. RJndflesch, T. C.; Tanabe, L.; Weinstein, J. N.; and Hunter, L. 2000. Edgar: Extraction of drugs, genes and relations from the biomedical literature. In Proceedings of the Pacific Symposium on Biocomputing. Salton, G. 1989. Automatic Text Processing. Addison- Wesley.

328 SHATKAY