<<

Domain-Specific Synonym Expansion and Validation for Biomedical (MultiText Experiments for TREC 2004)

Stefan Buttc¨ her Charles L. A. Clarke Gordon V. Cormack

School of Computer Science University of Waterloo Waterloo, Ontario, Canada

Abstract

In the domain of biomedical publications, synonyms and homonyms are omnipresent and pose a great challenge for document retrieval systems. For this year’s TREC Genomics Ad hoc Retrieval Task, we mainly addressed the problem of dealing with synonyms. We examined the impact of domain-specific knowledge on the effectiveness of and analyzed the quality of Google as a source of query expansion terms based on pseudo-relevance feedback. Our results show that automatic acronym expansion, realized by querying the AcroMed database of biomed- ical acronyms, almost always improves the performance of our document retrieval system. Google, on the other hand, produced results that were worse than the other, corpus-based feedback techniques we used as well in our experiments. We believe that the primary reason for Google’s bad performance in this task is its highly restricted query language.

1 Introduction Information retrieval in the context of biomedical databases traditionally has to contend with three major problems: the frequent use of (possibly non- standardized) acronyms, the presence of homonyms This paper describes the work done by members of (the same word referring to two or more different the MultiText project at the University of Waterloo entitities) and synonyms (two or more words refer- for the TREC 2004 Genomics track. The MultiText ring to the same entity). For the two runs we sub- group only participated in the Ad hoc retrieval task mitted for the Genomics track, we addressed the of the Genomics track, which consisted of finding problems caused by acronyms and synonyms. We documents that are relevant with respect to certain chose to follow a strategy that was a mixture of topics in a subset of the Medline/PubMed database known heuristics for approximate string matching of medical publications [PM04]. The subset in- for genomics-related data and a knowledge-based cluded articles published between 1963 and 2004. approach that used domain-specific knowledge from It consisted of 4,591,008 different documents with a various sources, such as the AcroMed database of total size of 14 GB (XML text), where the term doc- biomedical acronyms [Acr04]. Since acronyms can ument refers to incomplete articles, annotated with be viewed as special forms of synonyms, it was pos- certain keywords, such as MeSH terms [MeS04]. Out sible to develop a system that can deal with both of these 4,591,008 documents, 3,479,798 contained problems (synonyms and acronyms) in a uniform the abstract of the article, the remaining 1,111,210 way. only the article title and the annotations. In total, 50 topics were given to the track participants. An Furthermore, we implemented an automatic query example topic is shown in Figure 1. expansion technique using pseudo-relevance feed-

1 14 Expression or Regulation of TGFB in HNSCC cancers Documents regarding TGFB expression or regulation in HNSCC cancers. The laboratory wants to identify components of the TGFB signaling pathway in HNSCC, and determine new targets to study HNSCC.

Figure 1: Example topic in original XML form

back. We evaluated the quality of feedback terms is the average document length in the corpus, and produced by a simple Google query and compared qT is the query-specific relative weight of the term the results with known query expansion techniques. T . Usually, qT equals the number of occurrences of We also studied possibilities to combine both strate- T in the original query. We discuss this parameter in gies. more detail in section 8. The remaining parameters in formula 2 were chosen to be k1 = 1.2 and b = 0.75. In addition to mere term queries, MultiText sup- 2 The MultiText system ports structured queries, and in particular disjunc- tions of terms. We call a disjunction of query terms Our work is based on the MultiText text retrieval system, which was initially implemented at the T∨ = T1 ∨ T2 ∨ · · · ∨ Tm University of Waterloo in 1994 [CCB94] and has a disjunctive query element. The BM25 weight of a experienced a variety of extensions and modifica- disjunctive query element is tion over the last decade. MultiText supports sev- eral different document scoring mechanisms. We |D| − |DT1 ∪ · · · ∪ DTm | + 0.5 wT∨ = ln( ). (3) employed the MultiText implementation of Okapi |DT1 ∪ · · · ∪ DTm | + 0.5 BM25 [RWJ+94] [RWB98] and MultiText’s QAP For convenience, we let the “+” symbol denote the passage scoring algorithm [CCT00] [CCL01], which disjunction operator (“∨”) whenever we present ex- had initially been developed for ample queries. tasks.

2.1 Okapi BM25 document scoring 2.2 MultiText QAP passage scoring MultiText’s QAP passage scoring algorithm com- MultiText’s implementation of BM25 is the same putes text passage scores based on the concept of as the original function proposed by Robertson et self-information. It may be extended to a docu- al. [RWJ+94] in the absence of document relevance ment scoring algorithm by finding the passage with information. Every query term T is assigned a term the highest score within a document and assigning weight the passage’s score to the entire document, although |D| − |DT | + 0.5 wT = ln( ), (1) this is not the original purpose of QAP. Our pseudo- |D | + 0.5 T relevance feedback mechanisms are derived from the where D is the set of all documents in the corpus, QAP scoring function. and DT is the set of documents containing the term T . For a given query Q = {T1, ..., Tn}, the score of Given a query Q, every passage P of length l, con- a document D is computed using the formula taining the terms T ⊆ Q, is assigned the score

dT · (1 + k1) N wT · qT · , (2) H(P ) = X (log2( ) − log2(l)), (4) X lenD dT + k1 · ((1 − b) + b · ) fT T ∈Q lenavg T ∈T where dT is the number of occurrences of the term T where N is the total number of tokens in the corpus in the document D, lenD is the length of D, lenavg (corpus size), and fT is the number of occurrences Query Expansion

Parsing, Acronym/Synonym Finder Species Topic Names Variant (XML text) Removal (AcroMed, EuGenes, LocusLink) Finder Generator

Unexpanded MultiText Expanded Query BM25 Query Legend

Query MultiText MultiText QAP BM25 Stage 1 Process

Expansion Doc. Fusion Expansion Validator, (Interleave) Permutation Finder Stage 2 Final MultiText Expanded Result BM25 Query

Google Passage Document Feedback Feedback Feedback Species MultiText Names BM25 Filtering

Term Fusion Query Expansion (CombSUM) (Pseudo-Relevance Feedback) Stage 3

Figure 2: The document retrieval system

of the term T in the corpus. H(P ) is called the in the original query. The unexpanded query self-information of the passage P . is taken by MultiText to score documents using both BM25 and QAP, while the expanded query is used to produce a third, BM25-scored document set.

3 System overview Expansion validation (Stage 2) In the second stage (see section 5 for details), the An overview of the document retrieval system documents returned by the first stage are analyzed developed for the Genomics track is given by Figure and systematically scanned for any occurrences of 2. The system consists of a preprocessing stage and terms added to the query in the expansion phase three document retrieval stages. Most parts of the of stage 1 or permutations of query terms. Terms system can be changed by varying the respective that have been added to the query in stage 1, but parameters. Because there were no training data occur very infrequently in the documents returned, available, many of these parameters are arbitrarily are removed from the query, whereas frequent chosen. permutations of query terms which have not been part of the old query might be added. The resulting Query preprocessing & Synonym expansion query is sent to MultiText, and documents are (Stage 1) scored by BM25. In the preprocessing stage, the input data (XML text) is parsed and all stop words are removed. Pseudo-relevance feedback (Stage 3) The list of stop words contains the usual terms, like The third stage, which is described in sections 6 “the” and “and”, but also a number of additional and 7, performs a set of standard pseudo-relevance terms inferred from the sample topics, e.g. “find”, feedback operations in order to create new expan- “information”, and “literature”. After the stop sion terms. The terms produced by the feedback words have been removed, several synonym genera- are added to the query, and BM25 is used again to tion techniques are applied to the query. A detailed compute the document scores. Finally, if our re- explanation of these techniques can be found in trieval system has found species names in the orig- section 4. The result of this process is an expanded inal query, it tries to match the documents in the query containing possible synonyms of the terms ID: 14 TITLE: Expression or Regulation of TGFB in HNSCC cancers NEED: Documents regarding TGFB expression or regulation in HNSCC cancers. QUERY_TITLE: "expression" "regulation" "tgfb" "hnscc" "cancers" EXPANDED_TITLE: "expression" "regulation" ("tgfb"+"blk"+"ced"+"dlhc"+ "tgfbeta"+"tgfb1"+"tgfb2"+"tgf r"+"transforming growth factor beta"+ "latent transforming growth factor beta"+"ltbp 1"+"ltbp1"+"mav"+ "maverick"+.....) ("hnscc"+"head and neck squamous cell") "cancers" QUERY_NEED: "regarding" "tgfb" "expression" "regulation" "hnscc" "cancers" EXPANDED_NEED: "regarding" ("tgfb"+"blk"+"ced"+"dlhc"+"tgfbeta"+"tgfb1"+ .....) "expression" "regulation" ("hnscc"+"head and neck squamous cell") "cancers"

Figure 3: Example topic after preprocessing and query expansion in stage 1

result list against these species names and promotes are likely to have lexical variants). We consider a documents that contain the species names or syn- term interesting if it contains at least one hyphen, onyms of them. digit, or Greek letter. For each such term, a list of possible variants is created according to the follow- ing rules: 4 Domain-specific query ex- 1. Greek letters are contracted to their Latin pansion equivalents: “alpha” becomes “a”, “beta” be- comes “b”, and so on. For domain-specific query expansion, we used two 2. Hyphens are removed from the term. different techniques in stage 1 of our system: 3. Before and after every Greek letter, a hyphen • a general variant generation approach that cre- is inserted. ates lexical variants of a list of given terms; 4. At every transition from alphabetic to numeri- • a database of biomedical synonyms, extracted cal characters and back, a hyphen is inserted. from various sources available on the web. For each term, every possible combination of the In addition to these methods, we examine permu- above rules is applied to the term. For exam- tations of query terms to find further expansions. ple, the set of variants generated for the term The result of the query expansion in stage 1 is an “Lsp1alpha” (Larval serum protein 1 alpha) is: “lsp- expanded query that combines each original query 1-alpha”, “lsp-1-a”, “lsp-1alpha”, “lsp-1a”, “lsp1- terms and all its expansions (synonyms etc.) into alpha”, “lsp1-a”, “lsp1alpha”, “lsp1a”. one disjunctive query element (defined in section 2), The problem of replacing spaces by hyphens and as shown in Figure 3. vice versa can be ignored because the MultiText en- gine already considers them equivalent. Case folding 4.1 Lexical variants is automatically applied to all terms as well.

In the context of biomedical articles, we have to deal 4.2 Synonym expansion with an abundant number of lexical variants of the same term. These are mainly caused by the authors’ In biomedical publications, we have mainly three different preferences of hyphenation, spacing and types of synonyms that can appear in a query: Greek letters. In the Medline corpus used for this year’s Genomics track, for instance, the NF-kappa B • acronyms; protein is referred to in 6 different ways: “NF-kappa • gene names or symbols; B” (33902 times), “NF-kappaB” (28551), “NFkap- paB” (3211), “NF-kB” (688), “NFkB” (259), and • protein names or symbols. “NFkappa B” (45). Acronyms can be dealt with in a straightforward Our system addresses this problem by scanning the fashion by adding the respective long form to the original query for interesting terms (i.e., terms that query whenever a known acronym is encountered. Genes and proteins have additional structure that the query is asking for “ATPsyn”. can be used when looking for appropriate expan- sions: the symbol “ACP” (acid phosphatase) de- LocusLink scribes a family of proteins, including ACP1 and LocusLink is most prominent source of publicly ACP2. This allows us to not only add synonyms of available information on genes. It provides de- “ACP” to the query, but also “ACP1” and “ACP2”. tailed information about the function and position Following the Gene Ontology terminology, we call of genes. We used a version of the LocusLink these narrow synonyms (as opposed to exact syn- database containing 128,580 entries. Our system onyms, such as acronyms). does not utilize the full LocusLink information but We used three different sources to deal with the only the following fields of a LocusLink record: above kinds of synonyms: The AcroMed database OFFICIAL SYMBOL of biomedical acronyms [PCC+01] [Acr04], the Eu- • : the officially approved sym- karyotic Genes database at the University of Indiana bol for this gene; [Gil02] [euG04], and the LocusLink database [LL04]. • PREFERRED SYMBOL: a symbol that might be- come the official symbol in the future, but has AcroMed not been validated by the nomenclature com- mittee yet; The AcroMed database contains a list of mappings OFFICIAL GENE NAME from acronyms to their long forms, automatically • : the official name of the created from Medline abstracts. For our experi- gene; ments, we used only a subset of the AcroMed data. • PREFERRED GENE NAME: similar to the We ignored all acronym/long-form pairs with less PREFERRED SYMBOL field; than 5 occurrences. The rationale behind this was to • ALIAS SYMBOL: an alias that is sometimes used avoid low-quality expansion terms caused by wrong to refer to the gene; AcroMed data. The resulting database contained PRODUCT 25,589 acronyms and 49,822 long forms. • : the protein product of the transcript. For every input topic, our system searches the A gene can be found in the database by searching topic text for occurrences of acronyms stored in for its official symbol, its preferred symbol, one of the AcroMed database. In order to allow for lexical its aliases, or its protein product. The product is variants, an approximate string matching algorithm included in this list because genes and their products is used that allows exactly the types of variants are often treated as synonyms. Narrow synonyms do that are described in section 4.1. not get a special treatment in the LocusLink data because that information in many cases is already euGenes included in the ALIAS SYMBOL fields of a gene record. The Genomic Information for Eukaryotic Organisms Some LocusLink records also provide a PMID field database (euGenes.org) consists of genetic informa- that contains a list of PubMed articles related to tion for 9 different species. The version used for our the gene. This information could have been very experiments contains 184,460 different genes. Our beneficial if used in an appropriate way. However, system does not use the information offered by eu- it was not clear to us if these fields are of sufficiently Genes to its full extent. Instead, it only builds a high quality and how exactly we could make good mapping from gene symbols to full names. use of them. Again, the special needs discussed in section 4.1 are taken into account when the euGenes database is searched for a gene symbol. Furthermore, when 5 Expansion validation looking for expansions, we also allow for narrow synonyms by cutting off the tail of the gene symbol In stage 2, the results of the first stage are combined after the last transition from alphabetical to nu- using an interleaving fusion technique described by + merical characters or vice versa. The gene symbol Yeung et al. [YCC 03]. From each of the three “TGFB2” (Human gene Transforming Growth documents lists generated in stage 1, one document Factor Beta 2), for example, is given the database is picked in turn and added to the new document entries “TGFB2” and “TGFB”. The same holds for list. Duplicates are removed from the result. hyphens, so the gene “ATPsyn-beta” (Drosophila The new document list is used to validate the expan- gene ATP Synthase Beta) can even be found when sion terms in the following way: The system takes ID: 14 TITLE: Expression or Regulation of TGFB in HNSCC cancers NEED: Documents regarding TGFB expression or regulation in HNSCC cancers. QUERY_TITLE: "expression" "regulation" "tgfb" "hnscc" "cancers" EXPANDED_TITLE: #1.00 "expression" #1.00 "regulation" #0.45 "tgfb" #0.95 ("tgfb"+"tgf beta"+"beta tgf"+"tgfbeta"+"tgfb1"+"tgfb2"+"tgfb3") #0.45 "hnscc" #0.95 ("hnscc"+"head and neck squamous cell") #1.00 "cancers" QUERY_NEED: "regarding" "tgfb" "expression" "regulation" "hnscc" "cancers" EXPANDED_NEED: #1.00 "regarding" #0.45 "tgfb" #0.95 ("tgfb"+"tgf beta"+ "beta tgf"+"tgfbeta"+"tgfb1"+"tgfb2"+"tgfb3") #0.45 "hnscc" #0.95 ("hnscc"+"head and neck squamous cell") #1.00 "cancers"

Figure 4: Example topic in stage 2 (with query-specific term weights qT )

the first 150 documents from the result list produced 6 Pseudo-relevance feedback by stage 1 and tries to match them against the expansion terms generated by the expansion tech- In stage 3 of our retrieval system, we expand the niques described in section 4. Furthermore, it looks query produced in stage 2 by using the results of for permutations of expansion terms in the docu- three difference pseudo-relevance feedback (PRF) ments returned. This is done because the name of algorithms. The output of each PRF process is a the protein that belongs to a certain gene is often vector of term-score pairs. All terms that are al- a permutation of the gene name. The Homo sapi- ready part of the old query (either exactly or in a ens gene “glucosidase, alpha; acid”, for instance, en- stem-equivalent form) are removed from the vectors. codes the “acid alpha-glucosidase” preprotein. The results are merged using the normalized Comb- The documents are scanned for the expansion terms SUM fusion algorithm [Lee97] [SF94]. or term sequences, and the number of occurrences From the merged term vector, the best 10 candi- is counted for every expansion. For every original dates are taken and added to the old query with query term, the best 10 expansions are kept. This query-specific term weights set of remaining expansion candidates is then fur- ther restricted by removing all expansions that have qT = 0.3 · scoreT , (5) less than 1% of the occurrences of the top candidate. where score is the feedback score of the term T . The remaining expansions are combined into a new T Because feedback scores are normalized, the top- disjunctive query element that is added to the orig- scoring feedback term will have weight 0.3 in the inal query. resulting expanded query. Terms for which an expansion has been found, are In the following sections we explain the computation given a reduced term weight q = 0.45. How- T of the feedback scores for each of the three feedback ever, the set of validated expansions gets a weight techniques employed. qT 0 = 0.95, so the total weight of the term increases to qT,total = 1.4. The idea behind this weight in- crease is the assumption that terms for which we 6.1 Passage feedback have an expansion are likely to be key elements of the original query. Having the original term appear The passage feedback algorithm used by our system twice in the resulting query (once by itself, once in creates a vector of expansion candidate terms from a the disjunction) is intended to keep query drift low. list of scored passages (nodes “MultiText QAP” and Query terms for which no expansion could be found “Passage Feedback” in Figure 2). It is based on the get the default term weight qT = 1.0. QAP passage scoring algorithm described in section The result of the validation process is shown in fig- 2 and has successfully been used by the MultiText + ure 4. group in TREC 2003 [YCC 03]. The algorithm takes the top 100 documents, as returned by MultiText’s QAP, and examines the neighborhood of the best passage within each doc- ument. Every time the algorithm encounters a cer- tain term T within that neighborhood, the score of that term is increased: the text found in the Medline corpus, our feedback N system includes a third technique that uses Google scoreT := scoreT + log2( ) − log2(l), (6) as a source of expansion terms. fT N f where is the corpus size, T the number of oc- NEED currences of T in the corpus, and l is the size of the The part of the original query, stop words re- minimum window that contains both the relevant moved, is sent to Google. The document snippets passage and the term T . If, for example, the rele- returned are considered ordinary documents, and for T S vant passage in the document were “Bart Simpson” every term appearing in a Google snippet , the and the neighborhood “This is Bart Simpson with term score is updated in the same way as above: his sister Lisa”, the size of the minimum window for N scoreT := scoreT + log2( ). (10) the term “sister” would be 5. Obviously, this pas- fT · lenS sage feedback algorithm was inspired by the original QAP scoring function. 7 Species names 6.2 Document feedback After the feedback process has finished, a final filter- ing is performed on the list of documents returned. The document feedback is similar to the passage The filter does not cause any changes to the docu- feedback. However, it examines entire documents ment ranks unless our system has found one or more instead of passage neighborhoods. In addition, the species names in the original query. document feedback algorithm does not assume that all documents returned by the previous stage are For example, if the original query contains the terms equally relevant. Instead, it gives different weights “human” and “mice”, then the filter will try to to the input documents according to their rank and match the documents against the strings “human”, score in the previous stage. “humans”, “homo sapiens”, “mouse”, “mice”, and “mus musculus”. The filter is based on a manually The algorithm examines the top 100 documents, as created list of 11 species names (and synonyms) that T returned by BM25 in stage 2. For every term that frequently appear in biomedical publications. The appears within a document D, the feedback score of idea behind this process is that a species name that that term is increased: appears in a topic is most probably a key word. N scoreT := scoreT + wD · log ( ), (7) The actual filtering is based on the interleave fusion 2 f · len T D technique. From the list of documents D returned where wD is the feedback weight of the document D. by BM25, two lists are created: After trying several document weighting schemes, • D consists of all documents that contain at we decided to use the following formula: pos least one of the search strings; rankD wD = scoreD · c . (8) • Dneg contains all the remaining documents. scoreD is the document BM25 document score, Dpos and Dneg are merged into a new document list 0 rankD is the document rank in the result list pro- D by taking two documents from Dpos and one doc- duced by stage 2, and c < 1 is chosen in such a way ument from Dneg in turn. Depending on the relative that size of Dpos and Dneg, the filtering may leave the 10 100 i i document order untouched or change it radically. X c = X c , (9) i=1 i=11 i.e., the first 10 documents have the same cumulated 8 Evaluation weight as the following 90 documents. This weight- ing function assumes that the results produced in stage 2 are already of high quality. We submitted two runs for the Genomics track. The first run only made use of the NEED field of the given topics. The second run considered TITLE and NEED. 6.3 Google feedback If a query term appeared in both TITLE and NEED, it was simply given twice the normal term weight While the two feedback techniques described so (qT = 2.0). The results of our runs are shown in far rely on the quality of the previous query and Table 1. provement), respectively. Table 1: Results for NEED-only and TITLE+NEED runs (50 topics, 8,268 relevant documents; “*”: bug- From the intermediate values, we can also conclude fixed versions) that the domain-specific query expansion techniques Run Recall Avg.prec. R-Prec. applied in stage 1 and stage 2 constitute a little more than half of this improvement (5,315 relevant doc- NEED 5,082 0.3318 0.3630 uments retrieved, 0.3639 MAP), while the pseudo- TITLE+NEED 5,544 0.3867 0.4037 relevance feedback performed in stage 3 is responsi- NEED * 5,094 0.3394 0.3716 ble for the other half. TITLE+NEED* 5,534 0.3895 0.4046 Additional experiments have shown that the effec- tiveness of the second stage is primarily caused by the generation of permutations of query TITLE NEED Table 2: Intermediate results for + run terms/expansions, not by throwing away infrequent (default parameters) expansions. Stage/Process Recall Avg.prec. R-Prec. 1/BM25,plain 4,811 0.3353 0.3707 Perhaps more interesting than the mean average 1/BM25,exp. 4,772 0.2775 0.3138 precision, from the point of view of an actual person 1/QAP,plain 3,892 0.2782 0.3215 who has to look at the documents retrieved by our 2/BM25 5,315 0.3639 0.3845 system, is the precision after 20 documents. This 3/BM25,PRF 5,534 0.3935 0.4108 value increases from 0.530 (plain BM25 query) to 3/Filtering 5,534 0.3895 0.4046 0.585 (final result), which is a 10.4% improvement. From Table 2 we can also see that the final species names filtering process could not improve the re- After the runs had been submitted, we found a bug sults in the expected way. We will investigate this in the validation part of our retrieval system that phenomenon in section 8.4. gave a validation score to the first term in a docu- ment n times as high as to the last term, where n is the length of the document. All experiments dis- 8.2 Domain-specific knowledge cussed in this section have been conducted with the corrected version of the validation algorithm. For It is clear by now that domain-specific query expan- comparison, we have included both versions (with sion is beneficial for the effectiveness of our docu- and without the bug) in Table 1. It can be seen from ment retrieval system. However, we do not know the table that using the TITLE as well improves the yet which part of our domain-specific query expan- quality of the results dramatically over the NEED- sion method is the major factor for the improvement only run. For the further discussion, we will only seen. Therefore, we have conducted some additional consider TITLE+NEED runs. experiments in which we have selectively disabled certain parts of the query expansion subsystem. From the results shown in Table 3 (in compari- 8.1 System components son with Table 2) we see that the general heuris- tics we employed when generating lexical variants This section deals with the retrieval effectiveness of are responsible for the biggest improvement. The the individual stages of our system. Table 2 shows second best contributor is the AcroMed acronym the major trec eval values for the intermediate re- database, which causes an improvement of 4.8% over sults produced by each stage of our document re- the Heuristics only run. It is surprising that adding trieval system. The first row (1/BM25,plain) con- gene information from euGenes and LocusLink dete- tains the BM25 baseline, i.e., the results produced riorates the mean average precision (comparing rows by an unexpanded query (stop words removed), Heuristics&AcroMed and All of the above in Table where documents have been scored by BM25. The 3), although the additional data increases the re- last row contains the results for the final document call from 5,284 to 5,315 relevant documents. We list obtained after species names filtering. We can believe that this is mainly because the number of see that both the number of relevant documents re- alias symbols provided by the LocusLink database trived and the mean average precision (MAP) in- is overwhelming. For the term “TGFB” in topic 14, crease significantly, from 4,811 to 5,534 (15.0% im- for instance, the expansion techniques in stage 1 pro- provement) and from 0.3353 to 0.3895 (16.2% im- duce 185 candidates (including lexical variants). For panded queries to Google, but even unexpanded Table 3: Comparison of domain-specific knowledge queries might be too long. In fact, this was the (results after stage 2) case for 10 of the 50 topics. Technique Recall Avg.prec. R-Prec. Heuristics only 5,230 0.3550 0.3832 Heu.&AcroMed 5,284 0.3722 0.4059 8.4 Species names Heu.&euGenes 5,253 0.3554 0.3827 Heu.&Loc.Link 5,288 0.3631 0.3843 The results shown in Table 2 suggest that the fi- All of the above 5,315 0.3639 0.3845 nal filtering process performed after the pseudo- relevance feedback could not improve the perfor- mance of our system and did even decrease its pre- cision. To understand what happened during the Table 4: Comparison of pseudo-relevance feedback filtering process, it is necessary to look at individual methods topics: Our system could identify species names in Method Recall Avg.prec. R-Prec. 15 out of 50 topics. For 9 of these 15 topics a slight Passage PRF 5,470 0.3889 0.4092 improvement was caused by the filtering, for 4 top- Document PRF 5,528 0.3924 0.4065 ics nothing changed, and for the remaining 2 topics Google PRF 5,370 0.3708 0.3942 the average precision dropped radically (0.8870 to Passage&Doc. 5,559 0.3957 0.4128 0.7971 and 0.9478 to 0.7870). Because of the un- All of the above 5,534 0.3935 0.4108 usually high precision before the filtering, these two topics cannot be considered representative. Hence, the filtering seems to have a slightly pos- “P53” in topic 22, they come up with a set of 176 itive impact on the precision for topics with low candidates. This high number of expansion terms or medium precision. We did not expect precision entails two possible risks: reduction of the weight of values above 70% when designing the filtering algo- the original query term and query drift. rithm.

8.3 Pseudo-relevance feedback 8.5 Expansion term weights

Pseudo-relevance feedback could increase the MAP In stage 2 and 3, as shown in Figure 4, our sys- by 8.1% and the recall by 4.1% over the results pro- tem gives a term weight qT = 0.45 to a term T duced in stage 2 in our experiments (cf. Table 2). for which we have an expansion. The disjunctive Since we have utilized three different PRF methods, query element (the expansion) that is derived from it is interesting to see which method had the great- the original term is given the weight qT 0 = 0.95. Due est impact. to the lack of training data, we could not validate the choice of these parameters. Additional experi- Table 4 shows that passage feedback and docu- ments conducted after the qrels for the topics had ment feedback were very effective. We can also see been released showed that by setting qT := 0.9 and that the combination of both methods (row Pas- qT 0 := 0.5 the MAP could have been increased: sage&Doc.) produced even better results than each technique taken alone. On the other hand, Google • from 0.3639 to 0.3740 in stage 2; feedback performed really poorly. It is not clear • from 0.3935 to 0.4042 in stage 3. what caused Google to deliver much worse results than the other feedback methods. This increase, however, carries the cost of a slightly We can, however, identify at least two potential decreased recall (5,253 down from 5,315 documents problems related with the use of Google: and 5,519 from 5,534 documents, respectively).

• Google only supports exact boolean searches. While this is acceptable for short queries (3-5 9 Conclusions & Future work terms), it can be problematic when dealing with longer queries. For the TREC 2004 Genomics track, we have imple- • Queries are cut off after the 10th term. This mented a number of different query expansion tech- does not only mean that we cannot send ex- niques, based on simple heuristics generating lexical variants of query terms, domain-specific knowledge ings of the 24th Annual International ACM used to find synonyms and acronym expansions, and SIGIR Conference on Research and Devel- pseudo-relevance feedback. We have shown that opment in Information Retrieval, pages 358– most of the techniques utilized by our system im- 276, November 2001. proved recall and precision. [CCT00] Charles L. A. Clarke, Gordon V. Cormack, and Elizabeth A. Tudhope. Relevance Rank- From the sources we employed for knowledge-based ing for One to Three Term Queries. Informa- query expansion, the AcroMed database of biomed- tion Processing and Management, 36(2):291– ical acronyms produced expansions of highest qual- 311, 2000. ity, outperforming both the euGenes and LocusLink [euG04] Genomic Information for Eukaryotic Organ- genetic databases. isms. http://eugenes.org/, 2004. Since the domain-specific knowledge in the genomics [Gil02] Donald G. Gilbert. euGenes: A Eukaryote domain can be used to produce a very large number Genome Information System. Nucleic Acids Research, 30(1):145–148, 2002. of possible synonyms for initial query terms, tech- niques to validate these expansions have to be found. [Lee97] Joon H. Lee. Analyses of Multiple Evidence Combination. In Proceedings of the 20th An- We presented a simple way that can be used to ver- nual International ACM SIGIR Conference ify that an expansion term appears in the context on Research and Development in Informa- we are interested in by looking at the results of an tion Retrieval, pages 267–276. ACM Press, early retrieval stage. Better techniques for expan- 1997. sion validation, probably based on language model [LL04] National Center for Biotechnology approaches, are desirable. Iformation – LocusLink Home Page. Our experiments with Google as a source of pseudo- http://www.ncbi.nih.gov/LocusLink/, 2004. relevance feedback were a little disappointing, since [MeS04] U.S. National Library of Medicine – Google performed worse than either of the two other Medical Subject Headings Home Page. http://www.nlm.nih.gov/mesh/, 2004. feedback methods. Nonetheless, we will keep using + Google for this purpose and try to find a way to cir- [PCC 01] James Pustejovsky, Jose Castano, Brent Cochran, Maciej Kotecki, Michael Morrell, cumvent the problems created by Google’s restricted and Anna Rumshisky. Linguistic Knowledge query interface. We will also study how our own web Extraction from Medline: Automatic Con- corpus, which is about 1 terabyte of publicly avail- struction of an Acronym Database. In Med- able text data, can be used to produce high-quality Info 2001: Proceedings of the 10th World expansion terms in the special context of biomedical Congress on Health and Medical Informatics, publications. September 2001. [PM04] U.S. National Library of Medicine – PubMed. http://www.ncbi.nlm.nih.gov/pubmed, 2004. [RWB98] S. Robertson, S. Walker, and M. Beaulieu. Okapi at TREC-7. In Proceedings of the Seventh Text REtrieval Conference (TREC 1998), November 1998. [RWJ+94] S. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. References Okapi at TREC-3. In Proceedings of the Third Text REtrieval Conference (TREC [Acr04] The Medstract Project – AcroMed 1.1. 1994), November 1994. http://medstract.med.tufts.edu/acro1.1/, [SF94] Joseph A. Shaw and Edward A. Fox. Com- 2004. bination of Multiple Searches. In Proceed- [CCB94] Charles L. A. Clarke, Gordon V. Cormack, ings of the Third Text REtrieval Conference and Forbes J. Burkowski. An Algebra for (TREC 1994), November 1994. Structured Text Search and a Framework for [YCC+03] David L. Yeung, Charles L. A. Clarke, Gor- its Implementation. Technical report, Uni- don V. Cormack, Thomas R. Lynam, and versity of Waterloo, August 1994. Egidio L. Terra. Task-Specific Query Ex- [CCL01] Charles L. A. Clarke, Gordon V. Cormack, pansion. In Proceedings of the 12th Text RE- and Thomas R. Lynam. Exploiting Redun- trieval Conference (TREC 2003), November dancy in Question Answering. In Proceed- 2003.