Domain-Specific Synonym Expansion and Validation

Domain-Specific Synonym Expansion and Validation for Biomedical Information Retrieval (MultiText Experiments for TREC 2004) Stefan Buttc¨ her Charles L. A. Clarke Gordon V. Cormack School of Computer Science University of Waterloo Waterloo, Ontario, Canada Abstract In the domain of biomedical publications, synonyms and homonyms are omnipresent and pose a great challenge for document retrieval systems. For this year's TREC Genomics Ad hoc Retrieval Task, we mainly addressed the problem of dealing with synonyms. We examined the impact of domain-specific knowledge on the effectiveness of query expansion and analyzed the quality of Google as a source of query expansion terms based on pseudo-relevance feedback. Our results show that automatic acronym expansion, realized by querying the AcroMed database of biomedical acronyms, almost always improves the performance of our document retrieval system. Google, on the other hand, produced results that were worse than the other, corpus-based feedback techniques we used as well in our experiments. We believe that the primary reason for Google's bad performance in this task is its highly restricted query language. 1 Introduction Information retrieval in the context of biomedical databases traditionally has to contend with three major problems: the frequent use of (possibly non- standardized) acronyms, the presence of homonyms This paper describes the work done by members of (the same word referring to two or more different the MultiText project at the University of Waterloo entitities) and synonyms (two or more words refer- for the TREC 2004 Genomics track. The MultiText ring to the same entity). For the two runs we sub- group only participated in the Ad hoc retrieval task mitted for the Genomics track, we addressed the of the Genomics track, which consisted of finding problems caused by acronyms and synonyms. We documents that are relevant with respect to certain chose to follow a strategy that was a mixture of topics in a subset of the Medline/PubMed database known heuristics for approximate string matching of medical publications [PM04]. The subset in- for genomics-related data and a knowledge-based cluded articles published between 1963 and 2004. approach that used domain-specific knowledge from It consisted of 4,591,008 different documents with a various sources, such as the AcroMed database of total size of 14 GB (XML text), where the term doc- biomedical acronyms [Acr04]. Since acronyms can ument refers to incomplete articles, annotated with be viewed as special forms of synonyms, it was pos- certain keywords, such as MeSH terms [MeS04]. Out sible to develop a system that can deal with both of these 4,591,008 documents, 3,479,798 contained problems (synonyms and acronyms) in a uniform the abstract of the article, the remaining 1,111,210 way. only the article title and the annotations. In total, 50 topics were given to the track participants. An Furthermore, we implemented an automatic query example topic is shown in Figure 1. expansion technique using pseudo-relevance feed- 1 <TOPIC> <ID>14</ID> <TITLE>Expression or Regulation of TGFB in HNSCC cancers</TITLE> <NEED> Documents regarding TGFB expression or regulation in HNSCC cancers. </NEED> <CONTEXT> The laboratory wants to identify components of the TGFB signaling pathway in HNSCC, and determine new targets to study HNSCC. </CONTEXT> </TOPIC> Figure 1: Example topic in original XML form back. We evaluated the quality of feedback terms is the average document length in the corpus, and produced by a simple Google query and compared qT is the query-specific relative weight of the term the results with known query expansion techniques. T . Usually, qT equals the number of occurrences of We also studied possibilities to combine both strate- T in the original query. We discuss this parameter in gies. more detail in section 8. The remaining parameters in formula 2 were chosen to be k1 = 1:2 and b = 0:75. In addition to mere term queries, MultiText sup- 2 The MultiText system ports structured queries, and in particular disjunc- tions of terms. We call a disjunction of query terms Our work is based on the MultiText text retrieval system, which was initially implemented at the T_ = T1 _ T2 _ · · · _ Tm University of Waterloo in 1994 [CCB94] and has a disjunctive query element. The BM25 weight of a experienced a variety of extensions and modifica- disjunctive query element is tion over the last decade. MultiText supports several different document scoring mechanisms. We jDj − jDT1 [ · · · [ DTm j + 0:5 wT_ = ln( ): (3) employed the MultiText implementation of Okapi jDT1 [ · · · [ DTm j + 0:5 BM25 [RWJ+94] [RWB98] and MultiText's QAP For convenience, we let the \+" symbol denote the passage scoring algorithm [CCT00] [CCL01], which disjunction operator (\_") whenever we present ex- had initially been developed for question answering ample queries. tasks. 2.1 Okapi BM25 document scoring 2.2 MultiText QAP passage scoring MultiText's QAP passage scoring algorithm com- MultiText's implementation of BM25 is the same putes text passage scores based on the concept of as the original function proposed by Robertson et self-information. It may be extended to a docu- al. [RWJ+94] in the absence of document relevance ment scoring algorithm by finding the passage with information. Every query term T is assigned a term the highest score within a document and assigning weight the passage's score to the entire document, although jDj − jDT j + 0:5 wT = ln( ); (1) this is not the original purpose of QAP. Our pseudo- jD j + 0:5 T relevance feedback mechanisms are derived from the where D is the set of all documents in the corpus, QAP scoring function. and DT is the set of documents containing the term T . For a given query Q = fT1; :::; Tng, the score of Given a query Q, every passage P of length l, con- a document D is computed using the formula taining the terms T ⊆ Q, is assigned the score dT · (1 + k1) N wT · qT · ; (2) H(P ) = X (log2( ) − log2(l)); (4) X lenD dT + k1 · ((1 − b) + b · ) fT T 2Q lenavg T 2T where dT is the number of occurrences of the term T where N is the total number of tokens in the corpus in the document D, lenD is the length of D, lenavg (corpus size), and fT is the number of occurrences Query Expansion Parsing, Acronym/Synonym Finder Species Topic Stop Word Names Variant (XML text) Removal (AcroMed, EuGenes, LocusLink) Finder Generator Unexpanded MultiText Expanded Query BM25 Query Legend Query MultiText MultiText QAP BM25 Stage 1 Process Expansion Doc. Fusion Expansion Validator, (Interleave) Permutation Finder Stage 2 Final MultiText Expanded Result BM25 Query Google Passage Document Feedback Feedback Feedback Species MultiText Names BM25 Filtering Term Fusion Query Expansion (CombSUM) (Pseudo-Relevance Feedback) Stage 3 Figure 2: The document retrieval system of the term T in the corpus. H(P ) is called the in the original query. The unexpanded query self-information of the passage P . is taken by MultiText to score documents using both BM25 and QAP, while the expanded query is used to produce a third, BM25-scored document set. 3 System overview Expansion validation (Stage 2) In the second stage (see section 5 for details), the An overview of the document retrieval system documents returned by the first stage are analyzed developed for the Genomics track is given by Figure and systematically scanned for any occurrences of 2. The system consists of a preprocessing stage and terms added to the query in the expansion phase three document retrieval stages. Most parts of the of stage 1 or permutations of query terms. Terms system can be changed by varying the respective that have been added to the query in stage 1, but parameters. Because there were no training data occur very infrequently in the documents returned, available, many of these parameters are arbitrarily are removed from the query, whereas frequent chosen. permutations of query terms which have not been part of the old query might be added. The resulting Query preprocessing & Synonym expansion query is sent to MultiText, and documents are (Stage 1) scored by BM25. In the preprocessing stage, the input data (XML text) is parsed and all stop words are removed. Pseudo-relevance feedback (Stage 3) The list of stop words contains the usual terms, like The third stage, which is described in sections 6 \the" and \and", but also a number of additional and 7, performs a set of standard pseudo-relevance terms inferred from the sample topics, e.g. “find”, feedback operations in order to create new expan- \information", and \literature". After the stop sion terms. The terms produced by the feedback words have been removed, several synonym genera- are added to the query, and BM25 is used again to tion techniques are applied to the query. A detailed compute the document scores. Finally, if our re- explanation of these techniques can be found in trieval system has found species names in the orig- section 4. The result of this process is an expanded inal query, it tries to match the documents in the query containing possible synonyms of the terms ID: 14 TITLE: Expression or Regulation of TGFB in HNSCC cancers NEED: Documents regarding TGFB expression or regulation in HNSCC cancers. QUERY_TITLE: "expression" "regulation" "tgfb" "hnscc" "cancers" EXPANDED_TITLE: "expression" "regulation" ("tgfb"+"blk"+"ced"+"dlhc"+ "tgfbeta"+"tgfb1"+"tgfb2"+"tgf r"+"transforming growth factor beta"+ "latent transforming growth factor beta"+"ltbp 1"+"ltbp1"+"mav"+ "maverick"+.....) ("hnscc"+"head and neck squamous cell") "cancers" QUERY_NEED: "regarding" "tgfb" "expression" "regulation" "hnscc" "cancers" EXPANDED_NEED: "regarding" ("tgfb"+"blk"+"ced"+"dlhc"+"tgfbeta"+"tgfb1"+ .....) "expression" "regulation" ("hnscc"+"head and neck squamous cell") "cancers" Figure 3: Example topic after preprocessing and query expansion in stage 1 result list against these species names and promotes are likely to have lexical variants).

Load more