A Neural Autoencoder Approach for Document Ranking and Query Refinement in Pharmacogenomic Information Retrieval
Total Page:16
File Type:pdf, Size:1020Kb
A Neural Autoencoder Approach for Document Ranking and Query Refinement in Pharmacogenomic Information Retrieval Jonas Pfeiffer1;2 Samuel Broscheit1 Rainer Gemulla1 Mathias Goschl¨ 2 1University of Mannheim, Mannheim, Germany 2Molecular Health GmbH, Heidelberg, Germany [email protected] [email protected] [email protected] Abstract tionable biomarkers need to satisfy rigorous qual- ity criteria set by physicians and therefore call In this study, we investigate learning-to- upon manual data curation by domain experts. rank and query refinement approaches for To ascertain the timeliness of information, cu- information retrieval in the pharmacoge- rators are faced with the labor-intensive task to nomic domain. The goal is to improve the identify relevant articles in a steadily growing flow information retrieval process of biomedi- of publications (Lee et al., 2018). In our sce- cal curators, who manually build knowl- nario, curators iteratively refine search queries in edge bases for personalized medicine. We an electronic library, such as PubMed.1 The in- study how to exploit the relationships be- formation the curators search for, are biomarker- tween genes, variants, drugs, diseases and facts in the form of fGene(s) - Variant(s) - Drug(s) outcomes as features for document rank- - Disease(s) - Outcomeg. For example, a cu- ing and query refinement. For a su- rator starts with a query consisting of a single pervised approach, we are faced with a gene, e.g. q1 = fPIK3CAg, and receives a set small amount of annotated data and a large of documents D1. After examining D1, the cu- amount of unannotated data. Therefore, rator identifies the variants H1047R and E545K, we explore ways to use a neural document which yields queries q2 = fPIK3CA; H1047Rg auto-encoder in a semi-supervised ap- and q3 = fPIK3CA; E545Kg that lead to D2 and proach. We show that a combination of es- D3. As soon as studies are found that contain tablished algorithms, feature-engineering the entities in a biomarker relationship, the enti- and a neural auto-encoder model yield ties and the studies are entered into the knowledge promising results in this setting. base. This process is then repeated until, theoret- 1 Introduction ically, all published literature regarding the gene PIK3CA has been screened. Personalized medicine strives to relate genomic Our goal is to reduce the amount of docu- detail to patient phenotypic conditions (such as ments which domain experts need to examine. disease, adverse reactions to treatment) and to as- To achieve this, an information retrieval system sess the effectiveness of available treatment op- should rank documents high that are relevant to tions (Brunicardi et al., 2011). For computer- the query and should facilitate the identification of assisted decision making, knowledge bases need relevant entities to refine the query. to be compiled from published scientific evidence. Classic approaches for document ranking, like They describe biomarker relationships between tf-idf (Luhn, 1957); (Sparck¨ Jones, 1972), or bm25 key entity types: Disease, Protein/Gene, Vari- (Robertson and Zaragoza, 2009), and, for exam- ant/Mutation, Drug, and Patient Outcome (Out- ple, the Relevance Model (Lavrenko and Croft, come) (Manolio, 2010). While automated infor- 2001) for query refinement are established tech- mation extraction has been applied to simple re- niques in this setting. They are known to be robust lationships — such as Drug-Drug (Asada et al., and do not require data for training. However, as 2017) or Protein-Protein (Peng and Lu, 2017); they are based on a bag-of-words model (BOW), (Peng et al., 2015); (Li et al., 2017) interaction — with adequate precision and recall, clinically ac- 1https://www.ncbi.nlm.nih.gov/pubmed 87 Proceedings of the BioNLP 2018 workshop, pages 87–97 Melbourne, Australia, July 19, 2018. c 2018 Association for Computational Linguistics they cannot represent a semantic relationship of entities in a document. This, for example, yields search results with highly ranked review articles that only list query terms, without the desired re- lationship between them. Therefore, we investi- gate approaches that model the semantic relation- ships between biomarker entities. This can either be addressed by combining BOW with rule-based filtering, or by supervised learning, i.e. learning- to-rank (LTR). Our goal is, to tailor document ranking and query refinement to the task of the curator. This means that a document ranking model should as- sign a high rank to a document that contains the query entities in a biomarker relationship. A query refinement model should suggest additional query Figure 1: Document Ranking terms, i.e. biomarker entities, to the curator that are relevant to the current query. Given the com- (2014), Dai and Le(2015), and Li et al.(2015) we plexity of entity relationships and the high vari- implemented a text auto-encoder with a Sequence- ety of textual realizations this requires either effec- to-Sequence approach. In this model an encoder tive feature engineering, or large amounts of train- Enc produces a vector representation v = Enc(d) ing data for a supervised approach. The in-house of an input document d = [w ; w ;:::; w ], data set of Molecular Health consists of 5833 la- 1 2 n with w being word embedding representations beled biomarker-facts, and 24 million unlabeled i (Mikolov et al., 2013). This dense representation text documents from PubMed. Therefore, a good v is then fed to a decoder Dec, that tries to recon- solution is to exploit the large amount of unla- struct the original input, i.e. d^ = Dec(v). During beled data in a semi-supervised approach. Li et al. training we minimize error(d;^ d). After training (2015) have shown that a neural auto-encoder with we only use the Enc(d) to encode the text. We LSTMs (Hochreiter and Schmidhuber, 1997) can want to explore if we can use Enc to encode the encode the syntactics and semantics of a text in documents and the query. We will use the out- a dense vector representation. We show that this put of the document encoder Enc as features for representation can be effectively used as a feature a document ranking model and for a query refine- for semi-supervised learning-to-rank and query re- ment model. finement. In this paper, we describe a feature engineer- 3 Document Ranking ing approach and a semi-supervised approach. In our experiments we show that the two approaches Information retrieval systems rank documents in are, in comparison, almost on par in terms of per- the order that is estimated to be most useful to a formance and even improve in a joint model. In user query by assigning a numeric score to each Section2 we describe the neural auto-encoder, and document. Our pipeline for document ranking is then proceed in Section3 to describe our models depicted in Figure1: Given a query q, we first re- for document ranking and in Section4 the models trieve a set of documents Dq that contain all of for query refinement. the query terms. Then, we compute a representa- tion repq(q) for the query q, and a representation 2 Neural Auto-Encoder repd(d) for each document d 2 Dq. Finally, we compute the score with a ranker model scorerank. In this study, we use an unsupervised method to For repd we need to find a representation for encode text into a dense vector representation. Our an arbitrary number of entity-type combinations, goal is to investigate if we can use this repre- because a fact can consist of e.g. 3 Genes, 4 Vari- sentation as an encoding of the relations between ants, 1 Drug, 0 Diseases and 0 Outcomes. In the biomarker entities. following, we describe several of the settings for Following Sutskever et al.(2014) Cho et al. repq(q), repd(d) and the ranker model. 88 3.1 Bag-of-Words Models We have implemented two commonly used BOW models tf-idf and bm25. For these models the text representations repq(q) and repd(d) is the vector space model. 3.2 Learning-to-Rank Models For the learning-to-rank models, we chose a mul- tilayer perceptron (MLP) as the scoring function scorerank. In the following we explain how repq(q) and repd(d) are computed. Feature Engineering Model We created a set of basic features: encoding the frequency of entity Figure 2: Query Refinement types, distance features between entity types, and context words of entities. In this model, features 4 Query Refinement are query dependent and are computed on-demand by a feature function f(q; d). Query refinement finds additional terms for the The algorithm to compute the distance feature initial query q that better describe the information is as follows: Given query q with entities e 2 q need of the user (Nallapati and Shah, 2006). In and document d = [w1; w2; : : : ; wn], with w being our approach we follow Cao et al.(2008), in which words in the document. Let type(e) be the func- the ranked documents Dq are used as pseudo rel- tion that yields the entity type, f.ex. type(e) = evance feedback. Our goal is to suggest relevant Gene. Then, if ei; ej 2 q and there exists a entities e that are contained in Dq and that are in a wk = ei; wl = ej then we add jl − kj to the biomarker relationship to q. Therefore, we define bucket of ftype(ei), type(ej)g. To summarize the a scoring function scoreref for ranking of candi- collected distances we compute min(), max(), date entities e to the query q with respect to the re- mean() and std() over all collected distances for trieved document set Dq.