BioNLP 2020
The 19th SIGBioMed Workshop on Biomedical Language Processing
Proceedings of the Workshop
July 9, 2020 c 2020 The Association for Computational Linguistics
Order copies of this and other ACL proceedings from:
Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected]
ISBN 978-1-952148-09-5
ii BioNLP 2020: Research unscathed by COVID-19 Kevin Bretonnel Cohen, Dina Demner-Fushman, Sophia Ananiadou, Junichi Tsujii
The past year has been more than exciting for natural language processing in general, and for biomedical natural language processing in particular. A gradual accretion of studies of reproducibility and replicability in natural language processing, biomedical and otherwise had been making it clear that the reproducibility crisis that has hit most of the rest of science is not going to spare text mining or its related fields. Then, in March of 2020, much of the world ground to a sudden halt.
The outbreak of the COVID-19 disease caused by the novel coronavirus SARS-CoV-2 made computational work more obviously relevant than it had perhaps ever been before. Suddenly, newscasters were arguing about viral clades, the daily news was full of stories about modelling, and your neighbor had heard of PCR. But, some of us did not really see a role for natural language processing in the brave new world of computational instant reactions to an international pandemic.
That was wrong.
In mid-late March of 2020, a joint project between the Allen Artificial Intelligence Institute (Ai2), the National Library of Medicine (NLM), and the White House Office of Science and Technology Policy (OSTP) released CORD-19, a corpus of work on the SARS-CoV-2 virus, on COVID-19 disease, and on related coronavirus research. It was immediately notable for its inclusion of "gray literature" from preprint servers, which mostly have been neglected in text mining research, as well as for its flexibility with regards to licensing of content types. Perhaps most importantly, it was released in conjunction with a number of task types, including one related to ethics–although the value of medical ethics has been widely obvious since the Nazi "medical" experimentation horrors of the Second World War, the worldwide pandemic has made the value of medical ethicists more apparent to the general public than at any time since. Those task type definitions enabled the broader natural language processing community to jump into the fray quite quickly, and initial results have been quick to arrive.
Meanwhile, the pandemic did nothing to slow research in biomedical natural language processing on any other topic, either. That can be seen in the fact that this year the Association for Computational Linguistics SIGBIOMED workshop on biomedical natural language processing received 73 submissions. The unfortunate effect of the pandemic was the cancellation of the physical workshop, which would have allowed acceptance of all high-quality submissions as posters, if not for podium presentations. Indeed, the poster sessions at BioNLP have been continuously growing in size, due to the large number of high-quality submissions that the workshop receives annually. Unfortunately, because this year the Association for Computational Linguistics annual meeting will take place online only, there will be no poster session for the workshop. Consequently, only a handful of submissions could be accepted for presentation.
Transitioning of the traditional conferences to online presentations at the beginning of the COVID-19 pandemic showed that the traditional presentation formats are not as engaging remotely as they are in the context of in-person sessions. We are therefore exploring a new form of presentation, hoping it will be more engaging, interactive, and informative: 22 papers (about 30% of the submissions) will be presented in panel-like sessions. Papers will be grouped by similarity of topic, meaning that participants with related interests will be able to interact regarding their papers with a hopefully optimal number of people on line at the same time. As we write this introduction, the conference plans and platform are still evolving, as are the daily lives of much of the planet, so we hope that you will join us in planning for the worst, while hoping for the best.
iii Panel Discussions papers referenced in this section are included in this volume, unless otherwise indicated
Session 1: High accuracy information retrieval, spin and bias Invited talk and discussion lead: Kirk Roberts
Presentations: The exploration of Information Retrieval approaches enhanced with linguistic knowledge continues in the work that allows life-science researchers to search PubMed and the CORD-19 collection using patterns over dependency graphs (Taub-Tabib et al.) Representing biomedical relationships in the literature by encoding dependency structure with word embeddings promises to improve retrieval of relationships and literature-based discovery (Paullada et al.) Word embeddings trained on biomedical research articles and the tests based on their associations and coherence, among others, allow detecting and quantifying gender bias over time (Rios et al.) A BioBERT model fine-tuned for relation extraction might assist in detecting spin in reporting the results of randomized clinical trials (Koroleva et al.) Finally, a novel sequence-to-set approach to generating terms for pseudo-relevance feedback is evaluated (Das et al.)
Session 2: Clinical Language Processing Invited talk and discussion lead: Tim Miller
Presentations: Not surprisingly, much of the potentially reproducible work in the clinical domain is based on the Medical Information Mart for Intensive Care (MIMIC) data (Johnson et al., 2016). Kovaleva et al. used the MIMIC-CXR data to explore Visual Dialog for radiology and prepare the first publicly available silver- and gold-standard datasets for this task. Searle et al. present a MIMIC-based silver standard for automatic clinical coding and warn that frequently assigned codes in MIMIC-III might be undercoded. Mascio et al. used MIMIC and the Shared Annotated Resources (ShARe)/CLEF dataset in four classification tasks: disease status, temporality, negation, and uncertainty. Temporality is explored in-depth by Lin et al., and Wang et al. explore approaches to a clinical Semantic Textual Similarity (STS) task. Xu et al. apply reinforcement learning to deal with noise in clinical text for readmission prediction after kidney transplant.
Session 3: Language Understanding Invited talk and discussion lead: Anna Rumshisky
Presentations: Bringing clinical problems and poetry together, this creative work seeks to better understand dyslexia through a self-attention transformer and Shakespearean sonnets (Bleiweiss). Detection of early stages of Alzheimer’s disease using unsupervised clustering is explored with 10 years of President Ronald Reagan’s speeches (Wang et al.). Stavropoulos et al. introduce BIOMRC, a cloze- style dataset for biomedical machine reading comprehension, along with new publicly available models, and provide a leaderboard for the task. Another type of question answering – answering questions that can be answered by electronic medical records–is explored by Rawat et al. Hur et al. study veterinary records to identify reasons for administration of antibiotics. DeYoung et al. expand the Evidence Inference dataset and evaluate BERT-based models for the evidence inference task.
Session 4: Named Entity Recognition and Knowledge Representation Invited talk and discussion lead: Hoifung Poon
Invited talk: Machine Reading for Precision Medicine
iv The advent of big data promises to revolutionize medicine by making it more personalized and effective, but big data also presents a grand challenge of information overload. For example, tumor sequencing has become routine in cancer treatment, yet interpreting the genomic data requires painstakingly curating knowledge from a vast biomedical literature, which grows by thousands of papers every day. Electronic medical records contain valuable information to speed up clinical trial recruitment and drug development, but curating such real-world evidence from clinical notes can take hours for a single patient. Natural language processing (NLP) can play a key role in interpreting big data for precision medicine. In particular, machine reading can help unlock knowledge from text by substantially improving curation efficiency. However, standard supervised methods require labeled examples, which are expensive and time-consuming to produce at scale. In this talk, Dr. Poon presents Project Hanover, where the team overcomes the annotation bottleneck by combining deep learning with probabilistic logic, and by exploiting self-supervision from readily available resources such as ontologies and databases. This enables the researchers to extract knowledge from millions of publications, reason efficiently with the resulting knowledge graph by learning neural embeddings of biomedical entities and relations, and apply the extracted knowledge and learned embeddings to supporting precision oncology.
Hoifung Poon is the Senior Director of Precision Health NLP at Microsoft Research and an affiliated professor at the University of Washington Medical School. He leads Project Hanover, with the overarching goal of structuring medical data for precision medicine. He has given tutorials on this topic at top conferences such as the Association for Computational Linguistics (ACL) and the Association for the Advancement of Artificial Intelligence (AAAI). His research spans a wide range of problems in machine learning and natural language processing (NLP), and his prior work has been recognized with Best Paper Awards from premier venues such as the North American Chapter of the Association for Computational Linguistics (NAACL), Empirical Methods in Natural Language Processing (EMNLP), and Uncertainty in AI (UAI). He received his PhD in Computer Science and Engineering from University of Washington, specializing in machine learning and NLP.
Presentations: Nejadgholi et al. analyze errors in NER and introduce an F-score that models a forgiving user experience. Peng et al. study NER, relation extraction, and other tasks with a multi-tasking learning approach. Amin et al. explore multi-instance learning for relation extraction. ShafieiBavani et al. also explore relation and event extraction, but in the context of simultaneously predicting relationships between all mention pairs in a text. Chang et al. provide a benchmark for knowledge graph embedding models on the SNOMED-CT knowledge graph and emphasize the importance of knowledge graphs for learning biomedical knowledge representation.
Acknowledging the community
As always, we are profoundly grateful to the authors who chose BioNLP for presenting their innovative research. The authors’ willingness to continue sharing their work through BioNLP consistently makes the workshop noteworthy and stimulating. We are equally indebted to the program committee members (listed elsewhere in this volume) who produced thorough reviews on a tight review schedule and with an admirable level of insight, despite the timeline being even shorter than usual and the workload higher, while at the same time handling the unprecedented changes in their work and life caused by the COVID- 19 pandemic.
References
Johnson, A., Pollard, T., Shen, L. et al. MIMIC-III, a freely accessible critical care database. Sci Data 3, 160035 (2016). https://doi.org/10.1038/sdata.2016.35
v Wang, L.L., Lo,K., Chandrasekhar, Y. et al. Cord-19: The covid-19 open research dataset. ArXiv, abs/2004.10706, 2020.
vi Organizers: Dina Demner-Fushman, US National Library of Medicine Kevin Bretonnel Cohen, University of Colorado School of Medicine, USA Sophia Ananiadou, National Centre for Text Mining and University of Manchester, UK Junichi Tsujii, National Institute of Advanced Industrial Science and Technology, Japan
Program Committee: Sophia Ananiadou, National Centre for Text Mining and University of Manchester, UK Emilia Apostolova, Language.ai, USA Eiji Aramaki, University of Tokyo, Japan Asma Ben Abacha, US National Library of Medicine Siamak Barzegar, Barcelona Supercomputing Center, Spain Olivier Bodenreider, US National Library of Medicine Leonardo Campillos Llanos, Universidad Autonoma de Madrid, Spain Qingyu Chen, US National Library of Medicine Fenia Christopoulou, National Centre for Text Mining and University of Manchester, UK Aaron Cohen, Oregon Health & Science University, USA Kevin Bretonnel Cohen, University of Colorado School of Medicine, USA Brian Connolly, Kroger Digital, USA Viviana Cotik, University of Buenos Aires, Argentina Manirupa Das, Amazon Search, Seattle, WA, USA Dina Demner-Fushman, US National Library of Medicine Bart Desmet, Clinical Center, National Institutes of Health, USA Travis Goodwin, , US National Library of Medicine Natalia Grabar, CNRS, France Cyril Grouin, LIMSI - CNRS, France Tudor Groza, The Garvan Institute of Medical Research, Australia Antonio Jimeno Yepes, IBM, Melbourne Area, Australia Halil Kilicoglu, University of Illinois at Urbana-Champaign, USA Ari Klein, University of Pennsylvania, USA Andre Lamurias, University of Lisbon, Portugal Majid Latifi, Trinity College Dublin, Ireland Alberto Lavelli, FBK-ICT, Italy Robert Leaman, US National Library of Medicine Ulf Leser, Humboldt-Universität zu Berlin, Germany Maolin Li, National Centre for Text Mining and University of Manchester, UK Zhiyong Lu, US National Library of Medicine Timothy Miller, Children’s Hospital Boston, USA Claire Nedellec, INRA, France Aurelie Neveol, LIMSI - CNRS, France Mariana Neves, German Federal Institute for Risk Assessment, Germany Denis Newman-Griffis, Clinical Center, National Institutes of Health, USA Nhung Nguyen, The University of Manchester, UK Karen O’Connor, University of Pennsylvania, USA Naoaki Okazaki, Tokyo Institute of Technology, Japan Yifan Peng, US National Library of Medicine Laura Plaza, UNED, Madrid, Spain Francisco J. Ribadas-Pena, University of Vigo, Spain Angus Roberts, The University of Sheffield, UK Kirk Roberts, The University of Texas Health Science Center at Houston, USA
vii Roland Roller, DFKI GmbH, Berlin, Germany Diana Sousa, University of Lisbon, Portugal Karin Verspoor, The University of Melbourne, Australia Davy Weissenbacher, University of Pennsylvania, USA W John Wilbur, US National Library of Medicine Shankai Yan, US National Library of Medicine Chrysoula Zerva, National Centre for Text Mining and University of Manchester, UK Ayah Zirikly, Clinical Center, National Institutes of Health, USA Pierre Zweigenbaum, LIMSI - CNRS, France
Additional Reviewers: Jingcheng Du, School of Biomedical Informatics, UTHealth
Invited Speakers: Hoifung Poon, Microsoft Research Tim Miller, Boston Childrens Hospital and Harvard Medical School Kirk Roberts, School of Biomedical Informatics, UTHealth Anna Rumshisky, University of Massachusetts Lowell
viii Table of Contents
Quantifying 60 Years of Gender Bias in Biomedical Research with Word Embeddings Anthony Rios, Reenam Joshi and Hejin Shin ...... 1
Sequence-to-Set Semantic Tagging for Complex Query Reformulation and Automated Text Categoriza- tion in Biomedical IR using Self-Attention Manirupa Das, Juanxi Li, Eric Fosler-Lussier, Simon Lin, Steve Rust, Yungui Huang and Rajiv Ramnath...... 14
Interactive Extractive Search over Biomedical Corpora Hillel Taub Tabib, Micah Shlain, Shoval Sadde, Dan Lahav, Matan Eyal, Yaara Cohen and Yoav Goldberg...... 28
Improving Biomedical Analogical Retrieval with Embedding of Structural Dependencies Amandalynne Paullada, Bethany Percha and Trevor Cohen...... 38
DeSpin: a prototype system for detecting spin in biomedical publications Anna Koroleva, Sanjay Kamath, Patrick Bossuyt and Patrick Paroubek ...... 49
Towards Visual Dialog for Radiology Olga Kovaleva, Chaitanya Shivade, Satyananda Kashyap, Karina Kanjaria, Joy Wu, Deddeh Ballah, Adam Coy, Alexandros Karargyris, Yufan Guo, David Beymer Beymer, Anna Rumshisky and Vandana Mukherjee Mukherjee ...... 60
A BERT-based One-Pass Multi-Task Model for Clinical Temporal Relation Extraction Chen Lin, Timothy Miller, Dmitriy Dligach, Farig Sadeque, Steven Bethard and Guergana Savova 70
Experimental Evaluation and Development of a Silver-Standard for the MIMIC-III Clinical Coding Dataset Thomas Searle, Zina Ibrahim and Richard Dobson ...... 76
Comparative Analysis of Text Classification Approaches in Electronic Health Records Aurelie Mascio, Zeljko Kraljevic, Daniel Bean, Richard Dobson, Robert Stewart, Rebecca Ben- dayan and Angus Roberts ...... 86
Noise Pollution in Hospital Readmission Prediction: Long Document Classification with Reinforcement Learning Liyan Xu, Julien Hogan, Rachel E. Patzer and Jinho D. Choi ...... 95
Evaluating the Utility of Model Configurations and Data Augmentation on Clinical Semantic Textual Similarity Yuxia Wang, Fei Liu, Karin Verspoor and Timothy Baldwin ...... 105
Entity-Enriched Neural Models for Clinical Question Answering Bhanu Pratap Singh Rawat, Wei-Hung Weng, So Yeon Min, Preethi Raghavan and Peter Szolovits 112
Evidence Inference 2.0: More Data, Better Models Jay DeYoung, Eric Lehman, Benjamin Nye, Iain Marshall and Byron C. Wallace...... 123
ix Personalized Early Stage Alzheimer’s Disease Detection: A Case Study of President Reagan’s Speeches Ning Wang, Fan Luo, Vishal Peddagangireddy, Koduvayur Subbalakshmi and Rajarathnam Chan- dramouli...... 133
BioMRC: A Dataset for Biomedical Machine Reading Comprehension Dimitris Pappas, Petros Stavropoulos, Ion Androutsopoulos and Ryan McDonald ...... 140
Neural Transduction of Letter Position Dyslexia using an Anagram Matrix Representation Avi Bleiweiss ...... 150
Domain Adaptation and Instance Selection for Disease Syndrome Classification over Veterinary Clinical Notes Brian Hur, Timothy Baldwin, Karin Verspoor, Laura Hardefeldt and James Gilkerson ...... 156
Benchmark and Best Practices for Biomedical Knowledge Graph Embeddings David Chang, Ivana Balaževic,´ Carl Allen, Daniel Chawla, Cynthia Brandt and Andrew Taylor167
Extensive Error Analysis and a Learning-Based Evaluation of Medical Entity Recognition Systems to Approximate User Experience Isar Nejadgholi, Kathleen C. Fraser and Berry de Bruijn ...... 177
A Data-driven Approach for Noise Reduction in Distantly Supervised Biomedical Relation Extraction Saadullah Amin, Katherine Ann Dunfield, Anna Vechkaeva and Guenter Neumann...... 187
Global Locality in Biomedical Relation and Event Extraction Elaheh ShafieiBavani, Antonio Jimeno Yepes, Xu Zhong and David Martinez Iraola...... 195
An Empirical Study of Multi-Task Learning on BERT for Biomedical Text Mining Yifan Peng, Qingyu Chen and Zhiyong Lu ...... 205
x Conference Program
Thursday July 9, 2020
08:30–08:40 Opening remarks
08:40–10:30 Session 1: High accuracy information retrieval, spin and bias
08:40–09:10 Invited Talk – Kirk Roberts
09:10–09:20 Quantifying 60 Years of Gender Bias in Biomedical Research with Word Embed- dings Anthony Rios, Reenam Joshi and Hejin Shin
09:20–09:30 Sequence-to-Set Semantic Tagging for Complex Query Reformulation and Auto- mated Text Categorization in Biomedical IR using Self-Attention Manirupa Das, Juanxi Li, Eric Fosler-Lussier, Simon Lin, Steve Rust, Yungui Huang and Rajiv Ramnath
09:30–09:40 Interactive Extractive Search over Biomedical Corpora Hillel Taub Tabib, Micah Shlain, Shoval Sadde, Dan Lahav, Matan Eyal, Yaara Cohen and Yoav Goldberg
09:40–09:50 Improving Biomedical Analogical Retrieval with Embedding of Structural Depen- dencies Amandalynne Paullada, Bethany Percha and Trevor Cohen
09:50–10:00 DeSpin: a prototype system for detecting spin in biomedical publications Anna Koroleva, Sanjay Kamath, Patrick Bossuyt and Patrick Paroubek
10:00–10:30 Discussion
10:30–10:45 Coffee Break
xi Thursday July 9, 2020 (continued)
10:45–13:00 Session 2: Clinical Language Processing
10:45–11:15 Invited Talk – Tim Miller
11:15–11:25 Towards Visual Dialog for Radiology Olga Kovaleva, Chaitanya Shivade, Satyananda Kashyap, Karina Kanjaria, Joy Wu, Deddeh Ballah, Adam Coy, Alexandros Karargyris, Yufan Guo, David Beymer Beymer, Anna Rumshisky and Vandana Mukherjee Mukherjee
11:25–11:35 A BERT-based One-Pass Multi-Task Model for Clinical Temporal Relation Extrac- tion Chen Lin, Timothy Miller, Dmitriy Dligach, Farig Sadeque, Steven Bethard and Guergana Savova
11:35–11:45 Experimental Evaluation and Development of a Silver-Standard for the MIMIC-III Clinical Coding Dataset Thomas Searle, Zina Ibrahim and Richard Dobson
11:45–11:55 Comparative Analysis of Text Classification Approaches in Electronic Health Records Aurelie Mascio, Zeljko Kraljevic, Daniel Bean, Richard Dobson, Robert Stewart, Rebecca Bendayan and Angus Roberts
11:55–12:05 Noise Pollution in Hospital Readmission Prediction: Long Document Classification with Reinforcement Learning Liyan Xu, Julien Hogan, Rachel E. Patzer and Jinho D. Choi
12:05–12:15 Evaluating the Utility of Model Configurations and Data Augmentation on Clinical Semantic Textual Similarity Yuxia Wang, Fei Liu, Karin Verspoor and Timothy Baldwin
12:15–12:45 Discussion
12:45–13:30 Lunch
xii Thursday July 9, 2020 (continued)
13:30–15:30 Session 3: Language Understanding
13:30–14:00 Invited Talk – Anna Rumshisky
14:00–14:10 Entity-Enriched Neural Models for Clinical Question Answering Bhanu Pratap Singh Rawat, Wei-Hung Weng, So Yeon Min, Preethi Raghavan and Peter Szolovits
14:10–14:20 Evidence Inference 2.0: More Data, Better Models Jay DeYoung, Eric Lehman, Benjamin Nye, Iain Marshall and Byron C. Wallace
14:20–14:30 Personalized Early Stage Alzheimer’s Disease Detection: A Case Study of President Reagan’s Speeches Ning Wang, Fan Luo, Vishal Peddagangireddy, Koduvayur Subbalakshmi and Ra- jarathnam Chandramouli
14:30–14:40 BioMRC: A Dataset for Biomedical Machine Reading Comprehension Dimitris Pappas, Petros Stavropoulos, Ion Androutsopoulos and Ryan McDonald
14:40–14:50 Neural Transduction of Letter Position Dyslexia using an Anagram Matrix Repre- sentation Avi Bleiweiss
14:50–15:00 Domain Adaptation and Instance Selection for Disease Syndrome Classification over Veterinary Clinical Notes Brian Hur, Timothy Baldwin, Karin Verspoor, Laura Hardefeldt and James Gilker- son
15:00–15:30 Discussion
15:30–15:45 Coffee Break
xiii Thursday July 9, 2020 (continued)
15:45–17:45 Session 4: Named Entity Recognition and Knowledge Representation
15:45–16:25 Invited Talk: Machine Reading for Precision Medicine, Hoifung Poon
16:25–16:35 Benchmark and Best Practices for Biomedical Knowledge Graph Embeddings David Chang, Ivana Balaževic,´ Carl Allen, Daniel Chawla, Cynthia Brandt and Andrew Taylor
16:35–16:45 Extensive Error Analysis and a Learning-Based Evaluation of Medical Entity Recognition Systems to Approximate User Experience Isar Nejadgholi, Kathleen C. Fraser and Berry de Bruijn
16:45–16:55 A Data-driven Approach for Noise Reduction in Distantly Supervised Biomedical Relation Extraction Saadullah Amin, Katherine Ann Dunfield, Anna Vechkaeva and Guenter Neumann
16:55–17:05 Global Locality in Biomedical Relation and Event Extraction Elaheh ShafieiBavani, Antonio Jimeno Yepes, Xu Zhong and David Martinez Iraola
17:05–17:15 An Empirical Study of Multi-Task Learning on BERT for Biomedical Text Mining Yifan Peng, Qingyu Chen and Zhiyong Lu
17:15–17:45 Discussion
17:45–18:00 Closing remarks
xiv Quantifying 60 Years of Gender Bias in Biomedical Research with Word Embeddings
Anthony Rios1, Reenam Joshi2, and Hejin Shin3 1Department of Information Systems and Cyber Security 2Department of Computer Science 3Library Systems University of Texas at San Antonio Anthony.Rios, Reenam.Joshi, Hejin.Shin @utsa.edu { }
Abstract tion areas to understand the effect of bias. For ex- ample, Kay et al.(2015) found that the Google im- Gender bias in biomedical research can have age search application is biased (Kay et al., 2015). an adverse impact on the health of real peo- Specifically, they found an unequal representation ple. For example, there is evidence that heart of gender stereotypes in image search results for disease-related funded research generally fo- cuses on men. Health disparities can form different occupations (e.g., all police images are between men and at-risk groups of women of men). Likewise, ad-targeting algorithms may (i.e., elderly and low-income) if there is not an include characteristics of sexism and racism (Datta equal number of heart disease-related studies et al., 2015; Sweeney, 2013). Sweeney(2013) for both genders. In this paper, we study tem- found that the names of black men and women poral bias in biomedical research articles by are likely to generate ads related to arrest records. measuring gender differences in word embed- In healthcare, much of the prior work has stud- dings. Specifically, we address multiple ques- ied the bias in the diagnosis process made by doc- tions, including, How has gender bias changed over time in biomedical research, and what tors (Young et al., 1996; Hartung and Widiger, health-related concepts are the most biased? 1998). There have also been studies about ethical Overall, we find that traditional gender stereo- considerations about the use of machine learning types have reduced over time. However, we in healthcare (Cohen et al., 2014). also find that the embeddings of many medical It is possible to analyze and measure the pres- conditions are as biased today as they were 60 ence of gender bias in text. Garg et al.(2018) an- years ago (e.g., concepts related to drug addic- tion and body dysmorphia). alyzed the presence of well-known gender stereo- types over the last 100 years. Hamberg(2008) 1 Introduction shown that gender blindness and stereotyped pre- conceptions are the key cause for gender bias in It is important to develop gender-specific best- medicine. Heath et al.(2019) studied the gender- practice guidelines for biomedical research (Hold- based linguistic differences in physician trainee croft, 2007). If research is heavily biased towards evaluations of medical faculty. Salles et al.(2019) one gender, then the biased guidance may con- measured the implicit and explicit gender bias tribute towards health disparities because the evi- among health care professionals and surgeons. dence drawn-on may be questionable (i.e., not well Feldman et al.(2019) quantified the exclusion of studied). For example, there is more research fund- females in clinical studies at scale with automated ing for the study of heart disease in men (Weisz data extraction. Recently, researchers have studied et al., 2004). Therefore, the at-risk populations methods to quantify gender bias using word embed- of older women in low economic classes are not dings trained on biomedical research articles (Ku- as well-investigated. Therefore, this opens up the rita et al., 2019). Kurita et al.(2019) shown that possibility for an increase in the health disparities the resulting embeddings capture some well-known between genders. gender stereotypes. Moreover, the embeddings ex- Among informatics researchers, there has been hibit the stereotypes at a lower rate than embed- increased interest in understanding, measuring, and dings trained on other corpora (e.g., Wikipedia). overcoming bias associated with machine learning However, to the best of our knowledge, there has methods. Researchers have studied many applica- not been an automated temporal study in the change
1 Proceedings of the BioNLP 2020 workshop, pages 1–13 Online, July 9, 2020 c 2020 Association for Computational Linguistics of gender bias. we analyze two groups of words: occupations In this paper, we look at the temporal change of and mental health disorders. For each group, gender bias in biomedical research. To study social we measure the overall change in bias over biases, we make use of word embeddings trained time. Moreover, we measure the individual on different decades of biomedical research arti- bias associated with each occupation and men- cles. The two main question driving this work are, tal health disorder. In what ways has bias changed over time, and Are there certain illnesses associated with a specific 2 Related Work gender? We leverage three computational tech- In this section, we discuss research related to the niques to answer these questions, the Word Em- three major themes of this paper: gender disparities bedding Association Test (WEAT) (Caliskan et al., in healthcare, biomedical word embeddings, and 2017), the Embedding Coherence Test (ECT) (Dev bias in natural language processing (NLP). and Phillips, 2019), and Relational Inner Product Association (RIPA) (Ethayarajh et al., 2019). To 2.1 Gender Disparities in Healthcare. the best of our knowledge, this will be the first tem- poral analysis of bias of word embeddings trained There is evidence of gender disparities in the health- on biomedical research articles. Moreover, to the care system, from the diagnosis of mental health best of our knowledge, this is the first analysis that disorders to differences in substance abuse. An measures the gender bias associated with individual important question is, Do similar biases appear biomedical words. in biomedical research? In this work, while we Our work is most similar to Garg et al.(2018). explore traditional gender stereotypes (e.g., Intelli- Garg et al.(2018) study the temporal change of gence vs Appearance), we also measure potential both gender and racial biases using word embed- bias in the occupations and mental health-related dings. Our work substantially differs in three ways. disorders associated with each gender. First, this paper is focused on biomedical litera- With regard to mental health, as an example, af- ture, not general text corpora. Second, we analyze fecting more than 17 million adults in the United gender stereotypes using three distinct methods to States (US) alone, major depression is one of the see if the bias is robust to various measurement most common mental health illnesses (Pratt and techniques. Third, we extend the study beyond Brody, 2014). Depression can cause people to lose gender stereotypes. Specifically, we look at bias in pleasure in daily life, complicate other medical sets of occupation words, as well as bias in men- conditions, and possibly lead to suicide (Pratt and tal health-related word sets. Moreover, we quan- Brody, 2014). Moreover, depression can occur to tify the bias of individual occupational and mental anyone, at any age, and to people of any race or health-related words. ethnic group. While treatment can help individuals In summary, the paper makes the following con- suffering from major depression, or mental illness tributions: in general, only about 35% of individuals suffering from severe depression seek treatment from mental We answer the question; How has the usage health professionals. It is common for people to • of gender stereotypes changed in the last 60 resist treatment because of the belief that depres- years of biomedical research? Specifically, sion is not serious, that they can treat themselves, we look at the change in well-known gender or that it would be seen as a personal weakness stereotypes (e.g., Math vs Arts, Career vs Fam- rather than a serious medical illness (Gulliver et al., ily, Intelligence vs Appearance, and occupa- 2010). Unfortunately, while depression can affect tions) in biomedical literature from 1960 to anyone, women are almost twice as likely as men 2020. to have had depression (Albert, 2015). Moreover, depression is generally higher among certain de- The second contribution answers the question; mographic groups, including, but not limited to, • What are the most gender-stereotyped words Hispanic, non-Hispanic black, low income, and for each decade during the last 60 years, and low education groups (Bailey et al., 2019). The have they changed over time? This contribu- focus of this paper is to understand the impact of tion is more focused than simply looking at these mental health disparities in word embeddings traditional gender stereotypes. Specifically, trained on biomedical corpora.
2 2.2 Biomedical Word Embeddings. Year # Articles Word embeddings capture the distributional nature 1960-1969 1,479,370 between words (i.e., words that appear in similar 1970-1979 2,305,257 contexts will have a similar vector encoding). Over 1980-1989 3,322,556 1990-1999 4,109,739 the years, there have been multiple methods of pro- 2000-2010 6,134,431 ducing word embeddings, including, but not lim- 2010-2020 8,686,620 ited to, latent semantic analysis (Deerwester et al., 1990), Word2Vec (Mikolov et al., 2013a,b), and Total 26,037,973 GLOVE (Pennington et al., 2014). Moreover, pre- Table 1: The total number of articles in each decade. trained word embeddings have been shown to be useful for a wide variety of downstream biomedical NLP tasks (Wang et al., 2018), such as text classi- also been work on measuring bias in sentence em- fication (Rios and Kavuluru, 2015), named entity beddings (May et al., 2019). Furthermore, there recognition (Habibi et al., 2017), and relation ex- has been a significant amount of research that ex- traction (He et al., 2019). In Chiu et al.(2016), the plores different ways to measure bias in word em- authors study a standard methodology to train good beddings (Caliskan et al., 2017; Dev and Phillips, biomedical word embeddings. Essentially, they 2019; Ethayarajh et al., 2019). In this work, we study the impact of the various Word2Vec-specific make use of many of the bias measurement tech- hyperparameters. In this paper, we use the strate- niques (Caliskan et al., 2017; Dev and Phillips, gies proposed in Chiu et al.(2016) to train optimal 2019; Ethayarajh et al., 2019) to apply them to the decade-specific biomedical word embeddings. biomedical domain.
2.3 Bias and Natural Language Processing. 3 Dataset Unfortunately, because word embeddings are We analyze PubMed-indexed titles and abstracts learned using naturally occurring data, implicit bi- published anytime between 1960 and 2020. The ases expressed in text will be transferred to the total number of articles per decade are shown in Ta- vectors. Bias (and fairness) is an important topic ble1. The text is lower-cased and tokenized using among natural language processing researchers. the SimpleTokenizer available in GenSim (Khos- Bias has been found in word embeddings (Boluk- rovian et al., 2008). We find that the total number basi et al., 2016; Zhao et al., 2018, 2019), text clas- of papers have grown substantially each decade, sification models (Dixon et al., 2018; Park et al., from 1.4 million indexed articles in the 1960s to 2018; Badjatiya et al., 2019; Rios, 2020), and in 8.6 million in the 2010s. Yet, the rate of growth machine translation systems (Font and Costa-jussa`, stayed relatively stable each decade. 2019; Escude´ Font, 2019). In general, each paper generally focuses on either testing whether bias 4 Method exists in various models, or on removing bias from We train the Skip-Gram model on PubMed-indexed classification models for specific applications. titles and abstracts from 1960 to 2020. The hyper- Much of the work on measuring (gender) bias parameters of the Skip-Gram model are optimized using word embeddings neither studies the tempo- independently for each decade. Next, given the ral aspect (i.e., how bias changes over time) nor best set of embeddings for each decade, we ex- focuses on biomedical research (Chaloner and Mal- plore three different techniques to measure bias: donado, 2019). For example, Caliskan et al.(2017) the Word Embedding Association Test (WEAT), studied the bias in groups of words—focusing on the Embedding Coherence Test (ECT), and the Re- traditional gender stereotypes. Kurita et al.(2019) lational Inner Product Association (RIPA). Each expanded on Caliskan et al.(2017) to generalize method allows us to quantify bias in different ways, to contextual word embeddings. Garg et al.(2018) such as comparing multiple sets of words (e.g., developed a technique to study 100 years of gen- comparing the bias with respect to Career vs Fam- der and racial bias using word embeddings. They ily), comparing a single set of words (e.g., occupa- evaluated the bias over time using the US Cen- tions), and measuring the bias of individual words sus as a baseline to compare embedding bias to (e.g., nurse). In this section, we briefly discuss the demographic and occupation shifts. There has procedure we used to train the word embeddings,
3 X male, man, boy, brother, he, him, his, son, father, uncle, grandfather Attribute Words Male vs Female Y female, woman, girl, sister, she, her, hers, daughter, mother, aunt, grandmother A executive, management, professional, corporation, salary, office, business, career Career vs Family B home, parents, children, family, cousins, marriage, wedding, relatives A math, algebra, geometry, calculus, equations, computation, numbers, addition Math vs Art B poetry, art, Shakespeare, dance, literature, novel, symphony, drama A science, technology, physics, chemistry, Einstein, NASA, experiment, astronomy Science vs Art B poetry, art, Shakespeare, dance, literature, novel, symphony, drama Target Words precocious, resourceful, inquisitive, genius, inventive, astute, adaptable, reflective, A discerning, intuitive, inquiring, judicious, analytical, apt, venerable, imaginative, shrewd, thoughtful, wise, smart, ingenious, clever, brilliant, logical, intelligent Intelligence vs Appearance alluring, voluptuous, blushing, homely, plump, sensual, gorgeous, slim, bald, athletic, B fashionable, stout, ugly, muscular, slender, feeble, handsome, healthy, attractive, fat, weak, thin, pretty, beautiful, strong power, strong, confident, dominant, potent, command, assert, loud, bold, succeed, A triumph, leader, shout, dynamic, winner Weak vs Strong weak, surrender, timid, vulnerable, weakness, wispy, withdraw, yield, failure, shy, B follow, lose, fragile, afraid, loser
Table 2: Attribute and Target words words used by WEAT to measure the presence of traditional gender stereotypes in biomedical literature. as well as provide descriptions of each of the bias alization of the implicit bias test for word embed- measurement techniques. dings, measuring the association between two sets of target concepts and two sets of attributes. We 4.1 Word2Vec Model Training. use the same target and attribute sets from Kurita We train a Skip-Gram model using GenSim (Khos- et al.(2019). We list the targets and attributes in rovian et al., 2008). Following Chiu et al.(2016), Table2. The attribute sets of words are related to we search over the following key hyper-parameters: the groups in which the embeddings are biases to- Negative sample size, sub-sampling, minimum- wards or against, e.g., Male vs Female. The words count, learning rate, vector dimension, and context in the target categories—Career vs Family, Math vs window size. See Chiu et al.(2016, Table 2) for Arts, Science vs Arts, Intelligence vs Appearance, more details. and Strength vs Weakness—represent the specific To find the best model, as we search over the var- types of biases. For example, using the attributes ious hyper-parameters, we make use of the UMLS- and targets, we want to know whether the learned Sim dataset (McInnes et al., 2009). UMLS-Sim embeddings that represent men are more related to consists of 566 medical concept pairs for measur- career than the female-related words (i.e., test if ing similarity. The degree of association between female words are more related to family, than male terms in UMLS-Sim was rated by four medical res- words). idents from the University of Minnesota medical Formally, let X and Y be equal-sized sets of tar- school. All these clinical terms correspond to Uni- get concept embeddings and let A and B be sets fied Medical Language System (UMLS) concepts of attribute embeddings. To measure the bias, we included in the Metathesaurus (Bodenreider, 2004). follow Caliskan et al.(2017), which defines the fol- Evaulation is performed using Spearman’s rho rank lowing test statistic that is the difference between correlation between a vector of cosine similarities the sums over the respective target concepts, between each of the 566 pairs of words and their respective medical-resident ratings. Intuitively, the s(X, Y, A, B) = s(x, A, B) ranking of the pairs using cosine similarity, from − x X most similar pairs to the least, should be similar to X∈ the human (medical expert) annotations. s(y, A, B) y Y 4.2 Word Embedding Association Test X∈ The implicit bias test measures unconscious prej- where s(w, A, B) measures the association be- udice (Greenwald et al., 1998). WEAT is a gener- tween a single target word w (e.g., career) with
4 each of the attribute (gendered) words as Year Sim Pair Cnt 1960-1969 .6586 101 s(w, A, B) = cos(w~ ,~a) 1970-1979 .6715 207 − 1980-1989 .7033 277 a A X∈ 1990-1999 .7282 265 cos(w~ , ~b) , 2000-2010 .7078 272 2010-2020 .6867 306 b B X∈ such that cos() represents the cosine similarity be- Table 3: Quality of the embeddings trained for each tween two vectors. w~ Rd, ~a Rd, and ~b Rd decade, measured using the UMLS-Sim dataset. Sim ∈ ∈ ∈ represents the word embedding for x, y, and w, represents Spearman’s rho ranking correlation. Pair count is the number of UMLS-Sim’s word-pairs that d respectively. Similarly, is the dimension of each were present in that decades embeddings. word embedding. Instead of using the test statistic directly, to measure bias, we use the effect size. Ef- fect size is a normalized measure of the separation diagnostic tool published by the American Psychi- of the two distributions, defined as atric Association (Association et al., 2013). For each mental health disorder in DSM-5, which are µx X s(x, A, B) µy Y s(y, A, B) ∈ − ∈ generally multi-word expressions, we split it into σw X Y s(w, A, B) individual words. Next, we manually remove unin- ∈ ∪ formative adjective and function words. For exam- where µx X and µy Y represent the mean score ∈ ∈ ple, the disorder “Specific learning disorder, with over target words for a specific attribute word. Like- impairment in mathematics” is tokenized into the wise, σw X Y is the standard deviation of the ∈ ∪ following words: “learning”, “disorder”, “impair- scores for the word w in the union of X and Y . ment”, and “mathematics”. A complete listing of Intuitively, a positive score means that the attribute the occupational and mental health words can be words in X (e.g., male, man, boy) are more similar found in the appendix. to the target words A (e.g., strong, power, domi- Formally, ECT first computes the mean vectors nant) than Y (e.g., female, woman, girl). Moreover, for the attribute word sets X and Y, defined as larger effects represent more biased embeddings. As previously stated, the Attribute and Target 1 ~v = ~x words are from Kurita et al.(2019). It is important X X | | x X to note that the list is manually curated. Moreover, X∈ the bias measurement can change depending on the d where ~vX R and X represents the number of ∈ | | exact list of words. RIPA is more robust to slight words in category X. ~vY is calculated similarly. changes to the attribute words than WEAT (Etha- For both ~vX and ~vY , ECT computes the (cosine) yarajh et al., 2019). similarities with all vectors a A, i.e., the cosine ∈ similarity is calculated between each target word a 4.3 Embedding Coherence Test. A and ~vX and stored in sX R| |. The two resultant We also explore a second method of measuring ∈ vectors of similarity scores, sX (for X) and sY (for bias, the Embedding Coherence Test (ECT) (Dev Y ) are used to obtain the final ECT score. It is and Phillips, 2019). Unlike WEAT, it compares the Spearman’s rank correlation between the rank the attribute Words (e.g., Male vs Female) with a orders of sX and sY —the higher the correlation, single target set (e.g., Career). Thus, we do not the lower the bias. Intuitively, if the correlation need two contrasting target sets (e.g., Career vs is high, then the rank of target words based on Family) to measure bias. We take advantage of this similarity is correlated when calculated for the both to measure bias associated with occupations and X and Y (i.e., male and female). mental health-related disorders. Specifically, we use a total of 290 occupation words and 222 mental 4.4 Relational Inner Product Association. health-related words. The occupation words come While ECT only requires a single target set, both from prior work measuring per-word bias (Dev and WEAT and ECT 1 calculate the bias between sets Phillips, 2019). To form a list of mental health 1The cosine similarities from ECT can be used to mea- words, we use the Diagnostic and Statistical Man- sure scores for individual words, but it is not as robust as ual of Mental Disorders (DSM-5), a taxonomic and RIPA (Ethayarajh et al., 2019).
5 Career vs Family Math vs Art Science vs Art 2.5 0.4 1.0 2.0 0.8 1.5 0.6 0.6 1.0 Bias Bias Bias 0.4 0.5 0.8 0.2 0.0 1.0 0.0 0.5
1960-1969 1970-1979 1980-1989 1990-1999 2000-2010 2010-2020 1960-1969 1970-1979 1980-1989 1990-1999 2000-2010 2010-2020 1960-1969 1970-1979 1980-1989 1990-1999 2000-2010 2010-2020 Decade Decade Decade (a) (b) (c)
Strong vs Weak Intelligence vs Appearance 1.6
1.5 1.4 1.2 1.0 1.0
0.5 Bias Bias 0.8
0.0 0.6
0.4 0.5
1960-1969 1970-1979 1980-1989 1990-1999 2000-2010 2010-2020 1960-1969 1970-1979 1980-1989 1990-1999 2000-2010 2010-2020 Decade Decade (d) (e)
Figure 1: Each subfigure plots the bias measures using WEAT for one of five gender stereotypes: (a) Career vs Family, (b) Math vs Art, (c) Science vs Art, (d) Intelligence vs Appearance, and (e) Strong vs Weak. A bias score of zero represents no bias, i.e., no measurable difference between the two target categories for each gender. The shaded area of each subplot represents the bootstrap estimated 95% confidence interval. of words. However, neither approach calculates a have changed over time. To understand the biased robust bias score for individual words. To study the stereotypes, we make use of the WEAT method. most gender biased words over time, we make use Third, we look at whether occupational and mental of RIPA (Ethayarajh et al., 2019). Intuitively, RIPA health-related words are biased, and how the bias uses a single vector to represent gender, then each has changed over time. For this result, we only use word is scored by taking the dot product between a single set of target words. Thus, we make use of the gender embedding and its respective embed- ECT. Fourth, we use RIPA to find the most biased ding. The sign of the score will determine whether words for each gender in each decade. the embedding is more male or female-related. The major aspect of RIPA is creating the gender 5.1 Embedding Quality. embedding. Formally, given S, a non-empty set of In Table3, we report the quality of each decade’s ordered word pairs (x, y) (e.g., (‘man’, ‘woman’), embeddings based on the UMLS-sim dataset. Over- (‘male’, ‘female’)) that defines the gender associ- all, we find that the quality consistently improves ation, we take the first principal component of all until the 1990s, however, we see drops in the 2000s the difference vectors ~x ~y (x, y) S , which { − | ∈ } and 2010s. We hypothesize that the reason for the we call the relation vector ~g Rd—that would be ∈ decrease in embedding quality is because of the a one-dimensional bias subspace. Then, for some growth of research articles indexed on PubMed. word vector w~ Rd the dot product is taken with ∈ Intuitively, word embeddings are only able to cap- ~g to measure bias. ture a single sense of a word. However, given the breadth of articles PubMed indexes—from ma- 5 Results chine learning (e.g., BioNLP) to biomaterials— In this section, we present the results of our study in multiple word meanings are being stored in a single four parts. First, we report the embedding quality vector. Thus, the overall quality begins to drop. using UMLS-sim. Second, we study the tempo- ral bias of traditional gender stereotypes, such as 5.2 Traditional Gender Stereotypes. Career vs Family and Strong vs Weak. Ideally, we In Figure1, we plot the bias scores reported using want to understand how, and which, stereotypes WEAT. Remember, a large positive score means
6 Occupations Mental Disorders 0.95 1.00 0.90 0.95
0.85 0.90 Bias Bias 0.80 0.85
0.80 0.75 0.75
1960-1969 1970-1979 1980-1989 1990-1999 2000-2010 2010-2020 1960-1969 1970-1979 1980-1989 1990-1999 2000-2010 2010-2020 Decade Decade (a) (b)
Figure 2: ECT bias estimates for both the set of occupation and mental disorder words. The shaded area of each subplot represents the bootstrap estimated 95% confidence interval. that the male words are more similar to the targets bias in general text corpora to what is found in A (e.g., career) than the female words. There is biomedical literature. no measurable bias with a value of zero. Overall, we find that the results from the WEAT test vary 5.3 Occupational and Mental Health Bias. depending on the stereotype. For Career vs Family, In Figure2, we report the gender bias results from in Figure 1a, we find a steady linear decrease in ECT on two categories: occupations (e.g., doc- bias each decade—with the exception of the 1990s. tor, nurse, teacher) and mental health disorders We also find similar linear decreases in bias for (e.g., depression, alcoholism, PTSD). Again, un- both Science vs Art and Strong vs Weak (Figures 1c like WEAT, ECT calculates bias scores on a single and 1e). In Figure 1b, for Math vs Art, however, target set of words. Therefore, we do not need two the bias stays relatively static, i.e., it does not dra- contrasting target word sets (e.g., Math vs Art), in- matically change over time. Moreover, the WEAT stead we can focus on bias for a single set (e.g., score for Math vs Art is negative, meaning that the Math). Also, the larger the score, the lower the female words are more similar to math than the bias—a score of one would represent no difference male words. Likewise, for Intelligence vs Appear- between male and female words for that specific tar- ance (Figure 1d), we see relatively little bias from get set. Interestingly, we find that the ECT scores 1960 to 1989, however, in the 1990s and 2000s, we follow a similar pattern as found in Table3, the had a substantial jump in the bias score. better the embedding quality, the lower the bias. Comparing Figures 2a and 2b, we find that Our evaluation supports prior work evaluating the word embeddings for both occupations and bias in biomedical word embeddings (e.g., Strong mental disorders have relatively little bias in the vs Weak is the most biased stereotype in biomed- 1990s. Furthermore, while there was small vari- ical literature) (Chaloner and Maldonado, 2019, ation, mental disorders experienced little change Table 2). However, we also find differences when in bias decade-by-decade. Yet, occupation-related measuring bias over time. For example, we find words had a substantial amount of bias in the 1960s that from 2010 to 2019 there is not a lot evidence and 1970s. Moreover, we find that the bias re- for the Career vs Family stereotype in biomedical lated to occupations experienced more change, than corpora, matching the results from Chaloner and mental disorders, starting 0.83 in the 1960s and in- Maldonado(2019, Table 2). Yet, this is only a crease by more than ten points to 0.94 in the 1990s. recent phenomenon. The embeddings trained on ar- Whereas, mental disorder-related bias scores only ticles published from 1990 to 1999 exhibit a Career ranged from 0.90 to 0.94. vs Family bias score greater than 1.5. Overall, com- paring to Chaloner and Maldonado(2019, Table 5.4 Biased Words. 2), this means that the bias in recently published In Figure2, we analyze the bias of individ- biomedical literature may not be as strong as what ual occupational and mental health-related words. is found in general text corpora. But, if we exclude We found a substantial change in the bias of the most recent decade’s embeddings, the bias in occupational-related words. biomedical literature becomes much stronger. Fu- We found little change in the bias of mental ture work should explore comparing the temporal health-related words since the 1960s. Yet, while
7 Male Female 1970-1979 1980-1989 1990-1999 2000-2010 2010-2020 1970-1979 1980-1989 1990-1999 2000-2010 2010-2020 Occupations 1 promoter conductor chef dentist mediator teacher housewife neurosurgeon swimmer priest 2 collector chef baker counselor promoter professor teenager pediatrician baker fisherman 3 investigator biologist astronaut librarian dentist counselor bishop educator butcher teenager 4 principal collector swimmer pharmacist principal physician lawyer teenager medic chef 5 baker dad prisoner teenager collector pediatrician pediatrician counselor barber writer 6 researcher singer mechanic bishop cop consultant athlete neurologist physicist nanny 7 character chemist character acquaintance conductor doctor physician consultant soldier historian 8 mechanic butler worker cardiologist substitute student pathologist dentist baron president 9 analyst mechanic soldier promoter coach lawyer educator athlete director inventor 10 conductor promoter analyst attorney employee pathologist carpenter doctor singer housewife
Mental Disorders 1 caffeine cannabis separation lacunar lacunar dysmorphic factitious binge dissociative munchausen 2 restrictive hypnotic restrictive bulimia circadian psychogenic dysmorphic nervosa coordination mutism 3 attachment caffeine coordination erectile nicotine anorexia nervosa bulimia separation factitious 4 separation coordination dyskinesia gambling gambling adolescent mutism opioid parasitosis dysmorphic 5 circadian hallucinogen conversion bereavement phencyclidine nervosa bulimia hypersomnia terror hysteria 6 coordination dependence mathematics binge ocpd mutism tourette narcolepsy hysteria cotard 7 benzodiazepine attachment attachment nervosa cocaine infancy infancy anorexia conversion claustrophobia 8 dependence mathematics residual mood insomnia munchausen episode panic malingering ekbom 9 selective restrictive parasitosis depressive sleep factitious anorexia korsakoff tic diogenes 10 conversion pdd developmental polysubstance caffeine disorder munchausen factitious munchausen encopresis
Table 4: The top ten words with the largest RIPA scores (i.e., the most biased) across each decade. The RIPA scores are reported for both occupations and mental health disorders. While all the listed words are biased, they are ranked starting with the most biased word to the least. we found little change in mental health bias over- e.g., “caffeine”, “cannabis”, “nicotine”, and “gam- all, are there at least a few disorders that changed bling”. Similarly, disorders related to appearance over time? Moreover, we found a slight bias in are more female-related, e.g., “dysmorphic” 2 and mental health terms, therefore, What are the biased “anorexia”. We also find that disorders related to terms in each group? We look at the most gen- emotions are more female-related, such as “mun- der biased occupational and mental health-related chausen” 3, “hysteria” 4, and “terror”. Interest- terms for each decade in Table4. Because of space ingly, we find that the word “hysteria” is heavily limitations, we only display the gendered words biased in the 2010s. Even though the diagnosis of from the 1970s to the 2010s. The words from the female hysteria substantially fell in the 1900s (Mi- 1960s can be found in the appendix. The word-level cale, 1993), it still seems to be a biased term. We scores were generated using RIPA. First, for occu- want to note that this could simply be caused by pations, the words vary between male and female. research studying mental health diagnosis bias in For example, in the 1970s, male-related words in- women, however, the underlying cause of why the clude “mechanic”, “principal”, and “investigator”. term is biased in the 2010s is left for future work. The female-related words include “teacher”, “coun- selor”, and “pediatrician”. Interestingly, the jobs 6 Discussion associated with men such as “principal” and “re- searcher” are positions with power over the jobs In this section, we discuss the impact of the results associated with woman. For example “principals” on two stakeholders of this research: BioNLP re- (male) have power over “teachers” (female) and searchers and general biomedical researchers. Fur- “researchers” (male) have power over “students” thermore, we discuss the limitations of focusing on (female). We also find other well-known occupa- binary gender (Male vs Female). tions appear to be gender-related. For instance, “butler” in the 1980s is associated to male while 2Dysmorphia is a mental health disorder in which you “nanny” is related to female in the 2010s. can’t stop thinking about one or more perceived defects or flaws With regard to mental health, we find that dis- 3Munchausen is a mental disorder in which a person re- orders associated with well-known gender dispari- peatedly and deliberately acts as if he or she has a physical or ties appear to be biased using RIPA (Organization, mental illness 4Hysteria is a (biased) catch-all for symptoms including, 2013). For example, through the last 60 years, but not limited to, nervousness, hallucinations, and emo- words associated with addictions are male-related, tional outbursts.
8 6.1 Impact on BioNLP researchers. 2019), we can measure the bias of individual words The results in this paper are important for BioNLP across many contexts. Thus, we can possibly over- research in two ways. First, we have produced come the problem of a limited number of words decade-specific word embeddings. 5 Therefore, per gender. BioNLP research can use the embeddings to study 7 Conclusion other historical phenomenon in biomedical re- search articles. Second, the analysis of historical In this paper, we studied the historical bias present bias in biomedical research in this paper can be in word embeddings from 1960 to 2020. In sum- applied to other domains, beyond occupations and mary, we found that while some biases have shown mental disorders. a consistently decrease over time (e.g., Strong vs Weak), others have stayed relatively static, or worse, 6.2 Impact on Biomedical Researchers. increased (e.g., Intelligence vs Appearance). More- With regard to general biomedical researchers (e.g., over, we found that the gender bias towards occupa- medical researchers and biologist), this work can tions has substantially changed over time, showing provide a way to measure which demographics cur- that in the past, there was more gender bias associ- rent research is leaning towards in an automated ated with certain jobs. fashion. As discussed in Holdcroft(2007), if re- There are two major avenues for future work. search is heavily focused on a single gender, then First, this work quantified various aspects of gen- health disparities can increase. Treatments should der bias over time. However, we do not know why be explored equally for all at-risk patients. Fur- the bias is present in the word embeddings. For thermore,with the use of contextual word embed- example, is the word “hysteria” biased in 2010 be- dings (Scheuerman et al., 2019), implicit bias mea- cause researchers are associating it with women surement techniques can be used as part of the writ- implicitly, or is it that researchers are studying the ing process to avoid gendered language when it is historical usage of the diagnosis to ensure the di- not necessary (e.g., using singular they vs he/she). agnosis is not made because of implicit bias in the future? Thus, our future work will focus on causal 6.3 A Note About Gender. studies of bias in biomedical literature. Second, we Similar to prior work measuring gender simply independently trained Skip-Gram word em- bias (Chaloner and Maldonado, 2019), we beddings for each decade. However, recent work focus on binary gender. However, it is important has shown that dynamic embeddings, rather than to note that the results for binary gender do not static (decade-specific), perform better with regard necessarily generalize to other genders, including, to analyzing public perception over time (Gillani but not limited to, binary trans people, non-binary and Levy, 2019). Future work will focus on de- people, gender non-conforming people (Scheuer- veloping new techniques to study bias temporally. man et al., 2019). Therefore, we want to explicitly Moreover, many techniques may depend on the magnitude of the bias, therefore, we plan to ana- note that our research does not necessarily lyze the circumstances in which one embedding generalize beyond binary gender. In future work, we recommend that researcher’s studies approach may measure bias (e.g., Skip-Gram) bet- should be performed for other genders, beyond ter than another (e.g., dynamic embeddings). simply studying Male vs Female. Acknowledgements How can this study be expanded beyond binary gender? The three bias measurement techniques We would like to thank the anonymous reviewers studied in this paper (i.e., WEAT, ECT, and RIPA) for their invaluable help improving this manuscript. require sets of words representing a single gender This material is based upon work supported by (e.g., boy, men, male). Unfortunately, there is not the National Science Foundation under Grant a large number of words to represent every gender No. 1947697. of interest. A promising area of research is to ex- plore bias in contextual word embeddings. With the References use of contextual word embeddings (Kurita et al., Paul R Albert. 2015. Why is depression more prevalent 5https://github.com/AnthonyMRios/ in women? Journal of psychiatry & neuroscience: Gender-Bias-PubMed JPN, 40(4):219.
9 American Psychiatric Association et al. 2013. Diag- Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, nostic and statistical manual of mental disorders and Lucy Vasserman. 2018. Measuring and mitigat- (DSM-5 R ). American Psychiatric Pub. ing unintended bias in text classification. In Confer- ence on AI, Ethics, and Society. Pinkesh Badjatiya, Manish Gupta, and Vasudeva Varma. 2019. Stereotypical bias removal for hate Joel Escude´ Font. 2019. Determining bias in machine speech detection task using knowledge-based gen- translation with deep learning techniques. Master’s eralizations. In The World Wide Web Conference, thesis, Universitat Politecnica` de Catalunya. pages 49–59. Kawin Ethayarajh, David Duvenaud, and Graeme Hirst. Rahn Kennedy Bailey, Josephine Mokonogho, and 2019. Understanding undesirable word embedding Alok Kumar. 2019. Racial and ethnic differences in associations. In Proceedings of the 57th Annual depression: current perspectives. Neuropsychiatric Meeting of the Association for Computational Lin- disease and treatment, 15:603. guistics, pages 1696–1705. Olivier Bodenreider. 2004. The unified medical lan- Sergey Feldman, Waleed Ammar, Kyle Lo, Elly Trep- guage system (umls): integrating biomedical termi- man, Madeleine van Zuylen, and Oren Etzioni. 2019. nology. Nucleic acids research, 32(suppl 1):D267– Quantifying sex bias in clinical studies at scale with D270. automated data extraction. JAMA network open, 2(7):e196700–e196700. Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. 2016. Joel Escude´ Font and Marta R Costa-jussa.` 2019. Man is to computer programmer as woman is to Equalizing gender biases in neural machine trans- homemaker? debiasing word embeddings. In Ad- lation with word embeddings techniques. arXiv vances in neural information processing systems, preprint arXiv:1901.03116. pages 4349–4357. Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and Aylin Caliskan, Joanna J Bryson, and Arvind James Zou. 2018. Word embeddings quantify Narayanan. 2017. Semantics derived automatically 100 years of gender and ethnic stereotypes. Pro- from language corpora contain human-like biases. ceedings of the National Academy of Sciences, Science, 356(6334):183–186. 115(16):E3635–E3644. Kaytlin Chaloner and Alfredo Maldonado. 2019. Mea- Nabeel Gillani and Roger Levy. 2019. Simple dynamic suring gender bias in word embeddings across do- word embeddings for mapping perceptions in the mains and discovering new gender bias word cate- public sphere. In Proceedings of the Third Work- gories. In Proceedings of the First Workshop on shop on Natural Language Processing and Compu- Gender Bias in Natural Language Processing, pages tational Social Science, pages 94–99. 25–32. Anthony G Greenwald, Debbie E McGhee, and Jor- dan LK Schwartz. 1998. Measuring individual dif- Billy Chiu, Gamal Crichton, Anna Korhonen, and ferences in implicit cognition: the implicit associa- Sampo Pyysalo. 2016. How to train good word em- tion test. Journal of personality and social psychol- beddings for biomedical nlp. In Proceedings of the ogy, 74(6):1464. 15th workshop on biomedical natural language pro- cessing, pages 166–174. Amelia Gulliver, Kathleen M Griffiths, and Helen Christensen. 2010. Perceived barriers and facilita- I Glenn Cohen, Ruben Amarasingham, Anand Shah, tors to mental health help-seeking in young people: Bin Xie, and Bernard Lo. 2014. The legal and a systematic review. BMC psychiatry, 10(1):113. ethical concerns that arise from using complex pre- dictive analytics in health care. Health affairs, Maryam Habibi, Leon Weber, Mariana Neves, 33(7):1139–1147. David Luis Wiegandt, and Ulf Leser. 2017. Deep learning with word embeddings improves biomed- Amit Datta, Michael Carl Tschantz, and Anupam Datta. ical named entity recognition. Bioinformatics, 2015. Automated experiments on ad privacy set- 33(14):i37–i48. tings. Proceedings on Privacy Enhancing Technolo- gies, 2015(1):92–112. Katarina Hamberg. 2008. Gender bias in medicine. Women’s Health, 4(3):237–243. Scott Deerwester, Susan T Dumais, George W Fur- nas, Thomas K Landauer, and Richard Harshman. Cynthia M Hartung and Thomas A Widiger. 1998. 1990. Indexing by latent semantic analysis. Jour- Gender differences in the diagnosis of mental disor- nal of the American society for information science, ders: Conclusions and controversies of the dsm–iv. 41(6):391–407. Psychological bulletin, 123(3):260. Sunipa Dev and Jeff Phillips. 2019. Attenuating bias in Bin He, Yi Guan, and Rui Dai. 2019. Classifying med- word vectors. In The 22nd International Conference ical relations in clinical text via convolutional neural on Artificial Intelligence and Statistics, pages 879– networks. Artificial intelligence in medicine, 93:43– 887. 49.
10 Janae K Heath, Gary E Weissman, Caitlin B Clancy, Jeffrey Pennington, Richard Socher, and Christopher D Haochang Shou, John T Farrar, and C Jessica Dine. Manning. 2014. Glove: Global vectors for word rep- 2019. Assessment of gender-based linguistic differ- resentation. In Proceedings of the 2014 conference ences in physician trainee evaluations of medical fac- on empirical methods in natural language process- ulty using automated text mining. JAMA network ing (EMNLP), pages 1532–1543. open, 2(5):e193520–e193520. LA Pratt and DJ Brody. 2014. Depression and obe- Anita Holdcroft. 2007. Gender bias in research: how sity in the us adult household population, 2005-2010. does it affect evidence based medicine? Journal of NCHS data brief, (167):1–8. the Royal Society of Medicine, 100(1):2. Anthony Rios. 2020. FuzzE: Fuzzy fairness evaluation Matthew Kay, Cynthia Matuszek, and Sean A Munson. of offensive language classifiers on african-american 2015. Unequal representation and gender stereo- english. In Proceedings of the AAAI Conference on types in image search results for occupations. In Artificial Intelligence, volume 34. Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pages 3819– Anthony Rios and Ramakanth Kavuluru. 2015. Convo- 3828. ACM. lutional neural networks for biomedical text classi- fication: application in indexing biomedical articles. Keyvan Khosrovian, Dietmar Pfahl, and Vahid Garousi. In Proceedings of the 6th ACM Conference on Bioin- 2008. Gensim 2.0: a customizable process simula- formatics, Computational Biology and Health Infor- tion model for software process evaluation. In Inter- matics, pages 258–267. national conference on software process, pages 294– 306. Springer. Arghavan Salles, Michael Awad, Laurel Goldin, Kelsey Krus, Jin Vivian Lee, Maria T Schwabe, and Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W Black, Calvin K Lai. 2019. Estimating implicit and ex- and Yulia Tsvetkov. 2019. Quantifying social bi- plicit gender bias among health care professionals ases in contextual word representations. In 1st ACL and surgeons. JAMA network open, 2(7):e196545– Workshop on Gender Bias for Natural Language e196545. Processing. Morgan Klaus Scheuerman, Jacob M. Paul, and Jed R. Chandler May, Alex Wang, Shikha Bordia, Samuel Brubaker. 2019. How computers see gender: An Bowman, and Rachel Rudinger. 2019. On measur- evaluation of gender classification in commercial fa- ing social biases in sentence encoders. In Proceed- cial analysis services. Proc. ACM Hum.-Comput. In- ings of the 2019 Conference of the North American teract., 3(CSCW). Chapter of the Association for Computational Lin- guistics: Human Language Technologies, Volume 1 Latanya Sweeney. 2013. Discrimination in online ad (Long and Short Papers), pages 622–628. delivery. Queue, 11(3):10. Bridget T McInnes, Ted Pedersen, and Serguei VS Yanshan Wang, Sijia Liu, Naveed Afzal, Majid Pakhomov. 2009. Umls-interface and umls- Rastegar-Mojarad, Liwei Wang, Feichen Shen, Paul similarity: open source software for measuring paths Kingsbury, and Hongfang Liu. 2018. A comparison and semantic similarity. In AMIA annual sympo- of word embeddings for the biomedical natural lan- sium proceedings, volume 2009, page 431. Ameri- guage processing. Journal of biomedical informat- can Medical Informatics Association. ics, 87:12–20. Mark S Micale. 1993. On the” disappearance” of hys- Daniel Weisz, Michael K Gusmano, and Victor G Rod- teria: A study in the clinical deconstruction of a di- win. 2004. Gender and the treatment of heart dis- agnosis. Isis, 84(3):496–526. ease in older persons in the united states, france, and england: a comparative, population-based view of Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey a clinical phenomenon. Gender medicine, 1(1):29– Dean. 2013a. Efficient estimation of word represen- 40. tations in vector space. ICLR Workshop. Terry Young, Rebecca Hutton, Laurel Finn, Safwan Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Badr, and Mari Palta. 1996. The gender bias in rado, and Jeff Dean. 2013b. Distributed representa- sleep apnea diagnosis: are women missed because tions of words and phrases and their compositional- they have different symptoms? Archives of internal ity. In Advances in neural information processing medicine, 156(21):2445–2451. systems, pages 3111–3119. Jieyu Zhao, Tianlu Wang, Mark Yatskar, Ryan Cot- World Health Organization. 2013. Gender Disparities terell, Vicente Ordonez, and Kai-Wei Chang. 2019. in Mental Health. World Health Organization. Gender bias in contextualized word embeddings. arXiv preprint arXiv:1904.03310 Ji Ho Park, Jamin Shin, and Pascale Fung. 2018. Re- . ducing gender bias in abusive language detection. Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai- In Proceedings of the 2018 Conference on Empiri- Wei Chang. 2018. Learning gender-neutral word cal Methods in Natural Language Processing, pages embeddings. In Proc. of EMNLP, pages 4847–4853. 2799–2804.
11 A 1960s Most Biased Words depression, depressive, derealization, dermatillo- mania, desynchronosis, deux, developmental, dio- Male: genes, disease, disorder, dissociative, dyscalcu- lia, dyskinesia, dyslexia, dysmorphic, eating, ejac- physician • ulation, ekbom, encephalitis, encopresis, enure- doctor sis, epilepsy, episode, erectile, erotomania, exhi- • bitionism, factitious, fantastica, fetishism, fregoli, president • fugue, functioning, gambling, ganser, grandiose, dentist hallucinogen, hallucinosis, histrionic, huntington, • psychiatrist hyperactivity, hypersomnia, hypnotic, hypochon- • driasis, hypomanic, hysteria, ideation, identity, surgeon • impostor, induced, infancy, insomnia, intellec- student tual, intermittent, intoxication, kleptomania, ko- • rsakoff, lacunar, lethargica, love, major, maladap- nurse • tive, malingering, mania, mathematics, megalo- worker mania, melancholia, misophonia, mood, mun- • chausen, mutism, narcissistic, narcolepsy, nervosa, professor • neurocysticercosis, neurodevelopmental, nicotine, nightmare, nos, obsessive, obsessive–compulsive, Female: ocd, ocpd, oneirophrenia, opioid, oppositional, or- substitute thorexia, pain, panic, paralysis, paranoid, parasito- • sis, parasomnia, parkinson, partialism, pathologi- principal • cal, pdd, perception, persecutory, personality, per- editor vasive, phencyclidine, phobia, phobic, phonolog- • ical, physical, pica, polysubstance, posttraumatic, baker • pseudologia, psychogenic, psychosis, psychotic, character ptsd, pyromania, reactive, residual, retrograde, ru- • mination, schizoaffective, schizoid, schizophrenia, author • schizophreniform, schizotypal, seasonal, sedative, pharmacist selective, separation, sexual, sleep, sleepwalking, • scientist social, sociopath, somatic, somatization, somato- • form, stereotypic, stockholm, stress, stuttering, sub- therapist • stance, suicidal, suicide, tardive, terror, tic, tourette, teacher transient, transvestic, tremens, trichotillomania, tru- • man, withdrawal, wonderland] B Mental Health-Related Terms C Occupations [abuse, acute, adaptation, adjustment, adoles- cent, adult, affective, agoraphobia, alcohol, alco- [detective, ambassador, coach, officer, epidemiol- holic, alzheimer, amnesia, amnestic, amphetamine, ogist, rabbi, ballplayer, secretary, actress, man- anorexia, anosognosia, anterograde, antisocial, anx- ager, scientist, cardiologist, actor, industrial- iety, anxiolytic, asperger, atelophobia, attachment, ist, welder, biologist, undersecretary, captain, attention, atypical, autism, autophagia, avoidant, economist, politician, baron, pollster, environmen- avoidant, restrictive, barbiturate, behavior, benzodi- talist, photographer, mediator, character, housewife, azepine, bereavement, bibliomania, binge, bipolar, jeweler, physicist, hitman, geologist, painter, em- body, borderline, brief, bulimia, caffeine, cannabis, ployee, stockbroker, footballer, tycoon, dad, pa- capgras, catalepsy, catatonia, catatonic, childhood, trolman, chancellor, advocate, bureaucrat, strate- circadian, claustrophobia, cocaine, cognitive, com- gist, pathologist, psychologist, campaigner, magis- munication, compulsive, condition, conduct, con- trate, judge, illustrator, surgeon, nurse, mission- version, coordination, cotard, cyclothymia, day- ary, stylist, solicitor, scholar, naturalist, artist, dreaming, defiant, deficit, delirium, delusion, delu- mathematician, businesswoman, investigator, cura- sional, delusions, dependence, depersonalization, tor, soloist, servant, broadcaster, fisherman, land-
12 lord, housekeeper, crooner, archaeologist, teenager, rian, citizen, worker, pastor, serviceman, filmmaker, councilman, attorney, choreographer, principal, sportswriter, poet, dentist, statesman, minister, der- parishioner, therapist, administrator, skipper, aide, matologist, technician, nun, instructor, alderman, chef, gangster, astronomer, educator, lawyer, mid- analyst, chaplain, inventor, lifeguard, bodyguard, fielder, evangelist, novelist, senator, collector, goal- bartender, surveyor, consultant, athlete, cartoonist, keeper, singer, acquaintance, preacher, trumpeter, negotiator, promoter, socialite, architect, mechanic, colonel, trooper, understudy, paralegal, philoso- entertainer, counselor, janitor, firebrand, sports- pher, councilor, violinist, priest, cellist, hooker, man, anthropologist, performer, crusader, envoy, jurist, commentator, gardener, journalist, warrior, trucker, publicist, commander, professor, critic, co- cameraman, wrestler, hairdresser, lawmaker, psy- median, receptionist, financier, valedictorian, in- chiatrist, clerk, writer, handyman, broker, boss, spector, steward, confesses, bishop, shopkeeper, lieutenant, neurosurgeon, protagonist, sculptor, ballerina, diplomat, parliamentarian, author, sociol- nanny, teacher, homemaker, cop, planner, la- ogist, photojournalist, guitarist, butcher, mobster, borer, programmer, philanthropist, waiter, barrister, drummer, astronaut, protester, custodian, maestro, trader, swimmer, adventurer, monk, bookkeeper, pianist, pharmacist, chemist, pediatrician, lecturer, radiologist, columnist, banker, neurologist, bar- foreman, cleric, musician, cabbie, fireman, farmer, ber, policeman, assassin, marshal, waitress, artiste, headmaster, soldier, carpenter, substitute, director, playwright, electrician, student, deputy, researcher, cinematographer, warden, marksman, congress- caretaker, ranger, lyricist, entrepreneur, sailor, man, prisoner, librarian, magician, screenwriter, dancer, composer, president, dean, comic, medic, provost, saxophonist, plumber, correspondent, or- legislator, salesman, observer, pundit, maid, arch- ganist, baker, doctor, constable, treasurer, superin- bishop, firefighter, vocalist, tutor, proprietor, restau- tendent, boxer, physician, infielder, businessman, rateur, editor, saint, butler, prosecutor, sergeant, protege] realtor, commissioner, narrator, conductor, histo-
13 Sequence-to-Set Semantic Tagging for Complex Query Reformulation and Automated Text Categorization in Biomedical IR using Self-Attention
Manirupa Das†, Juanxi Li†, Eric Fosler-Lussier†, Simon Lin‡, Steve Rust‡, Yungui Huang‡ & Rajiv Ramnath† The Ohio State University† & Nationwide Children’s Hospital‡ das.65, li.8767, fosler.1, ramnath.6 @osu.edu Simon.Lin,{ Steve.Rust, Yungui.Huang @nationwidechildrens.org} }
Abstract lor their medical decisions to the individual patient, based on the patient’s genetic information, other Novel contexts, comprising a set of terms re- molecular analysis, and the patient’s preference. ferring to one or more concepts, may often arise in complex querying scenarios such as This often requires them to combine clinical expe- in evidence-based medicine (EBM) involving rience with evidence from scientific research, such biomedical literature. These may not explicitly as that available from biomedical literature, in a refer to entities or canonical concept forms oc- process known as evidence-based medicine (EBM). curring in a fact-based knowledge source, e.g. Finding the most relevant recent research however, the UMLS ontology. Moreover, hidden asso- is challenging not only due to the volume and the ciations between related concepts meaningful pace at which new research is being published, but in the current context, may not exist within also due to the complex nature of the information a single document, but across documents in the collection. Predicting semantic concept need, arising for example, out of a clinical note tags of documents can therefore serve to as- which may be used as a query. This calls for bet- sociate documents related in unseen contexts, ter automated methods for natural language under- or categorize them, in information filtering standing (NLU), e.g. to derive a set of key terms or or retrieval scenarios. Thus, inspired by the related concepts helpful in appropriately transform- success of sequence-to-sequence neural mod- ing a complex query, by reformulation so as to be els, we develop a novel sequence-to-set frame- able to handle and possibly resolve medical jargon, work with attention, for learning document representations in a unique unsupervised set- lesser-used acronyms, misspelling, multiple sub- ting, using no human-annotated document la- ject areas and often multiple references to the same bels or external knowledge resources and only entity or concept, and retrieve the most related, yet corpus-derived term statistics to drive the train- most comprehensive set of useful results. ing. This can effect term transfer within a At the same time, tremendous strides have been corpus for semantically tagging a large collec- made by recent neural machine learning mod- tion of documents. Our sequence-to-set mod- els in reasoning with texts on a wide variety of eling approach to predict semantic tags , gives to the best of our knowledge, the state-of-the- NLP tasks. In particular, sequence-to-sequence art for both, an unsupervised query expansion (seq2seq) neural models often employing attention (QE) task for the TREC CDS 2016 challenge mechanisms, have been largely successful in deliv- dataset when evaluated on an Okapi BM25– ering the state-of-the-art for tasks such as machine based document retrieval system; and also translation (Bahdanau et al., 2014), (Vaswani et al., over the MLTM system baseline (Soleimani 2017), handwriting synthesis (Graves, 2013), im- and Miller, 2016), for both supervised and age captioning (Xu et al., 2015), speech recognition semi-supervised multi-label prediction tasks (Chorowski et al., 2015) and document summariza- with del.icio.us and Ohsumed datasets. We make our code and data publicly available 1. tion (Cheng and Lapata, 2016). Inspired by these successes, we aimed to harness the power of se- 1 Introduction quential encoder-decoder architectures with atten- tion, to train end-to-end differentiable models that Recent times have seen an upsurge in efforts to- are able to learn the best possible representation of wards personalized medicine where clinicians tai- input documents in a collection while being predic- 1https://github.com/mcoqzeug/seq2set-semantic-tagging tive of a set of key terms that best describe the docu-
14 Proceedings of the BioNLP 2020 workshop, pages 14–27 Online, July 9, 2020 c 2020 Association for Computational Linguistics ment. These will be later used to transfer a relevant setting, to select most relevant terms. (Palangi et al., but diverse set of key terms from the most related 2016) employ a deep sentence embedding approach documents, for “semantic tagging” of the original using LSTMs and show improvement over standard input documents so as to aid in downstream query sentence embedding methods, but as a means for refinement for IR by pseudo-relevance feedback directly deriving encodings of queries and docu- (Xu and Croft, 2000). ments for use in IR, and not as a method for QE To this end and to the best of our knowledge, we by PRF. In another approach, (Xu et al., 2017) are the first to employ a novel, completely unsu- train autoencoder representations of queries and pervised end-to-end neural attention-based docu- documents to enrich the feature space for learning- ment representation learning approach, using no to-rank, and show gains in retrieval performance external labels, in order to achieve the most mean- over pre-trained rankers. But this is a fully super- ingful term transfer between related documents, vised setup where the queries are seen at train time. i.e. semantic tagging of documents, in a “pseudo- (Pfeiffer et al., 2018) also use an autoencoder-based relevance feedback”–based (Xu and Croft, 2000) approach for actual query refinement in pharma- setting for unsupervised query expansion. This may cogenomic document retrieval. However here too, also be seen as a method of document expansion their document ranking model uses the encoding of as a means for obtaining query refinement terms the query and the document for training the ranker, for downstream IR. The following sections give an hence the queries are not unseen with respect to the account of our specific architectural considerations document during training. They mention that their in achieving an end-to-end neural framework for work can be improved upon by the use of seq2seq- semantic tagging of documents using their repre- based approaches. In this sense, i.e. with respect sentations, and a discussion of the results obtained to QE by PRF and learning a sequential document from this approach. representation for document ranking, our work is most similar to (Pfeiffer et al., 2018). However the 2 Related Work queries are completely unseen in our case and we use only the documents in the corpus, to train our Pseudo-relevance feedback (PRF), a local context neural document language models from scratch in analysis method for automatic query expansion a completely unsupervised way. (QE), is extensively studied in information retrieval (IR) research as a means of addressing the word Classic sequence-to-sequence models like mismatch between queries and documents. It ad- (Sutskever et al., 2014) demonstrate the strength justs a query relative to the documents that initially of recurrent models such as the LSTM in captur- appear to match it, with the main assumption that ing short and long range dependencies in learn- the top-ranked documents in the first retrieval result ing effective encodings for the end task. Works contain many useful terms that can help discrimi- such as (Graves, 2013), (Bahdanau et al., 2014), nate relevant documents from irrelevant ones (Xu (Rocktaschel¨ et al., 2015), further stress the key and Croft, 2000), (Cao et al., 2008). It is moti- role that attention, and multi-headed attention vated by relevance feedback (RF), a well-known (Vaswani et al., 2017) can play in solving the end IR technique that modifies a query based on the task. We use these insights in our work. relevance judgments of the retrieved documents According to the detailed report provided for this (Salton et al., 1990). It typically adds common dataset and task in (Roberts et al., 2016) all of the terms from the relevant documents to a query and systems described perform direct query reweight- re-weights the expanded query based on term fre- ing aside from supervised term expansion and quencies in the relevant documents relative to the are highly tuned to the clinical queris in this dataset. non-relevant ones. Thus in PRF we find an ini- In a related medical IR challenge (Roberts et al., tial set of most relevant documents, then assuming 2017) the authors specifically mention that with that the top k ranked documents are relevant, RF only six partially annotated queries for system de- is done as before, without manual interaction by velopment, it is likely that systems were either the user. The added terms are, therefore, common under- or over-tuned on these queries. Since the terms from the top-ranked documents. setup of the seq2set framework is an attempt to To this end, (Cao et al., 2008) employ term classi- model the PRF based query expansion method of fication for retrieval effectiveness, in a “supervised” its closest related work (Das et al., 2018) where
15 the effort is also to train a neural generalized lan- 3.1 The Sequence-to-Set Semantic Tagging guage model for unsupervised semantic tagging, Framework we choose this system as the benchmark to com- Task Definition: For each query document dq pare against to our end-to-end approach for the in a given a collection of documents D = same task. d , d , ..., d , represented by a set of k key- { 1 2 N } words or labels, e.g. k terms in dq derived from top V TFIDF-scored terms, find an alternate set −| | 3 Methodology of k most relevant terms coming from documents “most related” to dq from elsewhere in the collec- tion. These serve as semantic tags for expanding Drawing on sequence-to-sequence modeling ap- dq. proaches for text classification, e.g. textual entail- In the unsupervised task setting described later, ment (Rocktaschel¨ et al., 2015) and machine trans- a document to be tagged is regarded as a query lation (Sutskever et al., 2014), (Bahdanau et al., document d ; its semantic tags are generated via 2014) we adapt from these settings into a sequence- q PRF, and these terms will in turn be used for PRF– to-set framework, for learning representations of based expansion of unseen queries in downstream input documents, in order to derive a meaningful IR. Thus d could represent an original complex set of terms, or semantic tags drawn from a closely q query text or a document in the collection. related set of documents, that expand the origi- In the following sections we describe the build- nal documents. These document expansion terms ing blocks used in the setup for the baseline and are then used downstream for query reformulation proposed models for sequence-to-set semantic tag- via PRF, for unseen queries. We employ an end- ging as described in the task definition. to-end framework for unsupervised representation pseudo- learning of documents using TFIDF-based 3.2 Training and Inference Setup labels (Figure1(a))and a separate cosine similarity- based ranking module for semantic tag inference The overall architecture for sequence-to-set seman- (Figure1(b)). tic tagging consists of two phases, as depicted in the block diagrams in Figures1(a) and1(b): the We employ various methods such as doc2vec, first, for training of input representations of doc- Deep Averaging, sequential models such as LSTM, uments; and the second for inference to achieve GRU, BiGRU, BiLSTM, BiLSTM with Attention term transfer for semantic tagging. As shown in and Self-attention, detailed in Figure1(c)-(f), see Figure1(a), the proposed model architecture would AppendixA, for learning fixed-length input docu- first learn the appropriate feature representations ment representations in our framework. We apply of documents in a first pass of training, by taking methods like DAN (Iyyer et al., 2015), LSTM, and in the tokens of an input document sequentially, BiLSTM as our baselines and formulate attentional using a document’s pre-determined top k TFIDF- − models including a self-attentional Transformer- scored terms as the pseudo-class labels for an input based one (Vaswani et al., 2017) as our proposed instance, i.e. prediction targets for a sigmoid layer augmented document encoders. for multi-label classification. The training objec- Further, we hypothesize that a sequential, bi- tive is to maximize probability for these k terms, or y = (t , t , ..t ) V , i.e. directional or attentional encoder coupled with a p 1 2 k ∈ decoder, i.e. a sigmoid or softmax prediction layer, that conditions on the encoder output v (similar to arg max P (yp = (t1, t2, ..tk) V v; θ) (1) θ ∈ | an approach by (Kiros et al., 2015) for learning a neural probabilistic language model), would en- given the document’s encoding v. For computa- able learning of the optimal semantic tags in our tional efficiency, we take V to be the list of top- unsupervised query expansion setting, while mod- 10K TFIDF-scored terms from our corpus, thus eling directly for this task in an end-to-end neural V = 10, 000. k is taken as 3, so each document is | | framework. In our setup the decoder predicts a initially labeled with 3 terms. The sequential model meaningful set of concept tags that best describe a is then trained with the k-hot 10K-dimensional la- document according to the training objective. The bel vector as targets for the sigmoid classification following sections describe our setup. layer, employing a couple of alternative training
16 Figure 1: Overview of Sequence-to-Set Framework. (a) Method for training document or query representations, (b) Method for Inference via term transfer for semantic tagging; Document Sequence Encoders: (c) Deep Averaging encoder; (d) LSTM last hidden state, GRU encoders; (e) BiLSTM last hidden state, BiGRU (shown in dotted box), BiLSTM attended hidden states encoders; and (f) Transformer self-attentional encoder [source: (Alammar, 2018)]. objectives. The first, typical for multi-label classifi- conditioned on the document encoding. Thus, cation, minimizes a categorical cross-entropy loss, which for a single training instance with ground- truth label set, yp, is: P (L d ) = P (l d ) = log(P (l d )) d| q | q − | q l Ld l Ld Y∈ X∈ (3) V Since P (l dq) exp(v v), where v and v are | | | ∝ l · l LCE(y ˆp) = yi log(y ˆi) (2) the label and document encodings, it is equivalent i=1 to minimizing: X
Since our goal is to obtain the most meaningful document representations most predictive of their L (y ˆ ) = log(exp(v v)) (4) assigned terms, and that can also be predictive LM p − l · l L of semantic tags not present in the document, we X∈ d also consider a language model–based loss objec- tive converting our decoder to a neural language model. Thus, we employ a training objective that Equation (4) represents our language model– maximizes the conditional log likelihood of the style loss objective. We run experiments training label terms Ld of a document dq, given the doc- with both losses (Equations (2)&(4)) as well as a ument’s representation v, i.e. P (L d ) (where variant that is a summation of both, with a hyper- d| q y = L V ). This amounts to minimizing the parameter α used to tune the language model com- p d ∈ negative log likelihood of the label representations ponent of the total loss objective.
17 4 Task Settings the Summary Text + Sum. UMLS terms as the “augmented” query. We use the Adam op- 4.1 Unsupervised Task – Semantic Tagging timizer (Kingma and Ba, 2014) for training our for Query Expansion models. After several rounds of hyper-paramater We now describe the setup and results for experi- tuning, batch size was set to 128, dropout to 0.3, ments run on our unsupervised task setting of se- the prediction layer was fixed to sigmoid, the loss mantic tagging of documents for PRF–based query function switched between cross-entropy and sum- expansion. mation of cross entropy and LM losses, and models trained with early stopping. 4.1.1 Dataset – TREC CDS 2016 Results from various Seq2Set encoder models The 2016 TREC CDS challenge dataset, makes on base (Summary Text) and augmented available actual electronic health records (EHR) (Summary Text + Summary-based of patients (de-identified), in the form of case re- UMLS terms) query, are outlined in Table1. ports, typically describing a challenging medical Evaluating on base query, a Seq2Set-Transformer case. Such a case report represents a query in model beats all other Seq2Set encoders, and our system, having a complex information need. also the TFIDF, MeSH QE terms and Expert There are 30 queries in this dataset, corresponding QE terms baselines. On the augmented query, to such case reports, at 3 levels of granularity Note, the Seq2Set-BiGRU and Seq2Set-Transformer Description and Summary text as described in models outperform all other Seq2Set encoders, (Roberts et al., 2016). The target document collec- and Seq2Set-Transformer outperforms all non- tion is the Open Access Subset of PubMed Central ensemble baselines and the Phrase2VecGLM (PMC), containing 1.25 million articles consisting unsupervised QE ensemble system baseline of title, keywords, abstract and body sections. significantly, with P@10 of 0.4333. Best per- We use a subset of 100K of these articles for which forming supervised QE systems for this dataset, human relevance judgments are made available by tuned on all 30 queries, range between 0.35– TREC, for training. Final evaluation however is 0.4033 P@10 (Roberts et al., 2016), better than done on an ElasticSearch index built on top of the unsupervised QE systems on base query, but entire collection of 1.25 million PMC articles. surpassed by the best Seq2Set-based models such as Seq2Set-Transformer on augmented query, 4.1.2 Unsupervised Task Experiments even without ensemble. Semantic tags from a We ran several sets of experiments with vari- best-performing model, do appear to pick terms ous document encoders, employing pre-trained relating to certain conditions, e.g.:
18 Unsupervised QE Systems (Base Query) P@10 BM25+Seq2Set-doc2vec (baseline) 0.0794 BM25+Seq2Set-TFIDF Terms (baseline) 0.2000 BM25+MeSH QE Terms (baseline) 0.2294 BM25+Human Expert QE Terms (baseline) 0.2511 BM25+unigramGLM+Phrase2VecGLM ensemble (system baseline) 0.2756 BM25+Seq2Set-Transformer (LCE )(model) 0.2861* Supervised QE Systems (Base Query) BM25+ETH Zurich-ETHSummRR 0.3067 BM25+Fudan Univ.DMIIP-AutoSummary1 0.4033 Unsupervised QE Systems (Augmented Query) BM25+Seq2Set-doc2vec ( 0.1345 baseline) Figure 2: Seq2Set– on , best BM25+Seq2Set-TFIDF Terms (baseline) 0.3000 supervised del.icio.us BM25+unigramGLM+Phrase2VecGLM Transformer model–based encoder ensemble (system baseline) 0.3091 BM25+Seq2Set-BiGRU (LM only loss) (model) 0.3333* BM25+Seq2Set-Transformer (LCE + LLM ) (model) 0.4333*
Table 1: Results on IR for best Seq2set models, in an unsupervised PRF–based QE setting. Boldface indi- cates statistical significance @p<<0.01 over previous. collection is available where the set of document labels are already known, we can learn to predict tags from this set of known labels on a new set of Figure 3: A comparison of document labeling perfor- similar documents. Thus, in order to generalize our mance of Seq2set versus MLTM Seq2set approach to such other tasks and setups, we therefore aim to validate the performance of our framework on such a labeled dataset of tagged doc- uments, which is equivalent to adapting the Seq2set 4.2.2 Supervised Task Experiments framework for a supervised setup. In this setup we We then run Seq2set-based training for our 8 dif- therefore only need to use the training module of ferent encoder models on the training set for the the Seq2set framework shown in Figure1(a), and 20 labels, and perform evaluation on the test set measure tag prediction performance on a held out measuring sentence-level ROC AUC on the labeled set of documents. For this evaluation, we there- documents in the test set. fore choose to work with the popular Delicious (del.icio.us) folksonomy dataset, same as that used Figure2 shows the ROC AUC for the best by (Soleimani and Miller, 2016) in order to do an performing Transformer model from the Seq2set appropriate comparison with their MLTM frame- framework on the del.icio.us dataset, which was work that is also evaluated on a similar document trained with a sigmoid–based prediction layer on multi-label prediction task. cross entropy loss with a batch size of 64 and dropout set to 0.3. This best model got an ROC 4.2.1 Dataset – del.icio.us AUC of 0.85 , statistically significantly surpassing The Delicious dataset contains tagged web MLTM (AUC 0.81 @ p << 0.001) for this task pages retrieved from the social bookmark- and dataset. ing site, del.icio.us. There are 20 com- Figure3 also shows a comparison of the ROC mon tags used as class labels: reference, AUC scores obtained with training Seq2set and design, programming, internet, computer, MLTM based models for this task with various web, java, writing, English, grammar, labeled data proportions. Here again we see style, language, books, education, philosophy, that Seq2set has clear advantage over the current politics, religion, science, history and culture. MLTM state-of-the-art, statistically significantly The training set consists of 8250 documents and surpassing it (p << 0.01) when trained with the test set consists of 4000 documents. greater than 25% of the labeled dataset.
19 based Seq2set model performing best. Next we experiment with the semi-supervised setting where we first train the Seq2set framework models on a large number of documents that do not have pre- existing labels. This pre-training is performed in exactly a similar fashion as the unsupervised setup. Thus we first preprocess the Ohsumed data from years 87-90 to obtain a top-1000 TFIDF score– based vocabulary of tags, pseudo-labeling all the documents in the training set with these. Our train- ing and evaluation for the semi-supervised setup Figure 4: Seq2Set–semi-supervised on Ohsumed, consists of 3 phases: Phase 1: We employ our best Transformer model–based encoder w/ Cross Entropy–based Softmax prediction; 4 layers, 10 atten- seq2set framework (using each one of our encoder tion heads, dropout=0 models) for multi-label prediction on this pseudo- labeled data, having an output prediction layer of 1000 having a penultimate fully-connected layer 4.3 Semi-Supervised Text Categorization of dimension 23, same as the number of labels in We then seek to further validate how well the the Ohsumed dataset; Phase 2: After pre-training Seq2set framework can leverage large scale pre- with pseudolabels we discard the final layer and training on unlabeled data given only a small continue to train labeled Ohsumed dataset from amount of labeled data for training, to be able to 91 by 5-fold cross-validation with early stopping. improve prediction performance on a held out set Phase 3: This is the final evaluation step of our of these known labels. This amounts to a semi- semi-supervised trained Seq2set model on the la- supervised setup–based evaluation of the Seq2set beled Ohsumed test dataset used by (Soleimani framework. In this setup, we perform the training and Miller, 2016). This constitutes simply infer- and evaluation of Seq2set similar to the supervised ring predicted tags using the trained model on the setup, except we have an added step of pre-training test data. As shown in Figure4, our evaluation of the multi-label prediction on large amounts of un- the Seq2set framework for the Ohsumed dataset, labeled document data in exactly the same way as comparing supervised and semi-supervised train- the unsupervised setup. ing setups, yields an ROC AUC of 0.94 for our best performing semi-supervised–trained model of Fig. 4.3.1 Dataset – Ohsumed 4, compared to the various supervised trained mod- We employ the Ohsumed dataset available from the els for the same dataset that got a best ROC AUC of TREC Information Filtering tracks of years 87-91 0.93. The top performing semi-supervised model and the version of the labeled Ohsumed dataset again involves a Transformer–based encoder using used by (Soleimani and Miller, 2016) for evalua- a softmax layer for prediction, with 4 layers, 10 tion, to have an appropriate comparison with their attention heads, and no dropout. Thus, the best re- MLTM system also evaluated for this dataset. The sults on the semi-supervised training experiments version of the Ohsumed dataset due to (Soleimani (ROC AUC 0.94) statistically significantly out- and Miller, 2016) consists of 11122 training and performs (p << 0.01) the MLTM system base- 5388 test documents, each assigned to one or multi- line (ROC AUC 0.90) on the Ohsumed dataset, ple labels of 23 MeSH diseases categories. Almost while also clearly surpassing the top-performing half of the documents have more than one label. supervised Seq2set models on the same dataset. This demonstrates that our Seq2set framework is 4.3.2 Semi-Supervised Task Experiments able to leverage the benefits of data augmentation We first train and test our framework on the la- in the semi-supervised setup by training with large beled subset of the Ohsumed data from (Soleimani amounts of unlabeled data on top of limited labeled and Miller, 2016) similar to the supervised setup data. described in the previous section. This evalua- tion gives a statistically significant ROC AUC of 0.93 over the 0.90 AUC for the MLTM system of (Soleimani and Miller, 2016) for a Transformer–
20 5 Conclusion References Mart´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng We develop a novel sequence-to-set end-to- Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, end encoder-decoder–based neural framework for Sanjay Ghemawat, Geoffrey Irving, Michael Isard, multi-label prediction, by training document repre- et al. 2016. Tensorflow: A system for large-scale sentations using no external supervision labels, for machine learning. In 12th USENIX Symposium on Operating Systems Design{ and Implementation} pseudo-relevance feedback–based unsupervised se- ( OSDI 16), pages 265–283. mantic tagging of a large collection of documents. { } We find that in this unsupervised task setting of Jay Alammar. 2018. The illustrated transformer. On- PRF-based semantic tagging for query expansion, a line; posted June 27, 2018. multi-term prediction training objective that jointly Ben Athiwaratkun, Andrew Gordon Wilson, and An- optimizes both prediction of the TFIDF–based doc- ima Anandkumar. 2018. Probabilistic fasttext for ument pseudo-labels and the log likelihood of the multi-sense word embeddings. arXiv preprint arXiv:1806.02901. labels given the document encoding, surpasses pre- vious methods such as Phrase2VecGLM (Das et al., Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- 2018) that used neural generalized language mod- gio. 2014. Neural machine translation by jointly els for the same. Our initial hypothesis that bi- learning to align and translate. arXiv preprint arXiv:1409.0473. directional or self-attentional models could learn the most efficient semantic representations of doc- Olivier Bodenreider. 2004. The unified medical lan- uments when coupled with a loss more effective guage system (umls): integrating biomedical termi- nology. Nucleic acids research, 32(suppl 1):D267– than cross-entropy at reducing language model per- D270. plexity of document encodings, is corroborated in all experimental setups. We demonstrate the ef- Guihong Cao, Jian-Yun Nie, Jianfeng Gao, and fectiveness of our novel framework in every task Stephen Robertson. 2008. Selecting good expansion terms for pseudo-relevance feedback. In Proceed- setting, viz. for unsupervised QE via PRF-based ings of the 31st annual international ACM SIGIR semantic tagging for a downstream medical IR chal- conference on Research and development in infor- lenge task; as well as for both, supervised and mation retrieval, pages 243–250. ACM. semi-supervised task settings, where Seq2set sta- Jianpeng Cheng and Mirella Lapata. 2016. Neural sum- tistically significantly outperforms the state-of-art marization by extracting sentences and words. arXiv MLTM baseline (Soleimani and Miller, 2016) on preprint arXiv:1603.07252. the same held out set of documents as MLTM, for Jan K Chorowski, Dzmitry Bahdanau, Dmitriy multi-label prediction on a set of known labels, for Serdyuk, Kyunghyun Cho, and Yoshua Bengio. automated text categorization; achieving to the best 2015. Attention-based models for speech recogni- of our knowledge, the current state-of-the-art for tion. In Advances in neural information processing multi-label prediction on documents, with or with- systems, pages 577–585. out known labels. We therefore demonstrate the Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, effectiveness of our Sequence-to-Set framework and Yoshua Bengio. 2014. Empirical evaluation of for multi-label prediction, on any set of documents, gated recurrent neural networks on sequence model- applicable especially towards the automated cate- ing. arXiv preprint arXiv:1412.3555. gorization, filtering and semantic tagging for QE- Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, based retrieval, of biomedical literature for EBM. and Yoshua Bengio. 2015. Gated feedback recur- Future directions would involve experiments re- rent neural networks. In International Conference on Machine Learning, pages 2067–2075. placing TDIDF labels with more meaningful terms (using unsupervised term extraction) for query ex- Manirupa Das, Eric Fosler-Lussier, Simon Lin, So- pansion, initialization with pre-trained embeddings heil Moosavinasab, David Chen, Steve Rust, Yungui for the biomedical domain such as BlueBERT Huang, and Rajiv Ramnath. 2018. Phrase2vecglm: Neural generalized language model–based semantic (Peng et al., 2019) and BioBERT (Lee et al., 2020), tagging for complex query reformulation in medical and multi-task learning with closely related tasks ir. In Proceedings of the BioNLP 2018 workshop, such as biomedical Named Entity Recognition and pages 118–128. Relation Extraction to learn better document repre- Dina Demner-Fushman, Willie J Rogers, and Alan R sentations and thus more meaningful semantic tags Aronson. 2017. Metamap lite: an evaluation of of documents useful for downstream EBM tasks. a new java implementation of metamap. Journal
21 of the American Medical Informatics Association, Yifan Peng, Shankai Yan, and Zhiyong Lu. 2019. 24(4):841–844. Transfer learning in biomedical natural language processing: An evaluation of bert and elmo on ten Alex Graves. 2013. Generating sequences with benchmarking datasets. In Proceedings of the 2019 recurrent neural networks. arXiv preprint Workshop on Biomedical Natural Language Process- arXiv:1308.0850. ing (BioNLP 2019), pages 58–65.
Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Long short-term memory. Neural computation, Gardner, Christopher Clark, Kenton Lee, and Luke 9(8):1735–1780. Zettlemoyer. 2018. Deep contextualized word repre- sentations. arXiv preprint arXiv:1802.05365. Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daume´ III. 2015. Deep unordered compo- Jonas Pfeiffer, Samuel Broscheit, Rainer Gemulla, and sition rivals syntactic methods for text classification. Mathias Goschl.¨ 2018. A neural autoencoder ap- In Proceedings of the 53rd Annual Meeting of the proach for document ranking and query refinement ACL, volume 1, pages 1681–1691. in pharmacogenomic information retrieval. In Pro- ceedings of the BioNLP 2018 workshop, pages 87– Yoon Kim. 2014. Convolutional neural net- 97. works for sentence classification. arXiv preprint ˇ arXiv:1408.5882. Radim Rehu˚rekˇ and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Cor- Diederik P Kingma and Jimmy Ba. 2014. Adam: A pora. In Proceedings of the LREC 2010 Workshop method for stochastic optimization. arXiv preprint on New Challenges for NLP Frameworks, pages 45– arXiv:1412.6980. 50, Valletta, Malta. ELRA. http://is.muni.cz/ publication/884893/en. Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, Kirk Roberts, Anupama E Gururaj, Xiaoling Chen, and Sanja Fidler. 2015. Skip-thought vectors. In Saeid Pournejati, William R Hersh, Dina Demner- Advances in neural information processing systems, Fushman, Lucila Ohno-Machado, Trevor Cohen, pages 3294–3302. and Hua Xu. 2017. Information retrieval for biomed- ical datasets: the 2016 biocaddie dataset retrieval Quoc V Le and Tomas Mikolov. 2014. Distributed rep- challenge. Database, 2017. resentations of sentences and documents. In ICML, Kirk Roberts, Ellen Voorhees, Dina Demner-Fushman, volume 14, pages 1188–1196. and William R. Hersh. 2016. Overview of the trec 2016 clinical decision support track. Online; posted Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, August-2016. Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. Biobert: a pre-trained biomed- Tim Rocktaschel,¨ Edward Grefenstette, Karl Moritz ical language representation model for biomedical Hermann, Toma´sˇ Kociskˇ y,` and Phil Blunsom. 2015. text mining. Bioinformatics, 36(4):1234–1240. Reasoning about entailment with neural attention. arXiv preprint arXiv:1509.06664. Donald AB Lindberg, Betsy L Humphreys, and Alexa T McCray. 1993. The unified medical lan- Gerard Salton, Chris Buckley, and Maria Smith. 1990. guage system. Yearbook of Medical Informatics, On the application of syntactic methodologies in au- 2(01):41–51. tomatic text analysis. Information Processing & Management, 26(1):73–92. Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- frey Dean. 2013a. Efficient estimation of word Hossein Soleimani and David J Miller. 2016. Semi- representations in vector space. arXiv preprint supervised multi-label topic models for document arXiv:1301.3781. classification and sentence labeling. In Proceedings of the 25th ACM international on conference on in- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- formation and knowledge management, pages 105– rado, and Jeff Dean. 2013b. Distributed representa- 114. ACM. tions of words and phrases and their compositional- ity. In Advances in neural information processing Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. systems, pages 3111–3119. Sequence to sequence learning with neural networks. In Advances in neural information processing sys- Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, tems, pages 3104–3112. Xiaodong He, Jianshu Chen, Xinying Song, and Rabab Ward. 2016. Deep sentence embedding using Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob long short-term memory networks: Analysis and ap- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz plication to information retrieval. IEEE/ACM Trans- Kaiser, and Illia Polosukhin. 2017. Attention is all actions on Audio, Speech and Language Processing you need. In Advances in Neural Information Pro- (TASLP), 24(4):694–707. cessing Systems, pages 5998–6008.
22 Bo Xu, Hongfei Lin, Yuan Lin, and Kan Xu. 2017. Learning to rank with query-level semi-supervised autoencoders. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Manage- ment, pages 2395–2398. ACM. Jinxi Xu and W Bruce Croft. 2000. Improving the ef- fectiveness of information retrieval with local con- text analysis. ACM Transactions on Information Sys- tems (TOIS), 18(1):79–112. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual atten- tion. In International conference on machine learn- ing, pages 2048–2057.
23 A Sequence-based Document Encoders the LSTM is able to learn a language model for the entire document, encoded in the hidden state of the We describe below the different neural models that final timestep, which we use as the document en- we use for the sequence encoder, as part of our coding to give to the prediction layer. By the same encoder-decoder architecture for deriving semantic token, owing to the bi-directional processing of its tags for documents. input, a BiLSTM-based document representation is expected to be even more robust at capturing docu- A.1 doc2vec encoder ment semantics than the LSTM, with respect to its doc2vec is the unsupervised algorithm due to (Le prediction targets. Here, the document representa- and Mikolov, 2014), that learns fixed-length rep- tion used for final classification is the concatenated resentations of variable length documents, repre- l l hidden state outputs from the final step, [−→h t; ←−h t], senting each document by a dense vector trained depicted by the dotted box in Fig.1(e). to predict surrounding words in contexts sampled from each document. We derive these doc2vec en- A.3 BiLSTM with Attention encoder codings by pre-training on our corpus. We then In addition, we also propose a BiLSTM with use them directly as features for inferring semantic attention-based document encoder, where the out- tags per Figure1(b) without training them within put representation is the weighted combination of our framework against the loss objectives. We ex- the concatenated hidden states at each time step. pect this to be a strong document encoding baseline Thus we learn an attention-weighted representation in capturing the semantics of documents. TFIDF at the final output as follows. Let X (d L) be ∈ × Terms is our other baseline where we don’t train a matrix consisting of output vectors [h1, . . . , hL] within the framework but rather use the top-k neigh- that the Bi-LSTM produces when reading L tokens bor documents’ TFIDF pseudo-labels as the seman- of the input document. Each word representation tic tags for the query document. hi is obtained by concatenating the forward and backward hidden states, i.e. hi = [−→hi ; ←−hi ]. d is A.2 Deep Averaging Network encoder the size of embeddings and hidden layers. The The Deep Averaging Network (DAN) for text clas- attention mechanism produces a vector α of atten- sification due to (Iyyer et al., 2015) Figure1 (c), tion weights and a weighted representation r of the is formulated as a neural bag of words encoder input, via: model for mapping an input sequence of tokens X (d L) M = tanh(WX),M × (5) to one of k labels. v is the output of a composition ∈ function g, in this case averaging, applied to the sequence of word embeddings v for w X. For T L w ∈ α = softmax(w M), α (6) our multi-label classification problem, v is fed to a ∈ sigmoid layer to obtain scores for each independent r = XαT , r d (7) classification. We expect this to be another strong ∈ document encoder given results in the literature Here, the intermediate attention representation th th and it proves in practice to be. mi (i.e. the i column vector in M) of the i word in the input document is obtained by apply- A.2.1 LSTM and BiLSTM encoders ing a non-linearity on the matrix of output vectors LSTMs (Hochreiter and Schmidhuber, 1997), by X, and the attention weight for the ith word in design, encompass memory cells that can store the input is the result of a weighted combination information for a long period of time and are there- (parameterized by w) of values in m . Thus r d i ∈ fore capable of learning and remembering over is the attention weighted representation of the − long and variable sequences of inputs. In addition word and phrase tokens in an input document used to three types of gates, i.e. input, forget, and output in optimizing the training objective in downstream gates, that control the flow of information into and multi-label classification, as shown by the final at- out of these cells, LSTMs have a hidden state vec- tended representation r in Figure1(e). l l tor ht, and a memory vector ct. At each time step, corresponding to a token of the input document, the A.4 GRU and BiGRU encoders LSTM can choose to read from, write to, or reset A Gated Recurrent Unit (GRU) is a type of re- the cell using explicit gating mechanisms. Thus current unit in recurrent neural networks (RNNs)
24 that aims at tracking long-term dependencies while Multi-headed self-attention further lends the model keeping the gradients in a reasonable range. In con- the ability to focus on different positions in the in- trast to the LSTM, a GRU has only 2 gates: a reset put, with multiple sets of Query/Key/Value weight gate and an update gate. First proposed by (Chung matrices, which we hypothesize should result in et al., 2014), (Chung et al., 2015) to make each re- the most effective document representation, among current unit to adaptively capture dependencies of all the models, for our downstream task. different time scales, the GRU, however, does not have any mechanism to control the degree to which A.6 CNN encoder its state is exposed, exposing the whole state each Inspired by the success of (Kim, 2014) in employ- time. In the LSTM unit, the amount of the memory ing CNN architectures successfully for achieving content that is seen, or used by other units in the gains in NLP tasks we also employ a CNN-based network is controlled by the output gate, while the encoder in the seq2set framework. (Kim, 2014) GRU exposes its full content without any control. train a simple CNN with a layer of convolution on Since the GRU has simpler structure, models us- top of pre-trained word vectors, as a sequence of ing GRUs generally converge faster than LSTMs, length n embeddings concatenated to form a ma- hence they are faster to train and may give better trix input. Filters of different sizes, representing performance in some cases for sequence modeling various context windows over neighboring words, tasks. The BiGRU has the same structure as GRU are then applied to this input, over each possible except constructed for bi-directional processing of window of words in the sequence to obtain feature the input, depicted by the dotted box in Fig.1(e). maps. This is followed by a max-over-time pooling operation to take maximum value of the feature A.5 Transformer self-attentional encoder map as the feature corresponding to a particular filter. The model then combines these features to Recently, the Transformer encoder-decoder archi- form a penultimate layer which is passed to a fully tecture due to (Vaswani et al., 2017), based on a self- connected softmax layer whose output is the prob- attention mechanism in the encoder and decoder, ability distribution over labels. In case of seq2set has achieved the state-of-the-art in machine trans- these features are passed to sigmoid layer for final lation tasks at a fraction of the computation cost. multi-label prediction used cross entropy loss or a Based entirely on attention, and replacing the re- combination of cross-entropy and LM losses. We current layers commonly used in encoder-decoder use filters of sizes 2, 3, 4 and 5. Like our other en- architectures with multi-headed self-attention, it coders, we fine-tune the document representations has outperformed most previously reported ensem- learnt. bles on the task. Thus we hypothesize that this self- attention-based model could learn the most efficient B Embedding Algorithms Experimented semantic representations of documents for our un- with supervised task. Since our models use tensorflow (Abadi et al., 2016), a natural choice was docu- We describe here the various algorithms used to ment representation learning using the Transformer train word embeddings for use in our models. model’s available tensor2tensor API. We hoped Skip-Gram word2vec: We generate word em- to leverage apart from the computational advan- beddings trained with the skip-gram model with tages of this model, the capability of capturing se- negative sampling (Mikolov et al., 2013b) with di- mantics over varying lengths of context in the input mension settings of 50 with a context window of document, afforded by multi-headed self-attention, 4, and also 300, with a context window of 5, using Figure1(f). Self-attention is realized in this archi- the gensim package 2 (Rehˇ u˚rekˇ and Sojka, 2010). tecture, by training 3 matrices, made up of vectors, Probabilistic FastText: The Probabilistic Fast- corresponding to a Query vector, a Key Vector and Text (PFT) word embedding model of (Athi- a Value vector for each token in the input sequence. waratkun et al., 2018) represents each word with a The output of each self-attention layer is a sum- Gaussian mixture density, where the mean of a mix- mation of weighted Value vectors that passes on ture component given by the sum of n-grams, can to a feed-forward neural network. Position-based capture multiple word senses, sub-word structure, encoding to replace recurrences help to lend more parallelism to computations and make things faster. 2https://radimrehurek.com/gensim/
25 and uncertainty information. This model outper- top documents as the terms for query expansion forms the n-gram averaging of FastText getting for the original query. We then re-run these ex- state-of-the-art performance on several word simi- panded queries through the ES index to record the larity and disambiguation benchmarks. The proba- retrieval performance. Thus the queries our system bilistic word representations with flexible sub-word is evaluated on, are not seen at the time of training structures, can achieve multi-sense representations our models, but only during evaluation, hence it is that also give rich semantics for rare words. This unsupervised QE. makes them very suitable to generalize for rare and Similar to Das et al.(2018), for the feedback out-of-vocabulary words motivating us to opt for loop based query expansion method, we had two 3 PFT-based word vector pre-training over regular separate human judgment–based baselines, one us- FastText. ing the MeSH terms available from PMC for the ELMo: Another consideration was to use em- top 15 documents returned in a first round of query- beddings that can explicitly capture the language ing with Summary text, and the other based on model underlying sentences within a document. human expert annotations of the 30 query topics, ELMo (Embeddings from Language Models) word made available by the authors. vectors (Peters et al., 2018) presented such a choice Since we had mixed results initially with our where the vectors are derived from a bidirectional models, we explored various options to increase the LSTM trained with a coupled language model (LM) training signal. First was by the use of neighbor- objective on a large text corpus. The representa- document’s labels in our label vector for training. tions are a function of all of the internal layers of In this scheme, we used the 3-hot TFIDF label the biLM. Using linear combinations of the vec- representation for each document to pick a list of n– tors derived from each internal state has shown nearest neighbors to it. We then included the labels marked improvements over various downstream of those nearest documents into the label vector for NLP tasks, because the higher-level LSTM states the original document for use during training. We capture context-dependent aspects of word mean- experimented with choices 3, 5, 10, 15, 20, 25 for ing (e.g., they can be used without modification n. We observed improvements with incorporating to perform well on supervised word sense disam- neighbor labels. biguation tasks) while lower- level states model Next we experimented with incorporating word aspects of syntax. Using the API 4 we generate neighbors into the input sequence for our models. ELMo embeddings fine-tuned for our corpus with We did this in two different ways, the first was to dimension settings of 50 and 100 using only the average all the neighbors and concatenate with the top layer final representations. A discussion of the original token embedding, the other was to average results from each set of experiments is outlined in all of the embeddings together. The word itself was the following section and summarized in Table1. always weighted more than the neighbors. This C Experimental Considerations and scheme also gave improvement. Hyperparameter Settings Finally we experimented with incorporating em- beddings pre-trained by latest state-of-the-art meth- Of the metrics available, P@10 gives the number ods (AppendixB) as the input tokens for our of relevant items returned in the top-10 results and models. After several rounds of hyper-parameter NDCG looks at precision of the returned items at tuning, batch size was set to 128, and dropout the correct rankings. For our particular dataset to 0.3. We also performed a small grid search domain, the number of relevant results returned in into the space of hyperparameters like number the top-10 is more important, hence Table1 reports of hidden layers varied as 2, 3, 4, and α varied results ranked in ascending order of P@10. as [1.0, 10.0, 100.0, 1000.0, 10000.0], determining The PRF setting shown in the results table means the best settings for each encoder. that, we take the top 10-15 documents returned by A glossary of acronyms and parameters used an ElasticSearch (ES) index for each of the 30 Sum- in training of our models is as follows: sg=skip- mary Text queries in our dataset, and subsequently gram; pft=Probablistic FastText; elmo=ELMo; use the semantic tags assigned to each of these d=embedding dimension; kln=number of “neigh- 3https://github.com/benathi/multisense-prob-fasttext bor documents’” labels; nl=number of hidden lay- 4https://allennlp.org/elmo ers in the model; h= Number of multi-attention
26 heads; bs=batch size; dp=dropout; ep=no. of epochs; α=weight parameter for language model loss component. Best-performing model settings: Our best per- forming models on the base query was a Trans- former encoder with 10 attention heads: nh = 10, loss: cross-entropy + LM loss with α = 1000.0, input embedding: 50-d pft, bs = 64 and dp = 0.3; and a GRU encoder for the ensemble with param- eters, loss: LM only loss with α = 1000.0, input embedding: 50-d pft, nl = 4, kln = 10, bs = 64 and dp = 0.2. For augmented query, our best performing mod- els were: (1) a BiGRU trained with parameters, loss function: LM only loss with α = 1.0, in- put embedding: 50-d skip gram, nl = 3, − kln = 5, bs = 128 and dp = 0.3, and (2) a Transformer trained with parameters, loss function: cross-entropy + LM loss with α = 1000.0, input embedding: 50-d skip gram, nl = 4, kln = 5, − bs = 128 and dp = 0.3. While we obtain significant improvement over the compared baselines with our best-performing models, we believe further gains are possible by a more targeted search through the parameter space.
27 Interactive Extractive Search over Biomedical Corpora
Hillel Taub-Tabib1 Micah Shlain1,2 Shoval Sadde1 Dan Lahav3 Matan Eyal1 Yaara Cohen1 Yoav Goldberg1,2 1 Allen Institute for AI, Tel Aviv, Israel 2 Bar Ilan University, Ramat-Gan, Israel 3 Tel Aviv University, Tel-Aviv, Israel hillelt,micahs @allenai.org { }
Abstract (e.g. all chemical-protein interactions or all risk factors for a disease) the task becomes challeng- We present a system that allows life-science re- ing and researchers will typically resort to curated searchers to search a linguistically annotated knowledge bases or designated survey papers in corpus of scientific texts using patterns over case ones are available. dependency graphs, as well as using patterns over token sequences and a powerful variant We present a search system that works in a of boolean keyword queries. In contrast to paradigm which we call Extractive Search, and previous attempts to dependency-based search, which allows rapid information seeking queries we introduce a light-weight query language that are aimed at extracting facts, rather than doc- that does not require the user to know the de- uments. Our system combines three query modes: tails of the underlying linguistic representa- boolean, sequential and syntactic, targeting differ- tions, and instead to query the corpus by pro- ent stages of the analysis process, and different ex- viding an example sentence coupled with sim- traction scenarios. Boolean queries ( 4.1) are the ple markup. Search is performed at an inter- § active speed due to efficient linguistic graph- most standard, and look for the existence of search indexing and retrieval engine. This allows for terms, or groups of search terms, in a sentence, re- rapid exploration, development and refinement gardless of their order. These are very powerful for of user queries. We demonstrate the system us- finding relevant sentences, and for co-occurrence ing example workflows over two corpora: the searches. Sequential queries ( 4.2) focus on the § PubMed corpus including 14,446,243 PubMed order and distance between terms. They are intu- abstracts and the CORD-19 dataset1, a col- lection of over 45,000 research papers fo- itive to specify and are very effective where the cused on COVID-19 research. The system text includes “anchor-words” near the entity of in- https://allenai. terest. Lastly, syntactic queries ( 4.4) focus on is publicly available at § github.io/spike the linguistic constructions that connect the query words to each other. Syntactic queries are very 1 Introduction powerful, and can work also where the concept to be extracted does not have clear linear anchors. Recent years have seen a surge in the amount of However, they are also traditionally hard to spec- accessible Life Sciences data. Search engines like ify and require strong linguistic background to use. Google Scholar, Microsoft Academic Search or Our systems lowers their barrier of entry with a Semantic Scholar allow researchers to search for specification-by-example interface. published papers based on keywords or concepts, Our proposed system is based on the following but search results often include thousands of papers components. and extracting the relevant information from the Minimal but powerful query languages. There papers is a problem not addressed by the search is an inherent trade-off between simplicity and con- engines. This paradigm works well when the in- trol. On the one extreme, web search engines like formation need can be answered by reviewing a Google Search offer great simplicity, but very lit- number of papers from the top of the search results. tle control, over the exact information need. On However, when the information need requires ex- the other extreme, information extraction pattern- traction of information nuggets from many papers specification languages like UIMA Ruta offer great 1https://pages.semanticscholar.org/coronavirus-research precision and control, but also expose a low-level
28 Proceedings of the BioNLP 2020 workshop, pages 28–37 Online, July 9, 2020 c 2020 Association for Computational Linguistics view of the text and come with over hundred-page which is the main target of the extraction, with the manual.2 rich signals available in its surrounding context. Our system is designed to offer high degree of Interactive Speed. Central to the approach is an expressivity, while remaining simple to grasp: the indexed solution, based on (Valenzuela-Escarcega´ syntax and functionality can be described in a few et al., 2020), that allows to perform all types of paragraphs. The three query languages are de- queries efficiently over very large corpora, while signed to share the same syntax to the extent possi- getting results almost immediately. This allows ble, to facilitate knowledge transfer between them the users to interactively refine their queries and and to ease the learning curve. improve them based on the feedback from the re- Linguistic Information, Captures, and Expan- sults. This contrasts with machine learning based sions. Each of the three query types are linguis- solutions that, even neglecting the development tically informed, and the user can condition not time, require substantially longer turnaround times only on the word forms, but also on their lemmas, between query and results from a large corpus. parts-of-speech tags, and identified entity types. The user can also request to capture some of the 2 Existing Information Discovery search terms, and to expand them to a linguistic Approaches context. For example, in a boolean search query The primary paradigm for navigating large scien- looking for a sentence that contains the lemmas tific collections such as MEDLINE/PubMed3 is “treat” and “treatment” (‘lemma=treat treatment’), a | document-level search. chemical name (‘entity=SIMPLE CHEMICAL’) and The most immediate document-level searching the word “infection” (‘infection’), a user can mark technique is boolean search (“keyword search”). the chemical name and the word “infection” as cap- However, these methods suffer from an inability tures. This will yield a list of chemical/infection to capture the concepts aimed for by the user, as pairs, together with the sentence from which they biomedical terms may have different names in dif- originated, all of which contain the words relating ferent sub-fields and as the user may not always to treatments. Capturing the word “infection” is know exactly what they are looking for. To over- not very useful on its own: all matches result in the come this issue several databases offer semantic exact same word. But, by expanding the captured searching by exploiting MeSH terms that indicate word to its surrounding linguistic environment, the related concepts. While in some cases MeSH terms captures list will contain terms such as “PEDV in- can be assigned automatically, e.g (Mork et al., fection”, “acyclovir-resistant HSV infection” and 2013), in others obtaining related concepts require “secondary bacterial infection”. Running this query a manual assignment which is laborious to obtain. over PubMed allows us to create a large and rela- Beyond the methods incorporated in the liter- tively focused list in just a few seconds. The list ature databases themselves, there are numerous can then be downloaded as a CSV file for further external tools for biomedical document searching. processing. The search becomes extractive: we are Thalia (Soto et al., 2018) is a system for seman- not only looking for documents, but also, by use of tic searching over PubMed. It can recognize dif- captures, extract information from them. ferent types of concepts occurring in Biomedical Sentence Focus, Contextual Restrictions. As abstracts, and additionally enables search based on our system is intended for extraction of informa- abstract metadata; LIVIVO (Muller¨ et al., 2017) tion, it works at the sentence level. However, each takes the task of vertically integrating information sentence is situated in a context, and we allow sec- from divergent research areas in the life sciences; ondary queries to condition on that context, for SWIFT-Review 4 offers iterative screening by re- example by looking for sentences that appear in ranking the results based on the user’s inputs. paragraphs that contain certain words, or which ap- All of these solutions are focused on the docu- pear in papers with certain words in their titles, in ment level, which can be limiting: they often sur- papers with specific MeSH terms, in papers whose face hundreds of papers or more, requiring careful abstracts include specific terms, etc. This combines reading, assessing and filtering by the user, in order the focus and information density of a sentence, to locate the relevant facts they are looking for.
2https://uima.apache.org/d/ 3https://www.ncbi.nlm.nih.gov/pubmed/ ruta-current/tools.ruta.book.pdf 4https://www.sciome.com/swift-review/
29 To complement document searching, some sys- search papers, using a novel multifaceted query tems facilitate automatic extraction of biomedical language which we designed, and which supports concepts, or patterns, from documents. Such sys- boolean search, sequential patterns search, and by- tems are often equipped with analysis capabili- example syntactic search (Shlain et al., 2020), as ties of the extracted information. For example, well as specification of search terms whose matches NaCTem has created systems that extract biomed- should be captured or expanded. The queries can ical entities, relations and events.5; ExaCT and be further restricted by contextual information. RobotReviewer (Kiritchenko et al., 2010; Marshall We demonstrate the system on two datasets: a et al., 2015) take a RCT report and retrieve sen- comprehensive dataset of PubMed abstracts and a tences that match certain study characteristics. dataset of full text papers focused on COVID-19 To improve the development of automatic doc- research. ument selection and information extraction the Comparison to existing systems. In contrast to BioNLP community organized a series of shared document level search solutions, the results re- tasks (Kim et al., 2009, 2011;N edellec´ et al., 2013; turned by our system are sentences which include Segura Bedmar et al., 2013; Deleger´ et al., 2016; highlighted spans that directly answer the user’s Chaix et al., 2016; Jin-Dong et al., 2019). The tasks information need. In contrast to supervised IE so- address a diverse set of biomed topics addressed by lutions, our solution does not require a lengthy a range of NLP-based techniques. While effective, process of data collection and labeling or a precise such systems require annotated training data and definition of the problem settings. substantial expertise to produce. As such, they are Compared to rule based systems our system dif- restricted to several “head” information extraction ferentiates itself in a number of ways: (i) our query needs, those that enjoy a wide community interest engine automatically translates lightly tagged natu- and support. The long tail of information needs of ral language sentences to syntactic queries (query- “casual” researchers remain mostly un-addressed. by-example) thus allowing domain experts to bene- fit from the advantages of syntactic patterns without 3 Interactive IE Approach a deep understanding of syntax; (ii) our queries run against indexed data, allowing our translated syn- Existing approaches to information extraction from tactic queries to run at interactive speed; and (iii) bio-medical data suffer from significant practical our system does not rely on relation schemas and limitations. Techniques based on supervised train- does not make assumptions about the number of ing require extensive data collection and annota- arguments involved or their types. tion (Kim et al., 2009, 2011;N edellec´ et al., 2013; In many respects, our system is similar to the Segura Bedmar et al., 2013; Deleger´ et al., 2016; PropMiner system (Akbik et al., 2013) for ex- Chaix et al., 2016), or a high degree of technical ploratory relation extraction (Akbik et al., 2014). savviness in producing high quality data sets from Both PropMiner and our system support by- distant supervision (Peng et al., 2017; Verga et al., example queries in interactive speed. However, the 2017; Wang et al., 2019). On the other hand, rule query languages we describe in section4 are signifi- based engines are generally too complex to be used cantly more expressive than PropMiner’s language, directly by domain experts and require a linguist which supports only binary relations. Furthermore, or an NLP specialist to operate. Furthermore, both compared to PropMiner, our annotation pipeline rule based and supervised systems typically oper- was optimized specifically for the biomedical do- ate in a pipeline approach where an NER engine main and our system is freely available online. identifies the relevant entities and subsequent ex- Technical details. The datasets were annotated traction models identify the relations between them. for biomedical entities and syntax using a custom This approach is often problematic in real world SciSpacy pipeline (Neumann et al., 2019)6, and biomedical IE scenarios, where relevant entities the syntactic trees were enriched to BART format often cannot be extracted by stock NER models. using pyBART (Tiktinsky et al., 2020). The an- To address these limitations we present a sys- notated data is indexed using the Odinson engine tem allowing domain experts to interactively query (Valenzuela-Escarcega´ et al., 2020). linguistically annotated datasets of scientific re- 6All abstracts underwent sentence splitting, tokenization, tagging, parsing and NER using all the 4 NER models avail- 5http://www.nactem.ac.uk/ able in SciSpacy
30 4 Extractive Query Languages co-occur with “fatal” and “asymptomatic”. We can also issue a query such as 4.1 Boolean Queries ‘chem:e=SIMPLE CHEMICALd :e=DISEASE’ Boolean queries are the standard in information to get a list of chemical-disease co-occurrences. retrieval (IR): the user provides a set of terms that Using additional terms, we can narrow down to should, and should not, appear in a document, and co-occurrences with specific words, and by using the system returns a set of documents that adhere contextual restrictions ( 4.3) we can focus on co- § to these constraints. This is a familiar and intuitive occurrences in specific papers or domains. model, which can be very effective for initial data Expansions. Finally, for captured terms we also exploration as well as for extraction tasks that fo- support linguistic expansions. After the term is cus on co-occurrence. We depart from standard matched, we can expand it to a larger linguistic boolean queries and extend them by (a) allowing environment based on the underlying syntactic sen- to condition on different linguistic aspects of each tence representation. An expansion is expressed by token; (b) allowing capturing of terms into named prefixing a term with angle brackets : hi variables; and (c) allowing linguistic expansion of ‘ inf:infection asymptomatic fatal’ will capture the hi the captured terms. word “infection” under the variable “inf” and ex- The simplest boolean query is a list of terms, pand it to its surrounding noun-phrase, captur- where each term is a word, i.e: ‘infection asymp- ing phrases like “malarialike infection”, “asymp- tomatic fatal’ The semantics is that all the terms tomatic infection”, “chronic infection” and “a mild must appear in the query. A term can be made op- subclinical infection 9”. tional by prefixing it with a ‘?’ symbol (‘infection asymptomatic?fatal ’ ). Each term can also specify a 4.2 Sequential (Token) Queries list of alternatives: ‘fatal deadly lethal’. While boolean queries allow terms to appear in any | | Beyond words. In addition to matching order, we sometimes care about the exact linear words, terms can also specify linguistic placements of words with respect to each other. properties: lemmas, parts-of-speech, and The term-specification, capture and expansion syn- domain-specific entity-types: ‘lemma=infect tax is the same as in boolean queries, but here terms entity=DISEASE’. Conditions can also be combined: must match as in the query. ‘lemma=cause reason&tag=NN’. We find that the ‘interspecies transmission’ looks for the exact phrase | ability to search for domain-specific types is very “interspecies transmission” and effective in boolean queries, as it allows to search ‘tag=NNS transmission’ looks for the word transmis- for concepts rather than words. In addition to sion immediately preceded by a plural noun. By exact match, we also support matching on regular capturing the noun (‘which:tag=NNS transmission’) expressions (‘lemma=/caus.*/’). The field names we obtain a list of terms that includes the words word, lemma, entity, tag can be shortened to w,l,e,t. “bacteria”, “diseases”, “nuclei” and “crossspecies”. Captures. Central to our extractive approach is Wildcards. sequential queries can also use wild- the ability to designate specific search term to be card symbols:* (matching any single word), ‘...’ captured. Capturing is indicated by prefixing the (0 or more words), ‘...2-5...’ (2-5 words). The query term with ‘:’ (for an automatically-named capture) ‘interspecies kind:...1-3... transmission’ looks for the or with ‘name:’ (for a named capture). The query words “interspecies” and “transmission” with 1 to 3 ‘fatal asymptomaticd :e=DISEASE’ will look for sen- intervening words, capturing the intervening words tences that contain the terms ‘fatal‘ and ‘asymp- under “kind”. First results include “host-host”, tomatic‘ as well as a name of a disease, and will “zoonotic”, “virus”, “TSE agent”, “and interclass”. capture the disease name under a variable “d”. Repetitions. We also allow to specify repetitions Each query result will be a sentence with a sin- of terms. To do so, the term is enclosed in [] and gle disease captured. If several diseases appear in followed by a quantifier. We support the standard the same sentence, each one will be its own result. list of regular expression quantifiers: *,+,?, n,m . { } The user can then focus on the captured entities, For example, ‘tag=DT[tag=JJ]*[tag=NN]+ ’. and export the entire query result to a CSV file, in which each row contains the sentence, its source, 4.3 Contextual Restrictions and the captured variables. In the current exam- Each query can be associated with contextual re- ples, the result will be a list of disease names that strictions, which are secondary queries that oper-
31 Figure 1: Query Graph of the syntactic query ‘ p1:[e]BMP-6$induces the$phosphorylation$of p2:Smad1.’ hi hi ate on the same data and restrict the set of sen- to matching specific terms (using the term specifi- tences that are considered for the main queries. cation syntax as in boolean or token queries) and These queries currently have the syntax of the can likewise relax the exact-match constraints on Lucene query language.7 Our system allows the anchor words. Like in other query types, capture secondary queries to condition on the paragraph nodes can be marked for expansion. The syntactic the sentence appears in, and on the title, abstract, graph is then matched against the pre-parsed and authors, publication data, publication venue and indexed corpus. MeSH terms of the paper the sentence appears This simple markup provides a rich syntax-based in. Additional sources of information are easy to query system, while alleviating the user from the add. For example, adding the contextual restriction need to know linguistic syntax. ‘#d+title :cancer+mesh :”Age Distribution”’ restricts a For example, consider the query below, the de- query results to sentences from papers which have tails of which will be discussed shortly: the word “cancer” in their title and whose MeSH ‘ p1:[e=GENE OR GENE PRODUCT]BMP-6$induces ‘#d+ti- hi terms include “Age Distribution”. Similarly the$phosphorylation$of p2:Smad1’ tle:/corona.*/+year : [2015 TO 2020]’ restricts queries hi The words ‘induce’, ‘phosphorylation’ and ‘of’ to include sentences from papers published be- are anchors (designated by ‘$’), while ‘p1’ and ‘p2’ tween 2015 and 2020 and have a word starting are captures for ‘BMP-6’ and ‘Smad1’. Both cap- with corona in their title. ture nodes are marked for expansion using angle These secondary queries greatly increase the braces (‘ ’). Node p1 is restricted to match tokens power of boolean, sequential and syntactic queries: hi with the same entity type of BMP-6 (indicated by one could look for interspecies transmissions that ‘e=GENE OR...’). The query can be shortened by relate to certain diseases, or for sentence-level omitting the entity type and retaining only the en- disease-chemical co-occurrences in papers that dis- tity restriction (‘e’): cuss specific sub-populations. ‘ p1:[e]BMP-6$induces the$phosphorylation$of 4.4 Example-based Syntactic Queries hi p2:Smad1’ Recent advances in machine learning brought with hi Here, the entity type is inferred by the system them accurate syntactic parsing, but parse-trees from the entity type of BMP-6.8 The graph for remain hard to use. We remedy this by employing the query is displayed in Figure1. It has 5 tokens a novel query language we introduced in (Shlain in a specific syntactic configuration determined et al., 2020) which is based on the principle of by directed labeled edges. The 1st token must query-by-example. have the entity tag of ‘GENE OR...’, the 2nd, 3rd, The query is specified by starting with a simple and 4th tokens must be the exact words “induces natural language sentence that conveys the desired phosphorylation of”, and the 5th is unconstrained. syntactic structure, for example, ‘BMP-6 induces Sentences whose syntactic graph has a subgraph the phosphorylation of Smad1’. Then, words can that aligns to the query and adheres to the be marked as anchor words (that need to match ex- constraints will match the query. Example of actly) or capture nodes (that are variables). Words matching sentences are: can also be neither anchor or capture, in which case - ERK activation induces phosphorylation of they only support the scaffolding of the sentence. p1 Elk-1 . The system then translates the sentence with the p2 - Thrombopoietin activates human platelets captures and anchors syntax into a syntactic query p1 and induces tyrosine phosphorylation of graph, which is presented to the user. The user can p80/85 cortactin then restrict capture nodes from “match anything” p2 7https://lucene.apache.org/core/6_ 8Similarly, we could specify ‘$[lemma]induces’, re- 0_0/queryparser/org/apache/lucene/ sulting in the restriction ‘lemma=induce’ instead of queryparser/classic/package-summary.html ‘word=induces’ for the anchor.
32 The sentence tokens corresponding to the p1 and p2 graph nodes will be bound to variables with these names: p1=ERK, p2=Elk-1 for the { } first sentence and p1=Thrombopoietin, p2=p80/85 { cortactin for the second. } 5 Example Workflow: Risk-factors We describe a workflow which is based on using our extractive search system over a corpus of all (a) ranked risk factors for stroke (b) ranked disease risk factors PubMed abstracts. While the described researcher is hypothetical, the results we discuss are real. Figure 2: Grouped and ranked results Consider a medical researcher who is trying to compile an up to date list of the risk factors for stroke. A PubMed search for “risk factors for ‘r:Diabetes is a$risk$factor for$stroke ’. stroke” yields 3317 results, and reading through all results is impractical. A Google query for the same phrase brings out an info box from NHLBI9 listing 16 common risk factors including high blood pressure, diabetes, heart disease, etc. Having a cu- rated list which clearly outlines the risk factors is The figure shows the top results for the query helpful, but curated lists or survey papers will often and the risk factors are indeed labeled with r as not include rare or recent research findings. expected. Unfortunately, some of the captured risk The researcher thus turns to extractive search factors names are not fully expanded. For exam- and tries an exploratory boolean query: ple, we capture syndrome instead of metabolic syn- ‘risk factor stroke’ drome and temperature instead of low temperature. Being interested in capturing the full names, the researcher adds angle brackets ‘ ’ to expand the hi captured elements:
‘ r:Diabetes is a$risk$factor for$stroke ’. hi
. The figure shows the top results for the query and the majority of sentences retrieved indeed specify specific risk factors for stroke. This is an improve- The full names are now captured as expected. ment over the PubMed results as the researcher can Now that that researcher has verified that the quickly identify the risk factors discussed without query yields relevant results, she clicks the down- going through the different papers. load button to download the full result set. Furthermore, the top results contain risk factors The resulting tab separated file has 1212 rows. like migrane or C804A polymorphism not listed Each row includes a result sentence, the captured in the NHLBI knowledge base. However, the full elements in it (in this case, just the risk factor), and result list is lengthy and extracting all the risk fac- their offsets. Using a spreadsheet to group the rows tors from it manually would be tedious. Instead, by risk factor and order the results by frequency, the the researcher notes that many of the top results researcher obtains a list of 640 unique risk factors, are variations on the “X is a risk factor for stroke” 114 of them appearing more than once in the data. structure. She thus continues by issuing the follow- Figure 2a lists the top results. ing syntactic query, where a capture labeled r is Reviewing the list, the researcher decides that used to directly capture the risk factors: she’s not interested in general risk factors, but 9https://www.nhlbi.nih.gov/ rather in diseases only. She modifies the query health-topics/stroke by adding an entity restriction to the ‘r’ capture:
33 Identifying COVID-19 Aliases Since the CORD-19 corpus includes papers about the entire Coronavirus family of viruses, it’s useful to identify papers and sentences dealing specifically As seen in the query graph, even though the with COVID-19. Before converging on the researcher didn’t specify the exact entity type, the acronym COVID-19 researchers have referred to query parser correctly resolved it to DISEASE. The the virus by many names: nCov-19, SARS-COV-ii, results now include diseases like sleep apnoea and novel coronavirus, etc. Luckily, it’s fairly easy to hypertension but do not include smoking, age and identify many of these aliases using a sequential alcohol (see Figure 2b). pattern: Analyzing the results, the researcher now wants ‘novel coronavirus( alias :...1-2...) ’ to compare the risk factors in the general popula- tion to ones listed in research papers dealing with children and infants. Luckily, such papers are in- dexed with corresponding MeSH terms and the researcher can utilize this fact by appending ‘#d mesh:Child mesh:Infant-mesh :Adult’ to her query. In The pattern looks for the words “novel coronavirus” cases where a desired MeSH term does not exist, followed by an open parenthesis, one-or-two words an alternative approach is filtering the results based which are to be captured under the ‘alias’ variable, on words in the abstract or title. For example, ap- and a closing parenthesis. The query retrieves 52 pending ‘#d abstract:child abstract:children’ to a query unique candidate aliases for COVID-19, though will ensure that the result sentences come from ab- some of them refer to older coronaviruses such as stracts which contain the word child or the word “MERS”, or non-relevant terms such as “Fig2”. Af- children. ter ranking by frequency and validating the results, Happy with the results of the initial query, the we can reuse the pattern on newly retrieved aliases researcher can further augment her list by query- to extend the list. Through this iterative process we ing for other structures which identify risk factors quickly compile a list of 47 aliases. We marked all (e.g. “‘r:Diabetes$causes$stroke ’”, “‘$risk$factors occurrences of these terms in the underlying corpus for$stroke$includer :Diabetes’”, etc.). as a new entity type, COVID-19, and re-indexed Importantly, once the researcher has identified the dataset with this entity information. one or more effective queries to extract the risk fac- Exploring Drugs and Treatments. To explore tors for stroke, the queries can easily be modified drugs and treatments for COVID-19 we search in useful ways. For example, with a small modifi- the corpus for chemicals co-occuring with the cation to our original query we can extract: COVID-19 entity using a boolean query: ‘chemical:e=SIMPLE CHEMICAL CHEMICAL risk factors for cancer: | ‘r:Diabetes is a$risk$factor for$cacner ’ e=COVID-19’ diseases which can be caused by smoking: Table1(a) shows the top matching chemicals by ‘$Smoking is a$risk$factor ford:[e]stroke ’. frequency. While some of the substances listed like Chloroquine and Remdesivir are drugs being tested ad-hoc KB of (risk factor, disease) tuples (for for treating COVID-19, others are only hypothe- self use or as an easily queryable public resource): sized as useful or appear in other contexts. ‘r:Diabetes is a$risk$ factor ford:[e]stroke ’. To guide the search toward therapeutic sub- 6 Example Workflow: CORD-19 stances in different stages of maturity we can add indicative terms to the query. For example, the The COVID-19 Open Research Dataset (Wang following query can be used to detect substances et al., 2020) is a collection of 45,000 research pa- at the stage of clinical trials: pers, including over 33,000 with full text, about ‘chemical:e=SIMPLE CHEMICAL CHEMICAL | COVID-19 and the coronavirus family. The cor- e=COVID-19l=trial experiment’, while adding | pus was released by the Allen Institute for AI and ‘l=suggest hypothesize candidate’ can assist in | | associated partners in an attempt to encourage re- detecting substances in ideation stage. searchers to apply recent advances in NLP to the Table1(b,c) shows the frequency distributions data to generate insights. of the chemicals resulting from the two queries.
34 (a) Unrestricted Treatments (via syntactic query) nucleic acid (171), chloroquine (118), nucleotide (115), NCP ribavirin (11), oseltamivir (9), ECMO (6), convales- (87), CR3022 (47), Ksiazek (46), IgG (45), lopinavir/ritonavir cent plasma (4), TCM (3), LPV/r (3), three fusions of (42), ECMO (40), LPV/r (35), corticosteroids (35), oxygen MSCs (2), supportive care (2), protective conditions (2), (32), ribavirin (31), lopinavir (31), Hydroxychloroquine (30), lopinavir/ritonavir (2), intravenous remdesivir (2), hydrox- amino acid (30), ritonavir (27), corticosteroid (24), Sofosbuvir ychloroquine (2), HCQ (2), glucocorticoids (2), FPV (2), ef- (22), amino acids (22), HCQ (19), glucocorticoids (19) fective isolation (2), chloroquine (2), caution (2), bDMARDs (b) Trial (2), azithromycin (2), ARBs (2), antivirals (2), ACE inhibitors (2), 500 mg chloroquine (2), masks (1) chloroquine (29), Remdesivir (8), LPV/r (7), lopinavir (6), HCQ (6), ritonavir (4), Arbidol (4), Sofosbuvir (3), nu- Table 2: Top elements occurring in the syntactic cleotide (3), nucleic acid (3), lopinavir/ritonavir (3), CQ “treated with X” configuration. Note that this query (3), oseltamivir (2), NCT04257656 (2), NCT04252664 (2), does not rely on NER information. Meplazumab (2), Hydroxychloroquine (2), glucocorticoids(2), CEP(2) The top results by frequency are shown in Table (c) Ideation 2. The top ranking results show many of the chem- chloroquine (6), ritonavir (5), S-RBD (4), nucleotide (4), icals obtained by equivalent boolean queries10, but Lopinavir (4), CR3022 (4), Ribavirin (3), nucleic acid (3), interestingly, they also contain non-chemical treat- logP (3), Li (3), ledipasvir (3), IgG (3), HCQ (3), TGEV (2), ments like supportive care, isolation and masks. teicoplanin (2), nelfinavir (2), NCP (2), HWs (2) glucocorti- This demonstrates a benefit of using entity agnos- coids (2), ENPEP (2), ECMO (2), darunavir (2), creatinine tic syntactic patterns even in cases where a strong (2), creatine (2), CQ (2), corticosteroid (2), CEP (2), ARB (2) NER model exists. Table 1: Top chemicals co-occuring with the COVID- 19 entity and their counts. (a) Unrestricted. (b) with 7 More Examples Trial related terms. (c) with Ideation related terms. While the workflows discussed above pertain mainly to the medical domain, the system is op- timized for the broader life science domain. Here While the queries are very basic and include only are a sample of additional queries, showing differ- a few terms for each category, the difference is ent potential use-cases. clearly noticeable: while the Malaria drug Chloro- Which genes regulate a cell process: quine tops both lists, the antiviral drug Remdesivir ‘ p1:[e]CD95v:[l]regulates p:[e]apoptosis’ which is currently tested for COVID-19 is second hi hi Which specie is the natural host of a disease: on the list of trial related drugs but does not appear ‘ host:[e]bat is a$natural$host of at all as a top result for ideation related drugs. hi disease:[e]coronavirus’ hi Importantly, entity co-mention queries like the Documented LOF mutations in genes: ones above rely on the availability and accuracy ‘$loss$of$function m:[w]mutation in gene:[e]PAX8’ of underlying NER models. As we’ve seen in hi hi Section5, in cases where the relevant types are 8 Conclusion not extracted by NER, syntactic queries can be used instead. For example the following query We presented a search system that targets extract- captures sentences including chemicals being used ing facts from a biomed corpus and demonstrated on patients (the abstract or paragraph are required its utility in a research and a clinical context over to include COVID-19 related terms). CORD-19 and PubMed. The system works in an ‘he was$treated$with a chem:treatment Extractive Search paradigm which allows rapid in- hi #d paragraph:ncov* paragraph:covid* abstract:ncov* formation seeking practices in 3 modes: boolean, abstract:covid*’ sequential and syntactic. The interactive and flexi- ble nature of the system makes it suitable for users in different levels of sophistication.
10to get a more comprehensive coverage we can issue queries for other syntactic structures like ‘ chem:chemical was used$in$treatment ’ and combinehi the results of the different queries.
35 Acknowledgements The work performed at Iain J Marshall, Joel¨ Kuiper, and Byron C Wallace. BIU is supported by funding from the Europoean 2015. RobotReviewer: evaluation of a system for Research Council (ERC) under the Europoean automatically assessing bias in clinical trials. Jour- nal of the American Medical Informatics Associa- Union’s Horizon 2020 research and innovation tion, 23(1):193–201. programme, grant agreement No. 802774 (iEX- TRACT). J.G. Mork, Antonio Jimeno-Yepes, and Alan Aronson. 2013. The nlm medical text indexer system for in- dexing biomedical literature. CEUR Workshop Pro- ceedings, 1094. References Bernd Muller,¨ Christoph Poley, Jana Possel,¨ Alexan- Alan Akbik, Oresti Konomi, and Michail Melnikov. dra Hagelstein, and Thomas Gubitz.¨ 2017. LI- 2013. Propminer: A workflow for interactive infor- VIVO : the vertical search engine for life sciences. mation extraction and exploration using dependency Datenbank-Spektrum, pages 1–6. trees. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: Sys- Claire Nedellec,´ Robert Bossy, Jin-Dong Kim, Jung- tem Demonstrations, pages 157–162. Jae Kim, Tomoko Ohta, Sampo Pyysalo, and Pierre Zweigenbaum. 2013. Overview of bionlp shared Alan Akbik, Thilo Michael, and Christoph Boden. task 2013. In Proceedings of the BioNLP shared 2014. Exploratory relation extraction in large text task 2013 workshop, pages 1–7. corpora. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguis- Mark Neumann, Daniel King, Iz Beltagy, and Waleed tics: Technical Papers, pages 2087–2096. Ammar. 2019. Scispacy: Fast and robust models for biomedical natural language processing. Estelle Chaix, Bertrand Dubreucq, Abdelhak Fatihi, Dialekti Valsamou, Robert Bossy, Mouhamadou Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina Ba, Louise Deleger,´ Pierre Zweigenbaum, Philippe Toutanova, and Wen-tau Yih. 2017. Cross-sentence Bessieres, Loic Lepiniec, et al. 2016. Overview of n-ary relation extraction with graph lstms. Transac- the regulatory network of plant seed development tions of the Association for Computational Linguis- (seedev) task at the bionlp shared task 2016. In tics, 5:101–115. Proceedings of the 4th bionlp shared task workshop, pages 1–11. Isabel Segura Bedmar, Paloma Mart´ınez, and Mar´ıa Herrero Zazo. 2013. Semeval-2013 task 9: Ex- Louise Deleger,´ Robert Bossy, Estelle Chaix, traction of drug-drug interactions from biomedical Mouhamadou Ba, Arnaud Ferre,´ Philippe Bessieres, texts (ddiextraction 2013). Association for Compu- and Claire Nedellec.´ 2016. Overview of the bacteria tational Linguistics. biotope task at bionlp shared task 2016. In Proceed- ings of the 4th BioNLP shared task workshop, pages Micah Shlain, Hillel Taub-Tabib, Shoval Sadde, and 12–22. Yoav Goldberg. 2020. Syntactic search by exam- ple. In Proceedings of ACL 2020, System Demon- Kim Jin-Dong, Nedellec´ Claire, Bossy Robert, and strations. Deleger´ Louise, editors. 2019. Proceedings of The 5th Workshop on BioNLP Open Shared Tasks. Asso- Axel J Soto, Piotr Przybyła, and Sophia Ananiadou. ciation for Computational Linguistics, Hong Kong, 2018. Thalia: semantic search engine for biomed- China. ical abstracts. Bioinformatics, 35(10):1799–1801.
Jin-Dong Kim, Tomoko Ohta, Sampo Pyysalo, Yoshi- Aryeh Tiktinsky, Yoav Goldberg, and Reut Tsarfaty. nobu Kano, and Jun’ichi Tsujii. 2009. Overview of 2020. pybart: Evidence-based syntactic transforma- bionlp’09 shared task on event extraction. In Pro- tions for ie. In Proceedings of ACL 2020, System ceedings of the BioNLP 2009 workshop companion Demonstrations. volume for shared task, pages 1–9. Marco A. Valenzuela-Escarcega,´ Gus Hahn-Powell, Jin-Dong Kim, Yue Wang, Toshihisa Takagi, and Aki- and Dane Bell. 2020. Odinson: A fast rule-based in- nori Yonezawa. 2011. Overview of genia event task formation extraction framework. In Proceedings of in bionlp shared task 2011. In Proceedings of the the Twelfth International Conference on Language BioNLP Shared Task 2011 Workshop, pages 7–15. Resources and Evaluation (LREC 2020), Marseille, Association for Computational Linguistics. France. European Language Resources Association (ELRA). Svetlana Kiritchenko, Berry de Bruijn, Simona Carini, Joel Martin, and Ida Sim. 2010. Exact: Automatic Patrick Verga, Emma Strubell, Ofer Shai, and Andrew extraction of clinical trial characteristics from jour- McCallum. 2017. Attending to all mention pairs nal publications. BMC medical informatics and de- for full abstract biological relation extraction. arXiv cision making, 10:56. preprint arXiv:1710.08312.
36 Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang, Darrin Eide, Kathryn Funk, Rodney Michael Kinney, Ziyang Liu, William. Merrill, Paul Mooney, Dewey A. Murdick, Devvret Rishi, Jerry Sheehan, Zhihong Shen, Brandon Stil- son, Alex D. Wade, Kuansan Wang, Christopher Wil- helm, Boya Xie, Douglas M. Raymond, Daniel S. Weld, Oren Etzioni, and Sebastian Kohlmeier. 2020. Cord-19: The covid-19 open research dataset. ArXiv, abs/2004.10706.
Lucy Lu Wang, Oyvind Tafjord, Sarthak Jain, Arman Cohan, Sam Skjonsberg, Carissa Schoenick, Nick Botner, and Waleed Ammar. 2019. Extracting ev- idence of supplement-drug interactions from litera- ture. arXiv preprint arXiv:1909.08135.
37 Improving Biomedical Analogical Retrieval with Embedding of Structural Dependencies
Amandalynne Paullada∗, Bethany Percha†, Trevor Cohen‡ ∗Department of Linguistics, University of Washington, Seattle, WA, USA † Dept. of Medicine and Dept. of Genetics & Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA ‡Dept. of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA [email protected], [email protected], [email protected]
Abstract relationships are being discovered faster than hu- man curators can manually update the knowledge Inferring the nature of the relationships be- bases. Furthermore, it is appealing to automati- tween biomedical entities from text is an im- cally generate hypotheses about novel relationships portant problem due to the difficulty of main- given the information in scientific literature (Swan- taining human-curated knowledge bases in rapidly evolving fields. Neural word embed- son, 1986), a process also known as ‘literature- dings have earned attention for an apparent based discovery.’ A trustworthy model should also ability to encode relational information. How- be able to reliably represent known relationships ever, word embedding models that disregard that are validated by existing literature. syntax during training are limited in their abil- Neural word embedding techniques such as ity to encode the structural relationships fun- word2vec1 and fastText2 are a widely-used damental to cognitive theories of analogy. In this paper, we demonstrate the utility of en- and effective approach to the generation of vector coding dependency structure in word embed- representations of words (Mikolov et al., 2013a) dings in a model we call Embedding of Struc- and biomedical concepts (De Vine et al., 2014). An tural Dependencies (ESD) as a way to repre- appealing feature of these models is their capac- sent biomedical relationships in two analog- ity to solve proportional analogy problems using ical retrieval tasks: a relationship retrieval simple geometric operators over vectors (Mikolov (RR) task, and a literature-based discovery et al., 2013b). In this way, it is possible to find ana- (LBD) task meant to hypothesize plausible re- lationships between pairs of entities unseen logical relationships between words and concepts in training. We compare our model to skip- without the need to specify the relationship type gram with negative sampling (SGNS), using explicitly, a capacity that has recently been used to 19 databases of biomedical relationships as our identify therapeutically-important drug/gene rela- evaluation data, with improvements in perfor- tionships for precision oncology (Fathiamini et al., mance on 17 (LBD) and 18 (RR) of these 2019). However, neural embeddings are trained sets. These results suggest embeddings encod- to predict co-occurrence events without consider- ing dependency path information are of value ation of syntax, limiting their ability to encode for biomedical analogy retrieval. information about relational structure, which is an 1 Introduction essential component of cognitive theories of ana- logical reasoning (Gentner and Markman, 1997). Distributed vector space models of language have Additionally, recent work (Peters et al., 2018) has been shown to be useful as representations of re- found that contextualized word embeddings from latedness and can be applied to information re- language models such as ELMo, when evaluated on trieval and knowledge base augmentation, includ- analogy tasks, perform worse on semantic relation ing within the biomedical domain (Cohen and Wid- tasks than static embedding models. dows, 2009). A vast amount of knowledge on The present work explores the utility of encod- biomedical relationships of interest, such as thera- ing syntactic structure in the form of dependency peutic relationships, drug-drug interactions, and ad- paths into neural word embeddings for analogical verse drug events, exists in largely human-curated knowledge bases (Zhu et al., 2019). However, the 1https://github.com/tmikolov/word2vec rate at which new papers are published means new 2https://fasttext.cc/
38 Proceedings of the BioNLP 2020 workshop, pages 38–48 Online, July 9, 2020 c 2020 Association for Computational Linguistics
Figure 1: Overview of training and evaluation pipeline. Two embedding models, Embedding of Structural Depen- dencies (ESD) and Skip-gram with Negative Sampling (SGNS), are trained on data from a corpus of 70 million ≈ sentences from Medline. The resulting representations are then evaluated on data collected from biomedical knowl- edge bases. retrieval of biomedical relations. To this end, we information about relationships between similar build and evaluate vector space models for repre- concepts. We refer to this task as relationship re- senting biomedical relationships, using a corpus of trieval (RR). The second task involves identifying dependency-parsed sentences from biomedical lit- concepts that are related in a particular way to one erature as a source of grammatical representations another, where this relationship has not been de- of relationships between concepts. scribed in the literature previously. We refer to this We compare two methods for learning biomedi- task as literature-based discovery (LBD), as identi- cal concept embeddings, the skip-gram with neg- fying such implicit knowledge is the main goal of ative sampling (SGNS) algorithm (Mikolov et al., this field (Swanson, 1986). 2013a) and Embedding of Semantic Predications We evaluate on four kinds of biomedical rela- (ESP) (Cohen and Widdows, 2017), which adapts tionships, characterized by the semantic types of SGNS to encode concept-predicate-concept triples. the entity pairs involved, namely chemical-gene, In the current work, we adapt ESP to encode de- chemical-disease, gene-gene, and gene-disease re- pendency paths, an approach we call Embedding lationships. of Structural Dependencies (ESD). We train ESD The following paper is structured as follows. and SGNS on a corpus of approximately 70 mil- 2 describes vector space models of language as § lion sentences from biomedical research paper ab- they are evaluated for their ability to solve pro- stracts from Medline, and evaluate each model’s portional analogy problems, as well as prior work ability to solve analogical retrieval problems de- in encoding dependency paths for downstream ap- rived from various biomedical knowledge bases. plications in relation extraction. 3 presents the § We train ESD on concept-path-concept triples ex- dependency path corpus from Percha and Altman tracted from these sentences, and SGNS on full (2018). 4 summarizes the knowledge bases from § sentences that have been minimally preprocessed which we develop our evaluation data sets. 5 de- § with named entities (see 3). Figure1 shows the scribes the training details for each vector space § pipeline from training to evaluation. model. 6 and 7 describe the methods and re- § § From an applications perspective, we aim to eval- sults for the RR and LBD evaluation paradigms. 8 and 9 offer discussion and conclude the paper. uate the utility of these representations of relation- § § ships for two tasks. The first involves correctly Code and evaluation data will be made available at identifying a concept that is related in a particular https://github.com/amandalynne/ESD. way to another concept, when this relationship has 2 Background already been described explicitly in the biomedical literature. This task is related to the NLP task of We look to prior work in using proportional analo- relationship extraction, but rather than considering gies as a test of relationship representation in the one sentence at a time, distributional models rep- general domain with existing studies on vector resent information from across all of the instances space models trained on generic English. While our in which this pair have co-occurred, as well as biomedical data is largely in English, we constrain
39 our evaluation to specific biomedical concepts and of these particular representations, but future work relationships as we apply and extend established will examine the application of these vector space methods. models to analogies in the general domain.
Vector space models of language and Dependency embeddings analogical reasoning Levy and Goldberg(2014a) adapt the SGNS Vector space models of semantics have been ap- model to encode direct dependency relationships, plied in information retrieval, cognitive science rather than dependency paths. In this approach, and computational linguistics for decades (Turney a dependency-type/relative pair is treated as a tar- and Pantel, 2010), with a resurgence of interest in get for prediction when the head of a phrase is recent years. Mikolov et al.(2013a) and Mikolov observed (e.g. P (scientist/nsubj discovers)). | et al.(2013b) introduce the skip-gram architecture. The dependency-based skipgram embeddings were This work demonstrated the use of a continuous shown to better reflect the functional roles of words vector space model of language that could be used than those trained on narrative text, which tended for analogical reasoning when vector offset meth- to emphasize topical associations. Recent work ods are applied, providing the following canoni- (Zhang et al.(2018), Zhou et al.(2018), Li et al. cal example: if xi is the vector corresponding to (2019)) has also integrated dependency path rep- word i, xking - xman + xwoman yields a vector that is resentations in neural architectures for biomedi- close in proximity to xqueen. This result suggests cal relation extraction, framing it as a classifica- that the model has learned something about seman- tion task rather than an analogical reasoning task. tic gender. They identified some other linguistic The work of Washio and Kato(2018) is perhaps patterns recoverable from the vector space model, the most closely related to our approach, in that such as pluralization: x - x x - x , apple apples ≈ car cars neural embeddings are trained on word-path-word and developed evaluation sets of proportional anal- triples. Aside from our application of domain- ogy problems that have since been widely used specific Named Entity Recognition (NER), a key as benchmarks for distributional models (see for methodological difference between this work and example (Levy et al., 2015)). the current work is that their approach represents However, work soon followed that pointed out word pairs as a linear transformation of the con- some of the shortcomings of attributing these re- catenation of their embeddings, while we use XOR sults to the models’ analogical reasoning capacity. as a binding operator (following the approach of For example, Linzen(2016) showed that the vector Kanerva(1996)), which was first used to model for ‘queen’ is itself one of the nearest neighbors to biomedical analogical retrieval with semantic pred- the vector for ‘woman,’ and so it can be argued that ications extracted from the literature by Cohen et al. the model does not actually learn relational infor- (2011)3. On account of the use of a binding opera- mation that can be applied to analogical reasoning, tor, individual entities, pairs of entities and depen- but rather, can rely on the direct similarity between dency paths are all represented in a common vector the target terms in the analogy to produce desirable space. results. Furthermore, Gladkova et al.(2016) introduce 3 Text Data the Better Analogy Test Set (BATS) to provide an evaluation set for analogical reasoning that in- We train both the ESD and SGNS models on data cludes a broader set of semantic and syntactic re- released by Percha and Altman(2018). This cor- 4 lationships between words. This set proved far pus consists of about 70 million sentences from more challenging for embedding-based approaches. a subset of MEDLINE (approximately 16.5 mil- Newman-Griffis et al.(2017) provide results of vec- lion abstracts) which have PubTator (Wei et al., tor offset methods applied to a dataset of biomedi- 2013) annotations applied to identify phrases that cal analogies derived from UMLS triples, showing denote names of chemicals (including drugs and that certain biomedical relationships are more diffi- other chemicals of interest), genes (and the proteins cult to learn with analogical reasoning than others. they code for), and diseases (including side effects Because the aim of this project is to robustly 3For related work, see Widdows and Cohen(2014) learn a handful of biomedical relationships, we are 4Version 7 of the corpus retrieved at https://zenodo. less concerned about the linguistic generalizability org/record/3459420
40 Figure 2: Example of a path of dependencies between two entities of interest. The full parse is not shown, but rather, the minimum path of dependency relations between the two entities given the sentence. and other phenotypes). Throughout this paper, we 4 Knowledge Bases use these shorthand names for each of these cate- gories, following the convention established in Wei We construct our evaluation data sets with exem- et al.(2013) and followed by Percha and Altman plars from knowledge bases for four primary kinds (2018). of biomedical relationships, characterized by the interactions between pairs of entities of the fol- The following example sentence from an article lowing types: chemical-gene, chemical-disease, PubTator processed by shows how multi-word gene-disease, and gene-gene. phrases that denote biomedical entities of interest, We evaluate on pairs of entities from the fol- in this case atypical depression and seasonal affec- lowing knowledge bases: DrugBank (Wishart tive disorder, are concatenated by underscores to et al., 2018), Online Mendelian Inheritance in constitute single tokens: Man (OMIM) (Hamosh et al., 2005), PharmGKB Chromium has a beneficial effect on eating-related atypical (PGKB) (Whirl-Carrillo et al., 2012), Reactome symptoms of depression, and may be a valuable agent in (Fabregat et al., 2016), Side Effect Resource treating atypical depression and seasonal affective disorder. (SIDER) (Kuhn et al., 2016), and Therapeutic Tar- Percha and Altman(2018) also provide pruned get Database (TTD) Wang et al.(2020). Stanford dependency (De Marneffe and Manning, Each knowledge base consists of pairs of en- 2008) parses for the sentences in the corpus, con- tities that relate in a specific way. For example, sisting, for each sentence, of the minimal path of de- SIDER Side Effects consists of chemical-disease- pendency relations connecting pairs of biomedical typed pairs such that the chemical is known to have named entities identified by PubTator. Specifi- the disease as a side effect, e.g. (sertraline, in- cally, they extract dependency paths that connect somnia). Meanwhile, another chemical-disease chemicals to genes, chemicals to diseases, genes pair from a different database, Therapeutic Target to diseases, and genes to genes. Figure2 shows an Database (TTD) indications, is such that the chem- example of a dependency path of relations between ical is indicated as a treatment for the disease, e.g. two terms, risperidone and rage. We use these de- (carphenazine, schizophrenia). In constructing our pendency paths as representations for predicates evaluation sets, we process all terms such that they that denote biomedical relationships of interest by are lower-cased, and multi-word terms are concate- concatenating the string representations of each nated by underscores. Furthermore, we eliminate path element, which are shown below the sentence from our evaluation sets any knowledge base terms in Figure2. Following Percha and Altman(2018), that do not appear in the training corpus described we exclude paths that denote a coordinating con- in 3 at least 5 times. It should be noted that across § junction between elements and paths that denote an these sets, a single biomedical entity may appear appositive construction, both of which are highly with numerous spellings and naming conventions. common in the set. In this corpus of 70 million Table2 shows the corresponding relationship sentences, there are about 44 million unique depen- type for each of the knowledge bases we use, as dency paths that connect concepts of interest, the well as the number of pairs from each that are used vast majority (around 40 million) of which appear in our evaluation data. The relationship retrieval just once in the corpus. 540,011 of these paths data consists of knowledge base pairs that appear appear at least 5 times in the corpus. in our training corpus connected by a dependency
41 path at least once, while the literature-based dis- was originally developed to encode knowledge ex- covery targets are those knowledge base pairs that tracted from the literature using a small set of prede- do not appear connected by a dependency path in fined predicates (e.g. TREATS), we adapt it here to the corpus. encode a large variety (n=546,085) of dependency paths. For training, we concatenate the dependency 5 Training Details relations (the underscored parts in Figure2) into a SGNS With SGNS, a shallow neural network is single predicate token for which a vector is learned. trained to estimate the probability of encounter- Some examples of path tokens (concatenated de- ing a context term, tc, within a sliding window pendency relations) can be seen in Table1. Unlike centered on an observed term, to. The train- the original ESP implementation where predicate ing objective involves maximizing this probabil- vectors were held constant, we permit dependency 6 ity for true context terms P (t t ), and minimiz- path vectors to evolve during training . Further de- c| o ing it for randomly drawn counterexamples t c, tails on ESP can be found in (Cohen and Widdows, ¬ P (t c to), with probability estimated as the sig- 2017). For the current work, we set the dimension- ¬ | moid function of the scalar product between the in- ality at 8000 bits (as this is equivalent in representa- put weight vector for the observed term and the out- tional capacity to 250-dimensional single precision real vectors). For ESD, Table1 shows the nearest put weight vector of the context term, σ(−→to .−−→tc c). |¬ We used the Semantic Vectors5 implemen- neighboring dependency path vectors to the bound product I(metformin) O(diabetes), illustrat- tation of SGNS (which performs similarly to the ⊗ fastText implementation across a range of ana- ing paths that indicate the relationship between logical retrieval benchmarks (Cohen and Widdows, these terms, and ESD’s capability to learn similar 2018)) to train 250-dimensional embeddings, with representations for paths with similar meaning. a sliding window radius of two, on the complete Both SGNS and ESD were trained over five 5 set of full sentences from the corpus described in epochs, with a subsampling threshold of 10− , a 3 as the training corpus. As previously mentioned, minimum term frequency threshold of 5 (which § multi-word phrases corresponding to named enti- includes concatenated dependency paths for ESD), 6 ties recognized by the PubTator system in these and a maximum frequency threshold of 10 . sentences are concatenated by underscores, and consequently receive a single vector representation. 6 Evaluation Methods ESD With ESD, a shallow neural network is trained to estimate the probability of encountering We use a proportional analogy ranked retrieval task the object, o, of a subject-predicate-object triple for both the RR and LBD tasks, following prior work as described in 2. Figure3 visualizes this sP o. The training objective involves maximiz- § ing this probability for true objects P (o s, P ) and process. From a set of (X, Y) entity pairs from a | minimizing it for randomly drawn counterexam- knowledge base, given a term C and all terms D ples, o, P ( o s, P ). We adapted the Semantic such that (C, D) is a pair in the set, we select n ¬ ¬ | Vectors5 implementation of ESP to encode de- random (A, B) cue pairs from a disjoint set of pairs. pendency paths, with binary vectors as represen- We refer to (C, D) pairs as ‘target pairs,’ correct D tational basis (Widdows and Cohen, 2012) and completions as ‘targets,’ and (A, B) pairs as ‘cues.’ the non-negative normalized Hamming distance The vectors for the cue terms (A, B) and the term (NNHD) to estimate the similarity between them. C are summed in the following fashion to produce the resulting vector v. Given an analogical pair A:B::C:D, where A and C, B and D are of the same 2 Hamming distance NNHD = max 0, 1 × semantic type, respectively, we develop cue vectors − dimensionality for the target D in each model as follows: With this representational paradigm, probabil- ity can be estimated as NNHD(o, s P ), where SGNS : −→v = −→B −→A + −→C ⊗ − represents the use of pairwise exclusive OR as ESD : v = I−−→(A) O−−−→(B) I−−→(C) ⊗ −→ ⊗ ⊗ a binding operator, in accordance with the Bi- nary Spatter Code (Kanerva, 1996). While ESP 6This capability has been used to to predict drug interac- tions, with performance exceeding that of models with orders 5https://github.com/semanticvectors/semanticvectors of magnitude more parameters (Burkhardt et al., 2019).
42 SCORE PATH 0.974 controlled nmod start entity end entity amod controlled 0.935 add-on nmod start entity end entity amod add-on 0.565 reduces nsubj start entity reduces dobj requirement requirement nmod end entity 0.537 associated compound start entity end entity nsubj associated 0.516 start entity conj efficacy efficacy acl treating treating dobj end entity 0.438 treatment amod start entity treatment nmod end entity
Table 1: Nearest neighboring dependency path embeddings to I(metformin) O(diabetes) where I and O indicate ⊗ input and output weight vectors respectively.
Figure 3: Overview of analogical ranked retrieval paradigm. where I and O represent the input and out- Table2 shows, for each knowledge base, how put weight vectors of the ESD model, respec- many total unique X terms and total (X, Y) pairs tively. The SGNS method is the same as the are used for each task. Additionally, we show the 3COSADD method as described in Levy and Gold- average number of correct Y terms per X and the berg(2014b). maximum number of correct Y terms per X. For the relationship retrieval task, we consider those A K-nearest neighbor search is performed for v (X, Y) pairs which are connected by at least one (using cosine distance for SGNS, NNHD for ESD) dependency path in our corpus. Meanwhile, (X, Y) over the search space, and we record the ranks pairs for the LBD task must not be connected by a for each correct D target. The search space is con- dependency path in the corpus (we treat these held- strained such that it consists of those terms from our out pairs as a proxy for estimating the quality of training corpus that have a vector in both ESD and novel hypotheses). We know from the (X, Y) pair’s SGNS, a total of about 300,000 terms overall. For presence in the knowledge base that it is a gold ESD, this space consists of the output weight vec- standard pair for the given relationship type, but tors for each concept. For the proportional analogy from the models’ perspective this information is task using K-nearest neighbors to rank completions not available from the text alone. Thus, we believe to the analogy, the desired outcome is for the cor- it is a good test of the models’ ability to generate rect targets to be highly similar to the analogy cue plausible hypotheses. To reiterate, the methodology vector v, such that the highest ranks are assigned to for both the relationship retrieval and literature- the correct target terms D in a search over the entire based discovery evaluations is the same; the only vector space. In this fashion, we perform this KNN difference is in which pairs of terms from each search for every (X, Y) pair in the knowledge base knowledge base are used for evaluation data. and record the ranks for correct targets. We then compare the ranks of terms D across both vector We examine the role of increasing the number spaces; the higher the ranks, the better the model is of cues in improving retrieval. For example, for a at capturing relational similarity. given (C, D) target pair, we can combine vectors
43 Relationship Retrieval Literature-based Discovery Total X Total Pairs Mean Y / X Max Y / X Total X Total Pairs Mean Y / X Max Y / X Gene Targets (DrugBank) 1626 6290 4 107 3569 37162 10 420 PGKB 535 2089 4 48 1563 28053 18 144 Agonists (TTD) 148 172 1 3 307 462 2 7 Chem-Gene Antagonists (TTD) 188 200 1 2 508 620 1 5 Gene Targets (TTD) 1179 1436 1 7 4088 6430 2 15 Inhibitors (TTD) 522 669 1 7 1273 2082 2 15 Side Effects (SIDER) 334 1289 4 31 892 6591 7 46 Drug Indication (SIDER) 1077 2737 3 22 2160 8356 4 45 Chem-Disease Biomarker-Disease (TTD) 298 417 1 11 253 321 1 6 Drug Indication (TTD) 1749 1958 1 6 2664 2999 1 10 Disease Targets (TTD) 710 1502 2 22 1085 3088 3 27 OMIM 2197 2870 1 9 3461 5545 2 11 Gene-Disease PGKB 600 1693 3 34 1609 12605 8 73 Enzymes (DrugBank) 966 3622 4 33 1781 16242 9 71 Carriers (DrugBank) 203 345 2 27 444 1174 3 18 Transporters (DrugBank) 510 2357 5 44 1140 13889 12 94 Gene-Gene PGKB 497 2595 5 50 940 14142 15 89 Complex (Reactome) 1757 3061 2 9 2550 6593 3 31 Reaction (Reactome) 579 1031 2 9 1274 4024 3 29
Table 2: Total unique X terms, total (X, Y) pairs, average number of correct Y terms per X, and maximum number of correct Y terms per X for each knowledge base. for multiple (A, B) pairs with the C term vector to B are extracted from the same A, B cue pairs as produce a final cue vector that is closer to the target those used for the full analogy setting to ensure a D. When multiple cues are used, we superpose the reasonable comparison across methods. In the C:D cue vector for each of the cues, and normalize the comparison setting, no cues are aggregated. resulting vector, with normalization of real vectors to unit length in SGNS, and normalization of binary 7 Results vectors using the majority rule with ties split at We present qualitative and quantitative results for random with ESD. Cues are always selected from each vector space model’s ability to represent and the subset of knowledge base pairs that co-occur retrieve relational information. in our training corpus. We ensure that none of the Qualitative Results Table3 shows a side-by- (A, B) cue terms overlap with each other, nor with side comparison of the top 10 retrieved terms given the (C, D) target terms, to assure that self-similarity the vector for the term risperidone composed with does not inflate performance. We produced results 25 randomly selected (drug, indication) cues from for a range of 1, 5, 10, 25, and 50 cues, finding that SIDER. The goal is to complete the proportional the best results come from using 25 cues; we only analogy corresponding to the treatment relation- report these resulting scores in 7. § ship. Of the top 10 terms retrieved in the ESD vec- As a baseline inspired partly by Linzen(2016), tor space, 4 are correct completions to the analogy, we compute the similarity of vectors for B and D while 3 more are plausible completions based on lit- terms and C and D terms compared directly to each erature. ‘Tardive oromandibular dystonia,’ while of other, omitting the analogical task. The intuition the correct semantic type targeted by this analogy, here is that C and D terms are potentially close is actually a side effect of risperidone. A major- together in the vector space merely due to frequent ity of the retrieved results, however, are known or co-occurrence in the corpus, and any analogical plausible treatment targets. Meanwhile, most of the reasoning performance is merely relying on that top 10 terms retrieved by SGNS are names of other fact. Meanwhile, terms B and D can be close to- drugs that are similar to risperidone. Additionally, gether in the vector space simply because they are ‘psychiatric and visual disturbances’ and ‘tardive the same semantic type, and thus occur in similar dyskinesia’ are side effects of risperidone, not treat- contexts. In this case, relational analogy might not ment targets. Notably, all of the results retrieved explain the performance, but mere distributional with ESD are of the correct semantic type, i.e., they similarity. In the B:D comparison setting, cues B are disorders, while SGNS retrieves a mix of drugs are added together to create a single cue vector with and side effects. which to perform the KNN ranking over terms in Quantitative Results For each C term in each which to find the target term D. These cue terms evaluation set, we record the ranks of all D tar-
44 rank ESD (ours) SGNS 1 separation anxiety risperidone × 2 schizophrenia olanzapine × 3 depressed state quetiapine × 4 bipolar mania aripiprazole × 5 tardive oromanibular dystonia clozapine × 6 treatment of trichotillomania psychiatric and visual disturbances ∗ 7 pervasive developmental disorder (NOS) ziprasidone ∗ × 8 borderline personality disorder amisulpride × 9 psychotic disorders paliperidone × 10 mania tardive dyskinesia
Table 3: Top 10 results for a K-nearest neighbor search over terms for treatment targets for the drug risperidone (an antipsychotic drug), using 25 (drug, indication) pairs from SIDER as cues. Bolded terms are correct targets, i.e., they are listed as treatment targets for risperidone in SIDER. : a disorder that risperidone treats or might treat, ∗ based on external literature or a synonym for a target from SIDER; : a chemical, i.e., something that could not × be a treatment target for a drug. get terms resulting from the K-nearest neighbor for each task separately. search. For ease of comparison, we normalize all raw ranks by the length of the full search space 8.1 Relationship retrieval (324363 terms in total), and then subtract this value For a total of 12 out of 19 sets, ESD on full analo- from 1 so that lower ranks (i.e., better results) are gies outperforms ESD on direct B:D comparisons, displayed as higher numbers, for ease of interpre- suggesting that the model has learned generalizable tation. For a baseline score, we ran a simulation relationship information for these types of relations in which the entire search space was shuffled ran- rather than relying on distributional term similarity. domly 100 times, and recorded the median ranks of Because gene-gene pairs consist of entities of the multiple target D terms, given some C. We find that same semantic type, it can be argued that B:D simi- the median rank for D terms in a randomly shuffled larity should be very high, and yet scores are higher space tended toward the middle of the ranked list. for the full analogy over the B:D baseline for most Thus, the baseline score is established as 0.5; any of these sets, for both ESD and SGNS. For SIDER score lower than this means the model performed side effects, the B:D baseline for ESD shows higher worse than a random shuffle at retrieving target scores than the full analogy for both LBD and RR; terms. In Table4, 1 is the highest possible score, one reason for this could be that there is a high and 0 is the lowest. degree of side effect overlap between drugs, and so We report results at 25 (A, B) cues, the setting the side effect terms themselves are highly similar for which performance was best for both ESD and to each other. SGNS. ‘Full’ in Table4 refers to evaluation with 8.2 Literature-based discovery a full A:B::C:D analogy, while ‘B:D’ refers to the baseline that compares vectors for terms directly, The best performance on a majority of the sets rather than using relational information. We do comes from the ESD B:D model, suggesting that not report C:D comparison results, as they were the model relies on term similarity over relational categorically worse than both Full and B:D results. information for performance. Although SGNS doesn’t perform the best overall, the full analogy 8 Discussion model tends to outperform its B:D counterpart, sug- gesting that SGNS has managed to extrapolate re- The results in Table4 show that ESD outperforms lational information to the retrieval of held-out tar- SGNS on the RR task for 18 of 19 databases, and gets. As previously mentioned, performance on for 17 of 19 databases on the LBD task. It is clear this task is made difficult due to the lack of normal- that literature-based discovery is harder than rela- ization of concepts across our datasets. Addition- tionship retrieval, as the scores are generally lower ally, as Table4 shows, several top ranked terms are across the board for this task. We discuss the results plausible analogy completions, but do not appear as
45 Relationship retrieval LBD ESD (ours) SGNS ESD (ours) SGNS Full B:D Full B:D Full B:D Full B:D Gene Targets (DrugBank) 0.912 0.897 0.839 0.212 0.715 0.806 0.496 0.250 PGKB 0.969 0.994 0.705 0.361 0.737 0.918 0.366 0.317 Agonists (TTD) 0.997 0.907 0.998 0.647 0.802 0.781 0.924 0.708 Chem-Gene Antagonists (TTD) 1.000 0.900 0.999 0.732 0.802 0.703 0.831 0.750 Gene Targets (TTD) 0.998 0.867 0.994 0.387 0.746 0.760 0.625 0.479 Inhibitors (TTD) 0.998 0.874 0.993 0.415 0.773 0.759 0.682 0.392 Side Effects (SIDER) 0.997 0.999 0.967 0.942 0.952 0.994 0.799 0.932 Drug Indication (SIDER) 1.000 0.995 0.949 0.588 0.969 0.988 0.663 0.605 Chem-Disease Biomarker-Disease (TTD) 0.996 0.997 0.944 0.781 0.932 0.977 0.799 0.726 Drug Indication (TTD) 1.000 0.994 0.981 0.675 0.977 0.992 0.722 0.661 Disease Targets (TTD) 0.990 0.997 0.900 0.711 0.887 0.989 0.663 0.648 OMIM 0.997 0.911 0.950 0.599 0.668 0.792 0.578 0.578 Gene-Disease PGKB 0.982 0.996 0.781 0.624 0.836 0.969 0.592 0.618 Enzymes (DrugBank) 1.000 1.000 0.987 0.981 0.979 0.999 0.900 0.975 Carriers (DrugBank) 0.987 1.000 0.636 0.555 0.841 0.962 0.360 0.487 Transporters (DrugBank) 1.000 1.000 0.974 0.947 0.996 0.999 0.870 0.951 Gene-Gene PGKB 0.999 0.995 0.899 0.471 0.907 0.956 0.479 0.425 Complex (Reactome) 1.000 0.819 1.000 0.206 0.866 0.731 0.838 0.399 Reaction (Reactome) 1.000 0.917 0.996 0.273 0.878 0.826 0.699 0.366
Table 4: Results for relationship retrieval (RR) and literature-based discovery (LBD) for full analogy (A:B::C:D) and B:D retrieval. Scores are displayed here as the median of scores (1 - normalized rank) for all D terms in a knowledge base evaluation set.
gold-standard targets in the databases. Considering Acknowledgements the case of SIDER, which is built from automati- cally extracted information (not human-curated) This research was supported by U.S. National Li- the plausible results here are missing from the brary of Medicine Grant No. R01 LM011563. The database but are supported by evidence from pub- authors would like to thank the anonymous review- lished papers (e.g. Oravecz and Stuhecˇ (2014)). ers for their feedback.
References 9 Conclusion Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chai- tanya Malaviya, Asli Celikyilmaz, and Yejin Choi. 2019. COMET: Commonsense transformers for au- tomatic knowledge graph construction. In Proceed- We have compared two vector space models of ings of the 57th Annual Meeting of the Association language, Embedding of Structural Dependencies for Computational Linguistics, pages 4762–4779, and Skip-gram with Negative Sampling, for their Florence, Italy. Association for Computational Lin- guistics. ability to represent biomedical relationships from literature in an analogical retrieval task. Our results Hannah A Burkhardt, Devika Subramanian, Justin suggest that encoding structural information in the Mower, and Trevor Cohen. 2019. Predicting ad- form of dependency paths connecting biomedical verse drug-drug interactions with neural embedding entities of interest can improve performance on of semantic predications. In AMIA Annual Sympo- sium Proceedings, volume 2019, page 992. Ameri- two analogical retrieval tasks, relationship retrieval can Medical Informatics Association. and literature-based discovery. In future work, we would like to compare our methods with knowledge Trevor Cohen and Dominic Widdows. 2009. Empiri- base completion techniques using contextualized cal distributional semantics: methods and biomedi- cal applications. Journal of biomedical informatics, vectors from language models as in Bosselut et al. 42(2):390–405. (2019) as another method applicable to literature- based discovery. Trevor Cohen and Dominic Widdows. 2017. Embed-
46 ding of semantic predications. Journal of biomedi- Michael Kuhn, Ivica Letunic, Lars Juhl Jensen, and cal informatics, 68:150–166. Peer Bork. 2016. The sider database of drugs and side effects. Nucleic acids research, 44(D1):D1075– Trevor Cohen and Dominic Widdows. 2018. Bringing D1079. order to neural word embeddings with embeddings augmented by random permutations (earp). In Pro- Omer Levy and Yoav Goldberg. 2014a. Dependency- ceedings of the 22nd Conference on Computational based word embeddings. In Proceedings of the Natural Language Learning, pages 465–475. 52nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 2: Short Papers), pages Trevor Cohen, Dominic Widdows, Roger Schvan- 302–308. eveldt, and Thomas C Rindflesch. 2011. Finding schizophrenia’s prozac emergent relational similar- Omer Levy and Yoav Goldberg. 2014b. Linguistic ity in predication space. In International Symposium regularities in sparse and explicit word representa- on Quantum Interaction, pages 48–59. Springer. tions. In Proceedings of the eighteenth conference on computational natural language learning, pages Marie-Catherine De Marneffe and Christopher D Man- 171–180. ning. 2008. The stanford typed dependencies repre- sentation. In Coling 2008: proceedings of the work- shop on cross-framework and cross-domain parser Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im- evaluation, pages 1–8. Association for Computa- proving distributional similarity with lessons learned tional Linguistics. from word embeddings. Transactions of the Associ- ation for Computational Linguistics, 3:211–225. Lance De Vine, Guido Zuccon, Bevan Koopman, Lau- rianne Sitbon, and Peter Bruza. 2014. Medical se- Zhiheng Li, Zhihao Yang, Chen Shen, Jun Xu, Yaoyun mantic similarity with a neural language model. In Zhang, and Hua Xu. 2019. Integrating shortest de- Proceedings of the 23rd ACM international confer- pendency path and sentence sequence into a deep ence on conference on information and knowledge learning framework for relation extraction in clinical management, pages 1819–1822. text. BMC medical informatics and decision making, 19(1):22. Antonio Fabregat, Konstantinos Sidiropoulos, Phani Garapati, Marc Gillespie, Kerstin Hausmann, Robin Tal Linzen. 2016. Issues in evaluating semantic spaces Haw, Bijay Jassal, Steven Jupe, Florian Korninger, using word analogies. In Proceedings of the 1st Sheldon McKay, et al. 2016. The reactome Workshop on Evaluating Vector-Space Representa- pathway knowledgebase. Nucleic acids research, tions for NLP, pages 13–18, Berlin, Germany. As- 44(D1):D481–D487. sociation for Computational Linguistics.
Safa Fathiamini, Amber M Johnson, Jia Zeng, Vi- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey jaykumar Holla, Nora S Sanchez, Funda Meric- Dean. 2013a. Efficient estimation of word repre- Bernstam, Elmer V Bernstam, and Trevor Cohen. sentations in vector space. In Proceedings of Inter- 2019. Rapamycin- mtor+ braf=? using rela- national Conference on Learning Representations tional similarity to find therapeutically relevant drug- (ICLR). gene relationships in unstructured text. Journal of biomedical informatics, 90:103094. Toma´sˇ Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013b. Linguistic regularities in continuous space Dedre Gentner and Arthur B Markman. 1997. Struc- word representations. In Proceedings of the 2013 ture mapping in analogy and similarity. American Conference of the North American Chapter of the psychologist, 52(1):45. Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, Anna Gladkova, Aleksandr Drozd, and Satoshi Mat- Georgia. Association for Computational Linguistics. suoka. 2016. Analogy-based detection of morpho- logical and semantic relations with word embed- Denis Newman-Griffis, Albert Lai, and Eric Fosler- dings: What works and what doesn’t. In Proceed- Lussier. 2017. Insights into analogy completion ings of the NAACL-HLT SRW, pages 47–54, San from the biomedical domain. In BioNLP 2017, Diego, California, June 12-17, 2016. ACL. pages 19–28, Vancouver, Canada,. Association for Ada Hamosh, Alan F Scott, Joanna S Amberger, Computational Linguistics. Carol A Bocchini, and Victor A McKusick. 2005. Online mendelian inheritance in man (omim), a Robert Oravecz and Matej Stuhec.ˇ 2014. Trichotillo- knowledgebase of human genes and genetic disor- mania successfully treated with risperidone and nal- ders. Nucleic acids research, 33(suppl 1):D514– trexone: a geriatric case report. Journal of the Amer- D517. ican Medical Directors Association, 15(4):301–302.
Pentti Kanerva. 1996. Binary spatter-coding of ordered Bethany Percha and Russ B Altman. 2018. A global k-tuples. In International Conference on Artificial network of biomedical relationships derived from Neural Networks, pages 869–873. Springer. text. Bioinformatics, 34(15):2614–2624.
47 Matthew Peters, Mark Neumann, Luke Zettlemoyer, Huiwei Zhou, Shixian Ning, Yunlong Yang, Zhuang and Wen-tau Yih. 2018. Dissecting contextual Liu, Chengkun Lang, and Yingyu Lin. 2018. word embeddings: Architecture and representation. Chemical-induced disease relation extraction with In Proceedings of the 2018 Conference on Em- dependency information and prior knowledge. Jour- pirical Methods in Natural Language Processing, nal of biomedical informatics, 84:171–178. pages 1499–1509, Brussels, Belgium. Association for Computational Linguistics. Yongjun Zhu, Olivier Elemento, Jyotishman Pathak, and Fei Wang. 2019. Drug knowledge bases and Don R Swanson. 1986. Fish oil, raynaud’s syndrome, their applications in biomedical informatics research. and undiscovered public knowledge. Perspectives in Briefings in bioinformatics, 20(4):1308–1321. biology and medicine, 30(1):7–18.
Peter D Turney and Patrick Pantel. 2010. From fre- quency to meaning: Vector space models of se- mantics. Journal of artificial intelligence research, 37:141–188.
Yunxia Wang, Song Zhang, Fengcheng Li, Ying Zhou, Ying Zhang, Zhengwen Wang, Runyuan Zhang, Jiang Zhu, Yuxiang Ren, Ying Tan, et al. 2020. Therapeutic target database 2020: enriched re- source for facilitating research and early develop- ment of targeted therapeutics. Nucleic acids re- search, 48(D1):D1031–D1041.
Koki Washio and Tsuneaki Kato. 2018. Filling missing paths: Modeling co-occurrences of word pairs and dependency paths for recognizing lexical semantic relations. arXiv preprint arXiv:1809.03411.
Chih-Hsuan Wei, Hung-Yu Kao, and Zhiyong Lu. 2013. Pubtator: a web-based text mining tool for assisting biocuration. Nucleic acids research, 41(W1):W518– W522.
Michelle Whirl-Carrillo, Ellen M McDonagh, JM Hebert, Li Gong, K Sangkuhl, CF Thorn, Russ B Altman, and Teri E Klein. 2012. Pharma- cogenomics knowledge for personalized medicine. Clinical Pharmacology & Therapeutics, 92(4):414– 417.
Dominic Widdows and Trevor Cohen. 2012. Real, complex, and binary semantic vectors. In Interna- tional Symposium on Quantum Interaction, pages 24–35. Springer.
Dominic Widdows and Trevor Cohen. 2014. Reason- ing with vectors: A continuous model for fast robust inference. Logic Journal of the IGPL, 23(2):141– 173.
David S Wishart, Yannick D Feunang, An C Guo, Elvis J Lo, Ana Marcu, Jason R Grant, Tanvir Sajed, Daniel Johnson, Carin Li, Zinat Sayeeda, et al. 2018. Drugbank 5.0: a major update to the drug- bank database for 2018. Nucleic acids research, 46(D1):D1074–D1082.
Yijia Zhang, Hongfei Lin, Zhihao Yang, Jian Wang, Shaowu Zhang, Yuanyuan Sun, and Liang Yang. 2018. A hybrid model based on neural networks for biomedical relation extraction. Journal of biomedi- cal informatics, 81:83–92.
48 DeSpin: a prototype system for detecting spin in biomedical publications Anna Koroleva Sanjay Kamath Zurich University of Applied Sciences (ZHAW), Total, Waedenswil, Switzerland France Swiss Institute of Bioinformatics (SIB), [email protected] Lausanne, Switzerland [email protected]
Patrick M.M. Bossuyt Patrick Paroubek Academic Medical Center, LIMSI, CNRS, Universite´ Paris-Saclay, University of Amsterdam, Orsay, France Amsterdam, Netherlands [email protected] [email protected]
Abstract 1 Background
Improving the quality of medical research re- It is widely acknowledged nowadays that that the porting is crucial to reduce avoidable waste in quality of reporting of research results in the clin- research and to improve the quality of health ical domain is suboptimal. As a consequence, re- care. Despite various initiatives aiming at im- search findings can often not be replicated, and proving research reporting – guidelines, check- billions of euros may be wasted yearly (Ioannidis, lists, authoring aids, peer review procedures, etc. – overinterpretation of research results, 2005). also known as distorted reporting or spin, is Numerous initiatives aim at improving the qual- still a serious issue in research reporting. ity of research reporting. Guidelines and checklists In this paper, we propose a Natural Language have been developed for every type of clinical re- Processing (NLP) system for detecting several search. Still, the quality of reporting remains low: types of spin in biomedical articles reporting authors fail to choose and follow a correct guide- randomized controlled trials (RCTs). We use a line/checklist (Samaan et al., 2013). Automated combination of rule-based and machine learn- tools, such as Penelope1, are introduced to facili- ing approaches to extract important informa- tate the use of guidelines/checklists. It was proved tion on trial design and to detect potential spin. that authoring aids improve the completeness of The proposed spin detection system includes reporting (Barnes et al., 2015). algorithms for text structure analysis, sentence Enhancing the quality of peer reviewing is an- classification, entity and relation extraction, other step to improve research reporting. Peer re- semantic similarity assessment. Our algo- rithms achieved operational performance for viewing requires assessing a large number of in- the these tasks, F-measure ranging from 79,42 formation items. Nowadays, Natural Language to 97.86% for different tasks. The most diffi- Processing (NLP) is applied to facilitate laborious cult task is extracting reported outcomes. manual tasks such as indexing of medical literature Our tool is intended to be used as a semi- (Huang et al., 2011) and systematic review process automated aid tool for assisting both authors (Ananiadou et al., 2009). Similarly, the peer re- and peer reviewers to detect potential spin. viewing process can be partially automated with The tool incorporates a simple interface that al- the help of NLP. lows to run the algorithms and visualize their Our project tackles a specific issue of research output. It can also be used for manual annota- reporting that, to our knowledge, has not been ad- tion and correction of the errors in the outputs. dressed by the NLP community: spin, also referred The proposed tool is the first tool for spin de- to as overinterpretation of research results. In the tection. The tool and the annotated dataset are context of clinical trials assessing a new (experi- freely available. At the time of reported work, Anna Koroleva was a PhD Netherlands. Sanjay Kamath was a PhD student at LIMSI- student at LIMSI-CNRS in Orsay, France and at the Academic CNRS and LRI Univ. Paris-Sud in Orsay, France. Medical Center, University of Amsterdam in Amsterdam, the 1https://www.penelope.ai/
49 Proceedings of the BioNLP 2020 workshop, pages 49–59 Online, July 9, 2020 c 2020 Association for Computational Linguistics mental) intervention, spin consists in exaggerating prototype comprises a set of spin-detecting algo- the beneficial effects of the studied intervention rithms and a simple interface to run the algorithms (Boutron et al., 2010). and display their output. Spin is common in articles reporting random- This paper is organized as follows: first, we pro- ized controlled trials (RCTs) - clinical trials com- vide an overview of some existing semi-automated paring health interventions, to which participants aid systems for authors, reviewers and readers of are allocated randomly to avoid biases - with non- biomedical articles. Second, we introduce in more significant primary outcome. Abstracts are more detail the notion of spin, the types of spin that we prone to spin than full texts. Spin is found in a address, and the information that is required to as- high percentage of abstracts of articles in surgical sess an article for spin. After that, we describe our research (40%) (Fleming, 2016), cardiovascular current algorithms, methods employed and provide diseases (57%) (Khan et al., 2019), cancer (47%) their evaluation. Finally, we discuss the potential (Vera-Badillo et al., 2016), obesity (46.7%) (Austin future development of the prototype. et al., 2018), otolaryngology (70%) (Cooper et al., 2018), anaesthesiology (32,2%) (Kinder et al., 2 Related work 2018), and wound care (71%) (Lockyer et al., Although there has been no attempt to automate 2013). Although the problem of spin has started to spin detection in biomedical articles, a number of attract attention in the medical community in the works addressed developing automated aid tools recent years, the shown prevalence of spin proves to assist authors and readers of scientific articles that it often remains unnoticed by editors and peer in performing various other tasks. Some of these reviewers. tools were tested and were shown to reduce the Abstracts are often the only part of the article workload and improve the performance of human available to readers, and spin in abstracts of RCTs experts on the corresponding task. poses a serious threat to the quality of health care by causing overestimation of the intervention by 2.1 Authoring aid tools clinicians (Boutron et al., 2014), which may lead to the use of an ineffective of unsafe intervention in Barnes et al.(2015) assessed the impact of a writing clinical practice. Besides, spin in research articles aid tool based on the CONSORT statement (Schulz is linked to spin in press releases and health news et al., 2010) on the completeness of reporting of (Haneef et al., 2015; Yavchitz et al., 2012), which RCTs. The tools was developed for six domains of has the negative impact of raising false expectations the Methods section (trial design, randomization, regarding the intervention among the public. blinding, participants, interventions, and outcomes) and consisted of reminders of the corresponding The importance of the problem of spin motivated CONSORT item(s), bullet points enumerating the our work. We aimed at developing NLP algorithms key elements to report, and good reporting exam- to aid authors and readers in detecting spin. We ples. The tool was assessed in an RCT in which focused on randomized controlled trials (RCTs) as the participants were asked to write a Methods sec- they are the most important source of evidence for tion of an article based on a trial protocol, either Evidence-based medicine, and spin in RCTs has using the aid tool (’intervention’ group) or without high negative impact. using the tool (’control’ group). The results of 41 Our work lies within the scope of the Methods in participants showed that the mean global score for Research on Research (MiRoR) project2, an inter- reporting completeness was higher with the use of national project devoted to improving the planning, the tool than without it. conduct, reporting and peer reviewing of health care research. For the design and development 2.2 Aid tools for readers and reviewers of our toolkit, we benefited from advice from the MiRoR consortium members. Kiritchenko et al.(2010) developed a system called In this paper, we introduce a prototype of a sys- ExaCT to automatically extract 21 key characteris- tem, called DeSpin (Detector of Spin), that au- tics of clinical trial design, such as treatment names, tomatically detects potential spin in abstracts of eligibility criteria, outcomes, etc. ExaCT consists RCTs and relevant supporting information. This of an information extraction algorithm that looks for text fragments corresponding to the target in- 2http://miror-ejd.eu/ formation elements, a web-based user interface
50 through which human experts can view and correct et al.(2015), who divided instances of spin into the suggested fragments. several types and subtypes. The National Library of Medicine’s Medical We addressed the following types of spin: Text Indexer (MTI) is a system providing auto- 1. Outcome switching – unjustified change of the matic recommendations based on the Medical Sub- pre-defined trial outcomes, leading to report- ject Headings (MeSH) terms for indexing medical ing only the favourable outcomes that support articles (Mork et al., 2013). MTI is used to assist the hypothesis of the researchers (Goldacre human indexers, catalogers, and NLM’s History of et al., 2019). Outcome switching is one of the Medicine Division in their work. Its use by index- most common types of spin. It can consist in ers was shown to grow over years (used to index omitting the primary outcome in the results / 15.75% of the articles 2002 vs 62.44% in 2014) conclusions of the abstract, or in the focus on and to improve the performance (precision, recall significant secondary outcomes, e.g.: and F-measure) of indexers (Mork et al., 2017). Marshall et al.(2015) addressed the task of au- The primary end point of this trial was overall tomating assessment of risk of bias in clinical trials. survival. <...> This trial showed a signifi- Bias is phenomenon related to spin: it is a sys- cantly increased R0 resection rate although tematic error or a deviation from the truth in the it failed to demonstrate a survival benefit. results or conclusions that can cause an under- or In this example, the primary outcome (”over- overestimation of the effect of the examined treat- all survival”), the results for which were not ment (Higgins and Green, 2008). The authors de- favourable, is mentioned in the conclusion, veloped a system called RobotReviewer that used but it is not reported in the first place and oc- machine learning to assess an article for the risk curs within a concessive clause (starting by of different types of bias and to extract text frag- ”although”). This way of reporting puts the ments that support these judgements. These works focus on the other, favourable, outcome (”R0 showed that automated risk of bias assessment can resection rate”). be achieve reasonable performance, and the extrac- tion of supporting text fragments reached similar 2. Interpreting non-significant outcome as a quality to that of human experts. Marshall et al. proof of equivalence of the treatments, e.g.: (2017) further developed RobotReviewer, adding The median PFS was 10.3 months in the functionality for extracting the PICO (Population, XELIRI and 9.3 months in the FOLFIRI arm Interventions/Comparators, Outcomes) elements (p = 0.78). Conclusion: The XELIRI regi- from articles and detecting study design (RCT), men showed similar PFS compared to the for the purpose of automated evidence synthesis. FOLFIRI regimen. Soboczenski et al.(2019) assessed RobotReviewer The results for the outcome ”median PFS” are in a user study involving 41 participants, evaluating not significant, which is often erroneously time spent for bias assessment, text fragment sug- interpreted as a proof of similarity of the gestions by machine learning, and usability of the treatments. However, a non-significant result tool. Semi-automation in this study was shown to means that the null hypothesis of a difference be quicker than manual assessment; 91% of the au- could not be rejected, which is not equivalent tomated risk of bias judgments and 62% of support- to a demonstration of similarity of the treat- ing text suggestions were accepted by the human ments. This would require the rejection of the reviewers. null hypothesis of a difference, or a substantial The cited works demonstrate that semi- difference, in outcomes between treatments. automated aid tools can prove useful for both au- thors and readers/reviewers of medical articles and 3. Focus on within-group comparisons, e.g.: has a potential to improve the quality of the articles Both groups showed robust improvement in and facilitate the analysis of the texts. both symptoms and functioning. 3 Spin: definition and types The goal of randomized controlled trials is to compare two treatments with regard to some We adopt the definition and classification of spin outcomes. If the superiority of the experimen- introduced by Boutron et al.(2010) and Lazarus tal treatment over the control treatment was
51 not shown, within-group comparisons (report- the primary outcome, or to selective reporting ing the changes within a group of patients of significant outcomes only. receiving a treatment, instead of comparing patients receiving different treatments) can be 4 Algorithms used to persuade the reader of beneficial ef- Spin is a complex notion and thus detecting spin fects of the experimental treatment. cannot be seen as a binary classification problem. Two concepts are vital for spin detection and We believe that the most viable approach to spin de- play a key role in our algorithms: tection is to assess each (sub)type of spin separately. We aimed at developing algorithms to extract and 1. The primary outcome of a trial – the most analyse pieces of information relevant to the ad- important variable monitored during the trial dressed types of spin. The extracted information to assess how the studied treatment impacts and its analysis, provided by our tool, can help hu- it. Primary outcomes are recorded in trial man experts in making the conclusion on presence registries (open online databases storing the or absence of spin of the given (sub)type. information about registered clinical trials), Detection of spin and related information is a and should be defined in the text of clinical complex task which cannot by fully automated. articles, e.g.: Our system is designed as a semi-automated tool The primary end point was a difference of that finds potential instances of the addressed types > 20% in the microvascular flow index of of spin and extracts the supporting information that small vessels among groups. can help the user to make the final decision on the presence of spin. In this section, we present 2. Statistical significance of the primary out- the algorithms currently included in the system, come. Statistical hypothesis testing is used to according to the types of spin that they are used to check for a significant difference in outcomes detect. between two patient groups, one receiving the As we aim at detecting spin in the Results and experimental treatment and the other receiving Conclusions sections of articles’ abstracts, we first the control treatment. Statistical significance need an algorithm analyzing the given article to is often reported as a P-value compared to pre- detect its abstract and the Results and Conclusions defined threshold, usually set to 0.05. Spin sections within the abstract. We will not mention most often occurs when the results for the this algorithm in the list of algorithms for each spin primary outcome are not significant (Boutron type to avoid repetition. If we talk about extracting et al., 2010; Fleming, 2016; Khan et al., 2019; some information from the abstract, it implies that Vera-Badillo et al., 2016; Austin et al., 2018; the text structure analysis algorithm was applied. Cooper et al., 2018; Kinder et al., 2018; Lock- yer et al., 2013), although trials with signifi- 4.1 Outcome switching cant effect on the primary outcome may also We focus on the switching (change/omission) of be prone to spin (Beijers et al., 2017). the primary outcome. Primary outcome switching Trial results are commonly reported as an ef- can occur at several points: 3 fect on the (primary) outcome , along with the primary outcome(s) recorded in the trial • the p-value. registry can differ from the primary out- Microcirculatory flow indices of small and come(s) declared in the article; medium vessels were significantly higher in the primary outcome(s) declared in the ab- the levosimendan group as compared to the • stract can differ from the primary outcome(s) control group (p < 0.05). declared in the body of the article; Statistical significance levels of trial outcomes the primary outcome(s) recorded in the trial are vital for spin detection, as spin is com- • monly related to non-significant results for registry can be omitted when reporting the results for the outcomes in the abstract; 3It is important to distinguish between the notions of out- come, effect and result in this context: an outcome is a mea- the primary outcome(s) recorded in the article sure/variable monitored during a clinical trial; effect refers to • the change in an outcome observed during a trial; trial results can be omitted when reporting the results for refer to the set of effects for all measured outcomes. the outcomes in the abstract.
52 Primary outcome switching detection involves the ”overall survival” and ”survival” is high, thus, following algorithms: we conclude that the primary outcome ”over- all survival” is reported as ”survival”. 1. Identification of primary outcomes in trial reg- istries and in the article’s text. As semantic similarity often depends on the context, the conclusions of the system are pre- 2. Identification of reported outcomes from sen- sented to the user, who can check them to tences reporting the results, e.g. (reported make the conclusions on correctness of the outcomes are in bold): analysis. The results of this study showed that symptom 4. Assessing the discourse prominence of the in massage group were improved sig- Scores reported primary outcome (detected by the nificantly compared with control group, and previous algorithms) by checking if it is re- the rate of , and in the dyspnea cough wheeze ported the first place among all the outcomes; experimental group than the control group if it is reported in a concessive clause. were reduced by approximately 45%, 56% and 52%. In the example above, the system will de- tect that the primary outcome ”survival” is 3. Assessment of semantic similarity of pairs of reported within a concessive clause (starting outcomes extracted by the above algorithms to by ”although”) and will flag the sentence as check for missing outcomes. We perform the potentially focusing on secondary outcomes. assessment for the following sets of outcomes: 4.2 Interpreting non-significant outcome as a The primary outcome extracted from the • proof of equivalence of the treatments registry is compared to the primary out- As we stated above, conclusions on the similar- come(s) declared in the article; ity/equivalence of the studies treatments are justi- The primary outcome extracted from the • fied only if the trial was of non-inferiority or equiv- abstract is compared to the primary out- alence type. Thus, we employ two algorithms to come(s) declared in the body of the arti- detect this type of spin: cle; The primary outcome extracted from the 1. Identification of statements of similarity be- • article is compared to the outcomes re- tween treatments, e.g.: ported in the abstract; Both products caused similar leukocyte The primary outcome extracted from the • counts diminution and had similar safety pro- registry is compared to the outcomes re- files. ported in the abstract. 2. Identifying the markers of non-inferiority or These assessments allow to detect switching equivalence trial design, e.g.: of the primary outcome at all the possible stages. If the primary outcome in the registry ONCEMRK is a phase 3, multicenter, double- and in the article, or in the abstract and body of blind, noninferiority trial comparing ralte- the article differ, we conclude that there is po- gravir 1200mg QD with raltegravir 400mg tential outcome switching, which is reported BID in treatment-naive HIV-1–infected adults. to the user. Similarly, if the primary outcome If there is a statement of similarity of treatments (from the article or from the registry) is miss- while no markers of non-inferiority / equivalence ing from the list of the reported outcomes, we design are found, we conclude the presence of spin suspect selective reporting of outcomes, and and report it to the user. the system reports it to the user. In the example on the page3, the system 4.3 Focus on within-group comparisons should extract ”overall survival” as the pri- Any statement in the results and conclusions of the mary outcome, and ”R0 resection rate” and abstract that presents a comparison of two states ”survival” as reported outcomes. The similar- of a patient group without comparing it to another ity between ”overall survival” and ”R0 resec- group is a within-group comparison. This type of tion rate” is low, while the similarity between spin is detected by a single algorithm that identifies
53 within-group comparisons that are further reported To find the abstract, we use regular expres- • to the user: sions rules that are evaluated on the set of Young Mania Rating Scale total scores improved 3938 PubMed Central (PMC)4 articles in with ritanserin. XML format with a specific tag for the ab- stract, used as the gold standard. To evaluate 4.4 Other algorithms our algorithm, we applied it to the raw texts We support extraction of some information that is extracted from the XML files and compared not directly involved in the detection of spin, but the extracted abstracts to those obtained using that can help user in spin assessment and that can the XML tag. be used in the future when new spin types are added. To extract outcomes from trial registries, we The algorithms include: • use regular expressions to extract the trial reg- istration number from the article; using it, we 1. Extraction of measures of statistical signifi- find on the web, download and parse the reg- cance, both numerical and verbal (in bold): istry entry corresponding to the trial. Study group patients had a significant lower To extract significance levels, we use rules reintubation rate than did controls; six • patients (17%) versus 19 patients (48%), based on regular expressions and token, P<0.05; respectively. lemma and pos-tag information. To assess the discourse prominence of an out- 2. Extraction of the relation between the reported • come, to detect statements of similarity be- outcomes and their statistical significance, ex- tween treatments, within-group comparisons tracted at the previous stages. For the ex- and markers of non-inferiority design, we em- ample above, we extract pairs (”reintubation ploy rules based on token, lemma and pos-tag rate”, ”significant”) and (”reintubation rate”, information. ”P<0.05”). We annotated abstracts of 180 articles (2402 These algorithms, in combination with the sentences) for similarity statements and assessment of semantic similarity of extracted within-group comparisons (Koroleva, 2020). outcomes, allows to identify the significance The proportion of these types of statements level for the primary outcome. in our corpus is low: we identified only 72 similarity statements and 127 within-group 5 Methods comparisons. The evaluation of statements In this section, we briefly outline the methods used of similarity between treatments and within- in our algorithms, the datasets used for evaluation, group comparisons was performed with two and the current performance of the algorithms. Our settings: 1) using the whole text of abstracts; approach is based on some previous works for the 2) using only the Results and Conclusions related tasks. As the details on development of the sections of the abstract, which raised the pre- algorithms, annotating the data and testing differ- cision, as expected (Table1). ent approaches are described in detail in the corre- 5.2 Machine learning methods sponding articles, we limit ourselves here to only a brief description of the best-performing method For the core tasks of our system, we either used that we selected for each task. an existing annotated corpus or annotated our own The methods we employ can be divided into two corpora. Our corpora were annotated by a single groups: machine learning, including deep learning, annotator (AK), consulted by consulted our med- used for the core tasks for which we have sufficient ical advisors from the MiRoR network (Isabelle training data, and rule-based methods, used for the Boutron, Patrick Bossuyt and Liz Wager). simpler tasks or for tasks where we do not have We tested several approaches for each task, enough data for machine learning. including rule-based and machine-learning ap- proaches (see details below). Overall, we found 5.1 Rule-based methods that the best performance on our tasks was shown We developed rules for the following tasks: 4https://www.ncbi.nlm.nih.gov/pmc/
54 Algorithm Method Annotated dataset Precision Recall F1 Primary outcomes Deep 2,000 sentences / 1,694 86.99 90.07 88.42 extraction learning outcomes Reported out- Deep 1,940 sentences / 2,251 81.17 78.09 79.42 comes extraction learning outcomes Outcome similar- Deep 3,043 pairs of outcomes 88.93 90.76 89.75 ity assessment learning Similarity state- Rules 180 abstracts / 2402 sen- ments extraction tences whole abstract 77.8 87.5 82.4 results and conclusions 85.1 87.5 86.3 Within-group com- Rules 180 abstracts / 2402 sen- parisons tences whole abstract 53.2 90.6 67.1 results and conclusions 71.9 90.6 80.1 Abstract extrac- Rules 3938 abstracts 94.7 94 94.3 tion Text structure Deep PubMed200k 97.82 95.81 96.8 analysis: sections learning of abstract Extraction of sig- Rules 664 sentences / 1,188 99.18 96.58 97.86 nificance levels significance level markers Outcome - signif- Deep 2,678 pairs of outcomes 94.3 94 94 icance level rela- learning and significance level tion extraction markers
Table 1: Overview of algorithms, methods, results and annotated datasets by a deep learning approach that was recently We compared a rule-based approach and BERT, proved to be highly successful in many NLP appli- SciBERT and BioBERT models, fine-tuned for the cations. It employs language representations pre- sentence classification task on the PubMed 200k trained on large unannotated data and fine-tuned dataset. The best performance was shown by the on a relatively small amount of annotated data for fine-tuned BioBERT model. a specific downstream task. The language repre- sentations that we tested include: BERT (Bidirec- 5.2.2 Outcome extraction tional Encoder Representations from Transformers) The outcome extraction task includes two subtasks: models (Devlin et al., 2018), trained on a general- extracting primary and reported outcomes. For domain corpus of 3.3B words; BioBERT model each subtask, we annotated a separate corpus. For (Lee et al., 2019), trained on the BERT corpus and primary outcome extraction, we annotated a cor- a biomedical corpus of 18B words; and SciBERT pus of 2,000 sentences, coming from 1,672 articles. models (Beltagy et al., 2019), trained on the BERT The sentences were selected randomly, from both corpus and a scientific corpus of 3.1B words. For abstracts and full texts, without restriction to a par- each task, we chose the best-performing model. ticular medical domain. A total of 1,694 primary Details about the annotated datasets that we used outcomes was annotated (Koroleva, 2019a). For re- and the tested approaches can be found below. The ported outcome extraction, we annotated reported best results for each task are summarised in Table1. outcomes in the abstracts of articles for which we annotated the primary outcomes. The corpus con- 5.2.1 Identification of sections in the abstract tains 1,940 sentences from 402 articles, with a total For identifying sections within the abstract (in par- of 2,251 reported outcomes (Koroleva, 2019a). ticular, Results and Conclusions), we used the We compared a rule-based system and several PubMed 200k dataset introduced in Dernoncourt machine learning algorithms for primary and re- and Lee(2017). This dataset contains approxi- ported outcome extraction. Details about the an- mately 200,000 abstracts of RCTs with 2.3 million notated datasets and the methods that we tested sentences. Each sentence is annotated with one of can be found in Koroleva et al.(EasyChair, 2020). the following classes, corresponding to the sections We selected the best performing approach to be of the abstract: background, objective, method, re- included in our tool sult, or conclusion. We used the train-dev-test split For primary outcomes extraction, the best perfor- provided by the developers of the dataset. mance was demonstrated by the BioBERT model
55 fine-tuned for named entity recognition task on our 6 Interface corpus of 2,000 sentences annotated for primary Our prototype system allows the user to load a outcomes. For reported outcomes extraction, the text (with or without annotations), run algorithms, best performance was achieved by the SciBERT visualize their output, correct, add or remove anno- model fine-tuned for named entity recognition task tations. The expected input is an article reporting on our corpus of 1,940 sentences annotated with an RCT in the text format, including the abstract. reported outcomes. Figure1 shows the interface with an example of a processed text. 5.2.3 Assessment of semantic similarity of The main items of the drop-down menu on the outcomes top of the page are Annotations, allowing to visu- alize and manage the annotations, and Algorithms, To annotate semantic similarity between outcomes, allowing to run the described algorithms to detect we used pairs of sentences from our corpora of potential spin and the related information. The text outcomes: the first sentence in each pair comes fragments identified by the algorithms can be high- from the corpus of primary outcomes, the second lighted in the text. When running the algorithms, sentence comes from the corpus of reported out- a report is generated that contains the extracted comes, and both sentences are from the same arti- information and its analysis by the tool (e.g. a cle. We assigned a binary label of similarity (sim- mismatch between the outcomes in the text and ilar/dissimilar) to each pair of outcomes in each in the trial registry; absence of the declared pri- sentence pair. The corpus contains 3,043 pairs of mary outcome among the reported outcomes in outcomes (Koroleva, 2019b). the abstract). The report is saved into the Meta- We tested several semantic similarity measures data section of Annotations menu, which can be (string-based, lexical, vector-based) and the BERT, accessed through the interface, and can be exported SciBERT and BioBERT models, fine-tuned for sen- to a file via the Generate report item of the Algo- tence pair classification task on the corpus of out- rithms menu. Human experts can use this report come pairs. Details on the corpus annotation and to check the extracted information and the analysis on the methods tested can be found in Koroleva performed by the tool, and to make a final decision et al.(2019). The best performance was shown by on the presence/absence of a given type of spin. the fine-tuned BioBERT model. 7 Results and conclusions 5.2.4 Extraction of the relation between The current functionality, methods in use, anno- reported outcomes and statistical tated datasets and the best achieved results are out- significance levels lined in Table1. Performance is assessed per-token for outcome and significance level extraction and To annotate the relation between reported outcomes per-unit for other tasks. and statistical significance levels, we selected sen- In this paper, we presented a first prototype tool tences containing markers of statistical significance for assisting authors and reviewers to detect spin from the corpus annotated with reported outcomes. and related information in abstracts of articles re- We annotated the pairs of outcomes and signifi- porting RCTs. The employed algorithms show op- cance levels with a binary label (“positive”: the erational performance in complex semantic tasks, significance level is related to the outcome; “neg- even with relatively low volume of available an- ative”: the significance level is not related to the notated data. We envisage two possible applica- outcome). The final corpus contains 663 sentences tions of our system: as an authoring aid or as peer- with 2,552 annotated relations (Koroleva, 2019c). reviewing tool. The authoring aid version can be We tested several machine learning algorithms further developed into an educational tool, explain- and the BERT, SciBERT and BioBERT model fine- ing the notion of spin and its types to the user. tuned for the relation extraction task on the anno- Possible directions for future work include: im- tated corpus. The details on the corpus and the proving the implementation and interface (adding method can be found in Koroleva and Paroubek prompts for interaction with the user; facilitating (2019). The best result for this task was achieved installation process), algorithms (improving cur- by the fine-tuned BioBERT model. rent performance, adding detection of new spin
56 Figure 1: Example of a processed text types), application (promoting the tool among the spin, prevalent in the biomedical domain, would target audience; encouraging users to submit their not be relevant in domains that do not use the no- manually annotated data, to be used to improve the tion of outcome). Hence, we suppose that spin- algorithms), and optimization (parallel processing detection algorithms are domain-specific as well of multiple input text files). Our system can be and cannot be applied to other domains. easily incorporated into other text processing tools. Another interesting yet challenging direction for 8 Availability the future work is detecting spin/distorted reporting in texts belonging to scientific domains other than The proposed prototype tool and associated models biomedicine. First of all, a qualitative study of spin are available at: is needed to define and classify spin in each scien- https://github.com/aakorolyova/DeSpin. tific domain (similar to the work of Boutron et al. (2010) and Lazarus et al.(2015) for clinical trials). To our best knowledge, there have been no attempts Acknowledgments to conduct such a study for non-biomedical texts. It is therefore difficult to hypothesise whether spin- This project has received funding from the Eu- detection algorithms developed for texts reporting ropean Union’s Horizon 2020 research and inno- clinical trials could be applicable for other domains. vation programme under the Marie Sklodowska- It appears that the definition and the types of spin Curie grant agreement No 676207. are domain-specific (e.g. outcome-related types of
57 References Ben Goldacre, Henry Drysdale, Aaron Dale, Ioan Milo- sevic, Eirion Slade, Philip Hartley, Cicely Marston, Sophia Ananiadou, Brian Rea, Naoaki Okazaki, Rob Anna Powell-Smith, Carl Heneghan, and Kamal R. Procter, and James Thomas. 2009. Supporting Mahtani. 2019. Compare: a prospective cohort systematic reviews using text mining. Social Sci- study correcting and monitoring 58 misreported tri- ence Computer Review - SOC SCI COMPUT REV, als in real time. Trials, 20(1):118. 27:509–523.
Jennifer Austin, Christopher Smith, Kavita Natarajan, Romana Haneef, Clement Lazarus, Philippe Ravaud, Mousumi Som, Cole Wayant, and Matt Vassar. 2018. Amelie Yavchitz, and Isabelle Boutron. 2015. In- Evaluation of spin within abstracts in obesity ran- terpretation of results of studies evaluating an inter- domized clinical trials: A cross-sectional review: vention highlighted in google health news: a cross- Spin in obesity clinical trials. Clinical Obesity, sectional study of news. PLoS ONE. 9:e12292. Julian P. Higgins and Sally Green, editors. 2008. Caroline Barnes, Isabelle Boutron, Bruno Giraudeau, Cochrane handbook for systematic reviews of inter- Raphael Porcher, Douglas Altman, and Philippe ventions. Wiley & Sons Ltd., West Sussex. Ravaud. 2015. Impact of an online writing aid tool for writing a randomized trial report: The cob- Minlie Huang, Aurelie´ Nev´ eol,´ and Zhiyong Lu. 2011. web (consort-based web tool) randomized controlled Recommending mesh terms for annotating biomedi- trial. BMC medicine, 13:221. cal articles. Journal of the American Medical Infor- matics Association : JAMIA, 18:660–7. Lian Beijers, Bertus F. Jeronimus, Erick H. Turner, Pe- ter de Jonge, and Annelieke M. Roest. 2017. Spin John Ioannidis. 2005. Why most published research in rcts of anxiety medication with a positive primary findings are false. PLoS medicine, 2:e124. outcome: a comparison of concerns expressed by the us fda and in the published literature. BMJ Open, Muhammad Khan, Noman Lateef, Tariq Siddiqi, 7(3). Karim Abdur Rehman, Saed Alnaimat, Safi Khan, Haris Riaz, M Hassan Murad, John Mandrola, Rami Iz Beltagy, Arman Cohan, and Kyle Lo. 2019. Scibert: Doukky, and Richard Krasuski. 2019. Level and Pretrained contextualized embeddings for scientific prevalence of spin in published cardiovascular ran- text. arXiv preprint arXiv:1903.10676. domized clinical trial reports with statistically non- Isabelle Boutron, Douglas Altman, Sally Hopewell, significant primary outcomes: A systematic review. Francisco Vera-Badillo, Ian Tannock, and Philippe JAMA Network Open, 2:e192622. Ravaud. 2014. Impact of spin in the abstracts of arti- cles reporting results of randomized controlled trials N.C. Kinder, M.D. Weaver, Cole Wayant, and Matt Vas- in the field of cancer: the SPIIN randomized con- sar. 2018. Presence of ‘spin’ in the abstracts and ti- trolled trial. Journal of Clinical Oncology. tles of anaesthesiology randomised controlled trials. British Journal of Anaesthesia, 122. Isabelle Boutron, Susan Dutton, Philippe Ravaud, and Douglas Altman. 2010. Reporting and interpretation Svetlana Kiritchenko, Berry de Bruijn, Simona Carini, of randomized controlled trials with statistically non- Joel D. Martin, and Ida Sim. 2010. Exact: automatic significant results for primary outcomes. JAMA. extraction of clinical trial characteristics from jour- nal publications. BMC Med Inform Decis Mak. Craig M. Cooper, Harrison M. Gray, Andrew E. Ross, Tom A. Hamilton, Jaye B. Downs, Cole Wayant, and Anna Koroleva. 2019a. MiRoR11 - P2 - Annotated cor- Matt Vassar. 2018. Evaluation of spin in the ab- pus for primary and reported outcomes extraction. stracts of otolaryngology randomized controlled tri- als: Spin found in majority of clinical trials. The Anna Koroleva. 2019b. MiRoR11 - P2 - Annotated cor- Laryngoscope. pus for semantic similarity of clinical trial outcomes. Franck Dernoncourt and Ji Young Lee. 2017. PubMed Anna Koroleva. 2019c. MiRoR11 - P2 - Annotated cor- 200k RCT: a dataset for sequential sentence classi- pus for the relation between reported outcomes and fication in medical abstracts. In 8th IJCNLP (Vol- their significance levels. ume 2: Short Papers), pages 308–313, Taipei, Tai- wan. Asian Federation of NLP. Anna Koroleva. 2020. MiRoR11 - P2 - Annotated Jacob Devlin, Ming-Wei Chang, Kenton Lee, and dataset for spin-related types of statements (state- Kristina Toutanova. 2018. BERT: pre-training of ments of similarity and within-group comparisons). deep bidirectional transformers for language under- standing. CoRR, abs/1810.04805. Anna Koroleva, Sanjay Kamath, and Patrick Paroubek. 2019. Measuring semantic similarity of clinical Padhraig S. Fleming. 2016. Evidence of spin in clini- trial outcomes using deep pre-trained language rep- cal trials in the surgical literature. Ann Transl Med., resentations. Journal of Biomedical Informatics: X, 4,19(385). 4:100058.
58 Anna Koroleva, Sanjay Kamath, and Patrick Paroubek. Frank Soboczenski, Thomas Trikalinos, Joel¨ Kuiper, EasyChair, 2020. Extracting outcomes from arti- Randolph G. Bias, Byron Wallace, and Iain J. Mar- cles reporting randomized controlled trials using pre- shall. 2019. Machine learning to help researchers trained deep language representations. EasyChair evaluate biases in clinical trials: A prospective, ran- Preprint no. 2940. domized user study. BMC Medical Informatics and Decision Making, 19. Anna Koroleva and Patrick Paroubek. 2019. Extracting relations between outcomes and significance levels Francisco E. Vera-Badillo, Marc Napoleone, Monika K. in randomized controlled trials (RCTs) publications. Krzyzanowska, Shabbir M.H. Alibhai, An-Wen In Proceedings of the 18th BioNLP Workshop and Chan, Alberto Ocana, Bostjan Seruga, Arnoud J. Shared Task, pages 359–369, Florence, Italy. Asso- Templeton, Eitan Amir, and Ian F. Tannock. 2016. ciation for Computational Linguistics. Bias in reporting of randomised clinical trials in on- cology. European Journal of Cancer, 61:29 – 35. Clement´ Lazarus, Romana Haneef, Philippe Ravaud, and Isabelle Boutron. 2015. Classification and Amelie´ Yavchitz, Isabelle Boutron, Aida Bafeta, prevalence of spin in abstracts of non-randomized Ibrahim Marroun, Pierre Charles, Jean J. C. Mantz, studies evaluating an intervention. BMC Med Res and Philippe Ravaud. 2012. Misrepresentation of Methodol. randomized controlled trials in press releases and Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, news coverage: a cohort study. PLoS Med. Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. Biobert: a pre-trained biomed- ical language representation model for biomedical text mining. arXiv preprint arXiv:1901.08746. Suzanne Lockyer, Robert Willard Hodgson, Jo C. Dumville, and Nicky Cullum. 2013. ”Spin” in wound care research: the reporting and interpreta- tion of randomized controlled trials with statistically non-significant primary outcome results or unspeci- fied primary outcomes. In Trials. Iain Marshall, Joel¨ Kuiper, Edward Banner, and By- ron C. Wallace. 2017. Automating biomedical ev- idence synthesis: RobotReviewer. In Proceedings of ACL 2017, System Demonstrations, pages 7–12, Vancouver, Canada. Association for Computational Linguistics. Iain James Marshall, Joel¨ Kuiper, and Byron C. Wal- lace. 2015. Robotreviewer: Evaluation of a sys- tem for automatically assessing bias in clinical trials. Journal of the American Medical Informatics Asso- ciation : JAMIA, 23. James Mork, Alan Aronson, and Dina Demner- Fushman. 2017. 12 years on – is the NLM medi- cal text indexer still useful and relevant? Journal of Biomedical Semantics, 8(1). James G. Mork, Antonio Jimeno-Yepes, and Alan R. Aronson. 2013. The NLM medical text indexer sys- tem for indexing biomedical literature. CEUR Work- shop Proceedings, 1094. Zainab Samaan, Lawrence Mbuagbaw, Daisy Kosa, Victoria Borg Debono, Rejane Dillenburg, Shiyuan Zhang, Vincent Fruci, Brittany Dennis, Monica Ba- wor, and Lehana Thabane. 2013. A systematic scop- ing review of adherence to reporting guidelines in health care literature. Journal of multidisciplinary healthcare, 6:169–88. Kenneth F. Schulz, Douglas G. Altman, and David Mo- her. 2010. Consort 2010 statement: updated guide- lines for reporting parallel group randomised trials. BMJ, 340.
59 Towards Visual Dialog for Radiology
Olga Kovaleva ,∗ Chaitanya Shivade , Satyananda Kashyap?, Karina Kanjaria?, ‡ †∗ Adam Coy?, Deddeh Ballah?, Joy Wu?, Yufan Guo?, Alexandros Karargyris?, David Beymer?, Anna Rumshisky , Vandana Mukherjee? University of Massachussets,‡ Lowell Amazon ‡ ? IBM Almaden Research Center.†
Abstract over additional information gathered from previous question answers in the dialog. Current research in machine learning for ra- Although limited work exploring VQA in radiol- diology is focused mostly on images. There ogy exists, VisDial in radiology remains an unex- exists limited work in investigating intelligent interactive systems for radiology. To address plored problem. With the healthcare setting increas- this limitation, we introduce a realistic and ingly requiring efficiency, evaluation of physicians information-rich task of Visual Dialog in radi- is now based on both the quality and the timeli- ology, specific to chest X-ray images. Using ness of patient care. Clinicians often depend on MIMIC-CXR, an openly available database official reports of imaging exam findings from ra- of chest X-ray images, we construct both a diologists to determine the appropriate next step. synthetic and a real-world dataset and pro- However, radiologists generally have a long queue vide baseline scores achieved by state-of-the- art models. We show that incorporating medi- of imaging studies to interpret and report, caus- cal history of the patient leads to better perfor- ing subsequent delay in patient care (Bhargavan mance in answering questions as opposed to et al., 2009; Siewert et al., 2016). Furthermore, it is conventional visual question answering model common practice for clinicians to call radiologists which looks only at the image. While our ex- asking follow-up questions on the official reporting, periments show promising results, they indi- leading to further inefficiencies and disruptions in cate that the task is extremely challenging with the workflow (Mangano et al., 2014). significant scope for improvement. We make Visual dialog is a useful imaging adjunct that both the datasets (synthetic and gold standard) and the associated code publicly available to can help expedite patient care. It can potentially the research community. answer a physician’s questions regarding official in- terpretations without interrupting the radiologist’s 1 Introduction workflow, allowing the radiologist to concentrate their efforts on interpreting more studies in a timely Answering questions about an image is a complex manner. Additionally, visual dialog could provide multi-modal task demonstrating an important ca- clinicians with a preliminary radiology exam in- pability of artificial intelligence. A well-defined terpretation prior to receiving the formal dictation task evaluating such capabilities is Visual Question from the radiologist. Clinicians could use the infor- Answering (VQA) (Antol et al., 2015) where a sys- mation to start planning patient care and decrease tem answers free-form questions reasoning about the time from the completion of the radiology exam an image. VQA demands careful understanding to subsequent medical management (Halsted and of elements in an image along with intricacies of Froehle, 2008). the language used in framing a question about it. In this paper, we address these gaps and make Visual Dialog (VisDial) (Das et al., 2017; de Vries the following contributions: 1) we introduce con- et al., 2016) is an extension to the VQA problem, struction of RadVisDial - the first publicly available where a system is required to engage in a dialog dataset for visual dialog in radiology, derived from about the image. This adds significant complex- the MIMIC-CXR (Johnson et al., 2019) dataset, ity to VQA where a system should now be able 2) we compare several state-of-the-art models for to associate the question in the image, and reason VQA and VisDial applied to these images, and 3) ∗ Equal contribution, Work done at IBM Research we conduct a comprehensive set of experiments
60 Proceedings of the BioNLP 2020 workshop, pages 60–69 Online, July 9, 2020 c 2020 Association for Computational Linguistics highlighting different challenges of the problem 206,576 reports. Each report is well structured and and propose solutions to overcome them. typically consists of sections such as Medical Condition, Comparison, Findings, and 2 Related Work Impression. Each report can map to one or Most of the large publicly available datasets (Kag- more images and each patient can have one or gle, 2017; Rajpurkar et al., 2017) for radiology more reports. The images consist of both frontal consist of images associated with a limited amount and lateral views. The frontal views are either of structured information. For example, Irvin et al. anterior-posterior (AP) or posterior-anterior (PA). (2019); Johnson et al.(2019) make images avail- The initial release of data also consists of annota- able along with the output of a text extraction mod- tions for 14 labels (13 abnormalities and one No ule that produces labels for 13 abnormalities in a Findings label) for each image. These anno- chest X-ray. Of note recently, the task of generating tations are obtained by running the CheXpert la- reports from radiology images has become popular beler (Irvin et al., 2019); a rule-based NLP pipeline in the research community (Jing et al., 2018; Wang against the associated report. The labeler output et al., 2018). Two recent shared tasks at Image- assigns one of four possibilities for each of the 13 abnormalities: yes, no, maybe, not mentioned in CLEF explored the VQA problem with radiology { the report . images (Hasan et al., 2018; Abacha et al., 2019). } Lau et al.(2018) also released a small dataset VQA- RAD for the specific task. 3.2 Visual Dialog dataset construction The first VQA shared task at ImageCLEF (Hasan Every training record of the original VisDial dataset et al., 2018) used images from articles at PubMed (Das et al., 2017) consists of three elements: an im- Central. While Abacha et al.(2019) and Lau age I, a caption for the image C, and a dialog et al.(2018) use clinical images, the sizes of these history H consisting of a sequence of ten question- datasets are limited. They are a mix of several answer pairs. Given the image I, the caption C, modalities including 2D modalities such as X-rays, a possibly empty dialog history H, and a follow- and 3D modalities such as ultrasound, MRI, and CT up question q, the task is to generate an answer a scans. They also cover several anatomic locations where q, a H. Following the original formula- { } ∈ from the brain to the limbs. This makes a multi- tion, we synthetically create our dataset using the modal task with such images overly challenging, plain text reports associated with each image (this with shared task participants developing separate synthetic dataset will be considered to be silver- models (Al-Sadi et al., 2019; Abacha et al., 2018; standard data for the experiments described in sec- Kornuta et al., 2019) to first address these subtasks tion 5). The Medical Condition section of (such as modality detection) before actually solving the radiology report is a single sentence describing the problem of VQA. the medical history of the patient. We treat this sen- We address these limitations and build up on tence from the Medical Condition section as MIMIC-CXR (Johnson et al., 2019) the largest the caption of the image. We use NegBio (Peng publicly available dataset of chest X-rays and cor- et al., 2018) for extracting sections within a report. responding reports. We focus on the problem of We discard all images that do not have a medical visual dialog for a single modality and anatomy in condition in their report. Further, each CheXpert the form of 2D chest X-rays. We restrict the num- label is formulated as a question probing the pres- ber of questions and generate answers for them ence of a disorder, and the output from the labeler automatically which allows us to report results on is treated as the corresponding answer. Thus, ignor- a large set of images. ing the No Findings label, there are 52 possible 3 Data question-answer pairs as a result of 13 questions and 4 possible answers. 3.1 MIMIC-CXR We decided to focus on PA images for most of The MIMIC-CXR dataset1 consists of 371,920 our experiments as this is the most informative chest X-ray images in the Digital Imaging and view for chest X-rays, according to our team radi- Communications (DICOM) format along with ologists. The original VisDial dataset (Das et al., 1https://physionet.org/content/ 2017) consists of ten questions per dialog and one mimic-cxr/1.0.0/ dialog per image. Since we only have a set of 13
61 Two people are in a wheelchair and one is holding a racket 80 year old man s/p vats R lower lobectomy
Q: Airspace opacity? A: Yes Q: Fracture? A: Not in report Q: Lung lesion? A: No
Pneumonia? Yes
Figure 1: Comparison of VisDial 1.0 (left) with our synthetically constructed dataset (right). possible questions, we limit the length of the dialog to 5 randomly sampled questions. The resulting dataset has 91060 images in the PA view (with train/validation/test splits containing 77205, 7340 and 6515 images, respectively). This synthetic data will be made available through the MIMIC Derived Pneumonia? LSTM No Data Repository.2 Thus any individual with access Airspace + to MIMIC-CXR will have access to our data. Fig- opacity? No LSTM ure 1 shows an example from our dataset and how ... it compares with one from VisDial 1.0. Figure 2: The modified architecture of the SAN model 3.3 Evaluation (image taken from (Yang et al., 2016)). The proposed The questions in our dataset are limited to probing modification shown in orange incorporates the history the presence of an abnormality in a chest X-ray. of dialog turns in the same way as the question through an LSTM. In our ablation experiments the changed part Similarly, the answers are limited to one of the either reduces to encoding an image caption only or four choices. Owing to the restricted nature of the gets cut completely. problem, we deviate from the evaluation protocol outlined in (Das et al., 2017) and instead calculate the F1-score for each of the four answers. We also guided attention over image features in an itera- report a macro-averaged F1 score across the four tive manner. The attended image features are then answers to make model comparisons easier. combined with the question features for answer prediction. SAN has been successfully adapted for 4 Models medical VQA tasks such as VQA-RAD (Lau et al., For our experiments, we selected a set of models 2018) and VQA-Med task of the ImageCLEF 2018 designed for image-based question answering tasks. challenge (Ionescu et al., 2018). In our setup, we Namely, we experimented with three architectures: use a stack of two image attention layers and an Stacked Attention Network (SAN) (Yang et al., LSTM-based question representation. 2016), Late Fusion Network (LF) (Das et al., 2017), To take the dialog history into account and there- and Recursive Visual Attention Network (RVA) fore adjust the SAN model for the needs of the (Niu et al., 2019). Following the original VisDial Visual Dialog task, we modify the first image at- study (Das et al., 2017), we use an encoder-decoder tention layer of the network by adding a term for structure with a discriminative decoder for each of LSTM representation of the history. This modifica- the models. Below we give an overview of all the tion forces the image attention weights to become three algorithms. both question- and history-guided (see Figure 2).
4.1 Stacked Attention Network 4.2 Late Fusion Network The original configuration of SAN was introduced Proposed by (Das et al., 2017) as a baseline model for the general-domain VQA task. The model per- for the Visual Dialog task, Late Fusion Network forms multi-step reasoning by refining question- encodes the question and the dialog history through 2https://physionet.org/physiotools/ two separate RNNs, and the image through a CNN. mimic-code/HEADER.shtml The resulting representations are simply concate-
62 nated in a single vector, which is then used by a decoder for predicting the answer. We use this model unchanged, as released in the original Visual Dialog challenge.
4.3 Recursive Visual Attention This model is the winner of the 2019 Visual Di- alog challenge3. It recursively browses the past history of dialog turns until the current question is paired with the turn containing the most relevant information. This strategy is particularly useful for resolving co-references, naturally occurring in Figure 3: Downsampling strategies. Every bar along general-domain dialog questions. As previously, the X axis represents a single question-answer pair, we do not modify the architecture of the model. where questions (13 in total) and answers (4 in total) are obtained through CheXpert. 5 Experiments
This section presents our down-sampling strategy, in report’ and ‘Fracture’ ‘Not in report’ } { → } gives details about conducted ablation studies, and remained the first and second largest counts in both describes experiments with various representations the unbalanced and downsampled versions of the of images and texts. datasets. To reduce the disparity between dominant 5.1 Downsampling and underrepresented categories, we tuned the pa- rameters outlined in (Hudson and Manning, 2019). A closer analysis of our data showed that the major- We experimented with two different sets of param- ity of the reports processed by the CheXpert labeler eter values and obtained two datasets with more resulted in no mention of most of the 13 patholo- balanced question-answer distributions. We further gies. This presented a heavily skewed dataset that refer to them as “minor” and “major” downsam- would lead to a biased model instead of true visual pling, reflecting the total amount of data reduced understanding. This issue is not unique to radiol- (shown in blue and gray in Figure 3). ogy; it is observed even in the current benchmarks for VQA, and attempts have been made to miti- gate the resulting problems (Hudson and Manning, 5.2 Evaluating importance of context 2019; Zhang et al., 2016; Agrawal et al., 2018). In order to dissuade the answer biases, we per- To assess the importance of the dialog context for formed data balancing, specifically by downsam- question answering, we compare the performance pling major labels in our dataset. As mentioned of different variations of the Stacked Attention Net- above, the CheXpert labeler outputs four possible work, selected as the best-performing model in the answers for 13 labels. To investigate the skew in the previous experiment (see subsection 6.1). In par- data, we plotted a distribution of the 52 question- ticular, we examine three scenarios: (a) the model answer pairs (Figure 3). Further, we downsampled makes a prediction based solely on a given image the question-answer pairs to fit a smoother answer (essentially solving the VQA task rather than the distribution with the method presented in GQA Visual Dialog task), (b) the model makes its pre- based on the Earth Mover’s Distance method (Hud- diction given an image and its caption, and (c) the son and Manning, 2019; Rubner et al., 2000). We model makes its prediction given an image, a cap- iterated over the 52 pairs in decreasing frequency tion, and a history of question-answer pairs. Simi- order and downsampled the categories belonging lar to the model modifications described in subsec- to the skewed head of the distribution. The relative tion 4.1 and Figure 2, we achieve the goal through label ranks by frequency remained the same for the experimenting with the SAN model by changing its balanced sets as with the unbalanced sets. For ex- first image attention layer to accordingly take in (a) ample, the pairs ‘Other pleural findings’ ‘Not { → question and image features, (b) question, image, 3https://visualdialog.org/challenge/ and caption features, and (c) question, image, and 2019 full dialog history features.
63 5.3 Image representations corresponding lateral view are concatenated as an We test three approaches for pre-trained image rep- aggregate image representation. resentations. The first approach uses a ResNet- 5.5 Text representations 101 architecture (He et al., 2016) for multiclass classification of input X-ray images into 14 find- Finally, we investigate the best way for represent- ing labels extracted from the associated reports ing the textual data by incorporating different pre- (as described in section 3.2). Our second method trained word vectors. More specifically, we mea- aims to replicate the original CheXpert study (Irvin sure the performance of our best-performing SAN et al., 2019). Here we use a DenseNet-121 image model reached with (a) randomly initialized word classifier trained for prediction of five pre-selected embeddings trained jointly with the rest of the and clinically important labels, namely, atelectasis, models, (b) domain-independent GloVe Common cardiomegaly, consolidation, edema, and pleural Crawl embeddings (Pennington et al., 2014), and effusion. In both ResNet and DenseNet-based ap- (c) domain-specific fastText embeddings trained proaches we take the features obtained from the by (Romanov and Shivade, 2018). The latter are last pooling layer. initialized with GloVe embeddings trained on Com- Finally, we adopted a bottom-up mechanism for mon Crawl, followed by training on 12M PubMed image region proposal introduced by Anderson abstracts, and finally on 2M clinical notes from et al.(2018). More specifically, we first trained MIMIC-III database (Johnson et al., 2016). In a neural network predicting bounding boxes for all the experiments, we use 300-dimensional word the image regions, corresponding to a set of 11 vectors. We also experimented with transformer- handcrafted clinical annotations adopted from an based contextual vectors using BERT (Devlin et al., existing chest X-ray dataset4. We then represented 2019). More specifically, instead of using LSTM every region as a latent feature vector of a trained representations of the textual data, we extracted the patch-wise convolution autoencoder, and (3) con- last layer vectors from ClinicalBERT (Alsentzer catenated all the obtained vectors to represent the et al., 2019) pre-trained on MIMIC notes, and aver- entire image. aged them over input sequence tokens. Based on the results of the experiment (subsec- 5.6 Question order tion 6.3), we found that ResNet-101 image vectors yielded the best performance, so we used them in In a visual dialog setting, a model is conditioned on other experiments. the image vector, the image caption, and the dialog history to predict the answer to a new question. 5.4 Effect of incorporating a lateral view We hypothesized that a model should be able to One of the crucial aspects of X-ray radiography answer later questions in a dialog better since it exams is to capture the subject from multiple views. has more information from the previous questions Typically, in case of chest X-rays, radiologists order and their answers. As described in Section 3.2, we an additional lateral view to confirm and locate randomly sample 5 questions out of 13 possible findings that are not clearly visible from a frontal choices to construct a dialog. We re-ordered the (PA or AP) view. We test whether the VisDial question-answer pairs in the dialog to reflect the models are able to leverage the additional visual order in which the corresponding abnormality label information offered by a lateral (LAT) view. We mentions occurred in the report. However, results filter the data down to the patients whose chest X- for questions ordered based on their occurrence in ray exams had both a frontal and lateral views and the narrative did not vary from the setup with a re-sample the resulting data-set into train (52952 random order of questions. PA and 8086 AP images), validation (6614 PA and 964 AP images), and test (6508 PA and 1035 AP 6 Results images). We train a separate ResNet-101 model We report macro-averaged F1-scores achieved on for each of the three views on this re-sampled data the same unbalanced validation set for each of the using the method described in the previous section. experiments. When experimenting with different The vector representations of a frontal view and the configurations of the same model, we also break 4https://www.kaggle.com/c/ down the aggregate score to the F1 scores for indi- rsna-pneumonia-detection-challenge vidual answer options.
64 Model ‘Yes’ ‘No’ ‘Maybe’ ‘Not in report’ Macro F1 SAN (VQA) 0.24 0 0.09 0.84 0.29 SAN (caption only) 0.30 0.09 0.09 0.81 0.33 SAN (full history) 0.22 0.26 0.04 0.83 0.34
Table 1: Ablation experiments. Per-answer F1-scores along with the macro F1-score are shown for tested SAN configurations.
6.1 Downsampling this might be due to the fact that, by design, the net- Our results show (Table 2) consistent improvement work uses a limited set of pre-training classes not of the scores across all the models as the train- sufficient to generalize well to a full set of diseases ing data becomes more balanced. All the mod- used in the Visual Dialog task. els yielded comparable scores, with SAN being slightly better than other models (0.34 against 0.33 Model DenseNet-121 Region ResNet macro F1-score). Later in our experiments, we used Proposal the major down sampled version of the data-set. SAN 0.27 0.29 0.34 LF 0.33 0.31 0.33 Downsampled Model Unbalanced RvA 0.29 0.32 0.33 Minor Major SAN 0.25 0.28 0.34 Table 3: Comparative performance (macro-F1) of Vi- LF 0.28 0.31 0.33 sual Dialog models on the test set with different image RvA 0.24 0.33 0.33 representations.
Table 2: Data balancing experiments. Macro F1 scores are reported for every tested model. 6.4 Effect of incorporating a lateral view
6.2 Evaluating importance of context As expected, for both variations of the frontal view One of the main findings of our study revealed the (i.e. AP and PA) appending lateral image vectors importance of contextual information for answer- enhanced the performance of the tested SAN model ing questions about a given image. As shown in (see Table4). This suggests that lateral and frontal Table 1, adding the image caption and the history image vectors complement each other, and the mod- of turns results in incremental increases of macro els can benefit from using both. However, in our F1-scores. Notably, the VQA setup in which the data-set only a subset of reports has both views model relies on the image only, it fails to detect the available, which significantly reduces the amount ‘No’ answer, whereas the history-aware configura- of training data. tion leads to a significant performance gain for this particular label. As expected and due to the skewed 6.5 Word embeddings nature of the data-set, the highest and the lowest per-label scores were achieved for the most and the Another observation from our experiments is that least frequent labels (‘Not in report’ and ‘Maybe’), domain-specific pre-trained word embeddings con- respectively. tribute to better scores (see Table5). This is due to the fact that domain-specific embeddings con- 6.3 Image representation tain medical knowledge that helps the model make Out of the tested image representations, ResNet- more justified predictions. derived vectors perform consistently better than the When using BERT, we did not notice gains in other approaches (see Table3). Although in our performance, which most likely means that the last- DenseNet-121 image classification pre-training we layer averaging strategy is not optimal and more were able to replicate the performance of (Irvin sophisticated approaches such as (Xiao, 2018) are et al., 2019), the Visual Dialog scores for the corre- required . Alternatively, the final representation of sponding vectors turned out to be lower. We believe the CLS can be used to represent input text.
65 View ‘Yes’ ‘No’ ‘Maybe’ ‘Not in report’ Macro F1 AP 0.40 0.21 0.12 0.79 0.381 LAT 0.41 0.23 0.13 0.75 0.379
AP+LAT AP + LAT 0.41 0.22 0.12 0.79 0.385 PA 0.30 0.30 0.08 0.88 0.392 LAT 0.32 0.32 0.07 0.86 0.391
PA+LAT PA + LAT 0.32 0.34 0.06 0.87 0.396
Table 4: Effect of adding the lateral view to a frontal view (AP and PA).
Embedding ‘Yes’ ‘No’ ‘Maybe’ ‘Not in report’ Macro F1 Random 0.26 0.22 0.04 0.73 0.31 GloVe (common crawl) 0.27 0 0.09 0.80 0.29 fastText (MedNLI) 0.24 0.22 0.07 0.84 0.33
Table 5: Comparative performance of the SAN model with different word embeddings.
7 Comparison with the gold-standard answer was elaborated with additional information data if the radiologists found it necessary. The whole data collection procedure resulted in 100 annotated To complement our experiments with the silver dialogs. data and investigate the applicability of the trained models to real-world scenarios, we also collected a 7.2 Gold standard results set of gold standard data which consisted of two ex- pert radiologists having a dialog about a particular Following the gold standard data collection, we chest X-ray. These X-ray images were randomly performed some preliminary analyses with the best sampled PA views from the test our data. In this silver standard SAN model. Our gold standard data section, we present the data collection workflow, was split into train (70), validation (20), and test outline the associated challenges, compare the re- (10) sets. We experimented with three setups: (a) sulting data-set with the silver-standard, and report evaluating the silver-data trained networks on the the performance of trained models. gold standard data, (b) training and evaluating the models on the gold data, and (c) fine-tuning the 7.1 Gold Standard Data Collection silver-data trained networks on the gold standard We laid the foundations for our data collection in data. Table 6 shows the results of these experiments. a manner similar to that of the general visual dia- We found the best macro-F1 score of 0.47 was log challenge (Das et al., 2017). Two radiologists, achieved by the silver data-trained SAN network designated as a “questioner” and an “answerer”, fine-tuned on the gold standard data. We observed conversed with each other following a detailed an- that the model could not directly predict any of the notation guideline created to ensure consistency. classes if directly evaluated on the gold data-set, The “answerer” in each scenario was provided with suggesting that it was trained to fit the data patterns an image and a caption (medical condition). The significantly different from those present in the “questioner” was provided with only the caption, collected data-set. However, pre-training on the and tasked with asking follow-up questions about silver data serves as a good starting point for further the image, visible only to the “answerer”. In or- model fine-tuning. The obtained scores in general der to make the gold data-set comparable to the imply that there are many differences between the silver-standard one, we restricted the beginning of gold and silver data, including their vocabularies, each answer to contain a direct response of ‘Yes’, answer distributions, and level of question detail. ‘No’, ‘Maybe’, or ‘Not mentioned’. In our anno- tation guidelines ‘Not mentioned’ referred to the 7.3 Comparison of gold and silver data lack of evidence of the given medical condition To provide a meaningful analysis of the sources that was asked by the “questioner” radiologist. The of difference between the gold and silver datasets,
66 Train data ‘Yes’ ‘No’ Macro F1 posed heavily of ‘Yes’ or ‘No’ answers, the silver comprised mostly of ‘Not in report’. Silver 0.00 0.00 0.00 Gold 0.27 0.77 0.35 8 Discussion Silver+gold 0.60 0.82 0.47 Our main finding is that the introduced task of vi- Table 6: Comparative performance of the SAN model sual dialog in radiology presents a lot of challenges trained on different combinations of silver and gold from the machine learning perspective, including a data, and evaluated on the test subset of gold data. Note that the gold annotations did not contain ‘Not in report’ skewed distribution of classes and a required abil- and ‘Maybe’ options. ity to reason over both visual and textual input data. The best of our baseline models achieved 0.34 macro-averaged F1-score, indicating on a sig- we grouped the gold questions semantically by us- nificant scope for potential improvements. Our ing the CheXpert vocabulary for the 13 labels used comparison of gold and silver standard data shows for the construction of the silver dataset. The gold some trends are in line with medical doctors’ strate- questions that are unable to be grouped via CheX- gies in medical history taking, starting with broader, pert were mapped manually using expert clinical general questions and then narrowing the scope of knowledge. We systematically compared the gold their questions to more specific findings (Talbot and silver dialogs on the same 100 chest X-rays et al.; Campillos-Llanos et al., 2020). and noted the following differences. Despite the difficulty and the practical usefulness of the task, it is important to list the limitations of Frequency of semantically equivalent ques- • our study. The questions were limited to presence tions. Just under half of the gold question of 13 abnormalities extracted by CheXpert and the types were semantically covered by the ques- answers were limited to 4 options. The studies tions in the silver dataset. used in this work (from MIMIC-CXR) originate from a single tertiary hospital in the United States. Granularity of questions. We observed that • Moreover, they correspond to a specific group of the silver dataset tends to ask highly granular patients, namely those admitted to the Emergency questions about specific findings (e.g. “con- Department (ED) from 2012 to 2014. Therefore, solidation”) as expected. The radiology ex- the data and hence the model reflect multiple real- perts, however, asked a range of low (e.g. world biases. It should also be noted that chest “Are there any bone abnormalities?), medium X-rays are mostly used for screening than diagnos- (e.g. “Are the lungs clear?”) and high (e.g. tic purposes. A radiology image is only one of the “Is there evidence of pneumonia?”) granular- many data points (e.g. labs, demographics, medi- ity questions. The gold dialogs tend to start cations) used while making a diagnosis. Therefore, with broader (low granularity) questions and although predicting presence of abnormalities (e.g. narrow the differential diagnosis down as the pneumonia) based on brief knowledge of the pa- dialogs progress. tient’s medical history and the chest X-ray might Question sense. The radiologists also asked be a good exercise and a promising first step in • questions in the form of whether some struc- evaluating machine learning models, it is clinically ture is “normal” (e.g. “Is the soft tissue limited. normal?”). Whereas, the silver questions There are plenty of directions for future work only asked whether an abnormality is present. that we intend to pursue. To make the synthetic Since chest X-rays are screening exams where data more realistic and expressive, both questions a good proportion of the images may be “nor- and answers should be diversified with the help of mal”, having more questions asking whether clinicians’ expertise and external knowledge bases different anatomies are normal would, there- such as UMLS(Bodenreider, 2004). We plan to fore, yield more ‘Yes’ answers. enrich the data with more question types, address- ing, for example, the location or the size of a given Answer distributions The answer distribu- lung abnormality. We plan to collect more real life • tions of the gold and silver data differ greatly. dialog between radiologists and augment the two Specifically, while the gold data was com- datasets to get a richer set of more expressive dia-
67 log. We anticipate that bridging the gap between Peter Anderson, Xiaodong He, Chris Buehler, Damien the silver- and the gold-standard data in terms of Teney, Mark Johnson, Stephen Gould, and Lei natural language formulations would significantly Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In reduce the difference in model performance for the Proceedings of the IEEE Conference on Computer two setups. Vision and Pattern Recognition, pages 6077–6086. Another direction is to develop a strategy to man- Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- age the uncertain labels such as ‘Maybe’ and ‘Not garet Mitchell, Dhruv Batra, C Lawrence Zitnick, in report’ to make the dataset more balanced. and Devi Parikh. 2015. Vqaa: Visual question an- swering. In Proceedings of the IEEE International 9 Conclusion Conference on Computer Vision, pages 2425–2433. We explored the task of Visual Dialog for radiol- Mythreyi Bhargavan, Adam H Kaye, Howard P For- ogy using chest X-rays and released the first pub- man, and Jonathan H Sunshine. 2009. Workload of licly available silver- and gold-standard datasets radiologists in united states in 2006–2007 and trends since 1991–1992. Radiology, 252(2):458–467. for this task. Having conducted a set of rigorous experiments with state-of-the-art machine learning Olivier Bodenreider. 2004. The unified medical lan- models used for the combination of visual and lan- guage system (umls): integrating biomedical termi- nology. Nucleic acids research, 32(suppl 1):D267– guage reasoning, we demonstrated the complexity D270. of the task and outlined the promising directions ´ for further research. Leonardo Campillos-Llanos, Catherine Thomas, Eric Bilinski, Pierre Zweigenbaum, and Sophie Rosset. Acknowledgments 2020. Designing a virtual patient dialogue system based on terminology-rich resources: Challenges We would like to thank Mousumi Roy for her help and evaluation. Natural Language Engineering, in this project. We are also thankful to Mehdi 26(2):183–220. Moradi for helpful discussions. Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jose´ MF Moura, Devi Parikh, and Dhruv Batra. 2017. Visual dialog. In References Proceedings of the IEEE Conference on Computer AB Abacha, SA Hasan, VV Datla, J Liu, D Demner- Vision and Pattern Recognition, pages 326–335. Fushman, and H Muller.¨ 2019. Vqa-med: Overview of the medical visual question answering task at Jacob Devlin, Ming-Wei Chang, Kenton Lee, and image-clef 2019. In CLEF2019 Working Notes. Kristina Toutanova. 2019. Bert: Pre-training of CEUR Workshop Proceedings (CEURWS. org), deep bidirectional transformers for language under- ISSN, pages 1613–0073. standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association Asma Ben Abacha, Soumya Gayen, Jason J Lau, for Computational Linguistics: Human Language Sivaramakrishnan Rajaraman, and Dina Demner- Technologies, Volume 1 (Long and Short Papers), Fushman. 2018. Nlm at imageclef 2018 visual ques- pages 4171–4186. tion answering in the medical domain. In CLEF (Working Notes). Mark J Halsted and Craig M Froehle. 2008. De- sign, implementation, and assessment of a radiology Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and workflow management system. American Journal of Aniruddha Kembhavi. 2018. Don’t just assume; Roentgenology, 191(2):321–327. look and answer: Overcoming priors for vi- sual question answering. In Proceedings of the Sadid A Hasan, Yuan Ling, Oladimeji Farri, Joey IEEE Conference on Computer Vision and Pattern Liu, Henning Muller,¨ and Matthew Lungren. 2018. Recognition, pages 4971–4980. Overview of imageclef 2018 medical domain visual question answering task. In CLEF (Working Notes). Aisha Al-Sadi, Bashar Talafha, Mahmoud Al-Ayyoub, Yaser Jararweh, and Fumie Costen. 2019. Just at im- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian ageclef 2019 visual question answering in the medi- Sun. 2016. Deep residual learning for image recog- cal domain. In CLEF (Working Notes). nition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770– Emily Alsentzer, John Murphy, William Boag, Wei- 778. Hung Weng, Di Jin, Tristan Naumann, and Matthew McDermott. 2019. Publicly available clinical BERT Drew A Hudson and Christopher D Manning. 2019. embeddings. In Proceedings of the 2nd Clinical Gqa: A new dataset for real-world visual reason- Natural Language Processing Workshop, pages 72– ing and compositional question answering. In 78, Minneapolis, Minnesota, USA. Association for Proceedings of the IEEE Conference on Computer Computational Linguistics. Vision and Pattern Recognition, pages 6700–6709.
68 Bogdan Ionescu, Henning Muller,¨ Mauricio Villegas, 2018. Negbio: a high-performance tool for nega- Alba Garc´ıa Seco de Herrera, Carsten Eickhoff, Vin- tion and uncertainty detection in radiology re- cent Andrearczyk, Yashin Dicente Cid, Vitali Li- ports. AMIA Summits on Translational Science auchuk, Vassili Kovalev, Sadid A Hasan, et al. 2018. Proceedings, 2018:188. Overview of imageclef 2018: Challenges, datasets and evaluation. In International Conference of Jeffrey Pennington, Richard Socher, and Christopher the Cross-Language Evaluation Forum for European Manning. 2014. Glove: Global vectors for word rep- Languages, pages 309–334. Springer. resentation. In Proceedings of the 2014 conference on empirical methods in natural language processing Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, (EMNLP), pages 1532–1543. Silviana Ciurea-Ilcus, Chris Chute, Henrik Mark- Pranav Rajpurkar, Jeremy Irvin, Aarti Bagul, Daisy lund, Behzad Haghgoo, Robyn Ball, Katie Shpan- Ding, Tony Duan, Hershel Mehta, Brandon Yang, skaya, et al. 2019. Chexpert: A large chest radio- Kaylie Zhu, Dillon Laird, Robyn L Ball, et al. graph dataset with uncertainty labels and expert com- 2017. Mura: Large dataset for abnormality detec- parison. In Proceedings of AAAI. tion in musculoskeletal radiographs. arXiv preprint arXiv:1712.06957. Baoyu Jing, Pengtao Xie, and Eric Xing. 2018. On the automatic generation of medical imaging reports. Alexey Romanov and Chaitanya Shivade. 2018. In Proceedings of the 56th Annual Meeting of the Lessons from natural language inference in the Association for Computational Linguistics (Volume clinical domain. In Proceedings of the 2018 1: Long Papers), pages 2577–2586. Conference on Empirical Methods in Natural Language Processing, pages 1586–1596. Alistair EW Johnson, Tom J Pollard, Seth Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih- Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. ying Deng, Roger G Mark, and Steven Horng. 2000. The earth mover’s distance as a metric for 2019. Mimic-cxr: A large publicly available image retrieval. International journal of computer database of labeled chest radiographs. arXiv vision, 40(2):99–121. preprint arXiv:1901.07042. Bettina Siewert, Olga R Brook, Mary Hochman, and Ronald L Eisenberg. 2016. Impact of communica- Alistair EW Johnson, Tom J Pollard, Lu Shen, tion errors in radiology on patient care, customer H Lehman Li-wei, Mengling Feng, Moham- satisfaction, and work-flow efficiency. American mad Ghassemi, Benjamin Moody, Peter Szolovits, Journal of Roentgenology, 206(3):573–579. Leo Anthony Celi, and Roger G Mark. 2016. Mimic-iii, a freely accessible critical care database. Thomas B Talbot, Kenji Sagae, Bruce John, and Al- Scientific data, 3:160035. bert A Rizzo. Designing useful virtual standardized patient encounters. Kaggle. 2017. Data science bowl. https://www. kaggle.com/c/data-science-bowl-2017. Harm de Vries, Florian Strub, A. P. Sarath Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron C. Tomasz Kornuta, Deepta Rajan, Chaitanya Shivade, Courville. 2016. Guesswhat?! visual object dis- Alexis Asseman, and Ahmet S Ozcan. 2019. Lever- covery through multi-modal dialogue. In 2017 aging medical visual question answering with sup- IEEE Conference on Computer Vision and Pattern porting facts. In CLEF (Working Notes). Recognition, pages 4466–4475. Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, and Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Ronald M Summers. 2018. Tienet: Text-image em- Dina Demner-Fushman. 2018. A dataset of clini- bedding network for common thorax disease classifi- cally generated visual questions and answers about cation and reporting in chest x-rays. In Proceedings radiology images. Scientific Data, 5:180251. of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9049–9058. Mark D Mangano, Arifeen Rahman, Garry Choy, Dushyant V Sahani, Giles W Boland, and Andrew J Han Xiao. 2018. bert-as-service. https://github. Gunn. 2014. Radiologists’ role in the communi- com/hanxiao/bert-as-service. cation of imaging examination results to patients: perceptions and preferences of patients. American Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, Journal of Roentgenology, 203(5):1034–1039. and Alex Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the Yulei Niu, Hanwang Zhang, Manli Zhang, Jianhong IEEE Conference on Computer Vision and Pattern Zhang, Zhiwu Lu, and Ji-Rong Wen. 2019. Recur- Recognition, pages 21–29. sive visual attention in visual dialog. In Proceedings Peng Zhang, Yash Goyal, Douglas Summers-Stay, of the IEEE Conference on Computer Vision and Dhruv Batra, and Devi Parikh. 2016. Yin and yang: Pattern Recognition, pages 6679–6688. Balancing and answering binary visual questions. In Proceedings of the IEEE Conference on Computer Yifan Peng, Xiaosong Wang, Le Lu, Mohammad- Vision and Pattern Recognition, pages 5014–5022. hadi Bagheri, Ronald Summers, and Zhiyong Lu.
69 A BERT-based One-Pass Multi-Task Model for Clinical Temporal Relation Extraction Chen Lin1, Timothy Miller1*, Dmitriy Dligach2, Farig Sadeque1, Steven Bethard3 and Guergana Savova1
*Co-first author 1Boston Children’s Hospital and Harvard Medical School 2Loyola University Chicago 3University of Arizona 1 first.last @childrens.harvard.edu { [email protected]} [email protected] Abstract the CONTAINS task remains a challenge for both conventional learning approaches (Sun et al., 2013; Recently BERT has achieved a state-of-the- Bethard et al., 2015, 2016, 2017) and neural models art performance in temporal relation extraction from clinical Electronic Medical Records text. (structured perceptrons (Leeuwenberg and Moens, However, the current approach is inefficient 2017), convolutional neural networks (CNNs) (Dli- as it requires multiple passes through each in- gach et al., 2017; Lin et al., 2017), and Long Short- put sequence. We extend a recently-proposed Term memory (LSTM) networks (Tourille et al., one-pass model for relation classification to 2017; Dligach et al., 2017; Lin et al., 2018; Galvan a one-pass model for relation extraction. We et al., 2018)). The difficulty is that the limited la- augment this framework by introducing global beled data is insufficient for training deep neural embeddings to help with long-distance rela- models for complex linguistic phenomena. Some tion inference, and by multi-task learning to increase model performance and generalizabil- recent work (Lin et al., 2019) has used massive ity. Our proposed model produces results on pre-trained language models (BERT; Devlin et al., par with the state-of-the-art in temporal rela- 2018) and their variations (Lee et al., 2019) for this tion extraction on the THYME corpus and is task and significantly increased the CONTAINS much “greener” in computational cost. score by taking advantage of the rich BERT rep- resentations. However, that approach has an input 1 Introduction representation that is highly wasteful – the same The analysis of many medical phenomena (e.g., sentence must be processed multiple times, once disease progression, longitudinal effects of medi- for each candidate relation pair. cations, treatment regimen and outcomes) heavily Inspired by recent work in Green AI (Schwartz depends on temporal relation extraction from the et al., 2019; Strubell et al., 2019), and one-pass en- clinical free text embedded in the Electronic Medi- codings for multiple relations extraction (Wang cal Records (EMRs). At a coarse level, a clinical et al., 2019), we propose a one-pass encoding event can be linked to the document creation time mechanism for the CONTAINS relation extraction (DCT) as Document Time Relations (DocTimeRel), task, which can significantly increase the efficiency with possible values of BEFORE, AFTER, OVER- and scalability. The architecture is shown in Fig- LAP, and BEFORE OVERLAP (Styler IV et al., ure1. The three novel modifications to the original 2014). At a finer level, a narrative container one-pass relational model of Wang et al.(2019) (Pustejovsky and Stubbs, 2011) can temporally are: (1) Unlike Wang et al.(2019), our model subsume an event as a contains relation. The operates in the relation extraction setting, mean- THYME corpus (Styler IV et al., 2014) consists ing it must distinguish between relations and non- of EMR clinical text and is annotated with time relations, as well as classifying by relation type. expressions (TIMEX3), events (EVENT), and tem- (2) We introduce a pooled embedding for relational poral relations (TLINK) using an extension of classification across long distances. Wang et al. TimeML (Pustejovsky et al., 2003; Pustejovsky and (2019) focused on short-distance relations, but clin- Stubbs, 2011). It was used in the Clinical Temp- ical CONTAINS relations often span multiple sen- Eval series (Bethard et al., 2015, 2016, 2017). tences, so a sequence-level embedding is necessary While the performance of DocTimeRel models for such long-distance inference. (3) We use the has reached above 0.8 F1 on the THYME corpus, same BERT encoding of the input instance for both
70 Proceedings of the BioNLP 2020 workshop, pages 70–75 Online, July 9, 2020 c 2020 Association for Computational Linguistics
the conventional token BERT inserts at the start of every input sequence and its embedding is viewed as the representation of the entire sequence. The concatenated embedding is passed to a linear classi- fier to predict the CONTAINS, CONTAINED-BY, or NONE relation, rˆij, as in eq. (1). P (r ˆ x,E ,E )=softmax(W L[G : e : e ] + b) ij| i j i j (1)
L 3dz lr where W R × , dz is the dimension of the ∈ Figure 1: Model Architecture. e1, e2, and t repre- BERT embedding, lr = 3 for the CONTAINS la- sent entity-embeddings for “surgery”, “scheduled”, and bels, b is the bias, and x is the input sequence. “March 11, 2014” respectively. G is the pooled embed- Similarly, for the DocTimeRel (dtr) task we ding for the entire input instance. feed each entity’s embedding, ei, together with the global pooling G, to another linear classifier to pre- dict the entity’s five “temporal statuses”: TIMEX DocTimeRel and CONTAINS tasks, i.e. adding if the entity is a time expression or the dtr type multi-task learning (MTL) on top of one-pass en- (BEFORE, AFTER, etc.) if the entity is an event: coding. DocTimeRel and CONTAINS are related tasks. For example, if a medical event A happens P (dtrˆ x,E ) = softmax(W D[G : e ] + b) i| i i BEFORE the DCT, while event B happens AFTER (2) the DCT, it is unlikely that there is a CONTAINS D 2dz l relation between A and B. MTL provides an effec- where W R × d , and ld = 5. ∈ tive way to leverage useful knowledge learned in For the combined task, we define loss as: one task to benefit other tasks. What is more, MTL L(r ˆ , r ) + α(L(dtrˆ , dtr ) + L(dtrˆ , dtr )) can potentially employ a regularization effect that ij ij i i j j (3) alleviates overfitting to a specific task. where rˆ is the predicted relation type, dtrˆ and 2 Methodology ij i dtrˆ j are the predicted temporal statuses for Ei and 2.1 Twin Tasks Ej respectively, rij is the gold relation type, and dtr and dtr are the gold temporal statuses. α is a Apache cTAKES (Savova et al., 2010)(http:// i j weight to balance CONTAINS loss and dtr loss. ctakes.apache.org) is used for segmenting and tokenizing the THYME corpus in order to gen- 2.2 Window-based token sequence processing erate instances. Each instance is a sequence of Following Lin et al.(2019), we use a set window of tokens with the gold standard event and time ex- tokens (Token-Window) disregarding natural sen- pression annotations marked in the token sequences tence boundaries for generating instances. BERT by logging their positional information. Using the may still take punctuation tokens into account. entity-aware self-attention based on relative dis- Each token sequence is limited by a set number tance (Wang et al., 2019), we can encode every of entities (Entity-Window) to be processed. We entity, E , by its BERT embedding, e . If an entity i i apply a sliding token window (windows may over- e consists of multiple tokens (many time expres- i lap), thus every entity gets processed. Positional sions are multi-token), it is average-pooled (local information for each entity is output along the to- pool in Figure1) over the embedding of the corre- ken sequence and is propagated through different sponding tokens in the last BERT layer. layers via the entity-aware self-attention mecha- For the CONTAINS task, we create relation can- nism (Wang et al., 2019). didates from all pairs of entities within an input sequence. Each candidate is represented by the 3 Experiments concatenation of three embeddings, ei, ej, and G, as [G:ei:ej], where G is an average-pooled embed- 3.1 Data and Settings ding over the entire sequence, and is different from We adopt the THYME corpus (Styler IV et al., the embedding of [CLS] token. The [CLS] token is 2014) for model fine-tuning and evaluation. The
71 Model P R F1 Model Single MTL Multi-pass 0.735 0.613 0.669 AFTER 0.86 0.83 Multi-pass+Silver 0.674 0.695 0.684 BEFORE 0.88 0.89 One-pass 0.647 0.671 0.659 BEFORE/OVERLAP 0.63 0.56 One-pass+[CLS] 0.665 0.673 0.669 OVERLAP 0.89 0.85 One-pass+Pooling 0.670 0.689 0.680 TIMEX 0.98 0.98 One-pass+Pooling+MTL 0.686 0.687 0.686 OVERALL 0.88 0.86
Table 1: Model performance of CONTAINS relation on Table 2: Model performance in F1-scores of tem- colon cancer test set. Multi-pass baselines are from Lin poral statuses on colon cancer test set. Single: et al.(2019)’s system without and with self-training us- One-pass+Pooling for a single dtr Task; MTL: One- ing silver instances (system predictions on a unlabeled pass+Pooling for twin tasks: CONTAINS and dtr. colon cancer set). We tested a one pass system with just argument embeddings; with the [CLS] token as the Model P R F1 global context vector ([CLS]); with argument embed- Lin et al.(2019) 0.473 0.700 0.565 dings plus a globally pooled context vector (Pooling); and with global pooling as well as multi-task learning One-pass+Pooling 0.506 0.643 0.566 (MTL) with DocTimeRel. One-pass+Pooling+MTL 0.545 0.624 0.582 Table 3: Model performance of CONTAINS relation on one-pass multi-task model is fine-tuned on the brain cancer test set. THYME Colon Cancer training set with un- cased BERT base model, using the code released 0.684 F1 (Lin et al., 2019). by Wang et al.(2019) 1 as a base. The batch size Table2 shows the performance of our one-pass is set to 4, the learning rate is selected from (1e-5, models for the DocTimeRel task on the Clinical 2e-5, 3e-5, 5e-5), the Token-Window size is se- TempEval colon cancer test set. The single-task lected from (60, 70, 100), the Entity-Window size model achieves 0.88 weighted average F1, while is selected from (8, 10, 16), the training epochs the MTL model compromises the performance to are selected from (2, 3, 4, 5), the clipping distance 0.86 F1. Of note, this result is not directly com- k (the maximum relative position to consider) is parable to Bethard et al.(2016) results because selected from (3, 4, 5), and α is selected from (0.01, the Clinical TempEval evaluation script does not 0.05). A single NVIDIA GTX Titan Xp GPU is take into account if an entity is correctly recog- used for the computation. The best model is se- nized as a time expression (TIMEX). There are two lected on the Colon cancer development set and types of entities in the THYME annotation: events tested on the Colon cancer test set, and on THYME and time expressions (TIMEX). The Bethard et al. Brain cancer test set for portability assessment. (2016) evaluation on DocTimeRel was focused on 3.2 Results on THYME all events, and classified an event into four Doc- TimeRel types. Our evaluation was for all entities. Table1 shows performance of our one-pass models For a given entity, we classify it as a TIMEX or for the CONTAINS task on the Clinical TempEval an event; if it is an event, we classify it into four colon cancer test set. The one-pass (OP) model DocTimeRel types, for a total of five classes. alone obtains an F1 score of 0.659. Adding the Table3 shows the portability of our one-pass [CLS] token as the global context vector increases models on the THYME brain cancer test set. With- the F1 score to 0.669. Using a globally average- out any tuning on brain cancer data, the MTL pooled context vectors G instead of [CLS] im- model with global pooling performs at 0.582 F1, proves performance to 0.680, better than the multi- which is better than the multi-pass model trained pass model without silver instances (Lin et al., with additional silver instances (0.565 F1) re- 2019). Applying the MTL setting, the one-pass ported in Lin et al.(2019), trading roughly equal twin-task (CONTAINS and DocTimeRel) model amounts of precision for recall to obtain a bet- without any silver data reaches 0.686 F1, which ter balance. Without MTL, the one-pass CON- is on par with the multi-pass model trained with TAINS model with global context embeddings additional silver instances on the CONTAINS task, (One-pass+Pooling) achieves 0.566 F1 on the brain 1https://github.com/helloeve/mre-in-one-pass cancer test set, significantly lower than the MTL
72 Model flops/inst inst# Ratio steps used for generating silver instances (Lin et al., OP 218,767,889 20k 1 2019), the time savings is even greater. OP+MTL 218,783,260 20k 1 Multi-pass 218,724,880 427k 23 4 Discussion Multi-pass+Silver 218,724,880 497k 25
Table 4: Computational complexity in flops per in- Through table1 row 3-5, we can see that sequence- stance (flops/inst) total number of instances (inst#). wise embedding, either global pooling G or [CLS], × is important for clinical temporal relation extrac- tion which involves long-distance relations that model (using a Wilcoxon Signed-rank test over may go across multiple natural sentences. Entity document-by-document comparisons, as in (Cherry embeddings are good for tasks that focus on short- et al., 2013), p-value=0.01962). distance relations (such as (Gabor´ et al., 2018)), but may not be sufficient for picking enough context 3.3 Computational Efficiency for long-distance relations. Table4 shows the computational burden for dif- Combining MTL with a one-pass mechanism ferent models in terms of floating point operations produces a more efficient and generalizable model. (flops). The flops are derived from TensorFlow’s With merely additional 15k flops (table4 row 1 profiling tool on saved model graphs. The sec- and 2), the model achieves high performance for ond column is the flops per one training instance, both tasks. However, we found that it is hard for the third column lists the number of instances for both tasks to get top performance. If the weight different model settings. The total computational for dtr loss is increased, the dtr F1 increases at complexity for one training epoch is thus the mul- the cost of the CONTAINS scores. Even though tiplication between column 2 and 3. The Ratio the majority of entities in CONTAINS relations column is the relative ratio of total complexity us- have aligned dtr values (e.g., in Figure2(#1), both ing the OP total flops as the comparator. entities have matching dtr value, AFTER), some re- For relation extraction, all entities within a se- lations do have conflicted dtr values. For example, quence must be paired. If there are n entities in in Figure2(#2), the dtr for screening is BEFORE, a token sequence, there are n (n 1)/2 ways while test is a BEFORE OVERLAP (the present × − to combine those entities for relational candidates. perfect tense signifies tests happened in the past but The multi-pass model would encode the same se- lasts through present, hence BEFORE OVERLAP). quence n (n 1)/2 times, while the one-pass Even though it is a gold CONTAINS annotation, × − model would only encode it once and add the pair- the model may be confused by an event that hap- ing computation on top of the BERT encoding rep- pened in the past (screening) to contain another resented in Figure1 with very minor increase in event (test) that is longer than its temporal scope. computation per one instance (about 43K flops); Due to these conflicts, we thus pick the more chal- and the MTL model adds another 15k flops; but lenging CONTAINS task as our priority and set α they are of the same magnitude, 219K flops. The relatively low (0.01) in order to optimize the model one-pass models save a lot of passes on the training towards the CONTAINS task, ignoring some of the instances, 20k vs. 497k, which results in a signif- dtr errors or conflicts. In the meantime, the MTL icant difference in computational load, 1 vs. 25, setting does help prevent the model from over- which could be several hours to several days differ- fitting to one specific task, thus achieving some ence in GPU hours. The exact number of training level of generalization. The significant 1.6% in- instances processed by the one-pass model is af- crease in F1-score on the Brain test set in table3 fected by the Token-Window and Entity-Window demonstrates the improved generalizability. hyper-parameters. However, even in the worst case In conclusion, we built a ”green” model for a scenario, when the Token-Window is set to 100, challenging problem. Deployed on a single gpu and the Entity-Window is set to 8, there are 108K with 25 times better efficiency, it succeeded in training instances for the one-pass model, which both temporal tasks, achieved better generalizabil- is still substantially fewer training instances than ity, and suited to other pre-trained models (Liu what are used for the multi-pass model. In addi- et al., 2019; Alsentzer et al., 2019; Beltagy et al., tion, since the one-pass models do not run the extra 2019; Lan et al., 2019; Yang et al., 2019, etc.)
73 AFTER AFTER 2012 i2b2 nlp challenge. Journal of the American #1: Subsequent staging will include MRI of the pelvis... Medical Informatics Association, 20(5):843–848. BEFORE #2: His colon cancer screening in the past has been fecal Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep BEFORE/OVERLAP bidirectional transformers for language understand- occult blood tests yearly since the age of 55... ing. arXiv preprint arXiv:1810.04805.
Figure 2: CONTAINS Relations with match- Dmitriy Dligach, Timothy Miller, Chen Lin, Steven ing(#1)/conflicting(#2) DocTimeRel values. Bethard, and Guergana Savova. 2017. Neural tem- poral relation extraction. EACL 2017, page 746. Kata Gabor,´ Davide Buscaldi, Anne-Kathrin Schu- Acknowledgments mann, Behrang QasemiZadeh, Ha¨ıfa Zargayouna, and Thierry Charnois. 2018. SemEval-2018 task The study was funded by R01LM10090, 7: Semantic relation extraction and classification in R01GM114355 and UG3CA243120 from the scientific papers. In Proceedings of The 12th Inter- Unites States National Institutes of Health. The national Workshop on Semantic Evaluation, pages content is solely the responsibility of the authors 679–688, New Orleans, Louisiana. Association for and does not necessarily represent the official Computational Linguistics. views of the National Institutes of Health. We Diana Galvan, Naoaki Okazaki, Koji Matsuda, and thank the anonymous reviewers for their valuable Kentaro Inui. 2018. Investigating the challenges of suggestions and criticism. The Titan Xp GPU temporal relation extraction from clinical text. In Proceedings of the Ninth International Workshop on used for this research was donated by the NVIDIA Health Text Mining and Information Analysis, pages Corporation. 55–64. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. References 2019. Albert: A lite bert for self-supervised learn- Emily Alsentzer, John R Murphy, Willie Boag, Wei- ing of language representations. arXiv preprint Hung Weng, Di Jin, Tristan Naumann, and Matthew arXiv:1909.11942. McDermott. 2019. Publicly available clinical bert embeddings. arXiv preprint arXiv:1904.03323. Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. Scibert: and Jaewoo Kang. 2019. BioBERT: a pre-trained A pretrained language model for scientific text. In biomedical language representation model for Proceedings of the 2019 Conference on Empirical biomedical text mining. Bioinformatics. Methods in Natural Language Processing and the Tuur Leeuwenberg and Marie-Francine Moens. 2017. 9th International Joint Conference on Natural Lan- Structured learning for temporal relation extraction guage Processing (EMNLP-IJCNLP), pages 3606– from clinical records. In Proceedings of the 15th 3611. Conference of the European Chapter of the Associa- tion for Computational Linguistics. Steven Bethard, Leon Derczynski, Guergana Savova, Guergana Savova, James Pustejovsky, and Marc Ver- Chen Lin, Timothy Miller, Dmitriy Dligach, Hadi hagen. 2015. Semeval-2015 task 6: Clinical temp- Amiri, Steven Bethard, and Guergana Savova. 2018. eval. In Proceedings of the 9th International Work- Self-training improves recurrent neural networks shop on Semantic Evaluation (SemEval 2015), pages performance for temporal relation extraction. In 806–814. Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis, pages Steven Bethard, Guergana Savova, Wei-Te Chen, Leon 165–176. Derczynski, James Pustejovsky, and Marc Verhagen. 2016. Semeval-2016 task 12: Clinical tempeval. Chen Lin, Timothy Miller, Dmitriy Dligach, Steven Proceedings of SemEval, pages 1052–1062. Bethard, and Guergana Savova. 2017. Repre- sentations of time expressions for temporal rela- Steven Bethard, Guergana Savova, Martha Palmer, tion extraction with convolutional neural networks. James Pustejovsky, and Marc Verhagen. 2017. BioNLP 2017, pages 322–327. Semeval-2017 task 12: Clinical tempeval. Proceed- ings of the 11th International Workshop on Semantic Chen Lin, Timothy Miller, Dmitriy Dligach, Steven Evaluation (SemEval-2017), pages 563–570. Bethard, and Guergana Savova. 2019. A bert-based universal model for both within-and cross-sentence Colin Cherry, Xiaodan Zhu, Joel Martin, and Berry clinical temporal relation extraction. In Proceedings de Bruijn. 2013. la recherche du temps perdu: ex- of the 2nd Clinical Natural Language Processing tracting temporal relations from medical text in the Workshop, pages 65–71.
74 Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, bonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Luke Zettlemoyer, and Veselin Stoyanov. 2019. Xlnet: Generalized autoregressive pretraining for Roberta: A robustly optimized bert pretraining ap- language understanding. Advances in Neural Infor- proach. arXiv preprint arXiv:1907.11692. mation Processing Systems 32, pages 5754–5764.
James Pustejovsky, Jose´ M Castano, Robert Ingria, Roser Sauri, Robert J Gaizauskas, Andrea Set- zer, Graham Katz, and Dragomir R Radev. 2003. Timeml: Robust specification of event and tempo- ral expressions in text. New directions in question answering, 3:28–34.
James Pustejovsky and Amber Stubbs. 2011. Increas- ing informativeness in temporal annotation. In Pro- ceedings of the 5th Linguistic Annotation Workshop, pages 152–160. Association for Computational Lin- guistics.
Guergana K Savova, James J Masanz, Philip V Ogren, Jiaping Zheng, Sunghwan Sohn, Karin C Kipper- Schuler, and Christopher G Chute. 2010. Mayo clin- ical text analysis and knowledge extraction system (ctakes): architecture, component evaluation and ap- plications. Journal of the American Medical Infor- matics Association, 17(5):507–513.
Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni. 2019. Green ai. arXiv preprint arXiv:1907.10597.
Emma Strubell, Ananya Ganesh, and Andrew McCal- lum. 2019. Energy and policy considerations for deep learning in nlp. Proceedings of the 57th An- nual Meeting of the Association for Computational Linguistics, pages 3645–3650.
William F Styler IV, Steven Bethard, Sean Finan, Martha Palmer, Sameer Pradhan, Piet C de Groen, Brad Erickson, Timothy Miller, Chen Lin, Guergana Savova, et al. 2014. Temporal annotation in the clin- ical domain. Transactions of the Association for Computational Linguistics, 2:143–154.
Weiyi Sun, Anna Rumshisky, and Ozlem Uzuner. 2013. Evaluating temporal relations in clinical text: 2012 i2b2 challenge. Journal of the American Medical Informatics Association, 20(5):806–813.
Julien Tourille, Olivier Ferret, Aurelie Neveol, and Xavier Tannier. 2017. Neural architecture for tem- poral relation extraction: A bi-lstm approach for de- tecting narrative containers. In Proceedings of the 55th Annual Meeting of the Association for Compu- tational Linguistics (Volume 2: Short Papers), vol- ume 2, pages 224–230.
Haoyu Wang, Ming Tan, Mo Yu, Shiyu Chang, Dakuo Wang, Kun Xu, Xiaoxiao Guo, and Saloni Potdar. 2019. Extracting multiple-relations in one-pass with pre-trained transformers. Proceedings of the 57th Annual Meeting of the Association for Computa- tional Linguistics, pages 1371–1377.
75 Experimental Evaluation and Development of a Silver-Standard for the MIMIC-III Clinical Coding Dataset Thomas Searle1, Zina Ibrahim1, Richard JB Dobson1,2 1Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, U.K. 2Institute of Health Informatics, University College London, London, London, U.K. firstname.lastname @kcl.ac.uk { } Abstract To alleviate the burden of the status quo of man- ual coding, several Machine learning (ML) auto- Clinical coding is currently a labour-intensive, mated coding models have been developed (Larkey error-prone, but critical administrative process and Croft, 1996; Aronson et al., 2007; Farkas and whereby hospital patient episodes are manu- ally assigned codes by qualified staff from Szarvas, 2008; Perotte et al., 2014; Ayyar et al., large, standardised taxonomic hierarchies of 2016; Baumel et al., 2018; Mullenbach et al., 2018; codes. Automating clinical coding has a long Falis et al., 2019). However, despite continued in- history in NLP research and has recently seen terest, translation of ML systems into real-world novel developments setting new state of the art deployments has been limited. An important factor results. A popular dataset used in this task is contributing to the limited translation is the fluctuat- MIMIC-III, a large intensive care database that ing quality of the manually-coded real hospital data includes clinical free text notes and associated codes. We argue for the reconsideration of the used to train and evaluate such systems, where large validity MIMIC-III’s assigned codes that are margins of error are a direct consequence of the often treated as gold-standard, especially when difficulty and error-prone nature of manual coding. MIMIC-III has not undergone secondary vali- To our knowledge, the literature contains only two dation. This work presents an open-source, re- systematic evaluations of the quality of clinically- producible experimental methodology for as- coded data, both based on UK trusts and showing sessing the validity of codes derived from accuracy to range between 50 to 98% Burns et al. EHR discharge summaries. We exemplify the (2012) and error rates between 1%-45.8% CHKS methodology with MIMIC-III discharge sum- maries and show the most frequently assigned Ltd(2014) respectively. In Burns et al.(2012), codes in MIMIC-III are under-coded up to the actual accuracy is likely to be lower because 35%. the reviewed trusts used varying statistical evalu- ation methods, validation sources (clinical text vs 1 Introduction clinical registries), sampling modes for accuracy Clinical coding is the process of translating state- estimation (random vs non-random), and the qual- ments written by clinicians in natural language to ity of validators (qualified clinical coders vs lay describe a patient’s complaint, problem, diagnosis people). CHKS Ltd(2014) highlight that 48% of and treatment, into an internationally-recognised the reviewed trusts used discharge summaries alone coded format (World Health Organisation, 2011). or as the primary source for coding an encounter, Coding is an integral component of healthcare and to minimise the amount of raw text used for code provides standardised means for reimbursement, assignment. However, further portions of the docu- care administration, and for enabling epidemiolog- mented encounter are often needed to assign codes ical studies using electronic health record (EHR) accurately. data (Henderson et al., 2006). The Medical Information Mart for Intensive Manual clinical coding is a complex, labour- Care (MIMIC-III) database (Johnson et al., 2016) intensive, and specialised process. It is also error- is the largest free resource of hospital data and con- prone due to the subtleties and ambiguities com- stitutes a substantial portion of the training of au- mon in clinical text and often strict timelines im- tomated coding models. Nevertheless, MIMIC-III posed on coding encounters. The annual cost of is significantly under-coded for specific conditions clinical coding is estimated to be $25 billion in the (Kokotailo and Hill, 2005), and has been shown US alone (Farkas and Szarvas, 2008). to exhibit reproducibility issues in the problem of
76 Proceedings of the BioNLP 2020 workshop, pages 76–85 Online, July 9, 2020 c 2020 Association for Computational Linguistics mortality prediction (Johnson et al., 2017). There- with health services result in a set of clinical codes fore, serious consideration is needed when using that directly correlate to the care provided. MIMIC-III to train automated coding solutions. Top-level ICD codes represent the highest level In this work, we seek to understand the limita- of the hierarchy, with ICD-9/10 (ICD-10 being the tions of using MIMIC-III to train automated cod- later version) listing 19 and 21 chapters respec- ing systems. To our knowledge, no work has at- tively. Clinically meaningful hierarchical subdivi- tempted to validate the MIMIC-III clinical coding sions of each chapter provide further specialisation dataset for all admissions and codes, due to the of a given condition. time-consuming and costly nature of the endeavour. Coding clinical text results in the assignment of To illustrate the burden, having two clinical coders, a single primary diagnosis and further secondary working 38 hours a week re-coding all 52,726 ad- diagnosis codes (World Health Organisation, 2011). mission notes at a rate of 5 minutes and $3 per The complexity of coding encounters largely stems document, would amount to $316,000 and 115 ∼ ∼ from the substantial number of available codes. For weeks work for a ‘gold standard’ dataset. Even example, ICD-10-CM is the US-specific extension then, documents with a low inter-annotator agree- to the standard ICD-10 and includes 72,000 codes. ment would undergo a final coding round by a Although a significant portion of the hierarchy cor- third coder, further raising the approximate cost responds to rare conditions, ‘common’ conditions to $316,000 and stretching the 70 weeks. ∼ to code are still in the order of thousands. In this work, we present an experimental eval- Moreover, clinical text often contains special- uation of coding coverage in the MIMIC-III dis- ist terminology, spelling mistakes, implicit men- charge summaries. The evaluation uses text ex- tions, abbreviations and bespoke grammatical rules. traction rules and a validated biomedical named However, even qualified clinical coders are not per- entity recognition and linking (NER+L) tool, Med- mitted to infer codes that are not explicitly men- CAT (Kraljevic et al., 2019) to extract ICD-9 codes, tioned within the text. For example, a diagnostic reconciling them with those already assigned in test result that indicates a condition (with the con- MIMIC-III. The training and experimental setup dition not explicitly written), or a diagnosis that yield a reproducible open-source procedure for is written as ‘questioned’ or ‘possible’ cannot be building silver-standard coding datasets from clin- coded. ical notes. Using the approach, we produce a Another factor contributing to the laborious na- silver-standard dataset for ICD-9 coding based on ture of coding is the large amount of duplication MIMIC-III discharge summaries. present in EHRs, as a result of features such as This paper is structured as follows: Section2 copy & paste being made available to clinical staff. reviews essential background and related work in It has been reported that 20-78% of clinicians dupli- automated clinical coding, with a particular focus cate sections of records between notes (Bowman, on MIMIC-III. Section3 presents our experimental 2013), subsequently producing an average data re- setup and the semi-supervised development of a sil- dundancy of 75% (Zhang et al., 2017). ver standard dataset of clinical codes derived from unstructured EHR data. The results are presented 2.2 MIMIC-III - a Clinical Coding Database in Section4, while Section5 discusses the wider impact of the results and future work. MIMIC-III (Johnson et al., 2016) is a de-identified database containing data from the intensive care 2 Background unit of the Beth Israel Medical Deaconess Center, Boston, Massachusetts, USA, collected 2001-12. 2.1 Clinical Coding Overview MIMIC-III is the world’s largest resource of freely- The International Statistical Classification of Dis- accessible hospital data and contains demograph- eases and Health Related Problems (ICD) provides ics, laboratory test results, procedures, medications, a hierarchical taxonomic structure of clinical ter- caregiver notes, imaging reports, admission and dis- minology to classify morbidity data (World Health charge summaries, as well as mortality (both in and Organisation, 2011). The framework provides con- out of the hospital) data of 52,726 critical care pa- sistent definitions across global health care services tients. MIMIC provides an open-source platform to describe adverse health events including illness, for researchers to work on real patient data. At the injury and disability. Broadly, patient encounters time of writing, MIMIC-III has over 900 citations.
77 2.3 Automated Clinical Coding maries. We also describe the semi-supervised cre- Early ML work on automated clinical coding con- ation of a silver-standard dataset of clinical codes sidered ensembles of simple text classifiers to pre- from unstructured EHR text based on MIMIC-III dict codes from discharge summaries (Larkey and discharge summaries. Croft, 1996). Rule-based models have also been 3.1 Data Preparation formulated, by directly replicating coding manuals. A prominent example of rule-based models is the Discharge summary reports are used to provide BioNLP 2007 shared task (Aronson et al., 2007), an overview for the given hospital episode. Au- which supplied a gold standard labelled dataset tomated coding systems often only use discharge of radiology reports. The dataset continues to be reports as they contain the salient diagnostic text used to train and validate ML coding. For example, (Perotte et al., 2014; Baumel et al., 2018; Mullen- Kavuluru et al.(2015) used the dataset in addition bach et al., 2018) without over burdening the model. to two US-based hospital EHRs. Although the two MIMIC-III discharge summaries are categorised additional datasets used by Kavuluru et al.(2015) distinctly from other clinical text. The text is of- were not validated to a gold standard, they are re- ten structured with section headings and content flective of the diversity found in clinical text. Their section delimiters such as line breaks. We identify largest dataset contained 71,463 records, 60,238 Discharge Diagnosis (DD) sections in the majority distinct code combinations and had an average doc- of discharge summary reports 92% (n=48,898) us- ument length of 5303 words. ing a simple rule based approach. These sections The majority of automated coding systems are are lists of diagnoses assigned to the patient during trained and tested Using MIMIC-III. Perotte et al. admission. Xie and Xing(2018) previously used (2014) trained hierarchical support vector machine these sections to develop a matching algorithm models on the MIMIC-II EHR (Saeed et al., 2011), from discharge diagnosis to ICD code descriptions the earlier version of MIMIC. The models were with moderate success demonstrating state-of-the- trained using the full ICD-9-CM terminology, cre- art sensitivity (0.29) and specificity (0.33) scores. ating baseline results for subsequent models of For the 8% (n=3,828) that are missing these sec- 0.395 F1-micro score. Ayyar et al.(2016) used tions we manually inspect a handful of examples a long-short-term-memory (LSTM) neural network and observe instances of patient death and admin- to predict ICD-9 codes in MIMIC-III. However, istration errors. The SQL procedures used to ex- Ayyar et al.(2016) cannot be directly compared to tract the raw data from a locally built replica of the former methods as the model only predicts the top MIMIC-III database and the extraction logic for nineteen level codes. DDs are available open-source as part of this wider 1 Methodological developments continued to use analysis . MIMIC-III with Tree-of-sequence LSTMs (Xie Table1 lists example extracted DDs. There is a and Xing, 2018), hierarchical attention gated re- large variation in structure, use of abbreviations and current unit (HA-GRU) neural networks (Baumel extensive use of clinical terms. Some DDs list the et al., 2018) and convolutional neural networks primary diagnosis alongside secondary diagnosis, with attention (CAML) (Mullenbach et al., 2018). whereas others simply list a set of conditions. The HA-GRU and CAML models were directly compared with (Perotte et al., 2014), achieving 3.2 Semi-Supervised Named Entity 0.405 and 0.539 F1-micro respectively. A recent Recognition and Linkage Tool empirical evaluation of ICD-9 coding methods pre- We use MedCAT (Kraljevic et al., 2019), a dicted the top fifty ICD-9 codes from MIMIC-III, pre-trained named entity recognition and linking suggesting condensed memory networks as a supe- (NER+L) model, to identify and extract the cor- rior network topology (Huang et al., 2018). responding ICD codes in a discharge summary note. MedCAT utilises a fast dictionary-based 3 Semi-Supervised Extraction of Clinical algorithm for direct text matches and a shallow Codes neural network concept to learn fixed length dis- tributed semantic vectors for ambiguous text spans. In this section, we describe the data preprocessing, The method is conceptually similar to Word2Vec methodology and experimental design for evaluat- ing the coding quality of MIMIC-III discharge sum- 1https://tinyurl.com/t7dxn3j
78 Extracted Discharge Diag- Admission 3.3 Code Prediction Datasets nosis ID We run our MedCAT model over each extracted CAD now s/p CABG 102894 DD subsection. The model assigns each token or HTN, DM, Osteoarthritis, sequence of tokens a UMLS code and therefore Dyslipidemia an associated ICD code. In our comparison of the Left convexity, tento- 161919 MedCAT produced annotations with the MIMIC- rial, parafalcine Subdural III assigned codes we have 3 distinct datasets: hematoma Primary Diagnoses: 152382 1. MedCAT does not identify a concept and the 1. Acute ST segment Eleva- code has been assigned in MIMIC-III. Denoted tion Myocardial Infarction A NP for ‘Assigned, Not Predicted’. Secondary Diagnoses: 1. Hypertension 2. MedCAT identifies a concept and this matches 2. Hyperlipidemia with an assigned code in MIMIC-III. Denoted for ‘Predicted, Assigned’. Seizures. 132065 P A 3. MedCAT identifies a concept and this does not Table 1: Example discharge diagnosis subsections ex- tracted from MIMIC-III discharge summaries match with an assigned code in MIMIC-III dataset. Denoted P NA for ‘Predicted, Not As- signed’. (Mikolov et al., 2013) in that word representa- tions are learnt by detecting correlations of context We do not consider the case where both MedCAT words, and learnt vectors exhibit the semantics of and the existing MIMIC-III assigned codes have the underlying words. The tool can be trained in missed an assignable code as this would involve a unsupervised or a supervised manner. However, manual validation of all notes, and as previously unlike Word2Vec that learns a single representa- discussed, is infeasible for a dataset of this size. tion for each word, MedCAT enables the learn- 3.3.1 Producing the Silver Standard ing of ‘concept’ representations by accommodat- Given the above initial datasets we produce our ing synonymous terms, abbreviations or alternative final silver-standard clinical coding dataset by: spellings. We use a MedCAT model pre-loaded with the 1. Sampling from the missing predictions dataset Unified Medical Language System (Bodenreider, (A NP) to manually collect annotations where 2004) (UMLS). UMLS is a meta-thesaurus of med- our out-of-the-box MedCAT model fails to ical ontologies that provides rich synonym lists recognise diagnoses. that can be used for recognition and disambigua- tion of concepts. Mappings from UMLS to the 2. Fine-tuning our MedCAT model with the col- ICD-92 taxonomy are then used to extract UMLS lected annotations and re-running on the entire concept to ICD codes. Our large pre-trained Med- DD subsection dataset producing updated A NP, CAT UMLS model contains 1.6 million concepts. P A, P NA datasets. ∼ This model cannot be made publicly available due 3. Sampling from P NA and P A and annotating to constraints on the UMLS license, but can be predicted diagnoses to validate correctness of trained in an an unsupervised method in 1 week ∼ the MedCAT predicted codes. on MIMIC-III with standard CPU only hardware3. In an effort to keep our analysis tractable we 4. Exclusion of any codes that fail manual valida- limit our MedCAT model to only extract the 400 tion step as they are not trustworthy predictions ICD-9 codes that occur most frequently in the made by MedCAT. dataset. This equates to 76% (n=48,2379) of to- We use the MedCATTrainer annotator (Searle tal assigned codes (n=634,709). We exclude the et al., 2019) to both collect annotations (stage 1) other 6,441 codes that occur less frequently. Future and to validate predictions from MedCAT (stage work could consider including more of these codes. 3). To collect annotations, we manually inspect 10 2https://bioportal.bioontology.org/ontologies/ICD9CM randomly sampled predictions for each of the 400 3https://tinyurl.com/yadtnz3w unique codes from A NP and add further acronyms,
79 and metabolic diseases) and 460-519 (diseases of the respiratory system) we see proportionally fewer manually collected examples despite the high num- ber of occurrence of codes assigned within MIMIC- III. We explain this by the DD subsection lacking appropriate detail to assign the specific code. For example codes under 250.* for diabetes mellitus and the various forms of complications are assigned frequently but often lack the appropriate level of detail specifying the type, control status and the manifestation of complication. Figure 1: The distributions of manually annotated ICD- Using the manual amendments made on the 864 9 codes and the assigned codes in MIMIC-III grouped new annotations, we re-run the MedCAT model by top-level ICD-9 axis. on the entire DD subsection dataset, producing updated P NA, P A and A NP datasets. We ac- knowledge A NP likely still includes cases of ab- abbreviations, synonyms etc for diagnoses if they breviations, synonyms as we only subsampled 10 are present in the DD subsection to improve the documents per code allowing for further improve- underlying MedCAT model. To validate predic- ments to the model. tions from P A and P NA, we use the MedCAT- The MedCAT fine-tuning process was run until Trainer annotator to inspect 10 randomly sampled convergence as measured by precision, recall and predictions for each of the 179 & 182 unique codes F1 achieving scores 0.90, 0.92 and 0.91 respec- respectively found. We mark each prediction made tively on a held out a test-set with train/test splits by MedCAT as correct or incorrect and report re- 80/20. The fine-tuning code is made available4. sults in Section 4.1. Annotations are available upon request given the 4 Results appropriate MIMIC-III licenses. The following section presents the distribution 4.1 P A&P NA Validation of manually collected annotations from sampling We use the MedCATTrainer interface to validate A NP, our validation of updated P A and P NA our MedCAT model predictions in the ‘Predicted, post MedCAT fine-tuning, and the final distribu- Assigned’ (P A) and ‘Predicted, Not Assigned’ tion of codes found in our produced silver standard (P NA) datasets. We sample (a maximum of) 10 dataset. unique predictions for each ICD-code resulting in Adding annotations to selected text spans di- 179 & 182 ICD-9 codes and 1588 & 1580 man- rectly adds the spans to the MedCAT dictionary, ually validated predictions from P A and P NA thereby ensuring further text spans of the same con- respectively. The validation of code assignment is tent are annotated by the model - if the text span performed by a medical informatics PhD student is unique. We collect 864 annotations after review- with no professional clinical coding experience and ing 4000 randomly sampled DD notes from the a qualified clinical coder, marking each term as A NP (Assigned, Not Predicted) dataset. 21.6% correct or incorrect. We achieve good agreement of DDs provide further annotations suggesting that with a Cohen’s Kappa of 0.85 and 0.8 resulting in the majority of missed codes lie outside the DD 95.51% and 87.91% marked correct for P A and subsection, or are incorrectly assigned. P NA respectively. We exclude from further exper- Figure1 shows the distributions of manually iments all codes that fail this validation step as they collected code annotations and the current MIMIC- are not trustworthy predictions made by MedCAT. III set of clinical codes, grouped by their top-level axis as specified by ICD-9-CM hierarchy. 4.2 Aggregate Assigned Codes & Codes We collect proportionally consistent annotations Silver Standard for most groups, including the 390-459 chapter We proportionally predict 10% (n=42,837) of to- ∼ (Diseases Of The Circulatory System), which is tal assigned codes (n=432,770). We predict 16% ∼ the top occurring group in both scales. However, 4https://github.com/tomolopolis/MIMIC-III-Discharge- for groups such as 240-279 (endocrine, nutritional Diagnosis-Analysis/blob/master/Run MedCAT.ipynb
80 of total assigned codes (n=258,953) if we only con- sider the 182 codes that resulted in at least one matched assignment to those present in the MIMIC- III assigned codes. We label and gather our three datasets into a sin- gle table, with an extra column called ‘validated’, with values: ‘yes’ for codes that have matched with an assigned code (P A), ‘new code’ for newly dis- covered codes (P NA), and ‘no’ for codes that we Figure 2: Left: Counts of admissions and the associ- were not able to validate (A NP). We have made ated % of characters covered by MedCAT code predic- tions. Right:Distribution of DD token lengths this silver-standard dataset available alongside our analysis code5.
4.3 Undercoding in MIMIC-III This work aims to identify inconsistencies and vari- ability in coding accuracy in the current MIMIC-III dataset. Ultimately to rigorously identify undercod- ing of clinical text full, double blind manual coding would be performed. However, as previously dis- cussed, this is prohibitively expensive. Comparing the codes predicted by MedCAT to the existing assigned codes enables the develop- Figure 3: Proportions of matching predictions against total number of assigned codes per admission. ment of an understanding of specific groups of codes that exhibit possible undercoding. In this section we firstly show the effectiveness of our 4.3.2 Predicted & Assigned method in terms of DD subsection prediction cov- erage. We then present our predicted code distri- Figure3 shows the distributions of the number butions against the MIMIC-III assigned codes at of assigned codes and the proportion of matches the ICD code chapter level, highlighting the most grouped into buckets of 10% intervals. We see a prevalent missing codes and showing correlations high proportion of matches in assigned codes in the between document length and prevalence. 1-40% range, indicating that although the DD sub- section does contribute to the assigned ICD codes, 4.3.1 Prediction Coverage many of the assigned codes are still missed. We MedCAT provides predictions at the text span level, exclude the admissions that had 0 matched codes with only one concept prediction per span. We can and discuss this result further in Section 4.3.4. therefore calculate the breadth of coverage of our If we order codes by the number of predicted predictions across all DD subsections. Figure2 and assigned we find the three highest occurring shows the proportion of DD subsection text that codes (4019, 41404, 4280) in MIMIC-III also rank are included in code predictions. We note the 100% highest in our predictions. However, we note that proportion (n=2105) is 75% larger than the next these three common codes only yield 25-39% of largest indicating that we are often utilising the their total assigned occurrence, which could be ex- entire DD subsection to suggest codes although the plained by these chronic conditions not being listed majority of the coverage distribution is around the in the DD subsection and referred elsewhere in the 40-50% range. note. If we normalise predictions by their preva- We find a token length distribution of DD sub- lence, we are most successful in matching specific sections with µ =14.54, σ =15.9, Med = 10 and conditions applicable to preterm newborns (7470, IQR = 14 and a code extraction distribution with 7766), pulmonary embolism (41519) and liver can- µ = 3.6 and σ = 3.1, Med = 3 and IQR = 4 cer (1550), all of which we match between 69-55% suggesting the DD subsections are complex and but rank 114-305 in total prevalence. We suggest often list multiple conditions of which we identify, these diagnoses are either acute, or the primary on average, 3 to 4 conditions. cause of an ICU admission so will be specified in 5https://tinyurl.com/u8yae8n the DD subsection.
81 From this dataset we calculate how many exam- ples of each code that has potentially been missed, or potentially under-coded. For the 10 most fre- quently assigned codes we see 0-35% missing oc- currences. We also identify the most frequent code 4019 (Unspecified Essential Hypertension) has 16% or 3312 potentially missing occurrences. To understand if DD subsection length impacts the occurrence of ‘missed’ codes we first calculate a Pearson-Correlation coefficient of 0.17 for DD subsection line length and counts of assigned codes over all admissions. This suggests a weak posi- Figure 4: Predicted, Assigned Codes grouped by top- level code group vs total assigned codes tive correlation between admission complexity and number of existing assigned codes. In contrast we find a stronger positive correlation of 0.504 for predicted and not assigned codes and DD subsection line length. This implies that where an episode has a greater number of diagnoses or the complexity of an admission is greater, there is a likelihood to result in more codes being missed during routine collection. We compute the Wasserstein metric between these two distributions at 1.6 10 2. This demon- × − strates a degree of similarity between distributions albeit is 8x further from the Predicted and Assigned dataset distance presented in Section 4.3.2. We ex- Figure 5: Predicted, Not Assigned Codes grouped by pect to see a larger distance here as we are detecting top-level code group vs total assigned codes codes that are indicated in the text but have been missed during routine code assignment.
We also group the predicted codes into their re- 4.3.4 Assigned & Not Predicted spective top-level ICD-9 groups in Figure4 and ob- We observe that the distribution of assigned and serve that predicted assigned codes display a simi- not predicted codes largely mirrors the distribu- lar distribution to total assigned codes. We quantify tion of total codes assigned in MIMIC-III with a the difference in distributions via the Wasserstein Wasserstein distance of 2.7 10 3 that is similar × − metric or ‘Earth Movers Distance’(Ramdas et al., to the distance observed in in our Predicted and 2015). This metric provides a single measure to Assigned Section 4.3.2) dataset. This suggests that compare the difference in our 3 datasets distribu- our method is proportionally consistent at not an- tions when compared with the current assigned notating codes that have likely been assigned from code distribution. We compute a small 2.7 10 3 elsewhere in the admission, but may also be incor- × − distance between both distributions, suggesting rectly assigned. our method proportionally identifies previously as- signed codes from the DD subsection alone. 5 Discussion On aggregate, the predicted codes by our MedCAT 4.3.3 Predicted & Not Assigned model suggest that the discharge diagnosis sections This dataset highlights codes that may have been listed in 92% of all discharge notifications are not missed from the current assigned codes. sufficient for full coding of an episode. Unsur- Figure5 shows that the distribution of predicted prisingly, this confirms that clinicians infrequently but not assigned codes is minimally different for document all codeable diagnoses within the dis- most codes, supporting our belief that the MIMIC- charge summary. Although, as previously stated, III assigned codes are not wholly untrustworthy, coders are not permitted to make clinical infer- but are likely under-coded in specific areas. ences. Therefore, to correctly assign a code, the
82 diagnoses must be present within the documented parse DD subsections, an example script to build patient episode within the structured or unstruc- a pre-trained MedCAT model, the script required tured data. to run MedCAT on the DD subsections, load into However, the positive correlation between doc- the annotator and finally re-run MedCAT and per- ument length and number of predicted codes in- form experimental analysis alongside outputting dicates that missed codes are more prevalent in the silver standard dataset6. highly complex cases with many diagnoses. From Given these materials it is possible for re- a coding workflow perspective, coders operate un- searchers to replicate and build upon our method, der strict time schedules and are required to code a or directly use the silver standard dataset in future minimum number of episodes each day. Therefore, work that investigates automated clinical coding us- it logically follows that the complexity of a case ing MIMIC-III. The silver standard dataset clearly directly correlates to the number of codes missed marks if each assigned code has been validated or during routine collection. not, or if it is a new code according to our method. Looking at individual code groups we find 240- 279 is not predicted proportionally with assigned 6 Conclusions & Future Work codes both in P A and P NA. We explain this as This work highlighted specific problems with using follows. Firstly, DD subsections generally convey MIMIC-III as a dataset for training and testing an clinically important diagnoses for follow-up care. automated clinical coding system that would limit Certain codes such as (250.*) describe diabetes model performance within a real deployment. mellitus with specific complications, but the DD We identified and deterministically extracted the subsection will often only describe the diagnoses discharge diagnosis (DD) subsections from dis- ‘DMI’ or ‘DMII’. Secondly, ICU admissions are charge summaries. We subsequently trained an for individuals with severe illness and therefore are NER+L model (MedCAT) to extract ICD-9 codes likely to have a high degree of co-morbidity. This from the DD subsections, comparing the results is implied by the majority of patients (74%) are across the full set of assigned codes. We find our assigned between 4 and 16 codes. method covers 47% of all tokens, considering we We also observe E000-E999 and V01-V99 codes only take 400 of the 7k unique codes and per- ∼ are disproportionately not predicted. However, this form minimal data cleaning of the DD subsection. is expected given that both groups are supplemen- We have shown in Section 4.3.2 and 4.3.3 that the tary codes that describe conditions or factors that MedCAT predicted codes are proportionally inline contribute to an admission but would likely not be with assigned codes in MIMIC-III. relevant for the DD subsection. Interestingly, we found a 0.504 positive correla- In contrast, we observe a disproportionately tion between DD length and the number of codes large number of predictions for 001-139 (Infectious predicted by MedCAT, but not assigned in MIMIC- and Parasitic Diseases). This is primarily driven III. This result can be understood by observing that by 0389 (Unspecified septicemia). A proportion the ICU admissions in MIMIC-III can be extremely of these predictions may be in error as the specific complex, with up to 30 clinical codes assigned to a form of septicemia is likely described in more de- single episode. The DD subsections alone can con- tail elsewhere in the note and therefore coded as tain up to 50 line items indicating highly complex the more specific form. cases where codes could easily be missed. We found that the code group 390-459 (Diseases 5.1 Method Reproducibility & Wider Utility of the Circulatory System) is both the most as- Inline with the suggestions of Johnson et al.(2017), signed group and the group of codes where there the original authors of MIMIC-III, we have at- are the most missing predictions from our model. tempted to provide the research community all Furthermore, codes such as Hypertension (4019), available materials to reproduce and build upon Sepsis and Septicemia (0389, 99591), Gastroin- our experiments and method for the development testinal hemorrhage (95789), Chronic Kidney dis- of silver standard datasets. Specifically, we have ease (5859), anemia (2859) and Chronic obstruc- made the following available as open-source: the tive asthma (49320) are all frequently assigned but SQL scripts to extract the raw data from a replica 6https://github.com/tomolopolis/MIMIC-III-Discharge- of the MIMIC-III database, the script required to Diagnosis-Analysis
83 are also the highest occurring conditions that ap- Care Directorates, Health and Social Care Research pear in the DD diagnosis subsection but are not and Development Division (Welsh Government), assigned in the MIMIC-III dataset. This suggests Public Health Agency (Northern Ireland), British that MIMIC-III exhibits specific cases of under- Heart Foundation and Wellcome Trust. 3. The coding, especially with codes that are frequently National Institute for Health Research University occurring in patients but are not likely to be the College London Hospitals Biomedical Research primary diagnosis for an admission to the ICU. Centre. This paper represents independent research As we only use the DD section, there are many part funded by the National Institute for Health codes which likely appear elsewhere in the note Research (NIHR) Biomedical Research Centre at that we cannot assign. Although 92% of discharge South London and Maudsley NHS Foundation summaries contain DD subsections we only match Trust and King’s College London. The views ex- 16% of assigned codes. We suggest this is due pressed are those of the author(s) and not necessar- ∼ to: our NER+L model lacking the ability to identify ily those of the NHS, MRC, NIHR or the Depart- more synonyms and abbreviations for conditions, ment of Health and Social Care. the DD subsections lacking enough detail to as- sign codes and in some occasions, little evidence to suggest a code assignment. Our textual span cover- References age, presented in Section 4.3.1 demonstrates that Alan R Aronson, Olivier Bodenreider, Dina Demner- we often cover all available discharge diagnosis, Fushman, Kin Wah Fung, Vivian K Lee, James G Mork, Aurelie´ Nev´ eol,´ Lee Peters, and Willie J although there is still room for improvement as the Rogers. 2007. From indexing the biomedical liter- majority of the coverage distribution is around the ature to coding clinical text: Experience with MTI 50% mark. and machine learning approaches. In Proceedings For future work we foresee applying the same of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing, method to either the entire discharge summary or BioNLP ’07, pages 105–112, Stroudsburg, PA, USA. more specific sections such as ‘previous medical Association for Computational Linguistics. history’ to surface chronic codeable diagnoses that Sandeep Ayyar, O B Don, and W Iv. 2016. Tagging could be validated against the current assigned code patient notes with icd-9 codes. In Proceedings of the set. Researchers would however likely need to 29th Conference on Neural Information Processing address false positive code predictions as clinical Systems. coding requires assigned codes to be from current Tal Baumel, Jumana Nassour-Kassis, Raphael Co- conditions associated with an admission. hen, Michael Elhadad, and Noemie´ Elhadad. 2018. In conclusion, this work has found that fre- Multi-Label classification of patient notes: Case quently assigned codes in MIMIC-III display signs study on ICD code assignment. In Workshops at the Thirty-Second AAAI Conference on Artificial of undercoding up to 35% for some codes. With Intelligence. this finding we urge researchers to continue to develop automated clinical coding systems using Olivier Bodenreider. 2004. The unified medical language system (UMLS): integrating biomedical MIMIC-III, but to also consider using our silver terminology. Nucleic Acids Res., 32(Database standard dataset or build on our method to further issue):D267–70. improve the dataset. Sue Bowman. 2013. Impact of electronic health record systems on information integrity: quality and safety Acknowledgments implications. Perspect. Health Inf. Manag., 10:1c.
RD’s work is supported by 1.National Institute E M Burns, E Rigby, R Mamidanna, A Bottle, P Aylin, P Ziprin, and O D Faiz. 2012. Systematic review of for Health Research (NIHR) Biomedical Research discharge coding accuracy. Centre at South London and Maudsley NHS Foun- dation Trust and King’s College London. 2. Health CHKS Ltd. 2014. CHKS - insight for better healthcare: The quality of clinical coding in the NHS. https: Data Research UK, which is funded by the UK //bit.ly/34eU5g3. Accessed: 2019-5-10. Medical Research Council, Engineering and Phys- ical Sciences Research Council, Economic and Matus Falis, Maciej Pajak, Aneta Lisowska, Patrick Schrempf, Lucas Deckers, Shadia Mikhael, Sotirios Social Research Council, Department of Health Tsaftaris, and Alison O’Neil. 2019. Ontological at- and Social Care (England), Chief Scientist Of- tention ensembles for capturing semantic concepts fice of the Scottish Government Health and Social in ICD code prediction from clinical text.
84 Richard´ Farkas and Gyorgy¨ Szarvas. 2008. Automatic Aaditya Ramdas, Nicolas Garcia, and Marco Cuturi. construction of rule-based ICD-9-CM coding sys- 2015. On wasserstein two sample testing and related tems. BMC Bioinformatics, 9 Suppl 3:S10. families of nonparametric tests. Toni Henderson, Jennie Shepheard, and Vijaya Sun- Mohammed Saeed, Mauricio Villarroel, Andrew T dararajan. 2006. Quality of diagnosis and procedure Reisner, Gari Clifford, Li-Wei Lehman, George coding in ICD-10 administrative data. Med. Care, Moody, Thomas Heldt, Tin H Kyaw, Benjamin 44(11):1011–1019. Moody, and Roger G Mark. 2011. Multiparameter intelligent monitoring in intensive care II: a public- Jinmiao Huang, Cesar Osorio, and Luke Wicent Sy. access intensive care unit database. Crit. Care Med., 2018. An empirical evaluation of deep learning for 39(5):952–960. ICD-9 code assignment using MIMIC-III clinical notes. Thomas Searle, Zeljko Kraljevic, Rebecca Bendayan, Daniel Bean, and Richard Dobson. 2019. Med- Alistair E W Johnson, Tom J Pollard, and Roger G CATTrainer: A biomedical free text annotation in- Mark. 2017. Reproducibility in critical care: a mor- terface with active learning and research use case tality prediction case study. In Proceedings of the specific customisation. In Proceedings of the 2nd Machine Learning for Healthcare Conference, 2019 Conference on Empirical Methods in Natural volume 68 of Proceedings of Machine Learning Language Processing and the 9th International Research, pages 361–376, Boston, Massachusetts. Joint Conference on Natural Language Processing PMLR. (EMNLP-IJCNLP): System Demonstrations, pages 139–144, Stroudsburg, PA, USA. Association for Alistair E W Johnson, Tom J Pollard, Lu Shen, Li- Computational Linguistics. Wei H Lehman, Mengling Feng, Mohammad Ghas- semi, Benjamin Moody, Peter Szolovits, Leo An- World Health Organisation. 2011. International thony Celi, and Roger G Mark. 2016. MIMIC-III, Statistical Classification of Diseases and Related a freely accessible critical care database. Sci Data, Health Problems. 3:160035. Pengtao Xie and Eric Xing. 2018. A neural archi- Ramakanth Kavuluru, Anthony Rios, and Yuan Lu. tecture for automated ICD coding. In Proceedings 2015. An empirical evaluation of supervised learn- of the 56th Annual Meeting of the Association ing approaches in assigning diagnosis codes to for Computational Linguistics (Volume 1: Long electronic medical records. Artif. Intell. Med., Papers), pages 1066–1076. 65(2):155–166. Rui Zhang, Serguei V S Pakhomov, Elliot G Arso- Rae A Kokotailo and Michael D Hill. 2005. Coding niadis, Janet T Lee, Yan Wang, and Genevieve B of stroke and stroke risk factors using international Melton. 2017. Detecting clinically relevant new in- classification of diseases, revisions 9 and 10. Stroke, formation in clinical notes across specialties and set- 36(8):1776–1781. tings. BMC Med. Inform. Decis. Mak., 17(Suppl 2):68. Zeljko Kraljevic, Daniel Bean, Aurelie Mascio, Lukasz Roguski, Amos Folarin, Angus Roberts, Rebecca Bendayan, and Richard Dobson. 2019. MedCAT – medical concept annotation tool. Leah S Larkey and W Bruce Croft. 1996. Combin- ing classifiers in text categorization. In SIGIR, vol- ume 96, pages 289–297. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013. Distributed representa- tions of words and phrases and their compositional- ity. In C J C Burges, L Bottou, M Welling, Z Ghahra- mani, and K Q Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc. James Mullenbach, Sarah Wiegreffe, Jon Duke, Jimeng Sun, and Jacob Eisenstein. 2018. Explainable pre- diction of medical codes from clinical text. Adler Perotte, Rimma Pivovarov, Karthik Natarajan, Nicole Weiskopf, Frank Wood, and Noemie´ Elhadad. 2014. Diagnosis code assignment: models and evaluation metrics. J. Am. Med. Inform. Assoc., 21(2):231–237.
85 Comparative Analysis of Text Classification Approaches in Electronic Health Records
Aurelie Mascio∗ Zeljko Kraljevic∗ Daniel Bean Richard Dobson Robert Stewart Rebecca Bendayan Angus Roberts Department of Biostatistics & Health Informatics King’s College London, UK [email protected] [email protected]
Abstract versus patient related (Dai, 2019); (3) obesity clas- sification from free text (Uzuner, 2009); (4) identi- Text classification tasks which aim at harvest- fication of patients for clinical trials (Meystre et al., ing and/or organizing information from elec- 2019). tronic health records are pivotal to support clin- Most of these tasks involve mapping mentions in ical and translational research. However these present specific challenges compared to other narrative texts (e.g. “pneumonia”) to their corre- classification tasks, notably due to the particu- sponding medical concepts (and concept ID) gen- lar nature of the medical lexicon and language erally using the Unified Medical Language System used in clinical records. (UMLS) (Bodenreider, 2004), and then training a Recent advances in embedding methods have classifier to identify these correctly (e.g. “pneumo- shown promising results for several clinical nia positive” versus “pneumonia negative”) (Yao tasks, yet there is no exhaustive comparison et al., 2019). of such approaches with other commonly used word representations and classification mod- els. Text classification performed on medical records In this work, we analyse the impact of various presents specific challenges compared to the gen- word representations, text pre-processing and eral domain (such as newspaper texts), including classification algorithms on the performance of four different text classification tasks. The dataset imbalance, misspellings, abbreviations or results show that traditional approaches, when semantic ambiguity (Mujtaba et al., 2019). tailored to the specific language and structure Despite recent advances in NLP, including neural- of the text inherent to the classification task, network based word representations such as BERT can achieve or exceed the performance of more (Devlin et al., 2019), few approaches have been recent ones based on contextual embeddings extensively tested in the medical domain and rule- such as BERT. based algorithms remain prevalent (Koleck et al., 2019). Furthermore, there is no consensus on which word representation is best suited to specific 1 Introduction downstream classification tasks (Si et al., 2019; Wang et al., 2018). Clinical text classification is an important task in natural language processing (NLP) (Yao et al., The purpose of this study is to analyse the impact 2019), where it is critical to harvest data from of numerous word representation methods (bag- electronic health records (EHRs) and facilitate its of-word versus traditional and contextual word use for decision support and translational research. embeddings) as well as classification approaches Thus, it is increasingly used to retrieve and orga- (deep learning versus traditional machine learning nize information from the unstructured portions of methods) on the performance of four different text EHRs (Mujtaba et al., 2019). classification tasks. To our knowledge this is the Examples include tasks such as: (1) detection of first paper to test a comprehensive range of word smoking status (Uzuner et al., 2008); (2) classi- representation, text pre-processing and classifica- fication of medical concept mentions into family tion methods combinations on several medical text ∗ These two authors contributed equally. tasks.
86 Proceedings of the BioNLP 2020 workshop, pages 86–94 Online, July 9, 2020 c 2020 Association for Computational Linguistics
2 Materials & Methods ShARe/CLEF (MIMIC-II) dataset The ShARe/CLEF annotated dataset proposed by 2.1 Datasets and text classification tasks Mowery et al.(2014) is based on 433 clinical In order to conduct our analysis we derived text records from the MIMIC-II database (Saeed classification tasks from MIMIC-III (Multiparame- et al., 2002). It was generated for community ter Intelligent Monitoring in Intensive Care) (John- distribution as part of the Shared Annotated son et al., 2016), and the Shared Annotated Re- Resources (ShARe) project (Elhadad et al., 2013), sources (ShARe)/CLEF dataset (Mowery et al., and contains annotations including disorder 2014). These datasets are commonly used for chal- mention spans, with several contextual attributes. lenges in medical text mining and act as bench- For our analysis we derived two tasks from this marks for evaluating machine learning models (Pu- dataset, focusing on two attributes, comprising rushotham et al., 2018). 8075 annotations for each:
MIMIC-III dataset MIMIC-III (Johnson et al., Negation (yes/no, indicating if the disorder is • 2016) is an openly available dataset developed by negated or affirmed); the MIT Lab for Computational Physiology. It com- prises clinical notes, demographics, vital signs, lab- Uncertainty (yes/no, indicating if the disorder • oratory tests and other data associated with 40,000 is hypothetical or affirmed). critical care patients. Text classification tasks For both annotated We used MedCAT (Kraljevic et al., 2019) to pre- datasets, we extracted from each document the por- pare the dataset and annotate a sample of clinical tions of text containing a mention of the concepts notes from MIMIC-III with UMLS concepts (Bo- of interest, keeping 15 words on each side of the denreider, 2004). We selected the concepts with the mention (including line breaks). Each task is then UMLS semantic type Disease or Syndrome (corre- made up of sequences comprising around 31 words, sponding to T047), out of which we picked the 100 centered on the mention of interest, with its cor- most frequent Concept Unique Identifier (CUIs, responding meta-annotation (status, temporality, allowing to group mentions with the same mean- negation, uncertainty), making up four text classifi- ing). For each concept we then randomly sampled cation tasks, denoted: 4 documents containing a mention of each concept, resulting in 400 documents with 2367 annotations MIMIC Status; in totals. The 100 most frequent concepts in these • | documents were manually annotated (and manu- MIMIC Temporality; ally corrected in case of disagreement) for two text • | classification tasks: ShARe Negation; • | Status (affirmed/other, indicating if the dis- • ShARe Uncertainty. ease is affirmed or negated/hypothetical); • |
Temporality (current/other, indicating if the Table1 summarizes the class distribution for • disease is current or past). each task.
Task Class 1 Class 2 Total Such contextual properties are often critical in the MIMIC Status | 1586 (67%) 781 (33%) 2367 medical domain in order to extract valuable in- (1: affirmed, 2: other) formation, as evidenced by the popularity of al- MIMIC Temporality | 2026 (86%) 341 (14%) 2367 gorithms like ConText or NegEx (Harkema et al., (1: current, 2: other) ShARe Negation 2009; Chapman et al., 2001). | 1470 (18%) 6605 (82%) 8075 (1: yes, 2: no) ShARe Uncertainty | 729 (9%) 7346 (91%) 8075 Annotations were performed by two annota- (1: yes, 2: no) tors, achieving an overall inter-annotator agreement above 90%. These annotations will be made pub- Table 1: Class distribution licly available.
87 Figure 1: Main workflow
2.2 Evaluation steps and main workflow For each step we measured the impact by evalu- ating the best possible combination, based on the We used the four different text classification tasks average F1 score (weighted average score derived described in Section 2.1 in order to explore vari- from 10-fold cross validation results). ous combinations of word representation models (see Section 2.3), text pre-processing and tokeniz- 2.3 Word representation models ing variations (Section 2.4) and classification al- gorithms (Section 2.5). In order to evaluate the Word embeddings as opposed to bag-of-words different approaches we followed the steps detailed (BoW) present the advantage of capturing semantic in Table2 and Figure1 for all four classification and syntactic meaning by representing words as tasks. real valued vectors in a dimensional space (vectors that are close in that space will represent similar words). Contextual embeddings go one step fur- Outcome Step Description (best F1) ther by capturing the context surrounding the word, Run all bag-of-word and traditional embeddings whilst traditional embeddings assign a single repre- + classification algorithms and select the A A-1 best combination (using baseline methods for sentation to a given word. text pre-processing and tokenization) Using A-1 as the new baseline model, test For our analysis we considered four off-the-shelf B different pre-processing methods (lowercasing, B-1 embedding models, pre-trained on public domain punctuation removal, lemmatization, stemming) Using B-1 as the new baseline model, compare various data, and compared them to the same embedding C C-1 tokenizers (word and subword level) models trained on biomedical corpora, as well as a Test contextual embedding approaches: D D-1 BERT (base, uncased) and BioBERT BoW representation.
Table 2: Evaluation steps For the traditional embeddings we chose three
88 commonly used algorithms, namely Word2Vec for clinical data mining tasks and achieve state-of- (Mikolov et al., 2013), GloVe (Pennington et al., the-art performance (Yao et al., 2019), namely arti- 2014) and FastText (Bojanowski et al., 2017). ficial neural network (ANN), convolutional neural We used publicly available models pre-trained on network (CNN), recurrent neural network (RNN), Wikipedia and Google News for all three (Yamada bi-directional long short term memory (Bi-LSTM), et al., 2018). and BERT (Devlin et al., 2019; Wolf et al., 2020). To obtain medical specific models we trained all We compared these with a statistics-based approach three on MIMIC-III clinical notes (covering 53,423 as a baseline, using a Support Vector Machine intensive care unit stays, including those used in (SVM) classifier, a popular method used for classi- the classification tasks) (Johnson et al., 2016). The fication tasks (Cortes and Vapnik, 1995). following hyperparameters, aligned to off-the-shelf pre-trained models, were used: dimension of 300, For Bi-LSTM and RNN, we tested both a stan- window size of 10, minimum word count of 5, un- dard approach and one that is configured to simu- cased, punctuation removed. late attention on the medical entity of interest. This For the contextual embeddings we used BERT custom approach consisted in taking the represen- base (Devlin et al., 2019), and BioBERT (Lee et al., tation of the network at the position of the entity 2019) which are pre-trained respectively on general of interest, which in most cases corresponds to the domain corpora and biomedical literature (PubMed center for each sequence. We refer to this latter abstracts and PMC articles). approach as custom Bi-LSTM and custom RNN. Finally we used a BoW representation as a base- For ANN and statistics-based models, which are line approach. limited by the size of the dataset and embeddings (300 dimensions x 31 words x 2300 or 8000 se- 2.4 Text pre-processing and tokenizers quences), we chose to represent sequences by av- In addition to pre-training several embedding mod- eraging the embeddings of the words composing els, we tested two different text tokenization meth- each sequence. This representation method is com- ods, using the following types of tokenizers: (1) monly used and has proven efficient for various SciSpaCy (Neumann et al., 2019), a traditional NLP applications (Kenter et al., 2016). tokenizer based on word detection; and (2) byte- Furthermore, each of these models was tested pair-encoding (BPE) adapted to word segmentation using different sets of parameters (e.g. varying the that works on subword level (Gage, 1994; Sennrich support function, dropout, optimizer, as reported in et al., 2016). Table3), the ones producing the best performance For the word level tokenizer we chose SciSpaCy were selected for further testing and are summa- as it is specifically aimed at biomedical and scien- rized in Table3. tific text processing. We further tested additional RNN SVM ANN CNN text pre-processing: lowercasing, punctuation re- Bi-LSTM Radial basis ReLU moval, stopwords removal, stemming and lemmati- Kernel or Linear ReLU + (with activation N/A zation. Poly sigmoid max function For the subword BPE tokenizer we used byte Sigmoid pooling) level byte-pair-encoding (BBPE) (Wang et al., Layers N/A 2 3 2 Filters N/A N/A 128 N/A 2019; Wolf et al., 2020). In this case the only Hidden units N/A 100 N/A 300 pre-processing performed is lowercasing, whilst dimensions 0.5 0.5 0.5 everything else including line breaks and spaces is Dropout N/A 0 0 0 left as is. This approach allows to limit the vocab- Adam Adam Adam Stochastic Stochastic Stochastic ulary size and is especially useful in the medical Optimizer N/A Gradient Gradient Gradient domain where a large number of words are very Descent Descent Descent rare. We limited the number of words to 30522, a Learning rate N/A 0.001 0.001 0.001 tolerance: standard vocabulary size also used in BERT (De- Epochs 5000 200 50 0.001 vlin et al., 2019). Table 3: Classifiers and corresponding parameters eval- 2.5 Text classification algorithms uated. Parameters highlighted in bold were the ones On all four classification tasks, we tested various selected based on performance. machine learning algorithms which are widely used
89 3 Results For each variant investigated, the same pre- processing settings were applied to prepare the an- 3.1 Performance comparison for all notated corpus as well as to the entire MIMIC-III embedding and algorithm combinations dataset, which was then used to re-train Word2Vec. (steps A & D) This ensured the same vocabulary was used across In this section we compare the performance of the the embedding and sequences to classify for each different embeddings and classification approaches. experiment. We report the weighted average F1/precision/recall The results, summarized in Table6, suggest that (weighted average value obtained from the 10-fold text pre-processing has a minor impact for all clas- cross-validation results) for selected combinations sification algorithms tested. Notably, stemming on the four classification tasks in Tables4 and5 and lemmatization have a slightly negative impact (full results in Appendix A.1). on performance. For all word embedding methods tested (Word2Vec, GloVe, FastText), the ones trained on biomedical 3.3 Impact of tokenizers (step C) data show the best performance (see Table4). We tested the impact of tokenization on the per- formance of text classification tasks, focusing on For classification algorithms, the best perfor- SciSpacy and BBPE tokenizers, as they allow us to mance is obtained when using the custom Bi- compare whole word versus subword unit methods. LSTM model configured to target the biomed- The results for the MIMIC Status task (and us- | ical concept of interest (see Table5). Both ing Word2Vec trained on MIMIC-III) are shown contextual embeddings (BERT and BioBERT), in Table7, and indicate that the performances are whether trained on biomedical or general cor- roughly similar when using the BBPE tokenizer pora, outperform any other combination of em- compared to SciSpacy. bedding/classification algorithm tested, and give Furthermore we compared both approaches in results very close to the customized Bi-LSTM, as terms of speed and vocabulary size. Tokenizing shown in Table5. text took on average 2.5 times longer with Scispacy (250 seconds to tokenize 100,000 medical notes for This indicates that for tasks incorporating infor- SciSpacy versus 99 seconds for BBPE, excluding mation about the position of the entity of interest in model loading time). For the models trained on the text (e.g. ShARe which reports disorder men- MIMIC-III corpus, Scispacy comprised 474,145 tions span offsets), the custom Bi-LSTM approach words, and BBPE 29,452 subword units. performs better than BioBERT, without necessitat- ing any text pre-processing. 3.4 Embeddings analysis: word similarities On the other hand, when looking at pure text classi- comparison fication, BioBERT shows better performance than Finally, in order to analyse the differences between a Bi-LSTM approach, and consequently may be embeddings trained on general and medical cor- preferred for tasks where the sequence of interest pora, we compared the semantic information cap- is not easily centered on a specific entity. tured by Word2Vec (using SciSpacy tokenizer and Finally, whilst the performance of BERT and without any preliminary text pre-processing). BioBERT is relatively similar, BioBERT converges Table8 explores word similarities by showing faster across all tasks tested. the top ten similar words for medical (“cancer”) and non-medical (“concentration” and ”attention”) 3.2 Impact of text pre-processing (step B) terms. In addition to exploring various embeddings, we Notably, it highlights the numerous misspellings, tested the impact of text pre-processing on classi- abbreviations and domain-specific meanings con- fication task performance. In order to do so, we tained in the medical lexicon, suggesting that gen- selected the best performing word embedding ob- eral corpora such as Wikipedia may not be appro- tained in the previous step (Word2Vec trained on priate when working on data from medical records MIMIC-III, using SciSpacy tokenizer), and com- (and by implication, for other specific domains). pared performances between all text cleaning varia- tions (lowercasing, punctuation removal, stemming, lemmatization).
90 F1-score (average from 10-fold cross validation) MIMIC MIMIC ShARe ShARe Model Tokenizer Embedding Status Temporality Negation Uncertainty Bi-LSTM (custom) SciSpacy Wiki Word2Vec 92.8% 97.3% 98.4% 96.7% | Bi-LSTM (custom) SciSpacy Wiki GloVe 93.4% 97.2% 98.4% 97.2% | Bi-LSTM (custom) SciSpacy Wiki FastText 93.6% 96.9% 98.6% 96.4% | Bi-LSTM (custom) SciSpacy MIMIC Word2Vec 94.5% 97.9% 98.7% 97.3% | Bi-LSTM (custom) SciSpacy MIMIC GloVe 93.9% 97.9% 98.7% 96.9% | Bi-LSTM (custom) SciSpacy MIMIC FastText 93.7% 97.6% 98.5% 97.2% | BERT WordPiece BERTbase 91.5% 97.3% 98.2% 93.6% BioBERT WordPiece BioBERT 93.4% 97.3% 98.5% 94.2% SVM SciSpacy Wiki Word2Vec 76.9% 94.8% 88.5% 85.9% | SVM SciSpacy Wiki GloVe 78.6% 94.9% 88.8% 87.1% | SVM SciSpacy Wiki FastText 78.1% 94.4% 88.7% 86.3% | SVM SciSpacy BoW 82.7% 96.0% 90.2% 91.7% SVM SciSpacy MIMIC Word2Vec 80.6% 95.1% 89.8% 90.2% | SVM SciSpacy MIMIC GloVe 79.1% 94.1% 89.4% 87.6% | SVM SciSpacy MIMIC FastText 79.6% 93.7% 88.9% 88.0% | Table 4: Comparison of embeddings (steps A & D)
F1-score (average from 10-fold cross validation) MIMIC MIMIC ShARe ShARe Model Tokenizer Embedding Status Temporality Negation Uncertainty Bi-LSTM SciSpacy MIMIC Word2Vec 88.4% 97.1% 96.2% 94.1% | Bi-LSTM (custom) SciSpacy MIMIC Word2Vec 94.5% 97.9% 98.7% 97.3% | BERT WordPiece BERTbase 91.5% 97.3% 98.2% 93.6% BioBERT WordPiece BioBERT 93.4% 97.3% 98.5% 94.2% ANN SciSpacy MIMIC Word2Vec 80.9% 96.5% 88.6% 86.7% | CNN SciSpacy MIMIC Word2Vec 84.6% 97.3% 92.0% 87.5% | RNN SciSpacy MIMIC Word2Vec 77% 96.8% 94.0% 87.1% | RNN (custom) SciSpacy MIMIC Word2Vec 89.5% 96.7% 97.9% 96.5% | SVM SciSpacy MIMIC Word2Vec 80.6% 95.1% 89.8% 90.2% | ANN SciSpacy BoW 79.8% 94.8% 89.3% 89.3% SVM SciSpacy BoW 82.7% 96% 90.2% 91.7%
Table 5: Comparison of classification algorithms (steps A & D)
F1-score (average from 10-fold cross validation) Task Embedding Text pre-processing SVM ANN RNN RNN (custom) CNN Bi-LSTM (custom) MIMIC Status MIMIC Word2Vec Lowercase (L) 80.6% 80.9% 77.0% 89.5% 84.6% 94.5% | | MIMIC Status MIMIC Word2Vec L + punctuation removal (LP) 80.1% 80.0% 80.2% 86.1% 84.7% 94.4% | | MIMIC Status MIMIC Word2Vec LP + lemmatizing 80.6% 79.6% 78.0% 86.3% 83.8% 94.1% | | MIMIC Status MIMIC Word2Vec LP + stemming 80.4% 79.7% 79.4% 86.1% 84.1% 94.1% | | Table 6: Comparison of text pre-processing methods (step B)
F1-score (average from 10-fold cross validation) Task Embedding Tokenizer SVM ANN RNN RNN (custom) CNN Bi-LSTM (custom) MIMIC Status MIMIC Word2Vec Sciscpacy 80.6% 80.9% 77.0% 89.5% 84.6% 94.5% | | MIMIC Status MIMIC Word2Vec BBPE 78.8% 80.5% 76.5% 86.0% 84.3% 94.7% | | Table 7: Comparison of tokenizing methods (step C)
Term: “cancer” Term: “concentration” Term: “attention” Word2Vec Medical Word2Vec General Word2Vec Medical Word2Vec General Word2Vec Medical Word2Vec General ca (0.78) prostate (0.85) hmf (0.51) concentrations (0.71) paid (0.43) attentions (0.65) carcinoma (0.78) colorectal (0.82) concentrations (0.49) arbeitsdorf (0.67) approximation (0.34) notoriety (0.63) cancer- (0.75) melanoma (0.8) formula (0.47) vulkanwerft (0.65) followup (0.33) attracted (0.63) caner (0.71) pancreatic (0.8) mct (0.47) sophienwalde (0.64) proximity (0.32) criticism (0.63) adenocarcinoma (0.71) leukemia (0.79) polycose (0.47) lagerbordell (0.64) short-term (0.31) publicity (0.57) ca- (0.64) entity/breast cancer (0.79) virtue (0.45) sterntal (0.64) mangagement (0.31) praise (0.57) melanoma (0.64) leukaemia (0.78) corn (0.45) durrgoy¨ (0.62) atetntion (0.31) aroused (0.56) cancer;dehydration (0.63) tumour (0.77) dosage (0.44) straflager (0.61) attnetion (0.3) acclaim (0.55) cancer/sda (0.61) cancers (0.76) planimetry (0.44) maidanek (0.61) atention (0.3) interest (0.55) rcc (0.61) ovarian (0.75) equation (0.44) szebnie (0.61) non-rotated (0.3) admiration (0.55)
Table 8: Comparison of word similarities between general and domain-specific embeddings
91 4 Discussion Finally, custom Bi-LSTM outperforms BioBERT when it simulates attention on the entity This study compared the impact of various embed- of interest. However, this configuration requires ding and classification methods on four different information on the entity mention span, and then text classification tasks. Notably we investigated to center each document on this span. For some the impact of pre-training embedding models on datasets, such information may either be readily clinical corpora versus off-the-shelf models trained available, or can be obtained by performing an on general corpora. additional named-entity extraction step. Unfortu- nately, many text classification tasks do not usually The results suggest that using embeddings pre- have this information, or may not rely on the trained for the specific task (clinical corpora in our specific entities/keywords required (e.g. sentiment case) leads to better performance with any classifi- analysis tasks). When Bi-LSTM is not customized, cation algorithm tested. However, pre-training such then both BERT models (trained on general and embeddings is not necessarily feasible due to either specific domains) produce the best performance, data or technical constraints. In this case our re- and consequently should be preferred for texts not sults highlight that using off-the-shelf embeddings easily allowing such customization. trained on large general corpora such as Wikipedia still produce acceptable performance. In particular BERTbase outperformed most algorithms tested, 5 Conclusion even when these were combined with clinical em- beddings. In this article we have explored the performance Additionally, BioBERT was not pre-trained on of various word representation approaches (com- medical notes but on texts from a related domain paring bag-of-words to traditional and contextual (biomedical articles and abstracts as opposed to embeddings trained on both specific and general clinical records), and therefore excludes speci- corpora, combined with various text pre-processing ficities inherent to the medical domain such as and tokenizing methods) as well as classification al- misspellings or technical jargon. Despite this, gorithms on four different text classification tasks, BioBERT’s performance is only marginally below all based on publicly available datasets. that of the best model (custom Bi-LSTM) com- bined with clinical embeddings. A detailed performance comparison on these four tasks highlighted the efficacy of contextual The various experiments conducted on text pre- embeddings when compared to traditional methods processing only lead to small variations in terms when no customization is possible, whether these of performance, and even negatively impact the embeddings are trained on specific or general cor- performance of several algorithms, for the text clas- pora. sification task and embedding model tested. Given When combined with appropriate entity extraction the additional constraints required to perform this tasks and specific domain embeddings, Bi-LSTM step (need to train embeddings on pre-processed outperforms contextual embeddings. Across all texts and to clean input data) and the mixed results classification algorithms, text pre-processing and in performance, pre-processing does not appear to tokenization approaches showed limited impact for be essential. the task and embedding tested, suggesting a rule of thumb to opt for the least time and resource Novel tokenization methods based on subword intensive method. dictionaries, whilst not improving the performance, Acknowledgments eliminate several shortcomings presented by SciS- pacy and similar methods, notably its speed and We thank the anonymous reviewers for their vocabulary size. valuable suggestions. In light of these limitations and the very small dif- This work was supported by Health Data Research ference in performance for the task tested, BBPE UK, an initiative funded by UK Research and appears to be a suitable alternative to traditional Innovation, Department of Health and Social Care tokenizers and allows to reduce significantly com- (England) and the devolved administrations, and putational costs. leading medical research charities.
92 AM is funded by Takeda California, Inc. Corinna Cortes and Vladimir Vapnik. 1995. Support- RD, RS, AR are part-funded by the National vector networks. Machine Learning, 20(3):273– Institute for Health Research (NIHR) Biomedical 297. Research Centre at South London and Maudsley Hong-Jie Dai. 2019. Family member information ex- NHS Foundation Trust and King’s College traction via neural sequence labeling models with London. different tag schemes. BMC Medical Informatics RD is also supported by The National Institute and Decision Making, 19(10):257. for Health Research University College London Hospitals Biomedical Research Centre, and Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training by the BigData@Heart Consortium, funded of Deep Bidirectional Transformers for Language by the Innovative Medicines Initiative-2 Joint Understanding. arXiv:1810.04805 [cs]. ArXiv: Undertaking under grant agreement No. 116074. 1810.04805. This Joint Undertaking receives support from the European Union’s Horizon 2020 research and N Elhadad, W. W Chapman, T O’Gorman, M Palmer, and G Savova. 2013. The ShARe schema for the innovation programme and EFPIA. syntactic and semantic annotation of clinical texts. DB is funded by a UKRI Innovation Fellowship as part of Health Data Research UK MR/S00310X/1. Philip Gage. 1994. A new algorithm for data compres- RB is funded in part by grant MR/R016372/1 sion. for the King’s College London MRC Skills Henk Harkema, John N. Dowling, Tyler Thornblade, Development Fellowship programme funded by and Wendy W. Chapman. 2009. ConText: an al- the UK Medical Research Council (MRC) and gorithm for determining negation, experiencer, and by grant IS-BRC-1215-20018 for the National temporal status from clinical reports. Journal of Institute for Health Research (NIHR) Biomedical Biomedical Informatics, 42(5):839–851. Research Centre at South London and Maudsley Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li- NHS Foundation Trust and King’s College London. wei H. Lehman, Mengling Feng, Mohammad Ghas- semi, Benjamin Moody, Peter Szolovits, Leo An- This paper represents independent research thony Celi, and Roger G. Mark. 2016. MIMIC-III, part funded by the National Institute for Health a freely accessible critical care database. Scientific Research (NIHR) Biomedical Research Centre at Data, 3(1):1–9. South London and Maudsley NHS Foundation Tom Kenter, Alexey Borisov, and Maarten de Rijke. Trust and King’s College London. The views ex- 2016. Siamese CBOW: Optimizing Word Embed- pressed are those of the authors and not necessarily dings for Sentence Representations. In Proceedings those of the NHS, the NIHR or the Department of of the 54th Annual Meeting of the Association for Health and Social Care. The funders had no role in Computational Linguistics (Volume 1: Long Papers), pages 941–951, Berlin, Germany. Association for study design, data collection and analysis, decision Computational Linguistics. to publish, or preparation of the manuscript. T.A. Koleck, C. Dreisbach, P.E. Bourne, and S. Bakken. 2019. Natural language processing of symptoms documented in free-text narratives of electronic References health records: A systematic review. Journal of the American Medical Informatics Association, Olivier Bodenreider. 2004. The Unified Med- 26(4):364–379. ical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research, Zeljko Kraljevic, Daniel Bean, Aurelie Mascio, 32(suppl 1):D267–D270. Lukasz Roguski, Amos Folarin, Angus Roberts, Piotr Bojanowski, Edouard Grave, Armand Joulin, and Rebecca Bendayan, and Richard Dobson. 2019. Tomas Mikolov. 2017. Enriching Word Vectors with MedCAT – Medical Concept Annotation Tool. Subword Information. Transactions of the Associa- arXiv:1912.10166 [cs, stat]. ArXiv: 1912.10166. tion for Computational Linguistics, 5:135–146. Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Wendy W. Chapman, Will Bridewell, Paul Hanbury, Donghyeon Kim, Sunkyu Kim, Chan Ho So, Gregory F. Cooper, and Bruce G. Buchanan. 2001. and Jaewoo Kang. 2019. BioBERT: a pre-trained A Simple Algorithm for Identifying Negated Find- biomedical language representation model for ings and Diseases in Discharge Summaries. Journal biomedical text mining. Bioinformatics, page of Biomedical Informatics, 34(5):301–310. btz682. ArXiv: 1901.08746.
93 Stephane´ M. Meystre, Paul M. Heider, Youngjun Kim, Ozlem¨ Uzuner. 2009. Recognizing Obesity and Co- Daniel B. Aruch, and Carolyn D. Britten. 2019. Au- morbidities in Sparse Data. Journal of the Amer- tomatic trial eligibility surveillance based on un- ican Medical Informatics Association : JAMIA, structured clinical data. International Journal of 16(4):561–570. Medical Informatics, 129:13–19. Changhan Wang, Kyunghyun Cho, and Jiatao Gu. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey 2019. Neural Machine Translation with Byte- Dean. 2013. Efficient Estimation of Word Repre- Level Subwords. arXiv:1909.03341 [cs]. ArXiv: sentations in Vector Space. arXiv:1301.3781 [cs]. 1909.03341. ArXiv: 1301.3781. Y. Wang, L. Wang, M. Rastegar-Mojarad, S. Moon, F. Shen, N. Afzal, S. Liu, Y. Zeng, S. Mehrabi, Danielle L Mowery, Sumithra Velupillai, Brett R S. Sohn, and H. Liu. 2018. Clinical information ex- South, Lee Christensen, David Martinez, Liadh traction applications: A literature review. Journal of Kelly, Lorraine Goeuriot, Noemie Elhadad, Guer- Biomedical Informatics, 77:34–49. gana Savova, and Wendy W Chapman. 2014. Task 2: ShARe/CLEF eHealth Evaluation Lab 2014. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien page 12. Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, Remi´ Louf, Morgan Funtow- Ghulam Mujtaba, Liyana Shuib, Norisma Idris, icz, and Jamie Brew. 2020. HuggingFace’s Trans- Wai Lam Hoo, Ram Gopal Raj, Kamran Khowaja, formers: State-of-the-art Natural Language Process- Khairunisa Shaikh, and Henry Friday Nweke. 2019. ing. arXiv:1910.03771 [cs]. ArXiv: 1910.03771. Clinical text classification research trends: System- atic literature review and open issues. Expert Sys- Ikuya Yamada, Akari Asai, Hiroyuki Shindo, tems with Applications, 116:494–520. Hideaki Takeda, and Yoshiyasu Takefuji. 2018. Wikipedia2Vec: An Optimized Tool for Learning Mark Neumann, Daniel King, Iz Beltagy, and Waleed Embeddings of Words and Entities from Wikipedia. Ammar. 2019. ScispaCy: Fast and Robust Models arXiv:1812.06280 [cs]. ArXiv: 1812.06280. for Biomedical Natural Language Processing. Pro- ceedings of the 18th BioNLP Workshop and Shared Liang Yao, Chengsheng Mao, and Yuan Luo. 2019. Task, pages 319–327. ArXiv: 1902.07669. Clinical text classification with rule-based features and knowledge-guided convolutional neural net- Jeffrey Pennington, Richard Socher, and Christopher works. BMC Medical Informatics and Decision Manning. 2014. Glove: Global Vectors for Word Making, 19(3):71. Representation. In Proceedings of the 2014 Con- ference on Empirical Methods in Natural Language A Appendices Processing (EMNLP), pages 1532–1543, Doha, A.1 Test set comparison of word Qatar. Association for Computational Linguistics. representation, text pre-processing, tokenization and classification Sanjay Purushotham, Chuizheng Meng, Zhengping methods across tasks Che, and Yan Liu. 2018. Benchmarking deep learn- ing models on large healthcare datasets. Journal of . Biomedical Informatics, 83:112–134.
M. Saeed, C. Lieu, G. Raber, and R. G. Mark. 2002. MIMIC II: a massive temporal ICU patient database to support research in intelligent patient monitoring. Computers in Cardiology, 29:641–644.
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. arXiv:1508.07909 [cs]. ArXiv: 1508.07909.
Yuqi Si, Jingqi Wang, Hua Xu, and Kirk Roberts. 2019. Enhancing clinical concept extraction with contex- tual embeddings. Journal of the American Medical Informatics Association, 26(11):1297–1304.
Ozlem Uzuner, Ira Goldstein, Yuan Luo, and Isaac Ko- hane. 2008. Identifying patient smoking status from medical discharge records. Journal of the American Medical Informatics Association: JAMIA, 15(1):14– 24.
94 Noise Pollution in Hospital Readmission Prediction: Long Document Classification with Reinforcement Learning
Liyan Xu† Julien Hogan‡ Rachel Patzer‡ Jinho D. Choi† †Department of Computer Science, ‡Department of Surgery Emory University, Atlanta, US {liyan.xu,julien.hogan,rpatzer,jinho.choi}@emory.edu
Abstract In particular, we face three types of challenges in this task. First, the document size can be very long; This paper presents a reinforcement learning approach to extract noise in long clinical doc- documents associated with these patients can have uments for the task of readmission prediction tens of thousands of tokens. Second, the dataset after kidney transplant. We face the challenges is relatively small with fewer than 2,000 patients of developing robust models on a small dataset available, as kidney transplant is a non-trivial med- where each document may consist of over 10K ical surgery. Third, the documents are noisy, and tokens with full of noise including tabular text there are many target-irrelevant sentences and tabu- and task-irrelevant sentences. We first exper- lar data in various text forms (Section2). iment four types of encoders to empirically decide the best document representation, and The lengthy documents together with the small then apply reinforcement learning to remove dataset impose a great challenge on representation noisy text from the long documents, which learning. In this work, we experiment four types models the noise extraction process as a se- of encoders: bag-of-words (BoW), averaged word quential decision problem. Our results show embedding, and two deep learning-based encoders that the old bag-of-words encoder outperforms that are ClinicalBERT (Huang et al., 2019) and deep learning-based encoders on this task, and LSTM with weight-dropped regularization (Merity reinforcement learning is able to improve upon baseline while pruning out 25% text segments. et al., 2018). To overcome the long sequence issue, Our analysis depicts that reinforcement learn- documents are split into multiple segments for both ing is able to identify both typical noisy tokens ClinicalBERT and LSTM (Section4). and task-specific noisy text. After we observe the best performed encoders, we further propose to combine reinforcement learn- 1 Introduction ing (RL) to automatically extract out task-specific Prediction of hospital readmission has always been noisy text from the long documents, as we observe recognized as an important topic in surgery. Pre- that many text segments do not contain predictive vious studies have shown that the post-discharge information such that removing these noise can po- readmission takes up tremendous social resources, tentially improve the performance. We model the while at least a half of the cases are preventable noise extraction process as a sequential decision (Basu Roy et al., 2015; Jones et al., 2016). Clini- problem, which also aligns with the fact that clini- cal notes, as part of the patients’ Electronic Health cal documents are received in time-sequential order. Records (EHRs), contain valuable information but At each step, a policy network with strong entropy are often too time-consuming for medical experts regularization (Mnih et al., 2016) decides whether to manually evaluate. Thus, it is of significance to to prune the current segment given the context, and develop prediction models utilizing various sources the reward comes from a downstream classifier af- of unstructured clinical documents. ter all decisions have been made (Section5). The task addressed in this paper is to predict 30- Empirical results show that the best performed day hospital readmission after kidney transplant, encoder is BoW, and deep learning approaches suf- which we treat it as a long document classification fer from severe overfitting under huge feature space problem without using specific domain knowledge. in contrast of the limited training data. RL is ex- The data we use is the unstructured clinical docu- perimented on this BoW encoder, and able to im- ments of each patient up to the date of discharge. prove upon baseline while pruning out around 25%
95 Proceedings of the BioNLP 2020 workshop, pages 95–104 Online, July 9, 2020 c 2020 Association for Computational Linguistics
Type P T Description CO 1,354 4,395.3 Report for every outpatient consultation before transplantation DS 514 1,296.7 Summary at the time of discharge from every hospital admission happened before transplant EC 1,110 1,073.6 Results of echocardiography HP 1,422 3,025.1 Summary of the patient’s medical history and clinical examination OP 1,472 4,224.8 Report of surgical procedures PG 1,415 13,723.4 Medical note during hospitalization summarizing the patient’s medical status each day SC 2,033 1,189.2 Report from the evaluation of each transplant candidate by the selection committee SW 1,118 1,407.6 Report from encounters with social workers
Table 1: Statistics of our dataset with respect to different types of clinical notes. P: # of patients, T: avg. # of tokens, CO: Consultations, DS: Discharge Summary, EC: Echocardiography, HP: History and Physical, OP: Operative, PG: Progress, SC: Selection Conference, SW: Social Worker. The report for SC is written by the committee that consists of surgeons, nephrologists, transplant coordinators, social workers, etc. at the end of the transplant evaluation. All 8 types follow the approximately 3:7 positive-negative class distribution. text segments (Section6). Further analysis shows Lab Fishbone (BMP, CBC, CMP, Diff) and that RL is able to identify traditional noisy tokens critical labs - Last 24 hours 03/08/2013 12:45 with few document frequencies (DF), as well as 142(Na) 104(Cl) 70H(BUN) - 10.7L(Hgb) < task-irrelevant tokens with high DF but of little 92(Glu) 6.5(WBC) 137L(Plt) 3.6(K) 26(CO2) information (Section7). Table 2: An example of tabular text in EKTD. 2 Data This work is based on the Emory Kidney Transplant preprocessing. Although numbers may provide use- Dataset (EKTD) that contains structured chart data ful features, most quantitative measurements are as well as unstructured clinical notes associated already included in the structured data so that those with 2,060 patients. The structured data comprises features can be better extracted from the structured 80 features that are lab results before the discharge data if necessary. The remaining tabular text con- as well as the binary labels of whether each patient tains headers and values that do not provide much is readmitted within 30 days after kidney transplant helpful information and become another source of or not where 30.7% patients are labeled as positive. noise, which we handle by training a reinforcement The unstructured data includes 8 types of notes learning model to identify them (Section5). such that all patients have zero to many documents Table1 gives the statistics of each clinical note for each note type. It is possible to develop a more type after preprocessing. The average number of accurate prediction model by co-training the struc- tokens is measured by counting tokens in all doc- tured and unstructured data; however, this work uments from the same note type of each patient. focuses on investigating the potentials of unstruc- Given this preprocessed dataset, our task is to take tured data only, which is more challenging. all documents in each note type as a single input and predict whether or not the patient associated 2.1 Preprocessing with those documents will be readmitted. As the clinical notes are collected through various sources of EMRs, many noisy documents exist in 3 Related Work EKTD such that 515 documents are HTML pages and 303 of them are duplicates. These documents Shin et al.(2019) presented ensemble models uti- are removed during preprocessing. Moreover, most lizing both the structured and the unstructured data documents contain not only written text but also in EKTD, where separate logistic regression (LR) tabular data, because some EMR systems can only models are trained on the structured data and each export entire documents in the table format. type of notes respectively, and the final prediction While there are many tabular texts in the documents of each patient is obtained by averaging predictions (e.g., lab results and prescription as in Table2), it is from each models. Since some patients may lack impractical to write rules to filter them out, as the documents from certain note types, prediction on exported formats are not consistent across EMRs. these note types are simply ignored in the averaging Thus, any tokens containing digits or symbols, ex- process. For the unstructured notes, concatenation cept for one-character tokens, are removed during of Term Frequency-Inverse Document Frequency
96 (TF-IDF) and Latent Dirichlet Allocation (LDA) 2017). It is suitable for this task as unseen terms representation is fed into LR. However, we have or misspellings frequently appear in these clinical found that the representation from LDA only con- notes. The averaged word embedding is used to tributes marginally, while LDA takes significantly represent the input document consisting of multi- more inferring time. Thus, we drop LDA and only ple notes, which gets fed into LR with the same use TF-IDF as our BoW encoder (Section 4.1). training objective. Various deep learning models regarding text clas- sification have been proposed in recent years. Pre- 4.3 ClinicalBERT trained language models like BERT have shown Following Huang et al.(2019), the pretrained lan- state-of-the-art performance on many NLP tasks guage BERT model (Devlin et al., 2019) is first (Devlin et al., 2019). ClinicalBERT is also intro- tuned on the MIMIC-III clinical note corpus (John- duced on the medical domain (Huang et al., 2019). son et al., 2016), which has shown to provide better However, deep learning approaches have two draw- related word similarities in medical domains. Then, backs on this particular dataset. First, deep learn- a dense layer is added on the CLS token of the last ing requires large dataset to train, whereas most of BERT layer. The entire parameters are fine-tuned our unstructured note types only have fewer than to optimize the binary cross entropy loss, that is the 2,000 samples. Second, these approaches are not same objective as Equation1. designed for long documents, and difficult to keep Since BERT has a limit on the input length, the long-term dependencies over thousands of tokens. input document of each patient is split into multi- Reinforcement learning has been explored to ple subsequences. Each subsequence is within the combat data noise by previous work (Zhang et al., BERT length limit, and serves as an independent 2018; Qin et al., 2018) on the short text setting. A sample with the same label of the patient. The policy network makes decision left-to-right over training data is therefore noisily inflated. The final tokens, and is jointly trained with another classifier. probability of readmission is computed as follows: However, there is little investigation of using RL on ni ni the long text setting, as it still requires an effective pmax + pmeanni/c p(yi = 1 gi) = (2) encoder to give meaningful representation of long | 1 + ni/c documents. Therefore, in our experiments, the first g i n step is to select the best encoder, and then apply where i is the BERT representation of patient , i RL on the long document classification. is the corresponding number of subsequences, and c is a hyperparameter to control the influence of ni. ni ni 4 Document Representation pmax and pmean are the max and mean probability across the subsequences, respectively. 4.1 Bag-of-Words The motivation behind balancing between the For the baseline model, the bag-of-words represen- max and mean probability is that subsequences do ni tation with TF-IDF scores, excluding stopwords not contain equal information. pmax represents the (Nothman et al., 2018), is fed into logistic regres- best potential, while longer text should give more ni ni sion (LR). The objective is to minimize the negative importance to pmean, because pmax is more easily af- log likelihood of the gold label yi: fected by noise as the text length grows. Although Equation2 seems intuitive, the use of pseudo labels m 1 on subsequences becomes another source of noise, [y log p(g )+(1 y ) log 1 p(g )] (1) −m i i − i − i especially when there are thousands of tokens; thus, i=1 X the performance is uncertain. Section 6.2 provides where gi is the TF-IDF representation of Di. In detailed empirical analysis for this model. addition, we experiment two common techniques in the encoder to reduce feature space: token stem- 4.4 Weight-dropped LSTM ming, and document frequency cutoff. We split documents of each patient into multiple short segments, and feed the segment representa- 4.2 Averaged Word Embedding tion to long short-term memory network (LSTM) Word embeddings generated by fastText are used at each time step: to establish another baseline, that utilizes subwords to better represent unseen terms (Bojanowski et al., hj LSTM(sj, hj 1; θ) (3) ← − 97 Environment Policy Network Action Downstream Classifier a a ⋯ a ⋯ a at 1 2 t T Selected Text st ⟶ at ↓ ↓ ↓ ↓ gτ ⟶ p(y|gτ) πθ State gτ s1 → s2 → ⋯ → st → ⋯ → sT st+1 Pruning Prediction Reward Reward
Figure 1: Overview of our reinforcement learning approach. Rewards are calculated and sent back to the policy network after all actions a1:T have been sampled for the given episode.
where hj is the hidden state at time step j, sj is the sponsible for the reward, and the REINFORCE al- jth segment, and θ is the set of parameters. gorithm is used to train the policy (Williams, 1992). Although segmentation of documents is still nec- State At each time step, the state st is the con- essary, no pseudo labels are needed. We get the catenation of two parts: the representation of previ- segment representation by averaging its token em- ously selected text, and the current segment repre- bedding from the last layer of BERT. The final sentation g . The previously selected text serves as hidden state at each step j is the concatenated hid- i the context and provides a prior importance. Both den states of a single-layer Bi-directional LSTM. parts are represented by an effective encoder, e.g. After we get the hidden state for each segment, a the best performing encoder from Section4. max-pooling operation is performed on h1:n over the time dimension to obtain a fixed-length vector, Action The action space at each step is binary: similar to Kim(2014); Adhikari et al.(2019). A {Keep, Prune}. If the action is Keep, the current dense layer is immediately followed. segment is added to the selected text; otherwise, it It is particularly important to strengthen regu- is discarded. The final selected text for a patient is larization on this dataset with small sample size. the concatenated segments selected by the policy. Dropout (Srivastava et al., 2014) as a way of regu- Reward The reward comes at the end when all larization has been shown effective in deep learning actions are sampled for the entire sequence. The models, and Merity et al.(2018) has successfully final selected text is fed to the downstream classi- applied dropout-like technique in LSTM: the use of fier, and negative log-likelihood of the gold label is DropConnect (Wan et al., 2013) is applied on the used as the reward R. In addition, we also include four hidden-to-hidden matrices, preventing overfit- a reward term R to encourage pruning, as follows: ting from occurring on the recurrent weights. p l R = c α [2σ( ) 1] (4) 5 Reinforcement Learning p · · β − Reinforcement learning is applied to the best per- where c and β are hyperparameters to control the forming encoder in Section4 to prune noisy text, scale of Rp, l is the number of segments, α is the ratio of pruned segments a = Prune /l, σ which can lead to comparable or even better per- |{ k }| formance, as many text segments in these clinical is the sigmoid function. The value of the term 2σ( l ) 1 falls into range (0, 1). When l is small, notes are found to be irrelevant to this task. Fig- β − ure1 describes the overview of our reinforcement it downgrades the encouragement of pruning; when learning approach. The pruning process is mod- l is large, it also gives an upper bound of Rp. Addi- eled as a sequential decision problem, for the fact tionally, we apply exponential decay on the reward. l that these notes are received in time-order. It con- The final reward is d R + Rp. d is the discount rate. sists of two separate components: a policy network, Policy Network The policy network maintains a and a downstream classifier. To avoid having too stochastic policy π(at st; θ): many time steps, the policy is performed on the seg- | ment level instead of token level. For each patient, π(a s ; θ) = σ(W s + b) (5) t| t t documents are split into short segments g1:T = g , g , , g , and the policy network conducts where θ is the set of policy parameters W and b, a { 1 2 ··· T } t a sequence of decisions a = a , a , , a and s are the action and state at the time step t re- 1:T { 1 2 ··· T } t over segments. The downstream classifier is re- spectively. During training, an action is sampled at
98 Encoder CO DS EC HP OP PG SC SW Bag-of-Words (§4.1) 58.6 62.1 52.0 58.9 51.8 61.2 59.3 51.6 + Cutoff 58.6 62.3 52.8 59.0 51.9 61.3 59.3 51.9 + Stemming 58.9 61.8 53.4 59.4 51.9 61.5 59.3 51.6 Averaged Embedding (§4.2) 56.3 53.7 52.4 54.0 53.4 54.7 54.2 46.6 ClinicalBERT (§4.3) 51.9 53.3 - 52.7 - - 52.3 - Weight-dropped LSTM (§4.4) 53.7 55.8 - 54.2 - - 54.5 -
Table 3: The Area Under the Curve (AUC) scores achieved by different encoders on the 5-fold cross-validation. See the caption in Table1 for the descriptions of CO, DS, EC, HP, OP, PG, SC, and SW. For deep learning encoders, only four types are selected in experiments (Section 6.2). each step with the probability from the policy. After cross-validation as suggested by Shin et al.(2019). the sampling is performed over the entire sequence, To evaluate each fold Fi, 12.5% of the training set, the delayed reward is computed. During evaluation, that is the combined data of the other 4 folds, are the action is picked by argmax π(a s ; θ). held out as the development set and the best config- a | t The training is guided by the REINFORCE algo- uration from this development set is used to decode rithm (Williams, 1992), which optimizes the policy Fi. The same split is used across all experiments to maximize the expected reward: for fair comparison. Following Shin et al.(2019), the averaged Area Under the Curve (AUC) across
J(θ) = Ea1:T π Ra1:T (6) ∼ these 5 folds is used as the evaluation metric. and the gradient has the following form: 6.1 Baseline T Bag-of-Words We first conduct experiments us- θJ(θ) = Eτ θ log π(at st; θ)Rτ (7) ing the bag-of-words encoder (BoW; Section 4.1) ∇ ∇ | t=1 to establish the baseline. Many experiments are per- X N T 1 formed on all note types using the vanilla TF-IDF, log π(a s ; θ)R ≈ N ∇θ it| it τi document frequency (DF) cutoff at 2 (removing i=1 t=1 all tokens whose DF 2), and token stemming. X X ≤ (8) For every experiment, the class weight is assigned where τ represents the sampled trajectory inversely proportional to class frequencies, and the inverse of regularization strength C is searched a1, a2, , aT , N is the number of sampled tra- { ··· } from 0.01, 0.1, 1, 10 , where the best results are jectories. Rτi here equals the delayed reward from { } C = 1 the downstream classifier at the last step. achieved with on the development set. To encourage exploration and avoid local op- Table3 describes the cross-validation results on tima, we add the entropy regularization (Mnih et al., every note type. The top AUC is 62.3%, which is 2016) on the policy loss: within expectation given the difficulty of this task. Some note types are not as predictive as the others, N λ 1 such as Operative (OP) and Social Worker (SW), Jreg(θ) = θH(π(sit; θ)) (9) with the AUC under 52%. Most note types have N Ti ∇ Xi=1 the standard deviations in range 0.02 to 0.03. where H is the entropy, and λ is the regularization In comparison to the previous work (Shin et al., 2019), we achieve 0.671 AUC combining both strength, Ti is the trajectory length. Finally, the downstream classifier and policy net- structured and unstructured data, despite without work are warm-started by separate training, and the use of LDA in our encoder. then jointly trained together. Noise Observation The DF cutoff coupled with token stemming significantly reduce feature space 6 Experiments for the BoW model. As shown in Table4, the DF Before experiments, we perform the preprocessing cutoff itself can achieve about 50% reduction of the described in Section 2.1, and then randomly split feature space. Furthermore, applying the DF cutoff patients in every note type by 5 folds to perform leads to slightly higher AUCs on most of the note
99 types, despite almost a half of the tokens are re- For the ClinicalBERT, we use the PyTorch BERT moved from the vocabulary. This implies that there implementation with the base configuration:2 768 exists a large amount of noisy text that appears only embedding dimensions and 12 transformer layers, in few documents, causing the models to be over- and we load the weights provided by Huang et al. fitted more easily. These results further verify our (2019) whose language model has been finetuned previous observation and strengthen the necessity on large-scale clinical notes.3 We finetune the en- to extract noise from these long documents using tire ClinicalBERT with batch size 4, learning rate reinforcement learning (Section 6.3). 2 10 5, and weight decay rate 0.01. × − For the weight-dropped LSTM, we set the batch Averaged Word Embedding For the averaged 64 10 3 word embedding encoder (AWE; Section 4.2), em- size to , the learning rate to − , the weight- 0.5 beddings generated by FastText trained on the Com- drop rate to , and search the hidden state dimen- sion from 128, 256, 512 on the development set. mon Crawl and the English Wikipedia with the 300 { } dimension is used.1 AWE is outperformed by BoW Early stop is used for both approaches. on every note type except Operative (OP; Table3). This empirical result implies that AWE over thou- Type + Model SEN SEQ INST sands of tokens is not so effective in generating the CO + BERT 318 14.8 11,376 document representation so that the averaged em- CO + LSTM 128 36.8 948 beddings are less discriminative than the sparse vec- DS + BERT 318 4.6 1,588 tors generated by BoW for such long documents. DS + LSTM 64 22.5 371 HP + BERT 318 10.1 8,364 Type Vanilla + Cutoff + Stemming HP + LSTM 128 27.3 987 CO 28,213 15,022 (46.8) 12,243 (56.6) SC + BERT 318 3.7 5,206 DS 11,029 6,117 (44.5) 5,228 (52.6) SC + LSTM 64 25.4 1,422 HP 20,245 11,276 (44.3) 9,329 (53.9) SC 19,050 9,873 (48.2) 8,200 (57.0) Table 5: SEN: maximum segment length (number of tokens) allowed by the corresponding model, SEQ: av- Table 4: The dimensions of the feature spaces used by erage sequence length (number of segments), INST: av- each BoW model with respect to the four note types. erage number of samples in the training set. The numbers in the parentheses indicate the percentage reduction from the vanilla model, respectively.
Result Analysis Table3 shows the final results 6.2 Deep Learning-based Encoders achieved by the ClinicalBERT and LSTM models. For deep learning encoders, the four note types with The AUCs of both models experience a non-trivial good baseline performance ( 60% AUC) and rea- drop from the baseline. After further investigation, ≈ sonable sequence length (< 5000) are selected to the issue is that both models suffer from severe use in the following experiments, which are Con- overfitting under the huge feature spaces, and strug- sultations (CO), Discharge Summary (DS), History gle to learn generalized decision boundaries from and Physical (HP), and Selection Conference (SC) this data. Figure2 shows an example of the weak (see Tables1 and3). correlation between the training loss and the AUC scores on the development set. Segmentation For both ClinicalBERT and the As more steps are processed, the training loss grad- LSTM models, the input document is split into ually decreases to 0. However, the model has high segments as described in Section 4.3. For LSTM, variance and it does not necessarily give better per- we set the maximum segment length to be 128 for formance on the development set as the training CO HP DS SC and , 64 for and , to balance between loss drops. This issue is more apparent with Clini- segment length and sequence length. The segment calBERT on CO because there are too many pseudo length for ClinicalBERT is set to 318 (approach- labels acting as noise, which makes it harder for ing 500 after BERT tokenization) to avoid noise the model to distinguish useful patterns from noise. brought by too many pseudo labels. More statistics about segmentation are summarized in Table5. 2https://github.com/huggingface/transformers 1https://fasttext.cc/docs/en/crawl-vectors.html 3https://github.com/kexinhuang12345/clinicalBERT
100 0.8 0.64 Tuning Analysis We find that two hyperparame-
0.7 62 0.6 60 ters are essential to the final success of reinforce- 58 0.5 56 ment learning (RL). The first is the reward discount 0.4 54 0.3 52 rate d. The scale of the policy gradient θJ(θ) de- 0.2 50 48 ∇ 0.1 T 46 pends on the sequence length , while the delayed 0 44 reward Rτ is always on the same scale regardless 0 500 1k 1.5k 2k 2.5k 3k 3.5k 4k 0 40 80 120 160 200 of T . Therefore, different sequence length across (a) Training loss (b) AUC on dev-set episodes causes turbulence on the policy gradient, Figure 2: Training loss and AUC scores on the develop- leading to unstable training. It is important to apply ment set during the LSTM training on the CO type. The reward decay to stabilize the scale of J(θ). ∇θ AUC scores depict high variance while showing weak The second is the entropy regularization coeffi- correlation to the training loss. cient λ, which forces the model to add bias towards uncertainty. Without strong entropy regularization, the training is easy to fall into local optima in early 6.3 Reinforcement Learning stage, which is to keep all segments, as shown by According to Table3, the BoW model achieves Figure3(a). λ = 6 gives the model descent incen- the best performance. Therefore, we decide to use tive to explore aggressively, as shown by Figure TF-IDF to represent the long text of each patient, 3(b), and finally leads to higher AUC. along with logistic regression as the classifier for 1.04 1.2 reinforcement learning. Document segmentation 1.02 1.1 1 1 is the same as LSTM (Table5). During training, 0.98 0.9 0.96 0.8 0.7 0.94 0.6 segments within each note are shuffled to reduce 0.92 0.5 0.9 overfitting risks, and sequences with more than 36 0.4 0.88 0.3 0.86 0.2 segments are truncated. 0.84 0.1 0.82 0 The downstream classifier is warm-started by - 0 500 1k 1.5k 2k 2.5k 3k 3.5k 4k 4.5 0 500 1k 1.5k 2k 2.5k 3k 3.5k 4k loading weights from the logistic regression model (a) Without Entropy Reg. (b) With Entropy Reg. in the previous experiment. The policy network is then trained for 400 episodes while freezing the Figure 3: Retaining ratios on the development set of SC downstream classifier. After the warm start, both while training the reinforcement learning model. En- models are jointly trained. We set the number of tropy regularization encourages more exploration. sampling N as 10 episodes, learning rate 2 10 4, × − and fix the scaling factor β in Equations4 as 8, 7 Noise Analysis and discount rate as 0.95. Moreover, we search the reward coefficient c in 0.02, 0.1, 0.4 , and entropy To investigate the noise extracted by RL, we an- { } coefficient λ in 2, 4, 6, 8 . alyze the pruned segments on the validation sets { } of the Consultations type (CO), and compare the CO DS HP SC results with other basic noise removal techniques. Best 58.9 62.3 59.4 59.3 Table7 demonstrates the RL 59.8 62.4 60.6 60.2 Qualitative Analysis potential of the learned policy to automatically Pruning 26% 5% 19% 23% identify noisy text from the long documents. The Table 6: The AUC scores and the pruning ratios of re- original notes of shown examples are tabular text inforcement learning (RL). Best: AUC scores from the with headers and values, mostly lab results and best performing models in Table3. medical prescription. After the data cleaning step, the text becomes broken and does not make much The AUC scores and the pruning ratios (the number sense for humans to evaluate. The learned policy of pruned segments divided by the sequence length) can identify noisy segments by looking at the pres- are shown in Table6. Our reinforcement learning ence of headers such as “lab fishbone”, “lab report”, approach outperforms the best performing models and certain medical terms that frequently appear in Table3, achieving around 1% higher AUC scores in tabular reports such as “chloride”, “creatinine”, on three note types, CO, HP, and SC, while pruning “hemoglobin”, “methylprednisolone”, etc. We find out up to 26% of the input documents. that many pruned segments have strong indicators
101 lab fishbone ( bmp , cbc , cmp , diff ) and critical labs - last hours ( not an official lab report . please see flowsheet ( or printed official lab reports ) for official lab results . ) ( na ) ( cl ) h ( bun )- ( hgb ) ( glu ) ( wbc ) ( plt ) ( ) h ( cr ) ( hct ) na = not applicable a = abnormal ( ftn ) = footnote . laboratory studies : sodium , potassium , chloride ,., bun , creatinine , glucose . total bilirubin 1 , phos of , calcium , ast 9 , alt , alk phos . parathyroid hormone level . white blood cell count , hemoglobin , hematocrit , platelets . inr , ptt , and pt . methylprednisolone ivpb : mg , ivpb , give in surgery , routine , / , infuse over : minute . mycophe- nolate mofetil : mg = 4 cap , po , capsule , once , now , / , stop date / , ml . documented medications documented accupril : mg , po , qday , 0 refill , substitution allowed .
Table 7: Examples of pruned segments by the learned policy. Tokens that have feature importance lower than 0.001 (towards Prune action) are marked bold. −
the social worker met with this pleasant year old caucasian male on this date for kidney transplant evaluation . the patient was alert , oriented and easily engaged in conversation with the social worker today . he resides in atlanta with his spouse of years , who he describes as very supportive . he reports occasional alcohol drinks per month but denies any illicit drug use . he has a grade education . he has been married for years . he is working full - time while on peritoneal dialysis as a business asset manager . he has medicare and an aarp prescriptions supplement . family history : mother deceased at age with complications of obesity , high blood pressure and heart disease .
Table 8: Examples of kept segments by the learned policy. Tokens that have feature importance greater than 0.0005 (towards Keep action) are marked bold. of headers and specific medical terms, which ap- pear mostly in tabular text rather than written notes.
Table8 shows examples that are kept by the pol- icy. Tokens that contribute towards Keep action are words related with human and social life, such as “social worker”, “engaged”, “drinks”, “married”,
“medicare”, and terms related with health condi- tions, such as “obesity”, “heart”, “high blood pres- sure”. These terms indeed appear mostly in written text rather than tabular data. In addition, we also notice that the policy is able Figure 4: Log scale distribution on document frequency to remove certain duplicate segments. Medical of tokens with top negative feature importance. professionals sometimes repeat certain description from previous notes to a new document, causing We observe that the majority of those tokens have duplicate content. The policy learns to make use of small DF values. It shows that the learned policy is the already selected context, and assigns negative able to identify certain tokens with small DF values coefficients to certain tokens. Duplicate segments as noise, which aligns with DF cutoff. Moreover, are only selected once if the segment contains many the distribution also shows a non-trivial amount of tokens that have opposite feature importance in the tokens with large DF values, demonstrating that context and segment vectors. RL can also identify task-specific noisy tokens that Quantitative Analysis We examine tokens that commonly appear in documents, which in this case are pruned by RL and compare with document fre- are certain tokens in noisy tabular text. quency (DF) cutoff. We select 3000 unique tokens Either RL or DF cutoff achieves higher AUC in the vocabulary that have the top negative feature while reducing input features, proving that given importance (towards Prune action) in the segment the small sample size, the extracted text is more vector of CO. Figure4 shows the DF distribution likely to cause overfit than being generalizable pat- of these tokens. tern, which also verifies our initial hypothesis.
102 8 Conclusion pages 4171–4186, Minneapolis, Minnesota. Associ- ation for Computational Linguistics. In this paper, we address the task of 30-day readmis- sion prediction after kidney transplant, and propose Kexin Huang, Jaan Altosaar, and Rajesh Ran- to improve the performance by applying reinforce- ganath. 2019. Clinicalbert: Modeling clinical notes and predicting hospital readmission. CoRR, ment learning with noise extraction capability. To abs/1904.05342. overcome the challenge of long document represen- tation with a small dataset, four different encoders Alistair EW Johnson, Tom J Pollard, Lu Shen, are experimented. Empirical results show that bag- H Lehman Li-wei, Mengling Feng, Moham- of-words is the most suitable encoder, surpassing mad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. Mimic- overfitted deep learning models, and reinforcement iii, a freely accessible critical care database. Scien- learning is able to improve the performance, while tific data, 3:160035. being able to identify both traditional noisy tokens that appear in few documents, and task-specific Caroline Jones, Robert Hollis, Tyler Wahl, Brad Oriel, noisy text that commonly appear. Kamal Itani, Melanie Morris, and Mary Hawn. 2016. Transitional care interventions and hospital readmis- sions in surgical populations: A systematic review. Acknowledgments The American Journal of Surgery, 212. We gratefully acknowledge the support of the Na- Yoon Kim. 2014. Convolutional neural networks tional Institutes of Health grant R01MD011682, for sentence classification. In Proceedings of the Reducing Disparities among Kidney Transplant Re- 2014 Conference on Empirical Methods in Natural cipients. Any opinions, findings, and conclusions Language Processing (EMNLP), pages 1746–1751, or recommendations expressed in this material are Doha, Qatar. Association for Computational Lin- guistics. those of the authors and do not necessarily reflect the views of the National Institutes of Health. Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2018. Regularizing and optimizing LSTM language models. In International Conference on References Learning Representations. Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Jimmy Lin. 2019. Rethinking complex neural net- Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, work architectures for document classification. In David Silver, and Koray Kavukcuoglu. 2016. Asyn- Proceedings of the 2019 Conference of the North chronous methods for deep reinforcement learning. American Chapter of the Association for Compu- In Proceedings of The 33rd International Confer- tational Linguistics: Human Language Technolo- ence on Machine Learning, volume 48 of Proceed- gies, Volume 1 (Long and Short Papers), pages ings of Machine Learning Research, pages 1928– 4046–4051, Minneapolis, Minnesota. Association 1937, New York, New York, USA. PMLR. for Computational Linguistics. Joel Nothman, Hanmin Qin, and Roman Yurchak. Senjuti Basu Roy, Ankur Teredesai, Kiyana Zolfaghar, 2018. Stop word lists in free open-source software Rui Liu, David Hazel, Stacey Newman, and Albert packages. In Proceedings of Workshop for NLP Marinez. 2015. Dynamic hierarchical classification Open Source Software (NLP-OSS), pages 7–12, Mel- for patient risk-of-readmission. In Proceedings of bourne, Australia. Association for Computational the 21th ACM SIGKDD International Conference on Linguistics. Knowledge Discovery and Data Mining, KDD ’15, page 1691–1700, New York, NY, USA. Association for Computing Machinery. Pengda Qin, Weiran Xu, and William Yang Wang. 2018. Robust distant supervision relation extrac- Piotr Bojanowski, Edouard Grave, Armand Joulin, and tion via deep reinforcement learning. In Proceed- Tomas Mikolov. 2017. Enriching word vectors with ings of the 56th Annual Meeting of the Association subword information. Transactions of the Associa- for Computational Linguistics (Volume 1: Long Pa- tion for Computational Linguistics, 5:135–146. pers), pages 2137–2147, Melbourne, Australia. As- sociation for Computational Linguistics. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Bonggun Shin, Julien Hogan, Andrew B. Adams, Ray- deep bidirectional transformers for language under- mond J. Lynch, and Jinho D. Choi. 2019. Mul- standing. In Proceedings of the 2019 Conference timodal ensemble approach to incorporate various of the North American Chapter of the Association types of clinical notes for predicting readmission. for Computational Linguistics: Human Language In 2019 IEEE EMBS International Conference on Technologies, Volume 1 (Long and Short Papers), Biomedical & Health Informatics (BHI).
103 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Re- search, 15(56):1929–1958. Li Wan, Matthew Zeiler, Sixin Zhang, Yann LeCun, and Rob Fergus. 2013. Regularization of neural networks using dropconnect. In Proceedings of the 30th International Conference on International Con- ference on Machine Learning - Volume 28, ICML’13, page III–1058–III–1066. JMLR.org. Ronald J. Williams. 1992. Simple statistical gradient- following algorithms for connectionist reinforce- ment learning. Mach. Learn., 8(3–4):229–256. Tianyang Zhang, Minlie Huang, and Li Zhao. 2018. Learning structured representation for text classifica- tion via reinforcement learning. In AAAI Conference on Artificial Intelligence.
104 Evaluating the Utility of Model Configurations and Data Augmentation on Clinical Semantic Textual Similarity
Yuxia Wang Fei Liu Karin Verspoor Timothy Baldwin School of Computing and Information Systems The University of Melbourne Victoria, Australia yuxiaw, fliu3 @student.unimelb.edu.au [email protected]{ } [email protected]
Abstract 2016; Reimers and Gurevych, 2019). These archi- tectures require a large amount of training data, an In this paper, we apply pre-trained language unrealistic requirement in low resource settings. models to the Semantic Textual Similarity (STS) task, with a specific focus on the clini- Recently, pre-trained language models (LMs) cal domain. In low-resource setting of clinical such as GPT-2 (Radford et al., 2018) and STS, these large models tend to be impractical BERT (Devlin et al., 2019) have been shown to and prone to overfitting. Building on BERT, benefit from pre-training over large corpora fol- we study the impact of a number of model de- lowed by fine tuning over specific tasks. How- sign choices, namely different fine-tuning and ever, for small-scale datasets, only limited fine- pooling strategies. We observe that the impact tuning can be done. For example, GPT-2 achieved of domain-specific fine-tuning on clinical STS is much less than that in the general domain, strong results across four large natural language likely due to the concept richness of the do- inference (NLI) datasets, but was less successful main. Based on this, we propose two data aug- over the small-scale RTE corpus (Bentivogli et al., mentation techniques. Experimental results on 2009), performing below a multi-task biLSTM N2C2-STS1 demonstrate substantial improve- model. Similarly, while the large-scale pre-training ments, validating the utility of the proposed of BERT has led to impressive improvements on methods. a range of tasks, only very modest improvements 1 Introduction have been achieved on STS tasks such as STS- B(Cer et al., 2017) and MRPC (Dolan and Brock- Semantic Textual Similarity (STS) is a language ett, 2005) (with 5.7k and 3.6k training instances, understanding task, involving assessing the degree resp.). Compared to general-domain STS bench- of semantic equivalence between two pieces of text marks, labeled clinical STS data is more scarce, based on a graded numerical score (Corley and which tends to cause overfitting during fine-tuning. Mihalcea, 2005). It has application in tasks such Moreover, further model scaling is a challenge due as information retrieval (Hliaoutakis et al., 2006), to GPU/TPU memory limitations and longer train- question answering (Hoogeveen et al., 2018), and ing time (Lan et al., 2019). This motivates us to summarization (AL-Khassawneh et al., 2016). In search for model configurations which strike a bal- this paper, we focus on STS in the clinical domain, ance between model flexibility and overfitting. in the context of a recent task within the framework In this paper, we study the impact of a number of of N2C2 (the National NLP Clinical Challenges)1, model design choices. First, following Reimers and which makes use of the extended MedSTS data Gurevych(2019), we study the impact of various set (Wang et al., 2018), referring to N2C2-STS, pooling methods on STS, and find that convolution with limited annotated sentences pairs (1.6K) that filters coupled with max and mean pooling out- are rich in domain terms. perform a number of alternative approaches. This Neural STS models typically consist of encoders can largely be attributed to their improved model to generate text representations, and a regression expressiveness and ability to capture local interac- layer to measure the similarity score (He et al., tions (Yu et al., 2019). Next, we consider differ- 2015; Mueller and Thyagarajan, 2016; He and Lin, ent parameter fine-tuning strategies, with varying 1https://portal.dbmi.hms.harvard.edu/ degrees of flexibility, ranging from keeping all pa- projects/n2c2-2019-t1/ rameters frozen during training to allowing all pa-
105 Proceedings of the BioNLP 2020 workshop, pages 105–111 Online, July 9, 2020 c 2020 Association for Computational Linguistics rameters to be updated. This allows us to identify 2.2 Data Augmentation the optimal model flexibility without over-tuning, Synonym replacement is one of the most com- thereby further improving model performance. monly used data augmentation methods to simulate Finally, inspired by recent studies, including linguistic diversity, but it introduces ambiguity if sentence ordering prediction (Lan et al., 2019) accurate context-dependent disambiguation is not and data-augmented question answering (Yu et al., performed. Moreover, random selection and re- 2019), we focus on data augmentation methods to placement of a single word used in general texts is expand the modest amount of training data. We first not plausible for term-rich clinical text, resulting consider segment reordering (SR), in permuting in too much semantic divergence (e.g patient to af- segments that are delimited by commas or semi- fected role and discharge to home to spark to home). colons. Our second method increases linguistic By contrast, replacing a complete mention of the diversity with back translation (BT). Extensive concept can increase error propagation due to the experiments on N2C2-STS reveal the effective- prerequisite concept extraction and normalization. ness of data augmentation on clinical STS, par- Random insertion, deletion, and swapping of ticularly when combined with the best parameter words have been demonstrated to be effective on fine-tuning and pooling strategies identified in Sec- five text classification tasks (Wei and Zou, 2019). tion3, achieving an absolute gain in performance. But those experiments targeted topic prediction, in contrast to semantic reasoning such as STS and MultiNLI. Intuitively, they do not change the over- 2 Related Work all topic of a text, but can skew the meaning of a sentence, undermining the STS task. Swapping an 2.1 Model Configurations entire semantic segment may mitigate the risk of In pre-training, a spectrum of design choices have introducing label noise to the STS task. been proposed to optimize models, such as the pre- Compared to semantic and syntactic distortion training objective, training corpus, and hyperpa- potentially caused by aforementioned methods, rameter selection. Specific examples of objective back translation (BT) (Sennrich et al., 2016)— functions include masked language modeling in translating to a target language then back to the BERT, permutation language modeling in XLNet original language — presents fluent augmented (Yang et al., 2019), and sentence order prediction data and reliable improvements for tasks demand- (SOP) in ALBERT (Lan et al., 2019). Addition- ing for adequate semantic understanding, such as ally, RoBERTa (Liu et al., 2019) explored benefits low-resource machine translation (Xia et al., 2019) from a larger mini-batch size, a dynamic masking and question answering (Yu et al., 2019). This strategy, and increasing the size of the training cor- motivates our application of BT on low-resource pus (16G to 160G). However, all these efforts are clinical STS, to bridge linguistic variation between targeted at improving downstream tasks indirectly two sentences. This work represents the first explo- by optimizing the capability and generalizability of ration of applying BT for STS. LMs, while adapting a single fully-connected layer to capture task features. 3 STS Model Configurations Sentence-BERT (Reimers and Gurevych, 2019) In this section, we study the impact of a number makes use of task-specific structures to optimize of model design choices on BERT for STS, using STS, concentrating on computational and time effi- a 12-layer base model initialized with pretrained ciency, and is evaluated on relatively larger datasets weights. in the general domain. For evaluating the impact of number of layers transferred to the supervised 3.1 Hierarchical Convolution (HConv) target task from the pre-trained language model, The resource-poor and concept-rich nature of clini- GPT-2 has been analyzed on two datasets. How- cal STS makes it difficult to train a large model end- ever, they are both large: MultiNLI (Williams et al., to-end on sentence pairs. To address this, most re- 2018) with >390k instances, and RACE (Lai et al., cent studies have made use of pre-trained language 2017) with >97k instances. These tasks also both models, such as BERT. The most straightforward involve reasoning-related classification, as opposed way to use BERT is the feature-based approach, to the nuanced regression task of STS. where the output of the last transformer block is
106 taken as input to the task-specific classifier. Many Model SICK-R STS-B N2C2-STS have proposed the use of a dummy CLS token to Feature-based: generate the feature vector, where CLS is a special CLS-BERT 53.6/52.1 49.3/67.9 14.6/28.4 symbol added in front of every sequence during HConv-BERT 80.1/73.6 83.0/83.2 79.4/74.4 pre-training, with its final hidden state always used Fine-tuning: as the aggregate sequence representation for clas- CLS-BERT 88.6/82.9 90.0/89.6 86.7/81.9 sification tasks, referring to CLS pooling. Other HConv-BERT 88.7/83.5 90.1/89.6 87.7/80.7 types of pooling, such as mean and max pooling, are investigated by Reimers and Gurevych(2019). Table 1: Pearson and Spearman correlation (r/ρ) be- tween the predicted score and the gold labels for three However, this results in inferior performance 2 STS datasets using the feature-based approach (upper as shown in the first row of Table1. As a con- half) and fine-tuning (bottom half) with CLS-BERT sequence, the best strategy for extracting feature and HConv-BERT. Performance is reported by conven- vectors to represent a sentence remains an open tion as r/ρ 100. × question. In this work, we first experiment with the feature- c based approach, coupled with convolutional filters. where i:i+1,k is the output of the first convolu- i i + 1 This is inspired by the use of convolutional filters tional layer over the span to as defined in w b in QANet (Yu et al., 2019) to capture local interac- Equation (2), and m and m are the filter and tions. The difference lies in where convolutional bias term for the second convolutional layer with, a 2 M = 128 filters are applied. With QANet, multiple conv kernel size of and output dimension of . filters are incorporated into each transformer en- Lastly, we extract feature vectors by max and coder block to process the input from the previous mean pooling over the temporal axis and then con- layer. In contrast, HConv-BERT is largely based catenation: on BERT, with the addition of a single task-specific vk = max ck vk = avg ck (5) classifier placed on top of BERT consisting of conv max i mean i 2 3 4 2 3 4 filters organised in a hierarchical fashion. This v = vmax; vmax; vmax; vavg; vavg; vavg . (6) results in a much simplified model, making HConv- BERT less prone to overfitting. The upper half of Table1 shows that the pro- Specifically, we run a collection of convolutional posed hierarchical convolutional (HConv) architec- filters with a kernel of size k [2, 4], each with ture provides substantial performance gains. ∈ J = 768 output channels (indexed by j [1,J]), ∈ 3.2 Model Flexibility over the temporal axis (indexed by i [1,T ]): ∈ We also evaluate the utility of this mechanism in the
ci,kj = wkj xi:i+k 1 + bkj (1) fine-tuning setting with varying modelling flexibil- ∗ − ity. Concretely, we progressively increase the num- ci,k = [ci,k1 ; ... ; ci,kJ ] (2) ber of trainable parameters by transformer blocks. where xi:i+k 1 is the output BERT features for the That is, for the base BERT model with 12 layers, − token span i to i + k 1, is the convolution − ∗ we allow errors to be back-propagated through the operation, wkj and bkj are the convolution filter last l layers while keeping the rest (12 l) fixed. − and bias term for the j-th kernel of size k, and The results on STS-B and N2C2-STS are shown [a; b] denotes the concatenation of a and b. in Figure1. We observe performance crossover of To capture interactions between distant elements, HConv and CLS-pooling on both datasets as the we feed the output ci,k into another convolution number of trainable transformer layers increases. layer of kernel size 2 with M = 128 output chan- While HConv reaches peak performance before the nels (indexed by m [1,M]): ∈ crossover, CLS-pooling often requires more blocks to be trainable to achieve comparable accuracy, ck = w c + b (3) i,m m ∗ i:i+1,k m rendering the model much slower. Notably, the pro- k k k ci = ci,1; ... ; ci,M (4) posed mechanism peaks with much fewer trainable blocks on N2C2-STS than STS-B. We speculate 2 h i Due to space constrains, we limit our comparison to the that this is due to the size difference between the CLS pooling strategy, based on the observation of little im- provements when using other types of pooling (mean, max) two datasets. To verify this hypothesis, we further and concatenation, or sequence processing recurrent units. look into the relationship between the number of
107 90.0 CLS-BERT 86 HConv-BERT 89.5 85 89.0 84 88.5
88.0 83 r on STS-B 87.5 r on N2C2 STS 82
87.0 81
86.5 CLS-BERT 80 HConv-BERT 86.0 2 4 6 8 10 12 2 4 6 8 10 12 The number of trainable transformer encoder layers The number of trainable transformer encoder layers
Figure 1: Evaluation of CLS-BERT and HConv-BERT over datasets from the general (STS-B) and clinical (N2C2) domains. r refers to Pearson correlation. N2C2-STS is split into 1233 and 409 instances for training and dev.
90 Table1, with HConv outperforming CLS-pooling.
88 4 Data Augmentation
86
r The accuracy of an STS model unsurprisingly de- 500 1000 84 pends on the amount of labeled data. This is re- 1500 flected in Figure2, where models trained with more 2200 82 3500 data outperform those with fewer training instances. 4500 In this section, we propose two data augmenta- 5749 80 tion methods, namely segment reordering (SR) and 2 4 6 8 10 12 The number of trainable transformer encoder layers back translation (BT), to address the data sparsity issue in clinical STS. Figure 2: Impact of number of trainable transformer blocks based on HConv-BERT over different data size, Segment reordering. Clinical texts often consist randomly sampled from STS-B, ranging from 500 to of text segments describing multiple events and pa- full set (5, 749). tient symptoms. Each segment is often an indepen- dent semantic unit, separated by commas or semi- colons. Inspired by the random word swapping trainable transformer blocks and training data size. of Wei and Zou(2019), we exploit this property In Figure2, we observe performance degradation and propose a heuristic, named segment reordering as the size of training data shrinks, with the mod- (SR), to generate permutations of the original se- els trained on the full set achieving far superior quence based on these segments. While we expect Pearson correlation to those trained on the smaller this to introduce some noise to the training data, our subsets. Zooming into the curve representing each hypothesis is that the increase in training data size subset, we find that peak performance is attained will outweigh this. For instance, consider the text at different points depending on data size: with new confusion or inability to stay alert and awake; the smallest dataset (500 instances), the number of feeling like you are going to pass out. Flipping the parameter updates is also limited. Only updating order of the two segments new confusion or inabil- the top few layers of transformer blocks is simply ity to stay alert and awake and feeling like you are not enough to make the model fully adapt to the going to pass out will not hinder the overall under- task. It is therefore beneficial to allow the model standing of the text. More formally, for a given access to more trainable layers (e.g., 11) to improve pair of sentences S1 and S2, each consisting of a performance. sequence of segments S = s , . . . , s and 1 { 11 1m} Based on this, we set the number of trainable S = s , . . . , s , we generate a new pair by 2 { 21 2n} blocks to 6 for SICK-R (consisting of 4, 500 train- randomly permuting the segment order, effectively ing instances), as presented in the bottom half of doubling the size of the training corpus.
108 Back translation. Inspired by the work of Yu Model r ρ et al.(2019), we make use of machine translation IBM-N2C2 90.1 — tools to perform back translation (BT). Here, we choose Chinese as the pivot language as it is linguis- BERTbase 86.7 81.9 tically distant to English and supported by mature + SR 87.1 80.8 commercial translation solutions. That is, we first + BT 87.2 81.7 translate from English to Chinese and then back BERTclinical 86.1 81.4 to English. We use Google Translate to translate + SR 87.4 82.7 each sentence in a sentence pair from English to + BT 88.6 82.4 3 Chinese, and Baidu Translation to translate back Conv1dBERTbase 87.7 80.7 to English. For example, for the original sentence + SR 88.0 81.4 negative for cough and stridor, the backtranslated + BT 88.1 82.2 result is bad for coughing and wheezing. We apply Conv1dBERTSTS-B 87.9 82.5 this to each sentence pair, doubling the amount of + SR 88.6 83.1 training data. + BT 89.4 83.0
5 Experiments Table 2: Pearson r and Spearman ρ on N2C2-STS for models with and without segment reordering (“SR”) 5.1 Experimental Setup and back translation (“BT”). We evaluate the effectiveness of SR and BT on N2C2-STS with four baseline mod- We additionally conduct a cross-domain exper- els: BERTbase (Devlin et al., 2019) and iment on BIOSSES (Sogancıo˘ glu˘ et al., 2017), BERTclinical (Alsentzer et al., 2019), both us- ing CLS-pooling and consisting of 12 layers; a biomedical literature STS dataset comprising 100 sentence pairs derived from the Text Analysis ConvBERTbase, based on BERTbase with hierarchi- cal convolution and fine-tuning over the last 4 lay- Conference Biomedical Summarization task with ers (consistent with our findings of the best model scores ranging from 0 (complete unrelatedness) to 4 (exact equivalence). Specifically, baseline model configuration in Section3); and ConvBERT STS-B, Pooling BERTbase and proposed ConvBERTSTS-B + where we take ConvBERTbase and fine-tune first over STS-B, before N2C2-STS. BT are both fine-tuned on N2C2-STS, and then ap- We split the training partition of N2C2-STS into plied with no further training to BIOSSES. Despite 1, 233 (train) and 409 (dev) instances, and report the increase in task difficulty, the proposed method results on the test set (412 instances). demonstrates strong generalisability, outperform- ing the baseline by an absolute gain of 2.4 and 3.9 5.2 Results to 85.42/82.83 (r/ρ). Experimental results are presented in Table2. We see clear benefits of the two proposed data aug- 6 Conclusions mentation methods, consistently boosting perfor- In this paper, we have presented an empirical study mance across all categories, with BT providing of the impact of a number of model design choices larger gains than SR. This is likely caused by the on a BERT-based approach to clinical STS. We rather na¨ıve implementation of SR, resulting in un- have demonstrated that the proposed hierarchical natural segment sequences. A possible fix to this is convolution mechanism outperforms a number of to further filter out such irregular statements with alternative conventional pooling methods. Also, we a language model pre-trained on clinical corpora. have investigated parameter fine-tuning strategies We leave this for future work. with varying degrees of flexibility, and identified It is impressive that the best-performing configu- the optimal number of trainable transformer blocks, ration ConvBERTSTS-B + BT is capable of achiev- thereby preventing over-tuning. Lastly, we have ing comparable results with the state-of-the-art verified the utility of two data augmentation meth- IBM-N2C2, an approach heavily reliant on exter- ods on clinical STS. It may be interesting to see the nal, domain-specific resources, and an ensemble of impact of leveraging target languages other than multiple pre-trained language models. Chinese in BT, which we leave for future work. 3https://fanyi.baidu.com/
109 References Doris Hoogeveen, Andrew Bennett, Yitong Li, Karin M Verspoor, and Timothy Baldwin. 2018. De- Yazan Alaya AL-Khassawneh, Naomie Salim, and tecting misflagged duplicate questions in commu- Adekunle Isiaka Obasae. 2016. Sentence similarity nity question-answering archives. In Twelfth Inter- techniques for automatic text summarization. Jour- national AAAI Conference on Web and Social Me- nal of Soft Computing and Decision Support Sys- dia. tems, 3(3):35–41. Emily Alsentzer, John Murphy, William Boag, Wei- Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, Hung Weng, Di Jindi, Tristan Naumann, and and Eduard Hovy. 2017. RACE: Large-scale ReAd- Matthew McDermott. 2019. Publicly available clini- ing comprehension dataset from examinations. In cal BERT embeddings. In Proceedings of the 2nd Proceedings of the 2017 Conference on Empirical Clinical Natural Language Processing Workshop, Methods in Natural Language Processing, pages pages 72–78, Minneapolis, Minnesota, USA. 785–794, Copenhagen, Denmark. Association for Computational Linguistics. Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. 2009. The fifth pascal recognizing tex- Zhenzhong Lan, Mingda Chen, Sebastian Goodman, tual entailment challenge. In TAC. Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite BERT for self-supervised Daniel Cer, Mona Diab, Eneko Agirre, Inigo˜ Lopez- learning of language representations. arXiv preprint Gazpio, and Lucia Specia. 2017. SemEval-2017 arXiv:1909.11942. task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- of the 11th International Workshop on Semantic dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Evaluation (SemEval-2017), pages 1–14, Vancouver, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Canada. Roberta: A robustly optimized bert pretraining ap- Courtney Corley and Rada Mihalcea. 2005. Measuring proach. arXiv preprint arXiv:1907.11692. the semantic similarity of texts. In Proceedings of the ACL workshop on empirical modeling of seman- Jonas Mueller and Aditya Thyagarajan. 2016. Siamese tic equivalence and entailment, pages 13–18. Asso- recurrent architectures for learning sentence similar- ciation for Computational Linguistics. ity. In Thirtieth AAAI Conference on Artificial Intel- ligence. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Alec Radford, Karthik Narasimhan, Tim Sali- deep bidirectional transformers for language under- mans, and Ilya Sutskever. 2018. Improving standing. In Proceedings of the 2019 Conference of language understanding by generative pre-training. the North American Chapter of the Association for URL https://s3-us-west-2.amazonaws.com/openai- Computational Linguistics: Human Language Tech- assets/researchcovers/languageunsupervised/language nologies, Volume 1 (Long and Short Papers), pages understanding paper.pdf. 4171–4186, Minneapolis, Minnesota. Nils Reimers and Iryna Gurevych. 2019. Sentence- William B Dolan and Chris Brockett. 2005. Automati- BERT: Sentence embeddings using Siamese BERT- cally constructing a corpus of sentential paraphrases. networks. In Proceedings of the 2019 Conference on In Proceedings of the Third International Workshop Empirical Methods in Natural Language Processing on Paraphrasing (IWP2005). and the 9th International Joint Conference on Natu- ral Language Processing (EMNLP-IJCNLP), pages Hua He, Kevin Gimpel, and Jimmy Lin. 2015. Multi- 3980–3990, Hong Kong, China. perspective sentence similarity modeling with con- volutional neural networks. In Proceedings of the Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015 Conference on Empirical Methods in Natu- 2016. Improving neural machine translation mod- ral Language Processing, pages 1576–1586, Lisbon, els with monolingual data. In Proceedings of the Portugal. 54th Annual Meeting of the Association for Compu- Hua He and Jimmy Lin. 2016. Pairwise word interac- tational Linguistics (Volume 1: Long Papers), pages tion modeling with deep neural networks for seman- 86–96, Berlin, Germany. tic similarity measurement. In Proceedings of the ¨ 2016 Conference of the North American Chapter of Gizem Sogancıo˘ glu,˘ Hakime Ozturk,¨ and Arzucan ¨ the Association for Computational Linguistics: Hu- Ozgur.¨ 2017. BIOSSES: a semantic sentence simi- man Language Technologies, pages 937–948, San larity estimation system for the biomedical domain. Diego, California. Bioinformatics, 33(14):i49–i58. Angelos Hliaoutakis, Giannis Varelas, Epimenidis Yanshan Wang, Naveed Afzal, Sunyang Fu, Liwei Voutsakis, Euripides GM Petrakis, and Evangelos Wang, Feichen Shen, Majid Rastegar-Mojarad, and Milios. 2006. Information retrieval by semantic sim- Hongfang Liu. 2018. MedSTS: a resource for clini- ilarity. International Journal on Semantic Web and cal semantic textual similarity. Language Resources Information Systems (IJSWIS), 2(3):55–73. and Evaluation, pages 1–16.
110 Jason Wei and Kai Zou. 2019. EDA: Easy data aug- mentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natu- ral Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6381–6387, Hong Kong, China. Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sen- tence understanding through inference. In Proceed- ings of the 2018 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguis- tics. Mengzhou Xia, Xiang Kong, Antonios Anastasopou- los, and Graham Neubig. 2019. Generalized data augmentation for low-resource translation. In Pro- ceedings of the 57th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 5786– 5796, Florence, Italy. Association for Computa- tional Linguistics. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- bonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural in- formation processing systems, pages 5754–5764. Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. 2019. QANet: Combining local convolution with global self-attention for reading comprehen- sion. In The Sixth International Conference on Learning Representations (ICLR).
111 Entity-Enriched Neural Models for Clinical Question Answering
1, 2,3,# 2,3, Bhanu Pratap Singh Rawat ,∗ Wei-Hung Weng , So Yeon Min §, 3,4, 2,3, Preethi Raghavan †, Peter Szolovits ‡ 1UMass-Amherst, 2MIT CSAIL, 3MIT-IBM Watson AI Lab, 4IBM Research, Cambridge * # [email protected], [email protected], §[email protected] †[email protected], ‡[email protected]
Context: The patient had an elective termination of her pregnancy on [DATE]. The Abstract work-up for the extent of the patient's disease included mri scan of the cervical and thoracic spine which revealed multiple metastatic lesions in the vertebral bodies; A T We explore state-of-the-art neural models 3 lesion extending from the body to the right neural for amina with foraminal obstruc tion. An abdominal and pelvic ct scan with iv contrast revealed bilateral pulmo for question answering on electronic medical nary nodules and bilateral pleural effusions, extensive liver metastases, narrowing of records and improve their ability to general- the intra hepatic ivc and distention of the azygous system suggestive of ivc obstructi on by liver metastases. ize better on previously unseen (paraphrased) Question: How was the patient's extensive liver metastases diagnosed? questions at test time. We enable this by learn- Paraphrase: What diagnosis was used for the patient's extensive liver metastases? Logical Form: {LabEvent (x) [date=x, result=x] OR ProcedureEvent (x) [date=x, re ing to predict logical forms as an auxiliary sult=x] OR VitalEvent (x) [date=x, result=x]} reveals ConditionEvent (|problem|) task along with the main task of answer span Answer: An abdominal and pelvic ct scan with iv contrast detection. The predicted logical forms also serve as a rationale for the answer. Further, Figure 1: A synthetic example of a clinical context, we also incorporate medical entity information question, its logical form and the expected answer. in these models via the ERNIE (Zhang et al., 2019a) architecture. We train our models on the large-scale emrQA dataset and observe that (clinical notes). Thus, we adapt these models for our multi-task entity-enriched models general- EMR QA while focusing on model generalization ize to paraphrased questions 5% better than ∼ via the following. (1) learning to predict the logi- the baseline BERT model. cal form (a structured semantic representation that captures the answering needs corresponding to a 1 Introduction natural language question) along with the answer The field of question answering (QA) has seen sig- and (2) incorporating medical entity embeddings nificant progress with several resources, models into models for EMR QA. We now examine the and benchmark datasets. Pre-trained neural lan- motivation behind these. guage encoders like BERT (Devlin et al., 2019) and A physician interacting with a QA system on its variants (Seo et al., 2016; Zhang et al., 2019b) EMRs may ask the same question in several dif- have achieved near-human or even better perfor- ferent ways; a physician may frame a question as: mance on popular open-domain QA tasks such as “Is the patient allergic to penicillin?” whereas the SQuAD 2.0 (Rajpurkar et al., 2016). While there other could frame it as “Does penicillin cause any has been some progress in biomedical QA on med- allergic reactions to the patient?”. Since paraphras- ical literature (Susterˇ and Daelemans, 2018; Tsat- ing is a common form of generalization in natural saronis et al., 2012), existing models have not been language processing (NLP) (Bhagat et al., 2009), similarly adapted to clinical domain on electronic a QA model should be able to generalize well to medical records (EMRs). such paraphrased question variants that may not be Community-shared large-scale datasets like em- seen during training (and avoid simply memorizing rQA (Pampari et al., 2018) allow us to apply state- the questions). However, current state-of-the-art of-the-art models, establish benchmarks, innovate models do not consider the use of meta-information and adapt them to clinical domain-specific needs. such as the semantic parse or logical form of the emrQA enables question answering from electronic questions in unstructured QA. In order to give the medical records (EMRs) where a question is asked model the ability to understand the semantic infor- by a physician against a patient’s medical record mation about answering needs of a question, we
∗ The author did this work while interning at MIT-IBM frame our problem in a multitask learning setting Watson AI Lab. where the primary task is extractive QA and the
112 Proceedings of the BioNLP 2020 workshop, pages 112–122 Online, July 9, 2020 c 2020 Association for Computational Linguistics auxiliary task is the logical form prediction of the 2 Problem Formulation question. We formulate the EMR QA problem as a read- Fine-tuning on medical copora (MIMIC-III, ing comprehension task. Given a natural language PubMed (Johnson et al., 2016; Lee et al., 2020)) question (asked by a physician) and a context, helps models like BERT align their representations where the context is a set of contiguous sentences according to medical vocabulary (since they are from a patient’s EMR (unstructured clinical notes), previously trained on open-domain corpora such the task is to predict the answer span from the given as WikiText (Zhu et al., 2015)). However, another context. Along with the (question, context, answer) challenge for developing EMR QA models is that triplet, also available as input are clinical entities different physicians can use different medical ter- extracted from the question and context. Also avail- minology to express the same entity; e.g., “heart able as input is the, logical form (LF) that is a struc- attack” vs. “myocardial infarction”. Mapping these tured representation that captures answering needs phrases to the same UMLS semantic type1 as dis- of the question through entities, attributes and rela- ease or syndrome (dsyn) provides common infor- tions required to be in the answer (Pampari et al., mation between such medical terminologies. In- 2018). A question may have multiple paraphrases corporating such entity information about tokens where all paraphrases map to the same LF (and the in the context and question can further improve the same answer, fig.1). performance of QA models for the clinical domain. Our contributions are as follows: 3 Methodology 1. We establish state-of-the-art benchmarks for In this section, we briefly describe BERT (Devlin EMR QA on a large clinical question answer- et al., 2019), ERNIE (Zhang et al., 2019a) and our ing dataset, emrQA (Pampari et al., 2018) proposed model.
2. We demonstrate that incorporating an auxil- 3.1 Bidirectional Encoder Representations iary task of predicting the logical form of a from Transformers (BERT) question helps the proposed models generalize BERT (Devlin et al., 2019) uses multi-layer bidi- well over unseen paraphrases, improving the rectional Transformer (Vaswani et al., 2017) net- overall performance on emrQA by 5% over works to encode contextualised language represen- ∼ BERT (Devlin et al., 2019) and by 3.5% tations. BERT representations are learned from two ∼ over clinicalBERT (Alsentzer et al., 2019). tasks: masked language modeling (Taylor, 1953) We support this hypothesis by running our pro- and next sentence prediction task. We chose BERT posed model over both emrQA and another model as pre-trained BERT models can be fine- clinical QA dataset, MADE (Jagannatha et al., tuned with just one additional inference layer and it 2019). achieved state-of-the-art results for a wide range of tasks such as question answering, such as SQuAD 3. The predicted logical form for unseen para- (Rajpurkar et al., 2016, 2018), and multiple lan- phrases helps in understanding the model bet- guage inference tasks, such as MultiNLI (Williams ter and provides a rationale (explanation) for et al., 2017). clinicalBERT (Alsentzer et al., 2019) why the answer was predicted for the provided yielded superior performance on clinical-related question. This information is critical in clin- NLP tasks such as i2b2 named entity recognition ical domain as it provides an accompanying (NER) challenges (Uzuner et al., 2011). It was answer justification for clinicians. created by further fine-tuning of BERTbase with biomedical and clinical corpus (MIMIC-III) (John- 4. We incorporate medical entity information by son et al., 2016). including entity embeddings via the ERNIE (Zhang et al., 2019a) architecture (Zhang et al., 3.2 Enhanced Language Representation with 2019a) and observe that the model accuracy Informative Entities (ERNIE) and ability to generalize goes up by 3% ∼ We adopt the ERNIE framework (Zhang et al., over BERTbase(Devlin et al., 2019). 2019a) to integrate the entity-level clinical con- 1https://metamap.nlm.nih.gov/ cept information into the BERT architecture, which SemanticTypesAndGroups.shtml has not yet been explored in the previous works.
113 ConditionEvent(|problem|) OR SymptomEvent(|problem|) Start Token End Token Logical Form
Span Selection Layer Logical Form Inference Layer
Information Fusion Layer Pooled Sequence Representation
e1 e1 et en
w1 w1 wt wn e1 e1 et en
Token: Multi-Head Attention Entity: Multi-Head Attention
[CLS] Has the edema ? [SEP] Extremities or edema . N/A N/A N/A FindingN/A N/A N/A N/A FindingN/A
MetaMap
Ques: Has the patient ever gone into edema?
Context: Extremities no clubbing, ... cyanosis or edema.
Figure 2: The network architecture of our multi-task learning question answering model (M-cERNIE). The ques- tion and context are provided to a multi-head attention model (orange) and are also passed through MetaMap to extract clinical entities which are passed through a separate multi-head attention (yellow). The token and entity representations are then passed through an information fusion layer (blue) to extract entity-enriched token represen- tations which are then used for answer span prediction. The pooled sequence representation from the information fusion layer is passed through logical form inference layer to predict the logical form.
ERNIE has shown significant improvement in dif- ERNIE architecture would be applicable to the ferent entity typing and relation classification tasks, model even if the logical forms are not available. as it utilises the extra entity information which is provided from knowledge graphs. ERNIE uses 3.3 Multi-task Learning for Extractive QA BERT for extracting contextualized token embed- In order to improve the ability of a QA model to dings and a multi-head attention model to generate generalize better over paraphrases, it helps to pro- entity embeddings. These two set of embeddings vide the model information about the logical form are aligned and provided as an input to an infor- that links these paraphrases. Since the answer to all mation fusion layer which provides entity-enriched the paraphrased questions is the same (and hence, token embeddings. For a token (wj) and its aligned logical form is the same), we constructed a multi- entity (ek = f(wj)), the information fusion pro- task learning framework to incorporate the logical cess is as follows: form information into the model. Thus, along with predicting the answer span, we added an auxiliary (i) (i) (i) (i) (i) hj = σ(Wt wj + We ek + b ) (1) task to also predict the corresponding logical form of the question. Multi-task learning provides an Here hj represents the entity enriched token em- inductive bias to enhance the primary task’s perfor- bedding, σ is the non-linear activation function, Wt mance via auxiliary tasks (Weng et al., 2019). In refers to an affine layer for token embeddings and our setting, the primary task is span detection of the We refers to an affine layer for entity embeddings. answer and the auxiliary task is logical form pre- For the tokens without corresponding entities, the diction for both emrQA and MADE (both datasets information fusion process becomes: are explained in detail in 4). The final loss for our § (i) (i) (i) model is defined as: hj = σ(Wt wj + b ) (2) model = ω lf + (1 ω) span, (3) Initially, each entity embedding is assigned ran- L L − L domly and is fine-tuned along with token em- where ω is the weightage given to the loss of aux- beddings throughout the training procedure. The illary task ( ), logical form prediction. Llf Lspan 114 is loss for answer span prediction and is MADE MADE 1.0 (Jagannatha et al., 2019) Lmodel the final loss for our proposed model. The multi- dataset was hosted as an adverse drug reactions task learning model can work with both BERT and (ADRs) and medication extraction challenge from ERNIE as the base model. Figure2 depicts the EMRs. This dataset was converted into a QA proposed multi-task model to predict both the an- dataset by following the same procedure as enu- swer and logical form given a question and ERNIE merated in the literature of emrQA (Pampari et al., architecture that is used to learn entity-enriched 2018). MADE QA dataset is smaller than emrQA, token embeddings. as emrQA consists of multiple datasets taken from i2b2 (Uzuner et al., 2011) whereas MADE only 4 Datasets has specific relations and entity mentions to that of We used emrQA2 and MADE3 datasets for our ex- ADRs and medications. This resulted in a clinical periments. We provide a brief summary of each QA dataset which has different properties as com- dataset and the methodology followed to split these pared to emrQA. MADE also has lesser number datasets into train and test sets. of logical forms (8 LFs) as compared to emrQA because of fewer entities and relations. The 8 LFs emrQA The emrQA corpus (Pampari et al., for MADE are provided in AppendixB. 2018) is the only community-shared clinical QA dataset that consists of questions, posed by physi- 4.1 Train/test splits cians against electronic medical records (EMRs) of The emrQA dataset is generated using a semi- a patient, along with their answers. The dataset was automated process that normalizes real physician developed by leveraging existing annotations avail- questions to create question templates, associates able for other clinical natural language processing expert annotated logical forms with each tem- (NLP) tasks (i2b2 challenge datasets (Uzuner et al., plate and slot fills them using annotations for 2011)). It is a credible resource for clinical QA various NLP tasks from i2b2 challenge datasets as logical forms that are generated by a physician (for e.g., fig.1). emrQA is rich in paraphrases help slot fill question templates and extract corre- as physicians often tend to express the same in- sponding answers from annotated notes. Multiple formation need in different ways. As shown in question templates can be mapped to the same logi- Table.1, all paraphrases of a question map to cal form (LF), as shown in Table1, and are referred the same logical form. Thus, if a model has ob- to as paraphrases of each other. served some of the paraphrases it should be able to generalize to the others effectively with the LF: MedicationEvent ( medication ) [dosage=x] | | help of their shared logical form “MedicationEvent How much medication does the patient take per day? ( medication ) [dosage=x]”. In order to simulate | | | | What is her current dose of medication ? this, and test the true capability of the model to gen- What is the current dose of the| patient’s| medication ? What is the current dose of medication ?| | eralize to unseen paraphrased questions, we create What is the dosage of medication| ? | a splitting scheme and refer to it as paraphrase- What was the dosage prescribed| of| medication ? | | level split. Table 1: A logical form (LF) and its respective question Paraphrase-level split templates (paraphrases). The basic idea is that some of question templates would be observed by the model during training The emrQA corpus has over 1M+ question, log- and remaining would be used during validation and ical form, and answer/evidence triplets, an example testing. The steps taken for creating this split are of a context, question, its logical form and a para- enumerated below: phrase is shown in Fig1. The evidences are the sentences from the clinical note that are relevant 1. First, the clinical notes are separated into train, to a particular question. There are total 30 logical val and test sets. Then the question, logical forms in the emrQA dataset 4. form and context triplets are generated for
2 each set resulting in the full dataset. Here the https://github.com/panushri25/emrQA context is the set of contiguous sentences from 3https://bio-nlp.org/index.php/ projects/39-nlp-challenges the EMR. 4https://github.com/panushri25/emrQA/ blob/master/templates/templates-all.csv 2. Then for each logical form (LF), 70% of its
115 corresponding question templates are chosen question-context pair. For both emrQA and MADE for train dataset and the rest are kept for val- dataset, the span is marked as the answer to the idation and test dataset. Considering the LF question and the sentence is marked as the evidence. shown in Table1, four of the question tem- Hence, we perform extractive question answering plates (QTtr) would be assigned for training at two levels: sentence and paragraph. and two (QT ) of them would be assigned v/t Sentence setting: For this setting, the evidence for validation/testing. So any sample in train- sentence which contains the answer span is pro- ing dataset whose question is generated from vided as the context to the question and the model the question template set Q would be dis- v/t has to predict the span of the answer, given the carded. Similarly, any sample with a question question. generated from the question template set Qtr would be discarded. Paragraph setting: Clinical notes are noisy and often contain incomplete sentences, lists and em- 3. To compare the generalizability performance bedded tables making it difficult to segment para- of our model, we keep the training dataset graphs in notes. Hence, we decided to define the QT + with both set of question templates ( tr context as evidence sentence and 15 20 sentences QT − v/t) as well. Essentially, a baseline model around it. We randomly chose the length of the which has observed all the question templates paragraph (lpara) and another number less than the (QTtr +QTv/t) should be able to perform bet- length of the paragraph (lpre < lpara). We chose ter on the QTv/t set as compared to a model lpre contiguous sentences which exist prior to the QT which has only observed tr set. This com- evidence sentence in the EMR and (l l ) sen- para− pre parison would help us in measuring the im- tences after the evidence sentence. We adopted this provement in performance with the help of strategy because the model could have benefited logical forms even when a set of question tem- from the information that the evidence sentence is plates are not observed by the model. exactly in the middle of a fixed length paragraph. The dataset statistics for both emrQA and MADE The model has to predict the span of the answer are shown in Table2. The training set with both from the lpara sentences long paragraph (context) question template sets (QTtr + QTv/t) is shown given the question. with ‘(r)’ appended as suffix, as it is essentially The datasets are appended by ‘-p’ and ‘-s’ for a random split, whereas the training set with the paragraph and sentence settings respectively. The question template (QTtr) is appended with ‘(pl)’ sentence setting is a relatively easier setting, for the for paraphrase-level split. model, compared to the paragraph setting because the scope of the answer is narrowed down to lesser Datasets Split Train Val. Test number of tokens and there is less noise. For both # Notes 433 44 47 settings, as also mentioned in 4, we kept the train emrQA § # Samples (pl) 133,589 21,666 19,401 set where all the question templates (paraphrases) # Samples (r) 198,118 21,666 19,401 are observed by the model during training and that # Notes 788 88 213 MADE # Samples (pl) 73,224 4,806 9,235 is referred with ‘(r)’ prefix, suggesting ‘random’ se- # Samples (r) 113,975 4,806 9,235 lection and no filtering based on question templates (paraphrases). All these dataset abbreviations are Table 2: Train, validation and test data splits. shown in the first column of Table3.
5.2 Extracting Entity Information 5 Experiments MetaMap (Aronson, 2001) uses a knowledge- In this section, we briefly discuss the experimen- intensive approach to discover different clinical tal settings, clinical entity extraction method, im- concepts referred to in the text according to unified plementation details of our proposed model and medical language system (UMLS) (Bodenreider, evaluation metrics for our experiments. 2004). The clinical ontologies, such as SNOMED (Spackman et al., 1997) and RxNorm (Liu et al., 5.1 Experimental Setting 2005), embedded in MetaMap are quite useful in As a reading comprehension style task, the model extracting 127 entities across diagnosis, medica- ∼ has to identify the span of the answer given the tion, procedure and sign/symptoms. We shortlisted
116 these entities (semantic types) by mapping them Dataset Model F1-score Exact Match to the entities which were used for creating logical BERT 72.13 65.81 cBERT 74.75 (+2.62) 67.25 (+1.44) forms of the questions as these are the main enti- emrQA-s (pl) cERNIE 77.39 (+5.26) 70.17 (+4.36) ties for which the question has been posed. The M-cERNIE 79.87 (+7.74) 71.86 (+6.05) selected entities are: acab, aggp, anab, anst, bpoc, emrQA-s (r) cBERT 82.34 74.58 cgab, clnd, diap, emod, evnt, fndg, inpo, lbpr, lbtr, BERT 64.19 56.30 phob, qnco, sbst, sosy and topp. Their descriptions cBERT 65.45 (+1.26) 57.58 (+1.28) emrQA-p (pl) are provided in AppendixC. cERNIE 66.15 (+1.96) 59.80 (+3.5) M-cERNIE 67.21 (+3.02) 61.22 (+4.92) These filtered entities (Table7), extracted from MetaMap, are provided to ERNIE. A separate em- emrQA-p (r) cBERT 72.51 65.14 bedding space is defined for the entity embeddings BERT 68.45 60.73 cBERT 70.19 (+1.74) 62.00 (+1.27) MADE-s (pl) which are passed through a multi-head attention cERNIE 71.51 (+3.06) 65.31 (+4.58) layer (Vaswani et al., 2017) before interacting with M-cERNIE 73.83 (+5.38) 67.53 (+6.8) token embeddings in the information fusion layer. MADE-s (r) cBERT 73.70 65.54 The entity-enriched token embeddings are then BERT 63.39 57.49 cBERT 64.97 (+1.58) 58.94 (+1.45) used to predict the span of the answer from the MADE-p (pl) cERNIE 65.71 (+2.32) 60.55 (+3.06) context. We fine-tuned these entity embeddings M-cERNIE 64.58 (+1.19) 59.39 (+1.9) along with the token embeddings, as opposed to MADE-p (r) cBERT 66.89 61.27 using learned entities and not fine-tuning during downstream tasks (Zhang et al., 2019a). The archi- Table 3: F1-score and exact match values for Models tecture is illustrated in Fig2. on emrQA and MADE. The ‘-s’ suffix refers to the sen- tence setting and ‘-p’ refers to the paragraph setting 5.3 Implementation Details for the context provided in our reading comprehension style QA task. The ‘(pl)’ refers to the paraphrase-level The BERT model was released with pre-trained and ‘(r)’ refers to the random split as explained in 4. weights as BERTbase and BERTlarge. BERTbase § BERT refers to BERTbase, cBERT refers to clinical- has lesser number of parameters but achieved state- BERT, cERNIE refers to clinicalERNIE and M-cERNIE of-the-art results on a number of open-domain refers to the multi-task learning clinicalERNIE model. NLP tasks. We performed our experiments with BERTbase and hence, from here onwards we re- help the model in achieving better performance or fer to BERTbase as BERT. A fine-tuned version of not. We also analyse the logical form predictions BERTbase on clinical notes was released as clin- to understand whether it provides a rationale for icalBERT (cBERT)(Alsentzer et al., 2019). We the answer predicted by our proposed model. The use cBERT as the multi-head attention model for compiled results for all the models are shown in getting the token representations in ERNIE. We Table3. The hyper-parameter values for the best refer to this version of ERNIE, with entities from performing models are provided in AppendixA. MetaMap, as cERNIE for clinical ERNIE. Our fi- nal multi-task learning model, incorporated with Does clinical entity information improve mod- an auxillary task of predicting logical forms, is els’ performance? Across all settings, the F1- referred to as M-cERNIE for multi-task clinical score of cERNIE improves by 2 5% over BERT ∼ − ERNIE. The code for all the models is provided at and 0.75 3% over cBERT. The exact match ∼ − https://github.com/emrQA/bionlp_acl20. performance improved by 3 4.5 over BERT ∼ − and 1.5 3.25% over cBERT. Also, as expected, − Evaluation Metrics For our extractive question the performance in sentence setting (-s) improved answering task, we utilised exact match and F1- relatively more than it did in paragraph-setting. score for evaluation as per earlier literature (Ra- The entity-enriched tokens help in identifying the jpurkar et al., 2016). tokens which are required by the question. For example, in Fig.3, the token ‘infiltrative’ in the 6 Results and Discussion question as well as the context get highlighted with In this section, we compare the results of all the the help of the identified entity ‘topp’ (therapeutic models that we introduced in 3. With the help of or preventive procedure) and then relevant tokens § different experiments, we try to analyse whether in the context, chest x ray, get highlighted with the the induced entity and logical form information relevant entity ‘diap’ (diagnostic procedure). This
117 information aids the model in narrowing down its the performance of M-cERNIE on MADE-s and focus to highlighted diagnostic procedures in the emrQA-s datasets for logical form prediction, as context for answer extraction. we saw most improvement in sentence setting (-s). We calculated macro-weighted precision, recall and Question: How was diffuse infiltrativetopp diagnosedfndg? F1-score for logical form classification. The model Context: Earlier that day, pt had a chest x ray which showed diap achieved a F1-score of 0.45 0.59 for both diffuse infiltrativetopp process concerning for ARDS. ∼ − Answer: chest x ray datasets, as shown in Table4, exact match setting. We analysed the confusion matrix of predicted LF Figure 3: An example of a question, context, their ex- and observed that the model mainly gets confused tracted entities and expected answer. between the logical forms which convey similar semantic information as shown in Fig.4. Does logical form information help the model Q1: What were the results of the abnormal BMI on 2094-12-02? generalize better? In order to answer this ques- Logical Form: LabEvent (|test|) [abnormalResultFlag=Y, date=|date|, tion, we compared the performance of our M- result=x] OR ProcedureEvent (|test|) [abnormalResultFlag=Y, date=|d ate|, result=x] OR VitalEvent (|test|) [date=|date|, (result=x)>vital.ref cERNIE model to cERNIE model and observed high] OR VitalEvent (|test|) [date=|date|, (result=x)
118 Setting Dataset Precision Recall F1-score re-ranking of predictions for queries that will be emrQA 0.65 0.61 0.59 Exact issued for each document. However, BERT-based 0.47 0.52 0.45 MADE models have not been adapted to answering physi- emrQA 0.93 0.91 0.92 Relaxed MADE 0.83 0.85 0.84 cian questions on EMRs. In case of domain-specific QA, logical forms Table 4: Precision, Recall and F1-score for logical form or semantic parse are typically used to integrate prediction. the domain knowledge associated with a KB-based (knowledge base) structured QA datasets, where Can logical form information be induced in a model is learnt for mapping a natural language multi-class QA tasks as well? To answer this question, we performed another experiment where question to a LF. GeoQuery (Zelle and Mooney, the model has to classify the evidence sentences 1996), and ATIS (Dahl et al., 1994), are the old- from the non-evidence sentences making it a two- est known manually generated question-LF annota- class classification task. The model would be pro- tions on closed-domain databases. QALD (Lopez vided a tuple of question and a sentence and it has et al., 2013), FREE 917 (Cai and Yates, 2013), to predict whether the sentence is evidence or not? SIMPLEQuestions (Bordes et al., 2015) contain hundreds of hand-crafted questions and their cor- The final loss of the model ( model) changes to: L responding database queries. Prior work has also model = ω lf + (1 ω) evidence (4) used LFs as a way to generate questions via crowd- L L − L sourcing (Wang et al., 2015). WEBQuestions (Be- where ω is the weightage given to the loss of auxil- rant et al., 2013) contains thousands of questions lary task ( ), logical form prediction. Llf Levidence from Google search where the LFs are learned as is loss for evidence classification and is the Lmodel latent representations in helping answer questions final loss for our proposed model. We conducted from Freebase. Prior work has not investigated our experiments on emrQA dataset as evidence sen- the utility of logical forms in unstructured QA, es- tences were provided in it. In the multi-class set- pecially as a means to generalize the QA model ting, the [CLS] token representation would be across different paraphrases of a question. used for evidence classification as well as logical There have been efforts on using multi-task form prediction. learning for efficient question answering, such as the authors of (McCann et al., 2018) tried to learn Dataset Model Precision Recall F1-score multiple tasks together resulting in an overall boost cBERT 0.67 0.99 0.76 emrQA cERNIE 0.69 0.98 0.78 (+0.02) in the performance of the model on SQuAD (Ra- M-cERNIE 0.73 0.99 0.82 (+0.06) jpurkar et al., 2016). Similarly, the authors of (Lu et al., 2019) also utilised the information across dif- Table 5: Macro-weighted precision, recall and F1-score ferent tasks which lie at the intersection of vision of Proposed Models on Test Dataset (Multi-choice QA). and natural language processing to improve the For the model names, c: clinical; M: multitask. performance of their model across all tasks. The The multi-task entity enriched model (M- authors of (Rawat et al., 2019) utilised weak super- cERNIE) achieved an absolute improvement of 6% vision to the model while predicting the answer but over cBERT and 4% over cERNIE. This suggests not much work has been done to incorporate the that the inductive bias introduced via LF prediction logical form of the question for unstructured ques- does help in improving the overall performance of tion answering in a multi-task setting. Hence, we the model for multi-class QA as well. decided to explore this direction and incorporate 7 Related Work the structured semantic information of the ques- tions for extractive question answering. In the general domain, BERT-based models are on the top of different leader boards across var- 8 Conclusion ious tasks, including QA tasks (Rajpurkar et al., 2018, 2016). The authors of (Nogueira and Cho, The proposed entity-enriched QA models trained 2019) applied BERT to the MS-MARCO passage with an auxiliary task improve over the state-of-the- retrieval QA task and observed improvement over art models by about 3 6% across the large-scale − state of the art results. (Nogueira et al., 2019) fur- clinical QA dataset, emrQA (Pampari et al., 2018) ther extended the work by combining BERT with (as well as MADE (Jagannatha et al., 2019)). We
119 also show that multitask learning for logical forms Alistair EW Johnson, Tom J Pollard, Lu Shen, along with the answer results in better generalizing H Lehman Li-wei, Mengling Feng, Moham- over unseen paraphrases for EMR QA. The pre- mad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. Mimic- dicted logical forms also serve as an accompanying iii, a freely accessible critical care database. Scien- justification to the answer and help in adding credi- tific data, 3:160035. bility to the predicted answer for the physician. Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Acknowledgement Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. Biobert: a pre-trained biomed- This work is supported by MIT-IBM Watson AI ical language representation model for biomedical Lab, Cambridge, MA USA. text mining. Bioinformatics, 36(4):1234–1240. Simon Liu, Wei Ma, Robin Moore, Vikraman Gane- san, and Stuart Nelson. 2005. Rxnorm: prescription References for electronic drug information exchange. IT profes- Emily Alsentzer, John R Murphy, Willie Boag, Wei- sional, 7(5):17–23. Hung Weng, Di Jin, Tristan Naumann, and Matthew McDermott. 2019. Publicly available clinical bert Vanessa Lopez, Christina Unger, Philipp Cimiano, and embeddings. NAACL-HLT Clinical NLP Workshop. Enrico Motta. 2013. Evaluating question answering over linked data. Web Semantics: Science, Services Alan R Aronson. 2001. Effective mapping of biomedi- and Agents on the World Wide Web, 21:3–13. cal text to the umls metathesaurus: the metamap pro- gram. In AMIA, page 17. Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 2019. 12-in-1: Multi-task Jonathan Berant, Andrew Chou, Roy Frostig, and Percy vision and language representation learning. arXiv Liang. 2013. Semantic parsing on freebase from preprint arXiv:1912.02315. question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Lan- Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, guage Processing, pages 1533–1544. and Richard Socher. 2018. The natural language de- cathlon: Multitask learning as question answering. Rahul Bhagat, Eduard Hovy, and Siddharth Patward- arXiv preprint arXiv:1806.08730. han. 2009. Acquiring paraphrases from text corpora. In Proceedings of the fifth international conference Rodrigo Nogueira and Kyunghyun Cho. 2019. Pas- on Knowledge capture, pages 161–168. sage re-ranking with bert. arXiv preprint Olivier Bodenreider. 2004. The unified medical lan- arXiv:1901.04085. guage system (umls): integrating biomedical termi- nology. Nucleic acids research, 32(suppl 1):D267– Rodrigo Nogueira, Wei Yang, Jimmy Lin, and D270. Kyunghyun Cho. 2019. Document expansion by query prediction. arXiv preprint arXiv:1904.08375. Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. 2015. Large-scale simple question Anusri Pampari, Preethi Raghavan, Jennifer Liang, and answering with memory networks. arXiv preprint Jian Peng. 2018. emrqa: A large corpus for question arXiv:1506.02075. answering on electronic medical records. EMNLP.
Qingqing Cai and Alexander Yates. 2013. Large-scale Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. semantic parsing via schema matching and lexicon Know what you don’t know: Unanswerable ques- extension. In ACL, volume 1, pages 423–433. tions for squad. arXiv preprint arXiv:1806.03822. Deborah A Dahl, Madeleine Bates, Michael Brown, Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and William Fisher, Kate Hunicke-Smith, David Pallett, Percy Liang. 2016. SQuAD: 100,000+ questions Christine Pao, Alexander Rudnicky, and Elizabeth for machine comprehension of text. arXiv preprint Shriberg. 1994. Expanding the scope of the atis task: arXiv:1606.05250. The atis-3 corpus. In HLT, pages 43–48. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Bhanu Pratap Singh Rawat, Fei Li, and Hong Yu. 2019. Kristina Toutanova. 2019. BERT: Pre-training of Naranjo question answering using end-to-end multi- deep bidirectional transformers for language under- task learning model. In Proceedings of the 25th standing. NAACL-HLT. ACM SIGKDD International Conference on Knowl- edge Discovery & Data Mining, pages 2547–2555. Abhyuday Jagannatha, Feifan Liu, Weisong Liu, and ACM. Hong Yu. 2019. Overview of the first natural lan- guage processing challenge for extracting medica- Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and tion, indication, and adverse drug events from elec- Hannaneh Hajishirzi. 2016. Bidirectional attention tronic health record notes (made 1.0). Drug safety, flow for machine comprehension. arXiv preprint 42(1):99–111. arXiv:1611.01603.
120 Kent A Spackman, Keith E Campbell, and Roger A Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut- Cotˆ e.´ 1997. Snomed rt: a reference terminology for dinov, Raquel Urtasun, Antonio Torralba, and Sanja health care. In Proceedings of the AMIA annual fall Fidler. 2015. Aligning books and movies: Towards symposium, page 640. American Medical Informat- story-like visual explanations by watching movies ics Association. and reading books. In Proceedings of the IEEE inter- national conference on computer vision, pages 19– Simon Susterˇ and Walter Daelemans. 2018. Clicr: a 27. dataset of clinical case reports for machine reading comprehension. arXiv preprint arXiv:1803.09720.
Wilson L Taylor. 1953. “cloze procedure”: A new tool for measuring readability. Journalism Bulletin, 30(4):415–433.
George Tsatsaronis, Michael Schroeder, Georgios Paliouras, Yannis Almirantis, Ion Androutsopoulos, Eric Gaussier, Patrick Gallinari, Thierry Artieres, Michael R Alvers, Matthias Zschunke, et al. 2012. Bioasq: A challenge on large-scale biomedical se- mantic indexing and question answering. In 2012 AAAI Fall Symposium Series.
Ozlem¨ Uzuner, Brett R South, Shuying Shen, and Scott L DuVall. 2011. 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Asso- ciation, 18(5):552–556.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information pro- cessing systems, pages 5998–6008.
Yushi Wang, Jonathan Berant, and Percy Liang. 2015. Building a semantic parser overnight. In ACL- IJCNLP, volume 1, pages 1332–1342.
Wei-Hung Weng, Yuannan Cai, Angela Lin, Fraser Tan, and Po-Hsuan Cameron Chen. 2019. Multi- modal multitask representation learning for pathol- ogy biobank metadata prediction. arXiv preprint arXiv:1909.07846.
Adina Williams, Nikita Nangia, and Samuel R Bow- man. 2017. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426.
John M Zelle and Raymond J Mooney. 1996. Learn- ing to parse database queries using inductive logic programming. In Proceedings of the national con- ference on artificial intelligence, pages 1050–1055.
Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. 2019a. Ernie: En- hanced language representation with informative en- tities. arXiv preprint arXiv:1905.07129.
Zhuosheng Zhang, Yuwei Wu, Junru Zhou, Sufeng Duan, and Hai Zhao. 2019b. Sg-net: Syntax-guided machine reading comprehension. arXiv preprint arXiv:1908.05147.
121 A Model Hyper-parameters Semantic Type Description acab Acquired Abnormality Most of the hyper-parameters across our models aggp Age Group remained same: learning rate: 2e 5, weight de- anab Anatomical Abnormality − cay: 1e 5, warm-up proportion: 10% and hidden anst Anatomical Structure − bpoc Body Part, Organ, or Organ Component dropout probability: 0.1. The parameters that var- cgab Congenital Abnormality ied across models for different datasets are enumer- clnd Clinical Drug ated in the Table6. The hyper-parametsrs provided diap Diagnostic Procedure emod Experimental Model of Disease in Table6 are for all models in a particular dataset. evnt Event This also suggests that even after adding an auxil- fndg Finding iary task, the proposed model doesn’t need a lot of inpo Injury or Poisoning lbpr Laboratory Procedure hyper-parameter tuning. lbtr Laboratory or Test Result phob Physical Object qnco Quantitative Concept Entity Embedding Auxiliary Task Dataset sbst Substance Dim Wt. sosy Sign or Symptom emrQA-rel 100 0.3 topp Therapeutic or Preventive Procedure BoolQ 90 0.3 emrQA 100 0.3 MADE 80 0.2 Table 7: Selected semantic types as per MetaMap and their brief descriptions. Table 6: Hyper-parameter values across different datasets.
B Logical forms (LFs) for MADE dataset 1. MedicationEvent ( medication ) [sig=x] | | 2. MedicationEvent ( medication ) causes Condi- | | tionEvent (x) OR SymptomEvent (x) 3. MedicationEvent ( medication ) given Condi- | | tionEvent (x) OR SymptomEvent (x) 4. [ProcedureEvent ( treatment ) given/conducted | | ConditionEvent (x) OR SymptomEvent (x)] OR [MedicationEvent ( treatment ) given Condition- | | Event (x) OR SymptomEvent (x)] 5. MedicationEvent (x) CheckIfNull ([enddate]) OR MedicationEvent (x) [enddate>currentDate] OR ProcedureEvent (x) [date=x] given Condition- Event ( problem ) OR SymptomEvent ( problem ) | | | | 6. MedicationEvent (x) CheckIfNull ([enddate]) OR MedicationEvent (x) [enddate>currentDate] given ConditionEvent ( problem ) OR Symp- | | tomEvent ( problem ) | | 7. MedicationEvent ( treatment ) OR Proce- | | dureEvent ( treatment ) given ConditionEvent (x) | | OR SymptomEvent (x) 8. MedicationEvent ( treatment ) OR Proce- | | dureEvent ( treatment ) improves/worsens/causes | | ConditionEvent (x) OR SymptomEvent (x)
C Selected entities from MetaMap The list of selected semantic types in the form of entities and their brief descriptors are provided in Table7.
122 Evidence Inference 2.0: More Data, Better Models
Jay DeYoung?Ψ, Eric Lehman?Ψ, Ben NyeΨ, Iain J. MarshallΦ, and Byron C. WallaceΨ
?Equal contribution ΨKhoury College of Computer Sciences, Northeastern University ΦKings College London deyoung.j,lehman.e,nye.b,b.wallace @northeastern.edu, [email protected] { }
Abstract trials per day.1 Motivated by the rapid growth in clinical trial publications, there now exist a plethora How do we most effectively treat a disease of tools to partially automate the systematic review or condition? Ideally, we could consult a database of evidence gleaned from clinical tri- task (Marshall and Wallace, 2019). However, ef- als to answer such questions. Unfortunately, forts at fully integrating the PICO framework into no such database exists; clinical trial results this process have been limited (Eriksen and Frand- are instead disseminated primarily via lengthy sen, 2018). What if we could build a database natural language articles. Perusing all such ar- of Participants,2 Interventions, Comparisons, and ticles would be prohibitively time-consuming Outcomes studied in these trials, and the findings for healthcare practitioners; they instead tend reported concerning these? If done accurately, this to depend on manually compiled systematic re- would provide direct access to which treatments views of medical literature to inform care. the evidence supports. In the near-term, such tech- NLP may speed this process up, and eventu- nologies may mitigate the tedious work necessary ally facilitate immediate consult of published evidence. The Evidence Inference dataset for manual synthesis. (Lehman et al., 2019) was recently released Recent efforts in this direction include the EBM- to facilitate research toward this end. This NLP project (Nye et al., 2018), and Evidence Infer- task entails inferring the comparative perfor- ence (Lehman et al., 2019), both of which comprise mance of two treatments, with respect to a annotations collected on reports of Randomized given outcome, from a particular article (de- Control Trials (RCTs) from PubMed.3 Here we scribing a clinical trial) and identifying sup- build upon the latter, which tasks systems with in- porting evidence. For instance: Does this ar- ferring findings in full-text reports of RCTs with ticle report that chemotherapy performed bet- ter than surgery for five-year survival rates of respect to particular interventions and outcomes, operable cancers? In this paper, we collect and extracting evidence snippets supporting these. additional annotations to expand the Evidence We expand the Evidence Inference dataset and Inference dataset by 25%, provide stronger evaluate transformer-based models (Vaswani et al., baseline models, systematically inspect the er- 2017; Devlin et al., 2018) on the task. Concretely, rors that these make, and probe dataset qual- our contributions are: ity. We also release an abstract only (as op- posed to full-texts) version of the task for We describe the collection of an additional rapid model prototyping. The updated cor- • 2,503 unique ‘prompts’ (see Section2) with pus, documentation, and code for new base- lines and evaluations are available at http: matched full-text articles; this is a 25% expan- //evidence-inference.ebm-nlp.com/. sion of the original evidence inference dataset that we will release. We additionally have col- 1 Introduction lected an abstract-only subset of data intended As reports of clinical trials continue to amass at to facilitate rapid iterative design of models, rapid pace, staying on top of all current literature to 1See https://ijmarshall.github.io/sote/. inform evidence-based practice is next to impossi- 2We omit Participants in this work as we focus on the ble. As of 2010, about seventy clinical trial reports document level task of inferring study result directionality, and the Participants are inherent to the study, i.e., studies do were published daily, on average (Bastian et al., not typically consider multiple patient populations. 2010). This has risen to over one hundred thirty 3https://pubmed.ncbi.nlm.nih.gov/
123 Proceedings of the BioNLP 2020 workshop, pages 123–132 Online, July 9, 2020 c 2020 Association for Computational Linguistics
as working over full-texts can be prohibitively over a variety of articles. This resulted in our col- time-consuming. lecting 3.77 prompts per article, on average. We asked doctors to derive at least 1 prompt from the We introduce and evaluate new models, • body (rather than the abstract) of the article. A achieving SOTA performance for this task. large difficulty of the task stems from the wide va- riety of treatments and outcomes used in the trials: We ablate components of these models and • 35.8% of interventions, 24.0% of comparators, and characterize the types of errors that they tend 81.6% of outcomes are unique to one another. to still make, pointing to potential directions In addition to these ICO prompts, doctors were for further improving models. asked to report the relationship between the inter- 2 Annotation vention and comparator with respect to the out- come, and cite what span from the article supports In the Evidence Inference task (Lehman et al., their reasoning. We find that 48.4% of the collected 2019), a model is provided with a full-text arti- prompts can be answered using only the abstract. cle describing a randomized controlled trial (RCT) However, 63.0% of the evidence spans supporting and a ‘prompt’ that specifies an Intervention (e.g., judgments (provided by both the prompt generator aspirin), a Comparator (e.g., placebo), and an Out- and prompt annotator), are from outside of the ab- come (e.g., duration of headache). We refer to these stract. Additionally, 13.6% of evidence spans cover as ICO prompts. The task then is to infer whether more than one sentence in length. a given article reports that the Intervention resulted in a significant increase, significant decrease, or 2.2 Prompt Annotation produced no significant difference in the Outcome, Following the guidelines presented in Lehman et al. as compared to the Comparator. (2019), each prompt was assigned to a single doc- Our annotation process largely follows that out- tor. They were asked to report the difference be- lined in Lehman et al.(2019); we summarize this tween the specified intervention and comparator, briefly here. Data collection comprises three steps: with respect to the given outcome. In particular, (1) prompt generation; (2) prompt and article anno- options for this relationship were: “increase”, “de- tation; and (3) verification. All steps are performed crease”, “no difference” or “invalid prompt.” An- 4 by Medical Doctors (MDs) hired through Upwork. notators were also asked to mark a span of text Annotators were divided into mutually exclusive supporting their answers: a rationale. However, groups performing these tasks, described below. unlike Lehman et al.(2019), here, annotators were Combining this new data with the dataset intro- not restricted via the annotation platform to only duced in Lehman et al.(2019) yields in total 12,616 look at the abstract at first. They were free to search unique prompts stemming from 3,346 unique arti- the article as necessary. 5 cles, increasing the original dataset by 25%. To Because trials tend to investigate multiple in- acquire the new annotations, we hired 11 doctors: terventions and measure more than one outcome, 1 for prompt generation, 6 for prompt annotation, articles will usually correspond to multiple — po- and 4 for verification. tentially many — valid ICO prompts (with cor- 2.1 Prompt Generation respondingly different findings). In the data we collected, 62.9% of articles comprise at least two In this collection phase, a single doctor is asked to ICO prompts with different associated labels (for read an article and identify triplets of interventions, the same article). comparators, and outcomes; we refer to these as ICO prompts. Each doctor is assigned a unique ar- 2.3 Verification ticle, so as to not overlap with one another. Doctors Given both the answers and rationales of the were asked to find a maximum of 5 prompts per prompt generator and prompt annotator, a third article as a practical trade-off between the expense doctor — the verifier — was asked to determine of exhaustive annotation and acquiring annotations the validity of both of the previous stages.6 We esti- 4http://upwork.com. mate the accuracy of each task with respect to these 5We use the first release of the data by Lehman et al., which verification labels. For prompt generation, answers included 10,137 prompts. A subsequent release contained 10,113 prompts, as the authors removed prompts where the 6The verifier can also discard low-quality or incorrect answer and rationale were produced by different doctors. prompts.
124 Evidence Identification ICO + Ev Inference [BERT + MLP] [BERT + MLP]
Supervised postive Training Fully supervised Sample negatives I C O Article Ev increased w.r.t. I O C Background Support: Duis aute irure dolor in reprehenderit in Classify Ev voluptate velit esse cillum dolore eu fugiat nulla pariatur. Results Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt. Increased | Deceased | Same
Testing Rank sentences by p(evidence) progesterone placebo mortality mortality rate was significantly lower mortality rate was significantly lower Classify Does progesterone affect adverse events were not found argmax mortality w.r.t. placebo ? intracranial pressure values were lower
results suggest further clinical trials Decreased
Figure 1: BERT to BERT pipeline. Evidence identification and classification stages are trained separately. The identifier is trained via negative samples against the positive instances, the classifier via only those same positive evidence spans. Decoding assigns a score to every sentence in the document, and the sentence with the highest evidence score is passed to the classifier. were 94.0% accurate, and rationales were 96.1% scientific corpora,8 followed by a Softmax. accurate. For prompt annotation, the answers were Specifically, we first perform sentence segmenta- 90.0% accurate, and accuracy of the rationales was tion over full-text articles using ScispaCy (Neu- 88.8%. The drop in accuracy between prompt gen- mann et al., 2019). We use this segmentation to eration answers and prompt annotation answers is recover evidence bearing sentences. We train an likely due to confusion with respect to the scope of evidence identifier by learning to discriminate be- the intervention, comparator, and outcome. tween evidence bearing sentences and randomly We additionally calculated agreement statistics sampled non-evidence sentences.9 We then train an amongst the doctors across all stages, yielding a evidence classifier over the evidence bearing sen- Krippendorf’s α of α = 0.854. In contrast, the tences to characterize the trial’s finding as report- agreement between prompt generator and annotator ing that the Intervention significantly decreased, (excluding verifier) had a α = 0.784. did not significantly change, or significantly in- creased the Outcome compared to the Compara- 2.4 Abstract Only Subset tor in an ICO. When making a prediction for an We subset the articles and their content, yielding (ICO, document) pair we use the highest scoring 9,680 of 24,686 annotations, or approximately 40%. evidence sentence from the identifier, feeding this This leaves 6375 prompts, 50.5% of the total. to the evidence classifier for a final result. Note 3 Models that the evidence classifier is conditioned on the ICO frame; we prepend the ICO embedding (from We consider a simple BERT-based (Devlin et al., Biomed RoBERTa) to the embedding of the identi- 2018) pipeline comprising two independent mod- fied evidence snippet. Reassuringly, removing this els, as depicted in Figure1. The first identifies signal degrades performance (Table1). evidence bearing sentences within an article for a For all models we fine-tuned the underlying given ICO. The second model then classifies the BERT parameters. We trained all models using reported findings for an ICO prompt using the ev- the Adam optimizer (Kingma and Ba, 2014) with idence extracted by this first model. These mod- a BERT learning rate 2e-5. We train these mod- els place a dense layer on top of representations els for 10 epochs, keeping the best performing 7 yielded from (Gururangan et al., 2020), a vari- version on a nested held-out set with respect to ant of RoBERTa (Liu et al., 2019) pre-trained over 8We use the [CLS] representations. 7An earlier version of this work used SciBERT (Beltagy 9We train this via negative sampling because the vast ma- et al., 2019); we preserve these results in AppendixC. jority of sentences are not evidence-bearing.
125 macro-averaged f-scores. When training the evi- Model Cond? P R F dence identifier, we experiment with different num- BR Pipeline X .784 .777 .780 bers of random samples per positive instance. We BR Pipeline .513 .510 .510 BR Pipeline abs. .776 .777 .776 used Scikit-Learn (Pedregosa et al., 2011) for X Baseline X .526 .516 .514 evaluation and diagnostics, and implemented all Diagnostics: PyTorch models in (Paszke et al., 2019). We ad- BR Pipeline 1.0 X .762 .764 .763 ditionally reproduce the end-to-end system from Baseline 1.0 X .531 .519 .520 Lehman et al.(2019): a gated recurrent unit (Cho BR ICO Only .522 .515 .511 et al., 2014) to encode the document, attention BR Oracle Spans X .851 .853 .851 (Bahdanau et al., 2015) conditioned on the ICO, BR Oracle Sentence X .845 .843 .843 with the resultant vector (plus the ICO) fed into an BR Oracle Spans .806 .812 .808 BR Oracle Sentence .802 .795 .797 MLP for a final significance decision. BR Oracle Spans abs. X .830 .823 .824 4 Experiments and Results Baseline Oracle 1.0 X .740 .739 .739 Baseline Oracle X .760 .761 .759 Our main results are reported in Table1. We make a few key observations. First, the gains over the prior Table 1: Classification Scores. BR Pipeline: Biomed state-of-the-art model — which was not BERT RoBERTa BERT Pipeline. abs: Abstracts only. Base- based — are substantial: 20+ absolute points in line: model from Lehman et al.(2019). Diagnostic models: Baseline scores Lehman et al.(2019), BR F-score, even beyond what one might expect to see 10 Pipeline when trained using the Evidence Inference 1.0 shifting to large pre-trained models. Second, con- data, BR classifier when presented with only the ICO ditioning on the ICO prompt is key; failing to do so element, an entire human selected evidence span, or results in substantial performance drops. Finally, a human selected evidence sentence. Full document we seem to have reached a plateau in terms of the BR models are trained with four negative samples; ab- performance of the BERT pipeline model; adding stracts are trained with sixteen; Baseline oracle span re- the newly collected training data does not budge sults from Lehman et al.(2019). In all cases: ‘Cond?’ indicates whether or not the model had access to the performance (evaluated on the augmented test set). ICO elements; P/R/F scores are macro-averaged. This suggests that to realize stronger performance here, we perhaps need a less naive architecture that to be well-aligned with the existing release. This better models the domain. We next probe specific also suggests that the performance of the current aspects of our design and training decisions. simple pipeline model may have plateaued; bet- Impact of Negative Sampling As negative sam- ter performance perhaps requires inductive biases pling is a crucial part of the pipeline, we vary the via domain knowledge or improved strategies for number of samples and evaluate performance. We evidence identification. provide detailed results in AppendixA, but to sum- Oracle Evidence We report two types of Ora- marize briefly: we find that two to four negative cle evidence experiments - one using ground truth samples (per positive) performs the best for the evidence spans “Oracle spans”, the other using sen- end-to-end task, with little change in both AUROC tences for classification. In the former experiment, and accuracy of the best fit evidence sentence. This we choose an arbitrary evidence span11 for each is likely because the model needs only to maximize prompt for decoding. For the latter, we arbitrarily discriminative capability, rather than calibration. choose a sentence contained within a span. Both Distribution Shift In addition to comparable experiments are trained to use a matching classifier. Krippendorf-α values computed above, we mea- We find that using a span versus a sentence causes sure the impact of the new data on pipeline perfor- a marginal change in score. Both diagnostics pro- mance. We compare performance of the pipeline vide an upper bound on this model type, improve with all data “Biomed RoBERTa (BR) Pipeline” over the original Oracle baseline by approximately vs. just the old data “Biomed RoBERTA (BR) 10 points. Using Oracle evidence as opposed to BERT Pipeline 1.0” in Table1. As performance a trained evidence identifier leaves an end-to-end stays relatively constant, we believe the new data performance gap of approximately 0.08 F1 score.
10To verify the impact of architecture changes, we exper- 11Evidence classification operates on a single sentence, iment with randomly initialized and fine-tuned BERTs. We but an annotator’s selection is span based. Furthermore, the find that these perform worse than the original models in all prompt annotation stage may produce different evidence spans instances and elide more detailed results. than prompt generation.
126 Predicted Class on the basis of these), achieving state of the art Ev. Cls ID Acc. Sig Sig Sig ∼ ⊕ results on this task. Sig .667 .684 .153 .163 With this expanded dataset, we hope to support Sig .674 .060 .840 .099 ∼ Sig .652 .085 .107 .808 further development of NLP for assisting Evidence ⊕ Based Medicine. Our results demonstrate promise Table 2: Breakdown of the conditioned Biomed for the task of automatically inferring results from RoBERTa pipeline model mistakes and performance by Randomized Control Trials, but still leave room evidence class. ID Acc. is the ”identification accuracy”, for improvement. In our future work, we intend to or percentage of . To the right is a confusion matrix for jointly automate the identification of ICO triplets end-to-end predictions. ‘Sig ’ indicates significantly and inference concerning these. We are also keen decreased, ‘Sig ’ indicates no significant difference, ∼ ‘Sig ’ indicates significantly increased. to investigate whether pre-training on related scien- ⊕ tific ‘fact verification’ tasks might improve perfor- Conditioning As the pipeline can optionally mance (Wadden et al., 2020). condition on the ICO, we ablate over both the ICO Acknowledgments and the actual document text. We find that using the ICO alone performs about as effectively as an We thank the anonymous BioNLP reviewers. unconditioned end-to-end pipeline, 0.51 F1 score This work was supported by the National Sci- (Table1). However, when fed Oracle sentences, the ence Foundation, CAREER award 1750978. unconditioned pipeline performance jumps to 0.80 F1. As shown in Table3 (AppendixA), this large decrease in score can be attributed to the model losing the ability to identify the correct evidence sentence. Mistake Breakdown We further perform an analysis of model mistakes in Table2. We find that the BERT-to-BERT model is somewhat better at identifying significantly decreased spans than it is at identifying spans for the significantly increased or no significant difference evidence classes. Spans for the no significant difference tend to be classified correctly, and spans for the significantly increased category tend to be confused in a similar pattern to the significantly decreased class. End-to-end mis- takes are relatively balanced between all possible confusion classes. Abstract Only Results We report a full suite of experiments over the abstracts-only subset in Ap- pendixB. We find that the pipeline models perform similarly on the abstract-only subset; differing in score by less than .01F1. Somewhat surprisingly, we find that the abstracts oracle model falls behind the full document oracle model, perhaps due to a difference in language reporting general results vs. more detailed conclusions. 5 Conclusions and Future Work We have introduced an expanded version of the Evidence Inference dataset. We have proposed and evaluated BERT-based models for the evidence inference task (which entails identifying snippets of evidence for particular ICO prompts in long documents and then classifying the reported finding
127 References using machine learning tools in research synthesis. Systematic Reviews, 8(1):163. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2015. Neural machine translation by jointly Mark Neumann, Daniel King, Iz Beltagy, and Waleed learning to align and translate. In 3rd Inter- Ammar. 2019. Scispacy: Fast and robust models for national Conference on Learning Representations, biomedical natural language processing. ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Benjamin Nye, Junyi Jessy Li, Roma Patel, Yinfei Yang, Iain Marshall, Ani Nenkova, and Byron Wal- Hilda Bastian, Paul Glasziou, and Iain Chalmers. 2010. lace. 2018. A corpus with multi-level annotations Seventy-five trials and eleven systematic reviews a of patients, interventions and outcomes to support day: how will we ever keep up? PLoS Med, language processing for medical literature. In Pro- 7(9):e1000326. ceedings of the 56th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Iz Beltagy, Arman Cohan, and Kyle Lo. 2019. Scibert: Papers), pages 197–207, Melbourne, Australia. As- Pretrained contextualized embeddings for scientific sociation for Computational Linguistics. text. arXiv preprint arXiv:1903.10676. Adam Paszke, Sam Gross, Francisco Massa, Adam Kyunghyun Cho, Bart van Merrienboer,¨ Caglar Gul- Lerer, James Bradbury, Gregory Chanan, Trevor cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Killeen, Zeming Lin, Natalia Gimelshein, Luca Schwenk, and Yoshua Bengio. 2014. Learning Antiga, et al. 2019. Pytorch: An imperative style, phrase representations using RNN encoder–decoder high-performance deep learning library. In Ad- for statistical machine translation. In Proceedings of vances in Neural Information Processing Systems, the 2014 Conference on Empirical Methods in Nat- pages 8024–8035. ural Language Processing (EMNLP), pages 1724– 1734, Doha, Qatar. Association for Computational F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, Linguistics. B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, Jacob Devlin, Ming-Wei Chang, Kenton Lee, and D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- Kristina Toutanova. 2018. Bert: Pre-training of deep esnay. 2011. Scikit-learn: Machine learning in bidirectional transformers for language understand- Python. Journal of Machine Learning Research, ing. arXiv preprint arXiv:1810.04805. 12:2825–2830. Mette Brandt Eriksen and Tove Faber Frandsen. 2018. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob The impact of patient, intervention, comparison, out- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz come (PICO) as a search strategy tool on literature Kaiser, and Illia Polosukhin. 2017. Attention is all search quality: a systematic review. Journal of the you need. In Advances in neural information pro- Medical Library Association, 106(4). cessing systems, pages 5998–6008. Suchin Gururangan, Ana Marasovi, Swabha David Wadden, Kyle Lo, Lucy Lu Wang, Shanchuan Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, Lin, Madeleine van Zuylen, Arman Cohan, and Han- and Noah A. Smith. 2020. Don’t stop pretraining: naneh Hajishirzi. 2020. Fact or fiction: Verifying Adapt language models to domains and tasks. scientific claim. In Association for Computational Linguistics (ACL). Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
Eric Lehman, Jay DeYoung, Regina Barzilay, and By- ron C. Wallace. 2019. Inferring which medical treat- ments work from reports of clinical trials. In Pro- ceedings of the 2019 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol- ume 1 (Long and Short Papers), pages 3705–3717, Minneapolis, Minnesota. Association for Computa- tional Linguistics.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining ap- proach.
Iain J. Marshall and Byron C. Wallace. 2019. Toward systematic review automation: A practical guide to
128 Appendix Model Cond? P R F BR Pipeline X .776 .777 .776 A Negative Sampling Results BR Pipeline .513 .510 .510 We report negative sampling results for Biomed Diagnostics: ICO Only .545 .543 .537 RoBERTa pipelines in Table3 and Figure2. Oracle Spans X .830 .823 .824 Oracle Sentence X .845 .843 .843 Oracle Spans .814 .809 .809 Oracle Sentence .802 .795 .797
Table 4: Classification Scores. Biomed RoBERTa Ab- stract only version of Table1. All evidence identifica- tion models trained with sixteen negative samples.
Neg. Samples Cond? AUROC Top1 Acc 1 X 0.983 0.647 2 X 0.982 0.664 4 X 0.981 0.680 8 X 0.978 0.656 16 X 0.980 0.673 1 0.944 0.351 2 0.953 0.373 Figure 2: End to end pipeline scores for different nega- 4 0.947 0.334 tive sampling strategies with Biomed RoBERTa. 8 0.938 0.273 16 0.947 0.308
Neg, samples Cond? AUROC Top1 Acc Table 5: Abstract only (v2.0) evidence identification 1 X 0.973 0.682 validation scores varying across negative sampling 2 X 0.972 0.700 strategies using Biomed RoBERTa. 4 X 0.972 0.671 8 X 0.961 0.492 at the time experiment configurations were deter- 16 X 0.590 0.027 1 0.915 0.236 mined. Biomed RoBERTa experiments use the v2.0 2 0.921 0.226 set for calibration. We find that Biomed RoBERTa 4 0.925 0.251 generally performs better, with a notable excep- 8 0.899 0.165 tion in performance on abstracts-only Oracle span 16 0.508 0.015 classification. Table 3: Evidence Inference v2.0 evidence identifica- C.1 Negative Sampling Results tion validation scores varying across negative sampling We report SciBERT negative sampling results in strategies using Biomed RoBERTa in the pipeline. Table9 and Figure4. C.2 Abstract Only Results B Abstract Only Results We repeat the experiments described in Section4 We repeat the experiments described in Section and report results in Tables 10, 11, 12 and Figure5. 4. Our primary findings are that the abstract-only Our primary findings are that the abstract-only task task is easier and sixteen negative samples perform is easier and eight negative samples perform better better than four. Otherwise results follow a similar than four. Otherwise results follow a similar trend trend to the full-document task. We document these to the full-document task. in Table4,5,6 and Figure3. C SciBERT Results We report original SciBERT results in Tables7,8, 9 and Figures4,5. Table7 contains the Biomed RoBERTa numbers for comparison. Note that orig- inal SciBERT experiments use the evidence infer- ence v1.0 dataset as v2.0 collection was incomplete
129 Model Cond? P R F BR Pipeline X .784 .777 .780 SB Pipeline X .750 .750 .749 BR Pipeline .513 .510 .510 SB Pipeline .489 .486 .486 BR Pipeline abs. X .776 .777 .776 SB Pipeline abs. X .803 .798 .799 Baseline X .526 .516 .514 Diagnostics: BR Pipeline 1.0 X .762 .764 .763 SB Pipeline 1.0 X .749 .761 .753 Baseline 1.0 X .531 .519 .520 BR ICO Only .522 .515 .511 SB ICO Only .494 .501 .494 BR Oracle Spans X .851 .853 .851 SB Oracle Spans X .840 .840 .838 BR Oracle Sentence X .845 .843 .843 Figure 3: End to end pipeline scores on the abstract- SB Oracle Sentence X .829 .830 .829 only subset for different negative sampling strategies BR Oracle Spans .806 .812 .808 with Biomed RoBERTa. SB Oracle Spans .786 .789 .787 BR Oracle Sentence .802 .795 .797 SB Oracle Sentence .780 .770 .773 BR Oracle Spans abs. X .830 .823 .824 SB Oracle Spans abs. .866 .862 .863 Conf. Cls X Baseline Oracle 1.0 .740 .739 .739 Ev. Cls ID Acc. Sig Sig Sig X ∼ ⊕ Baseline Oracle X .760 .761 .759 Sig .728 .761 .067 .172
Sig .691 .130 .802 .068 ∼ Table 7: Replica of Table1 with both SciBERT Sig .573 .123 .109 .768 ⊕ and Biomed RoBERTa results. Classification Scores. BR Pipeline: Biomed RoBERTa BERT Pipeline, SB Table 6: Breakdown of the abstract-only conditioned Pipeline: SciBERT Pipeline. abs: Abstracts only. Base- Biomed RoBERTa pipeline model mistakes and perfor- line: model from Lehman et al.(2019). Diagnostic mance by evidence class. ID Acc. is breakdown by models: Baseline scores Lehman et al.(2019), BR final evidence truth. To the right is a confusion matrix Pipeline when trained using the Evidence Inference 1.0 for end-to-end predictions. data, BR classifier when presented with only the ICO element, an entire human selected evidence span, or a human selected evidence sentence. Full document BR models are trained with four negative samples; ab- stracts are trained with sixteen; Baseline oracle span re- sults from Lehman et al.(2019). In all cases: ‘Cond?’ indicates whether or not the model had access to the ICO elements; P/R/F scores are macro-averaged over classes.
Predicted Class Ev. Cls ID Acc. Sig Sig Sig ∼ ⊕ Sig .711 .697 .143 .160
Sig .643 .076 .838 .086 ∼ Sig .635 .146 .141 .713 ⊕ Table 8: Replica of Table2 for SciBERT. Breakdown of the conditioned BERT pipeline model mistakes and performance by evidence class. ID Acc. is the ”iden- tification accuracy”, or percentage of . To the right is a confusion matrix for end-to-end predictions. ‘Sig ’ Figure 4: End to end pipeline scores for different nega- indicates significantly decreased, ‘Sig ’ indicates no tive sampling strategies for SciBERT. ∼ significant difference, ‘Sig ’ indicates significantly in- ⊕ creased.
130 Neg. Samples Cond? AUROC Top1 Acc 1 X .969 .663 2 X .959 .673 4 X .968 .659 8 X .961 .627 16 X .967 .593 1 .894 .094 2 .890 .181 4 .843 .083 8 .862 .170 16 .403 .014
Table 9: Evidence Inference v1.0 evidence identifica- tion validation scores varying across negative sampling strategies for SciBERT.
Model Cond? P R F BERT Pipeline X .803 .798 .799 BERT Pipeline .528 .513 .510 Diagnostics: ICO Only .480 .480 .479 Oracle Spans X .866 .862 .863 Oracle Sentence X .848 .842 .844 Oracle Spans .804 .802 .801 Oracle Sentence .817 .776 .783
Table 10: Classification Scores. SciBERT/Abstract only version of Table1. All evidence identification models trained with eight negative samples.
Neg. Samples Cond? AUROC Top1 Acc 1 X 0.980 0.573 2 X 0.978 0.596 4 X 0.977 0.623 8 X 0.950 0.609 16 X 0.975 0.615 Figure 5: End to end pipeline scores on the abstract- 1 0.946 0.340 only subset for different negative sampling strategies 2 0.939 0.342 for SciBERT. 4 0.912 0.286 8 0.938 0.313 16 0.940 0.282
Table 11: Abstract only (v1.0) evidence identifica- tion validation scores varying across negative sampling strategies for SciBERT.
Conf. Cls Ev. Cls ID Acc. Sig Sig Sig ∼ ⊕ Sig .767 .750 .044 .206
Sig .686 .092 .816 .092 ∼ Sig .591 .109 .064 .827 ⊕ Table 12: Breakdown of the abstract-only conditioned SciBERT pipeline model mistakes and performance by evidence class. ID Acc. is breakdown by final evidence truth. To the right is a confusion matrix for end-to-end predictions.
131 Train Dev Test Total Number of prompts 10150 1238 1228 12616 Number of articles 2672 340 334 3346 Label counts (-1 / 0 / 1) 2465 / 4563 / 3122 299 / 544 / 395 295 / 516 / 417 3059 / 5623 / 3934
Table 13: Corpus statistics. Labels -1, 0, 1 indicate significantly decreased, no significant difference and signifi- cantly increased, respectively.
132 Personalized Early Stage Alzheimer’s Disease Detection: A Case Study of President Reagan’s Speeches
Ning Wang Fan Luo Vishal Peddagangireddy∗ K.P. Subbalakshmi R. Chandramouli Stevens Institute of Technology Hoboken, NJ 07030 nwang7, fluo4, vpeddaga, ksubbala, mouli @stevens.edu { }
Paper Number 59 due to a variety of reasons including social, eco- Abstract nomic, and cultural factors. Therefore, Internet based technologies that unobtrusively and continu- Alzheimer’s disease (AD)-related global ally collect, store, and analyze mental health data healthcare cost is estimated to be $1 trillion by 2050. Currently, there is no cure for this are critical. For example, a home smart speaker de- disease; however, clinical studies show that vice can record a subject’s speech periodically, au- early diagnosis and intervention helps to tomatically extract AD related speech or linguistic extend the quality of life and inform tech- features, and present easy to understand machine nologies for personalized mental healthcare. learning based analysis and visualization. Such a Clinical research indicates that the onset and technology will be highly valuable for personal- progression of Alzheimer’s disease lead to ized medicine and early intervention. This may dementia and other mental health issues. As a result, the language capabilities of patient also encourage people to sign-up for such a low start to decline. cost and home-based AD diagnostic technology. In this paper, we show that machine learning- Data and results of such a technology will also in- based unsupervised clustering of and anomaly stantly provide invaluable information to mental detection with linguistic biomarkers are health professionals. promising approaches for intuitive visualiza- Several studies show that subtle linguistic tion and personalized early stage detection of Alzheimer’s disease. We demonstrate changes are observed even at the early stages of this approach on 10 year’s (1980 to 1989) AD. In (Forbes-McKay and Venneri, 2005), more of President Ronald Reagan’s speech data than 70% of AD patients scored low in a picture set. Key linguistic biomarkers that indicate description task. Therefore, a critical research ques- early-stage AD are identified. Experimental tion is: can spontaneous temporal language impair- results show that Reagan had early onset of ments caused by AD be detected at an early stage Alzheimer’s sometime between 1983 and of the disease? Relation between AD, language 1987. This finding is corroborated by prior functions and language domain are summarized in work that analyzed his interviews using a statistical technique. The proposed technique (Szatloczki et al., 2015). In (Venneri et al., 2008), a also identifies the exact speeches that reflect significant correlation between the lexical attributes linguistic biomarkers for early stage AD. characterising residual linguistic production and the integrity of regions of the medial temporal 1 Introduction lobes in early AD patients is observed. Specific Alzheimer’s disease is a serious mental health issue lexical and semantic deficiencies of AD patients at faced by the global population. About 44 million early to moderate stages are also detected in verbal people worldwide are diagnosed with AD. The U.S. communication task in (Boye´ et al., 2014). There- alone has 5.5 million AD patients. According to fore, in this paper, we explore a machine-learning the Alzheimer’s association the total cost of care based clustering and data visualization technique for AD is estimated to be $1 trillion by 2050. There to identify linguistic changes in a subject over a is no cure for AD yet; however, studies have shown time period. The proposed methodology is also that early diagnosis and intervention can delay the highly personalized since it observes and analyzes onset. the linguistic patterns of each individual separately Regular mental health assessment is a key chal- using only his/her own linguistic biomarkers over lenge faced by the medical community. This is a period of time.
133 Proceedings of the BioNLP 2020 workshop, pages 133–139 Online, July 9, 2020 c 2020 Association for Computational Linguistics
First, we explore a machine learning algorithm the picture ignoring expected facts and inferences called t-distributed stochastic neighbor embedding (Giles et al., 1996). AD patients have difficulty (t-SNE) (Maaten and Hinton, 2008). t-SNE is in naming things and replace target words with useful in dimensionality reduction suitable for vi- simpler semantically neighboring words. Under- sualization of high-dimensional datasets. It cal- standing metaphors and sarcasm also deteriorate in culates the probability that two points in a high- people diagnosed with AD (Rapp and Wild, 2011). dimensional space are similar, computes the corre- Machine learning based classifier design to dif- sponding probability in a low-dimensional space, ferentiate between AD and non-AD subjects is an and minimizes the difference between these two active area of research. A significant correlation probabilities for mapping or visualization. Dur- between dementia severity and linguistic measures ing this process, the sum of Kullback-Leibler di- such as confrontation naming, articulation, word- vergences (Liu and Shum, 2003) over all the data finding ability, and semantic fluency exists. Some points is minimized. Our hypothesis is that high- studies have reported a 84.8 percent accuracy in dimensional AD-related linguistic features when distinguishing AD patients from healthy controls visualized in a low-dimensional space may quickly using temporal and acoustic features ((Rentoumi and intuitively reveal useful information for early et al., 2014); (Fraser et al., 2016)). diagnosis. Such a visualization will also help in- Our study differs from the prior work in sev- dividuals and general medical practitioners (who eral ways. Prior work attempt to identify linguistic are first points of contact) to assess the situation features and machine learning classifiers for differ- for further tests. Second, we investigate two un- entiating between AD and non-AD subjects from a supervised machine learning techniques, one class corpus (e.g., Dementia Bank (MacWhinney, 2007)) support vector machine (SVM) and isolation forest, containing text transcripts of interviews with pa- for temporal detection of linguistic abnormalities tients and healthy control. In this paper, we first indicative of early-stage AD. These proposed ap- analyze the speech transcripts (over several years) proaches are tested on President Reagan’s speech of a single person (President Ronald Reagan) and dataset and corroborated with results from other visualize time-dependent linguistic information us- research in the literature. ing t-SNE. The goal is to identify visual clues for This paper is organized as follows. Back- linguistic changes that may indicate the onset of ground research is discussed in Section2, Sec- AD. Note that such an easy to understand visual tion3 presents the Reagan speech data set used representation depicting differences in linguistic in this paper, data pre-processing techniques that patterns will be useful to both a common person were applied and AD-related linguistic feature se- and a general practitioner (who is the first point lection rationale and methodology, Section4 con- of contact for majority of patients). Since most tains the clustering and visualization of the hand- general practitioners are not trained mental health crafted linguistic features and anomaly detection to professionals the impact of such a visualization infer the onset of AD and to detect the time period tool will be high, especially at the early stages of of changes in the linguistic statistical characteris- AD. Sigificant AD-related linguistic biomarkers tics, and experimental results to demonstrate the derived from t-SNE analysis are then used in two proposed method. In Section5, we describe the ma- unsupervised clustering algorithms for detecting chine learning algorithms for detecting anomalies AD-related temporal linguistic anomalies. This from personalized linguistic biomarkers collected provides an estimate for the time when early-stage over a period of time to identify early-stage AD. AD symptoms are beginning to be observed. Concluding remarks are given in Section6. 3 Reagan Speeches: Data Collection, 2 Background Pre-processing, and Feature Selection
The “picture description” task has been widely stud- We describe the data collection, pre-processing, ied to differentiate between AD and non-AD or and feature engineering methodologies in this sec- control subjects. In this task, a picture (the “cookie tion. Ronald Reagan was the 40th president (served theft picture”) is shown and the subject is asked from 1981 to 1989) of the United States of Amer- to describe it. It has been observed that subjects ica. He was an extraordinary orator, a radio an- with AD usually convey sparse information about nouncer, actor, and the host for a show called “Gen-
134 eral Electric Theatre.” Clearly, for being successful Vocabulary Richness: AD patients show a de- in these professional domains one needs to have cline in their vocabulary range. Therefore, vocab- good memory, consciousness, intuition, command ulary richness metrics: Honore’s Statistic (HS), over the language, and ease of communicating with Sichel Measure (SICH), and Brunet’s Measure a large audience. Reagan officially announced that (BM) were calculated for each speech. Higher he had been diagnosed with AD on November values of Honore’s and Sichel measures indicate 5, 1994. But it was speculated that his cognitive greater vocabulary richness. But a higher value abilities where on the decline even while in office corresponds to low vocabulary richness for the (Gottschalk et al., 1988). Therefore, analyzing his Brunet’s measure. speech transcripts for early signs of AD may reveal Readability Measures: We computed two read- interesting patterns, if any. ability measures, namely, Automated Readability The Reagan Library is the repository of presiden- Index (ARI) and Flesch-Kincaid readability (FKR) tial records for President Reagan’s administration. score. A higher ARI indicates complex speech We download his 98 speeches from 1980 to 1989 as with rich vocabulary whereas lower Flesch-Kincaid shown in Table1. We removed special characters, score indicates rich vocabulary. tags, and numbers and kept only the words from each speech transcript. The resulting data was then lemmatized and tokenized.
Table 1: President Reagan’s speech dataset
Year No. of speeches 1980 6 1981 8 1982 11 1983 12 1984 14 1985 13 1986 14 1987 10 Figure 1: Correlation matrix of the linguistic features. 1988 9 1989 1 Figure1 shows the correlation matrix of the cho- sen linguistic features computed for the Reagan speech dataset. Note that some of the features are Part-of-speech (POS) features: People diag- highly correlated, therefore, we pruned the feature nosed with AD use more pronouns than nouns. set to the following 9 features: Therefore, POS features such as the number of 1. pronoun-noun ratio pronouns and the pronouns-to-nouns ratio are im- portant. We identified adverbs, nouns, verbs, and 2. word frequency rate pronouns for each speech transcript with natural 3. verb frequency rate language processing (NLP) tools. POS tags having at least 10 occurrences were selected to compute 4. pronoun frequency rate their percentage ratios. Similarly, words that have 5. adverb frequency rate at least a frequency of 10 were selected and their occurrence percentages were computed. 6. Honore’s measure The full set of POS features we used were: (1) 7. Brunet’s measure number of pronouns, (2) pronoun-noun ratio, (3) number of adverbs, (4) number of nouns, (5) num- 8. Sichel measure ber of verbs, (6) pro-noun frequency rate, (7) noun frequency rate, (8) verb frequency rate, (9) adverb 9. Automated Readability Index frequency rate, (10) word frequency rate, and (11) Any further analysis in this paper uses the above 9 word frequency rate without excluding stop words. selected linguistic features.
135 4 Clustering and Visualization of the pronoun-to-noun ratio. Notice that the clus- Linguistic Features ter on the left side of the graph contains speeches (from 1983 to 1987) have higher values of the ra- We selected the t-SNE machine learning technique tio. Recall that higher pronoun-to-noun ratio is an for clustering and visulaization of linguistic fea- indicator of early stage AD. tures extracted from Reagan’s speeches. t-SNE is better at creating a single map for revealing struc- tures at many different scales important for high- dimensional data that lie on several different, but related, low-dimensional manifolds. This implies that it can capture much of the local structure of the high-dimensional data very well, while also reveal- ing global structure such as the presence of clusters at several scales (van der Maaten and Hinton 2008). If there are AD-related linguistic patterns then t- SNE may reveal them as clusters. In t-SNE, the high-dimensional Euclidean dis- tances between datapoints are converted into con- Figure 2: 2-dimensional visualization of speech tran- ditional probabilities representing similarities be- scripts where size of each circle in the map is propor- tween them. For example, the similarity of data tional to the pronoun-to-noun ratio. point xj with xi is the conditional probability pj i | that xi would pick xj as its neighbor. Neighbors Pronoun frequency: Fig.3 indicates clustering are picked in proportion to their probability density and low-dimensional visualization results when the under Student t-distribution centered at xi in the size of each circle is proportional to the pronoun low-dimensional space. Then the Kullback-Leibler frequency. We again see that the cluster on the divergence between a joint probability distribution, left side of the map contains speeches with higher P, in the high-dimensional space and a joint proba- pronoun frequency, another indicator of early stage bility distribution, Q, in the low-dimensional space AD. is then minimized:
pij minKL(P Q) = pijlog (1) || qij Xi Xj t-SNE has two tunable hyperparameters: per- plexity and learning rate. Perplexity allows us to balance the weighting between local and global relationships of the data. It gives a sense of the number of close neighbors for each point. A per- plexity value between 5 and 50 is recommended. The learning rate for t-SNE is usually in the inter- val [10, 1000]. For a high learning rate, each point Figure 3: 2-dimensional visualization of speech tran- will approximately be equidistant from its nearest scripts where size of each circle in the map is propor- neighbours. For a low rate, there will be few out- tional to the pronoun frequency. liers and therefore the points may look compressed into a dense cloud. Since t-SNE’s cost function Readability score: From Fig.4 we observe that is not convex different initializations can produce the speeches from 1983-1987 have lower readabil- different results. After tuning t-SNE’s hyperpa- ity scores. rameters for the dataset, we chose perplexity value Word repetition frequency: The word repe- equal to 4 and learning rate equal to 100. tition frequency map in Fig.5 shows that three Pronoun-to-noun ratio: Figure2 shows t-SNE speeches standout. These three speeches have a based clustering of the speech transcripts, speeches higher repetition of high frequency words. Inter- sharing similar patterns are clustered together. Be- estingly, two of these speeches have word lengths sides, the radius of each circle is proportional to smaller than the mean length of all the speeches.
136 08-12-1987: Address to the Nation on the • Iran-ContraAffair 08-12-1987: Address to the Nation on the • Iran-ContraAffair 09-21-1987: Address to the General Assem- • bly of the United Nations, (INF Agreement and Iran) Therefore, t-SNE based clustering and low- dimensional visualization of President Reagan’s Figure 4: 2-dimensional visualization of speech tran- speeches from 1964 to 1989 reveals the following: scripts where size of each circle in the map is propor- he started showing signs of early AD well tional to the readability score. • before the official announcement in 1994 it is highly likely that he developed AD some- • time between 1983 and 1987 over time, President Reagan’s showed a signif- • icant reduction the number of unique words but a significant increase in conversational fillers and non-specific nouns the proposed method identifies specific • speeches that exhibit linguistic markers for AD
Figure 5: 2-dimensional visualization of speech tran- Some of these findings are corroborated by prior scripts where size of each circle in the map is propor- research that analyzed his interviews and com- tional to the word repetition frequency. pared them with President Bursh’s public speeches, (Berisha et al., 2015). From these clustering results the speeches iden- 5 Linguistic Anomaly Detector for AD tified as showing early signs of AD are: t-SNE-clustering-based approach provides visual- 03-23-1983: Address to the Nation on Na- • ization of speeches that are statistically different tional Security (“Star Wars” SDI Speech) (“anomalies”). But we need an automated method 05-28-1984: Remarks honoring the Vietnam to identify these anomalies for early signs of AD. • War’s Unknown Therefore, we investigate a one-class support vec- tor machine (SVM) anomaly detector (Erfani et al., 02-06-1985: State of the Union Address, 2016). This method is useful in practice when ma- • (American Revolution II) jority of a subject’s speeches over several years would be (statistically) typical of a healthy control 04-24-1985: Address to the Nation on the • (“normal”) until he/she begins to exhibit early signs Federal Budget and Deficit Reduction of AD. Our hypothesis is that early stage AD will 05-05-1985: Speech at Bergen-Belsen Con- begin to reveal itself as statistical anomalies in lin- • centration Camp Memorial guistic feature space. In this section we investigate this hypothesis. 05-05-1985: Speech at Bitburg Air Base • We designed a one class SVM with the following 04-14-1986: Address to the Nation on the Air hyperparameter values (the choices of these values • Strike Against Libya are not discussed for the sake of clarity and focus), ν = 0.5 (an upper bound on the fraction of training 06-24-1986: Address to the Nation on Aid to errors and a lower bound of the fraction of sup- • the Contras port vectors), kernel=rbf (radial basis function) and
137 1 γ = (number of features variance of features) (kernel coefficient). Figure6 shows× that the one class SVM
Figure 7: Isolation forest algorithm for AD detection.
Figure 6: Speeches from 1984 to 1986 are detected as abnormal or anomalies. detector identified several speeches from 1984 to 1986 as anomalous. That is, President Reagan’s speeches in these years are different from the previ- ous years in the linguistic feature space. Therefore, it is likely that: Figure 8: Year-wise speeches identified as anomalous. he started showing signs of early AD well • before the official announcement in 1994 6 Conclusions
the onset of AD start from 1984 and became A set of nine linguistic biomarkers for AD was • more pronounced in 1985 and 1986, which is identified. Two complementary unsupervised ma- corroborated by prior research (e.g., (Berisha chine learning methods were applied on the linguis- et al., 2015)) that analyzed his interviews. tic features extracted from President Reagan’s 98 speeches given between 1980 to 1989. The first One class SVM learns the profile of non-AD method, t-SNE, identified and visualized speeches speeches as “normal” over a period of time and indicating early onset of AD. A higher pronoun detects anomalies to signal the onset of AD. But usage frequency, lower readability scores, higher in many practical instances long historical data repetition of high frequency words were revealed may not be available for a subject. In this case, to be the key characteristics of potential AD-related we must identify anomalies explicitly instead of speeches. A subset speeches from 1983 to 1987 learning what is normal. Isolation forest (Ding and were detected to possess these characteristics. Fei, 2013) is an unsupervised machine learning The second machine learning method, one class algorithm that is applicable for this purpose. SVM, learned what is “normal” (i.e., non-AD We applied the isolation forest algorithm on the speech) to detect anomalies in speeches over a speech dataset and tuned its hyperparameters. Fig- period of time. This approach detected several ure7 and Figure8 show the results. Figure7 shows speeches between 1984 and 1986 as potential AD- the ten speeches that were isolated as anomalies. related. Since normal speech may not be avail- The corresponding dates of these speeches are seen able historically we applied the isolation forest al- in Figure8. We observe that a few speeches be- gorithm that explicitly detects anomalies without tween 1984 and 1988 as linguistic anomalies. This learning what is normal. This detected 10 speeches time period overlaps significantly with the results from 1983 to 1987 as AD-related. of once class SVM and t-SNE clustering. From the experimental analysis our conclusion
138 is that President Reagan had signs of AD sometime Alexander M Rapp and Barbara Wild. 2011. Nonlit- between 1983 and 1988. This conclusion corrob- eral language in alzheimer dementia: a review. Jour- orates results from other studies in the literature. nal of the International Neuropsychological Society, 17(2):207–218. Note that that President Reagan had AD was pub- licly disclosed only in November 1994. Vassiliki Rentoumi, Ladan Raoufian, Samrah Ahmed, Celeste A de Jager, and Peter Garrard. 2014. Fea- tures and machine learning classification of con- References nected speech samples from patients with autopsy proven alzheimer’s disease with and without addi- Visar Berisha, Shuai Wang, Amy LaCross, and Julie tional vascular pathology. Journal of Alzheimer’s Liss. 2015. Tracking discourse complexity preced- Disease, 42(s3):S3–S17. ing alzheimer’s disease diagnosis: a case study com- paring the press conferences of presidents ronald rea- Greta Szatloczki, Ildiko Hoffmann, Veronika Vincze, gan and george herbert walker bush. Journal of Janos Kalman, and Magdolna Pakaski. 2015. Speak- Alzheimer’s Disease, 45(3):959–963. ing in alzheimer’s disease, is that an early sign? importance of changes in language abilities in Ma¨ıte´ Boye,´ Natalia Grabar, and Thi Mai Tran. 2014. alzheimer’s disease. Frontiers in aging neuro- Contrastive conversational analysis of language pro- science, 7:195. duction by alzheimer’s and control people. In MIE, pages 682–686. Annalena Venneri, William J McGeown, Heidi M Hietanen, Chiara Guerrini, Andrew W Ellis, and Zhiguo Ding and Minrui Fei. 2013. An anomaly de- Michael F Shanks. 2008. The anatomical bases of tection approach based on isolation forest algorithm semantic retrieval deficits in early alzheimer’s dis- for streaming data using sliding window. IFAC Pro- ease. Neuropsychologia, 46(2):497–510. ceedings Volumes, 46(20):12–17. Sarah M Erfani, Sutharshan Rajasegarar, Shanika Karunasekera, and Christopher Leckie. 2016. High- dimensional and large-scale anomaly detection us- ing a linear one-class svm with deep learning. Pat- tern Recognition, 58:121–134. Katrina E Forbes-McKay and Annalena Venneri. 2005. Detecting subtle spontaneous language decline in early alzheimer’s disease with a picture description task. Neurological sciences, 26(4):243–254. Kathleen C Fraser, Jed A Meltzer, and Frank Rudzicz. 2016. Linguistic features identify alzheimer’s dis- ease in narrative speech. Journal of Alzheimer’s Dis- ease, 49(2):407–422. Elaine Giles, Karalyn Patterson, and John R Hodges. 1996. Performance on the boston cookie theft pic- ture description task in patients with early dementia of the alzheimer’s type: missing information. Apha- siology, 10(4):395–408. Louis A Gottschalk, Regina Uliana, and Ronda Gilbert. 1988. Presidential candidates and cognitive impair- ment measured from behavior in campaign debates. Public Administration Review, pages 613–619. Ce Liu and Hueng-Yeung Shum. 2003. Kullback- leibler boosting. In 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recog- nition, 2003. Proceedings., volume 1, pages I–I. IEEE. Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605. Brian MacWhinney. 2007. The talkbank project. In Creating and digitizing language corpora, pages 163–180. Springer.
139 BIOMRC: A Dataset for Biomedical Machine Reading Comprehension
Petros Stavropoulos1,2, Dimitris Pappas1,2, Ion Androutsopoulos1, Ryan McDonald3,1 1Department of Informatics, Athens University of Economics and Business, Greece 2Institute for Language and Speech Processing, Research Center ‘Athena’, Greece 3Google Research pstav1993, pappasd, ion @aueb.gr { [email protected]}
Abstract been created following this approach either using books (Hill et al., 2016; Bajgar et al., 2016) or We introduce BIOMRC, a large-scale cloze- news articles (Hermann et al., 2015). Datasets of style biomedical MRC dataset. Care was taken this kind are noisier than MRC datasets containing to reduce noise, compared to the previous human-authored questions and manually annotated BIOREAD dataset of Pappas et al.(2018). Ex- periments show that simple heuristics do not passage spans that answer them (Rajpurkar et al., perform well on the new dataset, and that two 2016, 2018; Nguyen et al., 2016). They require neural MRC models that had been tested on no human annotations, however, which is particu- BIOREAD perform much better on BIOMRC, larly important in biomedical question answering, indicating that the new dataset is indeed less where employing annotators with appropriate ex- noisy or at least that its task is more feasible. pertise is costly. For example, the BIOASQQA Non-expert human performance is also higher dataset (Tsatsaronis et al., 2015) currently contains on the new dataset compared to BIOREAD, and biomedical experts perform even better. We approximately 3k questions, much fewer than the also introduce a new BERT-based MRC model, 100k questions of a SQUAD (Rajpurkar et al., 2016), the best version of which substantially outper- exactly because it relies on expert annotators. forms all other methods tested, reaching or sur- To bypass the need for expert annotators and passing the accuracy of biomedical experts in produce a biomedical MRC dataset large enough some experiments. We make the new dataset to train (or pre-train) deep learning models, Pap- available in three different sizes, also releasing pas et al.(2018) adopted the cloze-style questions our code, and providing a leaderboard. approach. They used the full text of unlabeled 1 1 Introduction biomedical articles from PUBMEDCENTRAL, and METAMAP (Aronson and Lang, 2010) to annotate Creating large corpora with human annotations is a the biomedical entities of the articles. They ex- demanding process in both time and resources. Re- tracted sequences of 21 sentences from the arti- search teams often turn to distantly supervised or cles. The first 20 sentences were used as a passage unsupervised methods to extract training examples and the last sentence as a cloze-style question. A from textual data. In machine reading compre- biomedical entity of the ‘question’ was replaced hension (MRC)(Hermann et al., 2015), a training by a placeholder, and systems have to guess which instance can be automatically constructed by taking biomedical entity of the passage can best fill the an unlabeled passage of multiple sentences, along placeholder. This allowed Pappas et al. to produce with another smaller part of text, also unlabeled, a dataset, called BIOREAD, of approximately 16.4 usually the next sentence. Then a named entity of million questions. As the same authors reported, the smaller text is replaced by a placeholder. In this however, the mean accuracy of three humans on a setting, MRC systems are trained (and evaluated for sample of 30 questions from BIOREAD was only their ability) to read the passage and the smaller 68%. Although this low score may be due to the text, and guess the named entity that was replaced fact that the three subjects were not biomedical ex- by the placeholder, which is typically one of the perts, it is easy to see, by examining samples of named entities of the passage. This kind of ques- BIOREAD, that many examples of the dataset do tion answering (QA) is also known as cloze-type questions (Taylor, 1953). Several datasets have 1https://www.ncbi.nlm.nih.gov/pmc/
140 Proceedings of the BioNLP 2020 workshop, pages 140–149 Online, July 9, 2020 c 2020 Association for Computational Linguistics
‘question’ originating from caption: for researchers with more or fewer resources, along “figure 4 htert @entity6 and @entity4 XXXX cell invasion.” ‘question’ originating from reference: with the 60 instances (TINY) humans answered. “2004 , 17 , 250 257 .14967013 c samuni y. ; samuni u. ; Random samples from BIOMRCLARGE where se- goldstein s. the use of cyclic XXXX as hno scavengers .” lected to create LITE and TINY. BIOMRCTINY ‘passage’ containing captions: “figure 2: distal UNK showing high insertion of rectum is used only as a test set; it has no training and into common channel. figure 3: illustration of the cloacal validation subsets. malformation. figure 4: @entity5 showing UNK” We tested on BIOMRCLITE the two deep learn- Table 1: Examples of noisy BIOREAD data. XXXX is ing MRC models that Pappas et al.(2018) had tested the placeholder, and UNK is the ‘unknown’ token. on BIOREADLITE, namely Attention Sum Reader not make sense. Many instances contain passages (AS-READER)(Kadlec et al., 2016) and Attention or questions crossing article sections, or originat- Over Attention Reader (AOA-READER)(Cui et al., ing from the references sections of articles, or they 2017). Experimental results show that AS-READER include captions and footnotes (Table1). Another and AOA-READER perform better on BIOMRC, with the accuracy of AOA-READER reaching 70% com- source of noise is METAMAP, which often misses or mistakenly identifies biomedical entities (e.g., it pared to the corresponding 52% accuracy of Pappas often annotates ‘to’ as the country Togo). et al.(2018), which is a further indication that the new dataset is less noisy or that at least its task In this paper, we introduce BIOMRC, a new is more feasible. We also developed a new BERT- dataset for biomedical MRC that can be viewed based (Devlin et al., 2019) MRC model, the best ver- as an improved version of BIOREAD. To avoid sion of which (SCIBERT-MAX-READER) performs crossing sections, extracting text from references, even better, with its accuracy reaching 80%. We captions, tables etc., we use abstracts and titles of encourage further research on biomedical MRC by biomedical articles as passages and questions, re- making our code and data publicly available, and spectively, which are clearly marked up in PUBMED by creating an on-line leaderboard for BIOMRC.3 data, instead of using the full text of the articles. Using titles and abstracts is a decision that favors 2 Dataset Construction precision over recall. Titles are likely to be re- lated to their abstracts, which reduces the noise-to- Using PUBTATOR, we gathered approx. 25 million signal ratio significantly and makes it less likely to abstracts and their titles. We discarded articles generate irrelevant questions for a passage. We with titles shorter than 15 characters or longer than replace a biomedical entity in each title with a 60 tokens, articles without abstracts, or with ab- placeholder, and we require systems to guess the stracts shorter than 100 characters, or fewer than hidden entity by considering the entities of the 10 sentences. We also removed articles with ab- abstract as candidate answers. Unlike BIOREAD, stracts containing fewer than 5 entity annotations, we use PUBTATOR (Wei et al., 2012), a repository or fewer than 2 or more than 20 distinct biomedi- that provides approximately 25 million abstracts cal entity identifiers. (PUBTATOR assigns the same and their corresponding titles from PUBMED, with identifier to all the synonyms of a biomedical en- 2 multiple annotations. We use DNORM’s biomed- tity; e.g., ‘hemorrhagic stroke’ and ‘stroke’ have ical entity annotations, which are more accurate the same identifier ‘MESH:D020521’.) We also than METAMAP’s (Leaman et al., 2013). We also discarded articles containing entities not linked to perform several checks, discussed below, to dis- any of the ontologies used by PUBTATOR,4 or en- card passage-question instances that are too easy, tities linked to multiple ontologies (entities with and we show that the accuracy of experts and non- multiple ids), or entities whose spans overlapped expert humans reaches 85% and 82%, respectively, with those of other entities. We also removed ar- on a sample of 30 instances for each annotator type, ticles with no entities in their titles, and articles which is an indication that the new dataset is indeed with no entities shared by the title and abstract.5 less noisy, or at least that the task is more feasible for humans. Following Pappas et al.(2018), we 3Our code, data, and information about the leader- board will be available at http://nlp.cs.aueb.gr/ release two versions of BIOMRC, LARGE and LITE, publications.html. containing 812k and 100k instances respectively, 4PUBTATOR uses the Open Biological and Biomedical On- tology (OBO) Foundry, which comprises over 60 ontologies. 2Like PUBMED, PUBTATOR is supported by NCBI. Consult: 5A further reason for using the title as the question is that www.ncbi.nlm.nih.gov/research/pubtator/ the entities of the titles are typically mentioned in the abstract.
141 BACKGROUND: Most brain metastases arise from @entity0 . Few studies compare the brain regions they involve, their numbers and intrinsic attributes. METHODS: Records of all @entity1 referred to Radiation Oncology for treatment of symptomatic brain metastases were obtained. Computed tomography (n = 56) or magnetic resonance imaging (n = 72) brain scans were reviewed. RESULTS: Data from 68 breast and 62 @entity2 @entity1 were compared. Brain metastases presented earlier in the course of the lung than of the @entity0 @entity1 (p = 0.001). There were more metastases in the Passage cerebral hemispheres of the breast than of the @entity2 @entity1 (p = 0.014). More @entity0 @entity1 had cerebellar metastases (p = 0.001). The number of cerebral hemisphere metastases and presence of cerebellar metastases were positively correlated (p = 0.001). The prevalence of at least one @entity3 surrounded with > 2 cm of @entity4 was greater for the lung than for the breast @entity1 (p = 0.019). The @entity5 type, rather than the scanning method, correlated with differences between these variables. CONCLUSIONS: Brain metastases from lung occur earlier, are more @entity4 , but fewer in number than those from @entity0 . Cerebellar brain metastases are more frequent in @entity0 . @entity0 : [‘breast and lung cancer’] ; @entity1 : [‘patients’] ; @entity2 : [‘lung cancer’] ; Candidates @entity3 : [‘metastasis’] ; @entity4 : [‘edematous’, ‘edema’] ; @entity5 : [‘primary tumor’] Question Attributes of brain metastases from XXXX. Answer @entity0 : [‘breast and lung cancer’]
Figure 1: Example passage-question instance of BIOMRC. The passage is the abstract of an article, with biomedical entities replaced by @entityN pseudo-identifiers. The original entity names are shown in square brackets. Both ‘edematous’ and ‘edema’ are replaced by ‘@entity4’, because PUBTATOR considers them synonyms. The question is the title of the article, with a biomedical entity replaced by XXXX. @entity0 is the correct answer.
BIOMRCLARGE BIOMRCLITE BIOMRCTINY Training Development Test Total Training Development Test Total Setting A Setting B Total Instances 700,000 50,000 62,707 812,707 87,500 6,250 6,250 100,000 30 30 60 Avg candidates 6.73 6.68 6.68 6.72 6.72 6.68 6.65 6.71 6.60 6.57 6.58 Max candidates 20 20 20 20 20 20 20 20 13 11 13 Min candidates 2 2 2 2 2 2 2 2 2 3 2 Avg abstract len. 253.79 257.41 253.70 254.01 253.78 257.32 255.56 254.11 248.13 264.37 256.25 Max abstract len. 543 516 511 543 519 500 510 519 371 386 386 Min abstract len. 57 89 77 57 60 109 103 60 147 154 147 Avg title len. 13.93 14.28 13.99 13.96 13.89 14.22 14.09 13.92 14.17 14.70 14.43 Max title len. 51 46 43 51 49 40 42 49 21 35 35 Min title len. 3 3 3 3 3 3 3 3 6 4 4
Table 2: Statistics of BIOMRCLARGE, LITE, TINY. The questions of the TINY version were answered by humans. All lengths are measured in tokens using a whitespace tokenizer.
Finally, to avoid making the dataset too easy for pseudo-identifier in the whole dataset. This allows a system that would always select the entity with a system to learn information about the entity rep- the most occurrences in the abstract, we removed resented by a pseudo-identifier from all the occur- a passage-question instance if the most frequent rences of the pseudo-identifier in the training set. entity of its passage (abstract) was also the answer For example after seeing the same pseudo-identifier to the cloze-style question (title with placeholder); multiple times a model may learn that it stands for if multiple entities had the same top frequency in a drug, or that a particular pseudo-identifier tends the passage, the instance was retained. We ended to neighbor with specific words. Then, much like a up with approx. 812k passage-question instances, language model, a system may guess the pseudo- which form BIOMRCLARGE, split into training, de- identifier that should fill in the placeholder even velopment, and test subsets (Table2). The LITE and without the passage, or at least it may infer a prior TINY versions of BIOMRC are subsets of LARGE. probability for each candidate answer. In contrast, In all versions of BIOMRC (LARGE, LITE, TINY), Setting B uses a local scope, i.e., it restarts the the entity identifiers of PUBTATOR are replaced by numbering of the pseudo-identifiers (from @en- pseudo-identifiers of the form @entityN (Fig.1), tity0) anew in each passage-question instance. This as in the CNN and Daily Mail datasets (Hermann forces the models to rely only on information about et al., 2015). We provide all BIOMRC versions the entities that can be inferred from the particular in two forms, corresponding to what Pappas et al. passage and question. This corresponds to a non- (2018) call Settings A and B in BIOREAD.6 In Set- expert answering the question, who does not have ting A, each pseudo-identifier has a global scope, any prior knowledge of the biomedical entities. meaning that each biomedical entity has a unique Table2 provides statistics on BIOMRC. In TINY, we use 30 different passage-question instances in 6Pappas et al.(2018) actually call ‘option a’ and ‘option b’ Settings A and B, because in both settings we asked our Setting B and A, respectively. the same humans to answer the questions, and we
142 passage with the same highest frequency, if mul- tiple exist, otherwise it selects the entity with the second highest frequency. BASE4 extracts all the token n-grams from the passage that include an en- tity identifier (@entityN), and all the n-grams from the question that include the placeholder (XXXX).7 Then for each candidate answer (entity identifier), it counts the tokens shared between the n-grams that include the candidate and the n-grams that in- Figure 2: Illustration of our SCIBERT-based models. clude the placeholder. The candidate with the most Each sentence of the passage is concatenated with the question and fed to SCIBERT. The top-level embed- shared tokens is selected. These baselines are used ding produced by SCIBERT for the first sub-token of to check that the questions cannot be answered by each candidate answer is concatenated with the top- simplistic heuristics (Chen et al., 2016). level embedding of [MASK] (which replaces the place- Neural baselines: We use the same implementa- holder XXXX) of the question, and they are fed to an AS READER AOA MLP, which produces the score of the candidate answer. tions of - (Kadlec et al., 2016) and - In SCIBERT-SUM-READER, the scores of multiple oc- READER (Cui et al., 2017) as Pappas et al.(2018), currences of the same candidate are summed, whereas who also provide short descriptions of these neu- SCIBERT-MAX-READER takes their maximum. ral models, not provided here to save space. The hyper-parameters of both methods were tuned on did not want them to remember instances from the development set of BIOMRCLITE. one setting to the other. In LARGE and LITE, the instances are the same across the two settings, apart BERT-based model: We use SCIBERT (Beltagy from the numbering of the entity identifiers. et al., 2019), a pre-trained BERT (Devlin et al., 2019) model for scientific text. SCIBERT is pre- 3 Experiments and Results trained on 1.14 million articles from Semantic Scholar,8 of which 82% (935k) are biomedical We experimented only on BIOMRCLITE and TINY, and the rest come from computer science. For since we did not have the computational resources each passage-question instance, we split the pas- to train the neural models we considered on the sage into sentences using NLTK (Bird et al., 2009). LARGE version of BIOREAD. Pappas et al.(2018) For each sentence, we concatenate it (using BERT’s also reported experimental results only on a LITE [SEP] token) with the question, after replacing the version of their BIOREAD dataset. We hope that oth- XXXX with BERT’s [MASK] token, and we feed ers may be able to experiment on BIOMRCLARGE, the concatenation to SCIBERT (Fig.2). We col- and we make our code available, as already noted. lect SCIBERT’s top-level vector representations of the entity identifiers (@entityN) of the sentence 3.1 Methods and [MASK].9 For each entity of the sentence, we We experimented with the four basic baselines concatenate its top-level representation with that (BASE1–4) that Pappas et al.(2018) used in of [MASK], and we feed them to a Multi-Layer BIOREAD, the two neural MRC models used by the Perceptron (MLP) to obtain a score for the partic- same authors, AS-READER (Kadlec et al., 2016) ular entity (candidate answer). We thus obtain a and AOA-READER (Cui et al., 2017), and a BERT- score for all the entities of the passage. If an entity based (Devlin et al., 2019) model we developed. occurs multiple times in the passage, we take the Basic baselines: BASE1, 2, 3 return the first, last, sum or the maximum of the scores of its occur- and the entity that occurs most frequently in the rences. In both cases, a softmax is then applied to passage (or randomly one of the entities with the the scores of all the entities, and the entity with the same highest frequency, if multiple exist), respec- maximum score is selected as the answer. We call tively. Since in BIOREAD the correct answer is 7We tried n = 2,..., 6 and use n = 3, which gave the never (by construction) the most frequent entity best accuracy on the development set of BIOMRCLARGE. of the passage, unless there are multiple entities 8https://www.semanticscholar.org/ 9 with the same highest frequency, BASE3 performs BERT’s tokenizer splits the entity identifiers into sub- tokens (Devlin et al., 2019). We use the first one. The top-level poorly. Hence, we also include a variant, BASE3+, token representations of BERT are context-aware, and it is com- which randomly selects one of the entities of the mon to use the first or last sub-token of each named-entity.
143 BIOMRC Lite – Setting A BIOMRC Lite – Setting B Train Dev Test Train All Word Entity Train Dev Test Train All Word Entity Method Acc Acc Acc Time Params Embeds Embeds Acc Acc Acc Time Params Embeds Embeds BASE1 37.58 36.38 37.63 0 0 0 0 37.58 36.38 37.63 0 0 0 0 BASE2 22.50 23.10 21.73 0 0 0 0 22.50 23.10 21.73 0 0 0 0 BASE3 10.03 10.02 10.53 0 0 0 0 10.03 10.02 10.53 0 0 0 0 BASE3+ 44.05 43.28 44.29 0 0 0 0 44.05 43.28 44.29 0 0 0 0 BASE4 56.48 57.36 56.50 0 0 0 0 56.48 57.36 56.50 0 0 0 0 AS-READER 84.63 62.29 62.38 18 x 0.92 hr 12.87M 12.69M 1.59M 79.64 66.19 66.19 18 x 0.65 hr 6.82M 6.66M 0.60k AOA-READER 82.51 70.00 69.87 29 x 2.10 hr 12.87M 12.69M 1.59M 84.62 71.63 71.57 36 x 1.82 hr 6.82M 6.66M 0.60k SCIBERT-SUM-READER 71.74 71.73 71.28 11 x 4.38 hr 154k 0 0 68.92 68.64 68.24 6 x 4.38 hr 154k 0 0 SCIBERT-MAX-READER 81.38 80.06 79.97 19 x 4.38 hr 154k 0 0 81.43 80.21 79.10 15 x 4.38 hr 154k 0 0 Table 3: Training, development, test accuracy (%) on BIOMRCLITE in Settings A (global scope of entity identifiers) and B (local scope), training times (epochs time per epoch), and number of trainable parameters (total, word × embedding parameters, entity identifier embedding parameters). In the lower zone (neural methods), the difference from each accuracy score to the next best is statistically significant (p < 0.02). We used singe-tailed Approximate Randomization (Dror et al., 2018), randomly swapping the answers to 50% of the questions for 10k iterations. this model SCIBERT-SUM-READER or SCIBERT- READER in Setting B, which again casts doubts MAX-READER, depending on how it aggregates the on the value of SCIBERT’s extensive pre-training. scores of multiple occurrences of the same entity. We expect, however, that the performance of the SCIBERT-SUM-READER is closer to AS-READER SCIBERT-based models, could be improved further and AOA-READER, which also sum the scores of by fine-tuning SCIBERT’s parameters. multiple occurrences of the same entity. This sum- The performance of SCIBERT-SUM-READER is ming aggregation, however, favors entities with sev- slightly better in Setting A than in Setting B, which eral occurrences in the passage, even if the scores might suggest that the model manages to capture of all the occurrences are low. Our experiments global properties of the entity pseudo-identifiers indicate that SCIBERT-MAX-READER performs bet- from the entire training set. However, the perfor- ter. In all cases, we only update the parameters of mance of SCIBERT-MAX-READER is almost the the MLP during training, keeping the parameters of same across the two settings, which contradicts SCIBERT frozen to their pre-trained values to speed the previous hypothesis. Furthermore, the devel- up training. With more computing resources, it may opment and test performance of AS-READER and be possible to improve the scores of SCIBERT-MAX- AOA-READER is higher in Setting B than A, indi- READER (and SCIBERT-SUM-READER) further by cating that these two models do not capture global fine-tuning SCIBERT on BIOMRC training data. properties of entities well, performing better when forced to consider only the information of the par- 3.2 Results on BIOMRCLITE ticular passage-question instance. Overall, we see Table3 reports the accuracy of all methods on no strong evidence that the models we considered BIOMRCLITE for Settings A and B. In both settings, are able to learn global properties of the entities. all the neural models clearly outperform all the ba- In both Settings A and B, AOA-READER per- sic baselines, with BASE3 (most frequent entity of forms better than AS-READER, which was expected the passage) performing worst and BASE3+ per- since it uses a more elaborate attention mechanism, forming much better, as expected. In both settings, at the expense of taking longer to train (Table3). 10 SCIBERT-MAX-READER clearly outperforms all the The two SCIBERT-based models are also compet- other methods on both the development and test itive in terms of training time, because we only sets. The performance of SCIBERT-SUM-READER train the MLP (154k parameters) on top of SCIB- is approximately ten percentage points worse than ERT, keeping the parameters of SCIBERT frozen. SCIBERT-MAX-READER’s on the development and The trainable parameters of AS-READER and test sets of both settings, indicating that the superior AOA-READER are almost double in Setting A com- results of SCIBERT-MAX-READER are to a large ex- pared to Setting B. To some extent, this difference tent due to the different aggregation function (max is due to the fact that for both models we learn instead of sum) it uses to combine the scores of a word embedding for each @entityN pseudo- multiple occurrences of a candidate answer, not identifier, and in Setting A the numbering of the to the extensive pre-training of SCIBERT. AOA- identifiers is not reset for each passage-question READER, which does not employ any pre-training, is competitive to SCIBERT-SUM-READER in Set- 10We trained all models for a maximum of 40 epochs, using ting A, and performs better than SCIBERT-SUM- early stopping on the dev. set, with patience of 3 epochs.
144 Figure 3: More detailed statistics and results on the development subset of BIOMRCLITE. Number of passage- question instances with 2, 3, . . . , 20 candidate answers (top left). Accuracy (%) of the basic baselines (top right). Accuracy (%) of the neural models in Settings A (bottom left) and B (bottom right). instance, leading to many more pseudo-identifiers least more feasible for machines. The results of (31.77k pseudo-identifiers in the vocabulary of Set- Pappas et al.(2018) were slightly higher in Setting ting A vs. only 20 in Setting B); this accounts for A than in Setting B, suggesting that AOA-READER a difference of 1.59M parameters.11 The rest of was able to benefit from the global scope of entity the difference in total parameters (from Setting A identifiers, unlike our findings in BIOMRC.12 to B) is due to the fact that we tuned the hyper- Figure3 shows how many passage-question in- parameters of each model separately for each set- stances of the development subset of BIOMRCLITE ting (A, B), on the corresponding development set. have 2, 3, . . . , 20 candidate answers (top left), Hyper-parameter tuning was performed separately and the corresponding accuracy of the basic base- for each model in each setting, but led to the same lines (top right), and the neural models (bottom). numbers of trainable parameters for AS-READER BASE3+ is the best basic baseline for 2 and 3 can- and AOA-READER, because the trainable parame- didates, and for 2 candidates it is competitive to the ters are dominated by the parameters of the word neural models. Overall, however, BASE4 is clearly embeddings. Note that the hyper-parameters of the the best basic baseline, but it is outperformed by two SCIBERT-based models (of their MLPs) were all neural models in almost all cases, as in Table3. very minimally tuned, hence these models may per- SCIBERT-MAX-READER is again the best system form even better with more extensive tuning. in both settings, almost always outperforming the AOA-READER was also better than AS-READER other systems. AS-READER is the worst neural in the experiments of Pappas et al.(2018) on a model in almost all cases. AOA-READER is compet- LITE version of their BIOREAD dataset, but the itive to SCIBERT-SUM-READER in Setting A, and development and test accuracy of AOA-READER slightly better overall than SCIBERT-SUM-READER in Setting A of BIOREAD was reported to be only in Setting B, as can be seen in Table3. 52.41% and 51.19%, respectively (cf. Table3); in Setting B, it was 50.44% and 49.94%, respectively. 3.3 Results on BIOMRCTINY The much higher scores of AOA-READER (and AS- Pappas et al.(2018) asked humans (non-experts) to READER) on BIOMRCLITE are an indication that answer 30 questions from BIOREAD in Setting A, the new dataset is less noisy, or that the task is at and 30 other questions in Setting B. We mirrored their experiment by providing 30 questions (from 11Hyper-parameter tuning led to 50- and 30-dimensional word embeddings in Settings A, B, respectively. AS-READER 12For AS-READER, Pappas et al.(2018) report results only and AOA-READER learn word embeddings from the training for Setting B: 37.90% development and 42.01% test accuracy set, without using pre-trained embeddings. on BIOREADLITE. They did not consider BERT-based models.
145 The study enrolled 53 @entity1 (29 males, 24 females) with @entity1576 aged 15-88 years. Most of them were 59 years of age and younger. In 1/3 of the @entity1 the diseases started with symptoms of @entity1729, in 2/3 of them–with pulmonary affection. @entity55 was diagnosed in 50 @entity1 (94.3%), acute @entity3617 –in 3 @entity1. ECG changes were registered in about half of the examinees who had no cardiac complaints. 25 of them had alterations in the end part of the Passage ventricular ECG complex; rhythm and conduction disturbances occurred rarely. Mycoplasmosis @entity1 suffering from @entity741( @entity741 ) had stable ECG changes while in those free of @entity741 the changes were short. @entity296 foci were absent. @entity299 comparison in @entity1 with @entity1576 and in other @entity1729 has found that cardiovascular system suffers less in acute mycoplasmosis. These data are useful in differential diagnosis of @entity296. @entity1 : [‘patients’] ; @entity1576 : [‘respiratory mycoplasmosis’] ; @entity1729 : [‘acute respiratory infections’, ‘acute respiratory viral infection’] ; @entity55 : [‘Pneumonia’] ; @entity3617 Candidates : [‘bronchitis’] ; @entity741 : [‘IHD’, ‘ischemic heart disease’] ; @entity296 : [‘myocardial infections’, ‘Myocardial necrosis’] ; @entity299 : [‘Cardiac damage’] . Question Cardio-vascular system condition in XXXX. Expert Human Answers annotator1: @entity1576; annotator2: @entity1576. Non-expert Human Answers annotator1: @entity296; annotator2: @entity296; annotator3: @entity1576. Systems’ Answers AS-READER: @entity1729; AOA-READER: @entity296; SCIBERT-SUM-READER: @entity1576.
Figure 4: Example from BIOMRCTINY. In Setting A, humans see both the pseudo-identifiers (@entityN) and the original names of the biomedical entities (shown in square brackets). Systems see only the pseudo-identifiers, but the pseudo-identifiers have global scope over all instances, which allows the systems, at least in principle, to learn entity properties from the entire training set. In Setting B, humans no longer see the original names of the entities, and systems see only the pseudo-identifiers with local scope (numbering reset per passage-question instance).
BIOMRCLITE) to three non-experts (graduate CS they do benefit from the global scope of entity iden- students) in Setting A, and 30 other questions in tifiers. Also, SCIBERT-MAX-READER performs bet- Setting B. We also showed the same questions of ter than both experts and non-experts in Setting A, each setting to two biomedical experts. As in the and better than non-experts in Setting B. However, experiment of Pappas et al.(2018), in Setting A BIOMRCTINY contains only 30 instances in each both the experts and non-experts were also pro- setting, and hence the results of Table4 are less vided with the original names of the biomedical reliable than those from BIOMRCLITE (Table3). entities (entity names before replacing them with In the corresponding experiments of Pappas et al. @entityN pseudo-identifiers) to allow them to use (2018), which were conducted in Setting B only, prior knowledge; see the top three zones of Fig.4 the average accuracy of the (non-expert) humans for an example. By contrast, in Setting B the origi- was 68.01%, but the humans were also allowed not nal names of the entities were hidden. to answer (when clueless), and unanswered ques- Table4 reports the human and system accuracy tions were excluded from accuracy. On average, scores on BIOMRCTINY. Both experts and non- they did not answer 21.11% of the questions, hence experts perform better in Setting A, where they can their accuracy drops to 46.90% if unanswered ques- use prior knowledge about the biomedical entities. tions are counted as errors. In our experiment, the The gap between experts and non-experts is three humans were also allowed not to answer (when points larger in Setting B than in Setting A, presum- clueless), but we counted unanswered questions ably because experts can better deduce properties as errors, which we believe better reflects human of the entities from the local context. Turning to the performance. Non-experts answered all questions system scores, SCIBERT-MAX-READER is again the in Setting A, and did not answer 13.33% (4/30) of best system, but again much of its performance is the questions on average in Setting B. The decrease due to the max-aggregation of the scores of multi- in the questions non-experts did not answer (from ple occurrences of entities. With sum-aggregation, 21.11% to 13.33%) in Setting B (the only one con- SCIBERT-SUM-READER obtains exactly the same sidered in BIOREAD) again suggests that the new scores as AOA-READER, which again performs bet- dataset is less noisy, or at least that the task is more ter than AS-READER.(AOA-READER and SCIBERT- feasible for humans, even when the names of the SUM-READER make different mistakes, but their entities are hidden. Experts did not answer 2.5% scores just happen to be identical because of the (0.75/30) and 1.67% (0.5/30) of the questions on small size of TINY.) Unlike our results on BIOMRC average in Settings A and B, respectively. LITE, we now see all systems performing better in Inter-annotator agreement was also higher for Setting A compared to Setting B, which suggests experts than non-experts in our experiment, in both
146 Method Setting A Setting B Annotators (Setting) Kappa Experts (Avg) 85.00 61.67 Experts (A) 70.23 Non-Experts (Avg) 81.67 55.56 Non Experts (A) 65.61 AS-READER 66.67 46.67 Experts (B) 72.30 AOA READER 70.00 56.67 - Non Experts (B) 47.22 SCIBERT-SUM-READER 70.00 56.67 SCIBERT-MAX-READER 90.00 60.00 Table 5: Human agreement (Cohen’s Kappa, %) on BIOMRCTINY. Avg. pairwise scores for non-experts. Table 4: Accuracy (%) on BIOMRCTINY. Best human and system scores shown in bold. perts. Creating data of this kind, however, requires significant expertise and time. In the eight years Settings A and B (Table5). In Setting B, the agree- of BIOASQ, only 3,243 questions and gold answers ment of non-experts was particularly low (47.22%), have been created. It would be particularly inter- possibly because without entity names they had to esting to explore if larger automatically generated rely more on the text of the passage and question, datasets like BIOMRC and CLICR could be used to which they had trouble understanding. By contrast, pre-train models, which could then be fine-tuned the agreement of experts was slightly higher in Set- for human-generated QA or MRC datasets. ting B than Setting A, possibly because without Outside the biomedical domain, several cloze- prior knowledge about the entities, which may dif- style open-domain MRC datasets have been created fer across experts, they had to rely to a larger extent automatically (Hill et al., 2016; Hermann et al., on the particular text of the passage and question. 2015; Dunn et al., 2017; Bajgar et al., 2016), but have been criticized of containing questions that 4 Related work can be answered by simple heuristics like our ba- Several biomedical MRC datasets exist, but have sic baselines (Chen et al., 2016). There are also orders of magnitude fewer questions than BIOMRC several large open-domain MRC datasets annotated (Ben Abacha and Demner-Fushman, 2019) or are by humans (Kwiatkowski et al., 2019; Rajpurkar not suitable for a cloze-style MRC task (Pampari et al., 2016, 2018; Trischler et al., 2017; Nguyen et al., 2018; Ben Abacha et al., 2019; Zhang et al., et al., 2016; Lai et al., 2017). To our knowledge the 2018). The closest dataset to ours is CLICR (Susterˇ biggest human annotated corpus is Google’s Nat- and Daelemans, 2018), a biomedical MRC dataset ural Questions dataset (Kwiatkowski et al., 2019), with cloze-type questions created using full-text with approximately 300k human annotated exam- articles from BMJ case reports.13 CLICR contains ples. Datasets of this kind require extensive an- 100k passage-question instances, the same num- notation effort, which for open-domain datasets ber as BIOMRCLITE, but much fewer than the is usually crowd-sourced. Crowd-sourcing, how- 812.7k instances of BIOMRCLARGE. Susterˇ et ever, is much more difficult for biomedical datasets, al. used CLAMP (Soysal et al., 2017) to detect because of the required expertise of the annotators. biomedical entities and link them to concepts of the UMLS Metathesaurus (Lindberg et al., 1993). 5 Conclusions and Future Work Cloze-style questions were created from the ‘learn- We introduced BIOMRC, a large-scale cloze-style ing points’ (summaries of important information) biomedical MRC dataset. Care was taken to reduce of the reports, by replacing biomedical entities with noise, compared to the previous BIOREAD dataset placeholders. Susterˇ et al. experimented with the of Pappas et al.(2018). Experiments showed that Stanford Reader (Chen et al., 2017) and the Gated- BIOMRC’s questions cannot be answered well by Attention Reader (Dhingra et al., 2017), which per- simple heuristics, and that two neural MRC models form worse than AOA-READER (Cui et al., 2017). that had been tested on BIOREAD perform much The QA dataset of BIOASQ (Tsatsaronis et al., better on BIOMRC, indicating that the new dataset 2015) contains questions written by biomedical ex- is indeed less noisy or at least that its task is more perts. The gold answers comprise multiple relevant feasible. Human performance was also higher on a documents per question, relevant snippets from the sample of BIOMRC compared to BIOREAD, and documents, exact answers in the form of entities, biomedical experts performed even better. We as well as reference summaries, written by the ex- also developed a new BERT-based model, the best 13https://casereports.bmj.com/ version of which outperformed all other meth-
147 ods tested, reaching or surpassing the accuracy of Danqi Chen, Adam Fisch, Jason Weston, and Antoine biomedical experts in some experiments. We make Bordes. 2017. Reading Wikipedia to Answer Open- Domain Questions. In Proceedings of the 55th An- BIOMRC available in three different sizes, also re- nual Meeting of the Association for Computational leasing our code, and providing a leaderboard. Linguistics (Volume 1: Long Papers), pages 1870– We plan to tune more extensively the BERT- 1879, Vancouver, Canada. based model to further improve its efficiency, and Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang, to investigate if some of its techniques (mostly its Ting Liu, and Guoping Hu. 2017. Attention-over- max-aggregation, but also using sub-tokens) can Attention Neural Networks for Reading Comprehen- also benefit the other neural models we considered. sion. In Proceedings of the 55th Annual Meeting of We also plan to experiment with other MRC models the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 593–602, Vancouver, that recently performed particularly well on open- Canada. domain MRC datasets (Zhang et al., 2020). Finally, we aim to explore if pre-training neural models on Jacob Devlin, Ming-Wei Chang, Kenton Lee, and BIOREAD is beneficial in human-generated biomed- Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Un- ical datasets (Tsatsaronis et al., 2015). derstanding. In NAACL-HLT. Acknowledgments Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang, William Cohen, and Ruslan Salakhutdinov. 2017. Gated- We are most grateful to I. Almirantis, S. Kotitsas, Attention Readers for Text Comprehension. In Pro- V. Kougia, A. Nentidis, S. Xenouleas, who partici- ceedings of the 55th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long pated in the human evaluation with BIOMRCTINY. Papers), pages 1832–1846, Vancouver, Canada.
Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Re- References ichart. 2018. The Hitchhiker’s Guide to Testing Sta- tistical Significance in Natural Language Processing. Alan R Aronson and Franc¸ois-Michel Lang. 2010. An In Proceedings of the 56th Annual Meeting of the As- overview of MetaMap: historical perspective and re- sociation for Computational Linguistics (Volume 1: cent advances. Journal of the American Medical In- Long Papers), pages 1383–1392, Melbourne, Aus- formatics Association, 17(3):229–236. tralia.
Ondrej Bajgar, Rudolf Kadlec, and Jan Kleindienst. Matthew Dunn, Levent Sagun, Mike Higgins, V. Ugur 2016. Embracing data abundance: BookTest Guney,¨ Volkan Cirik, and Kyunghyun Cho. 2017. Dataset for Reading Comprehension. CoRR. SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine. CoRR. Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciB- ERT: Pretrained Language Model for Scientific Text. Karl Moritz Hermann, Toma´sˇ Kociskˇ y,´ Edward Grefen- In EMNLP. stette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching Machines to Read and Comprehend. In Proceedings of the 28th Asma Ben Abacha and Dina Demner-Fushman. 2019. International Conference on Neural Information A question-entailment approach to question answer- Processing Systems - Volume 1, page 1693–1701, ing. BMC Bioinformatics, 20:511. Cambridge, MA, USA.
Asma Ben Abacha, Chaitanya Shivade, and Dina Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Demner-Fushman. 2019. Overview of the MEDIQA Weston. 2016. The Goldilocks Principle: Reading 2019 Shared Task on Textual Inference, Question Children’s Books with Explicit Memory Represen- Entailment and Question Answering. In Proceed- tations. CoRR. ings of the 18th BioNLP Workshop and Shared Task, pages 370–379, Florence, Italy. Rudolf Kadlec, Martin Schmid, Ondrej Bajgar, and Jan Kleindienst. 2016. Text Understanding with the At- Steven Bird, Loper Edward, and Ewan Klein. tention Sum Reader Network. In Proceedings of the 2009. Natural Language Processing with Python. 54th Annual Meeting of the Association for Compu- O’Reilly Media Inc. tational Linguistics (Volume 1: Long Papers), pages 908–918, Berlin, Germany. Danqi Chen, Jason Bolton, and Christopher D. Man- ning. 2016. A Thorough Examination of the Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- CNN/Daily Mail Reading Comprehension Task. In field, Michael Collins, Ankur Parikh, Chris Al- Proceedings of the 54th Annual Meeting of the As- berti, Danielle Epstein, Illia Polosukhin, Jacob De- sociation for Computational Linguistics (Volume 1: vlin, Kenton Lee, Kristina Toutanova, Llion Jones, Long Papers), pages 2358–2367, Berlin, Germany. Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai,
148 Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Association for Computational Linguistics: Human Natural Questions: A Benchmark for Question An- Language Technologies, Volume 1 (Long Papers), swering Research. Transactions of the Association pages 1551–1563, New Orleans, Louisiana. for Computational Linguistics, 7:453–466. Wilson L. Taylor. 1953. “Cloze Procedure”: A New Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, Tool for Measuring Readability. Journalism Quar- and Eduard Hovy. 2017. RACE: Large-scale ReAd- terly, 30(4):415–433. ing Comprehension Dataset From Examinations. In Proceedings of the 2017 Conference on Empirical Adam Trischler, Tong Wang, Xingdi Yuan, Justin Har- Methods in Natural Language Processing, pages ris, Alessandro Sordoni, Philip Bachman, and Ka- 785–794, Copenhagen, Denmark. heer Suleman. 2017. NewsQA: A Machine Compre- hension Dataset. In Proceedings of the 2nd Work- Robert Leaman, Rezarta Islamaj Dogan,˘ and Zhiy- shop on Representation Learning for NLP, pages ong Lu. 2013. DNorm: disease name normaliza- 191–200, Vancouver, Canada. tion with pairwise learning to rank. Bioinformatics, 29(22):2909–2917. G. Tsatsaronis, G. Balikas, P. Malakasiotis, I. Parta- las, M. Zschunke, M.R. Alvers, D. Weissenborn, Donald A. B. Lindberg, Betsy L. Humphreys, and A. Krithara, S. Petridis, D. Polychronopoulos, Alexa T. McCray. 1993. The Unified Medical Lan- Y. Almirantis, J. Pavlopoulos, N. Baskiotis, P. Galli- guage System. Yearbook of medical informatics, nari, T. Artieres, A. Ngonga, N. Heino, E. Gaussier, 1:41–51. L. Barrio-Alvers, M. Schroeder, I. Androutsopou- los, and G. Paliouras. 2015. An Overview of the Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, BioASQ Large-Scale Biomedical Semantic Index- Saurabh Tiwary, Rangan Majumder, and Li Deng. ing and Question Answering Competition. BMC 2016. MS MARCO: A Human Generated MAchine Bioinformatics, 16(138). Reading COmprehension Dataset. ArXiv. Chih-Hsuan Wei, Bethany R. Harris, Donghui Li, Anusri Pampari, Preethi Raghavan, Jennifer Liang, and Tanya Z. Berardini, Eva Huala, Hung-Yu Kao, and Jian Peng. 2018. emrQA: A Large Corpus for Ques- Zhiyong Lu. 2012. Accelerating literature curation tion Answering on Electronic Medical Records. In with text-mining tools: a case study of using PubTa- Proceedings of the 2018 Conference on Empirical tor to curate genes in PubMed abstracts. Database, Methods in Natural Language Processing, pages 2012. 2357–2368, Brussels, Belgium. Xiao Zhang, Ji Wu, Zhiyang He, Xien Liu, and Ying Dimitris Pappas, Ion Androutsopoulos, and Haris Pa- Su. 2018. Medical Exam Question Answering with pageorgiou. 2018. BioRead: A New Dataset for Large-scale Reading Comprehension. ArXiv. Biomedical Reading Comprehension. In Proceed- ings of the Eleventh International Conference on Zhuosheng Zhang, Jun jie Yang, and Hai Zhao. 2020. Language Resources and Evaluation (LREC 2018), Retrospective Reader for Machine Reading Compre- Miyazaki, Japan. hension. ArXiv. Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know What You Don’t Know: Unanswerable Ques- tions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Lin- guistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceed- ings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Ergin Soysal, Jingqi Wang, Min Jiang, Yonghui Wu, Serguei Pakhomov, Hongfang Liu, and Hua Xu. 2017. CLAMP – a toolkit for efficiently build- ing customized clinical natural language processing pipelines. Journal of the American Medical Infor- matics Association, 25(3):331–336. Simon Susterˇ and Walter Daelemans. 2018. CliCR: a Dataset of Clinical Case Reports for Machine Read- ing Comprehension. In Proceedings of the 2018 Conference of the North American Chapter of the
149 Neural Transduction of Letter Position Dyslexia using an Anagram Matrix Representation
Avi Bleiweiss BShalem Research Sunnyvale, CA, USA [email protected]
Abstract Along the same lines Kezilas et al.(2014) suggest that character transposition effects in LPD are most Research on analyzing reading patterns of dyslectic children has mainly been driven by likely caused by a deficit specific to coding the classifying dyslexia types offline. We contend letter position and is evidenced by an interaction that a framework to remedy reading errors in- between the orthographic and visual analysis stages line is more far-reaching and will help to fur- of reading. To this end, more recently Marcet et al. ther advance our understanding of this impair- (2019) managed to significantly reduce migration ment. In this paper, we propose a simple and errors by either altering letter contrast or presenting intuitive neural model to reinstate migrating letters to the young adult sequentially. words that transpire in letter position dyslexia, a visual analysis deficit to the encoding of To dyslectic children not all letter positions are character order within a word. Introduced by equally impaired as medial letters in a word are by the anagram matrix representation of an input far more vulnerable to reading errors compared to verse, the novelty of our work lies in the ex- the first and last characters of the word (Friedmann pansion from one to a two dimensional con- and Gvion, 2001). Children with LPD have high text window for training. This warrants words migration errors where the transposition of letters that only differ in the disposition of letters to in the middle of the word leads to another word, remain interpreted semantically similar in the for example, slime–smile or cloud–could. On the embedding space. Subject to the apparent con- straints of the self-attention transformer archi- other hand, not all reading errors in cases of se- tecture, our model achieved a unigram BLEU lective LPD are migratable and are evidenced by score of 40.6 on our reconstructed dataset of words read without a lexical sense e.g., slime–silme. the Shakespeare sonnets. Intriguingly, increasing the word length does not elevate the error rate, and moreover, shorter words 1 Introduction that have lexical anagrams are prone to a larger Dyslexia is a reading disorder that is perhaps the proportion of migration errors compared to longer most studied of learning disabilities, with an esti- words that possess no-anagram words. A key ob- mated prevalence rate of 5 to 17 percentage points servation for LPD is that although words read may of school-age children in the US (Shaywitz and share all letters in most of the positions, they still Shaywitz, 2005; Made by Dyslexia, 2019). Counter remain semantically unrelated. to popular belief, dyslexia is not only tied to the vi- Machine learning tools to classify dyslexia use sual analysis system of the brain, but also presents a large corpus of reading errors for training and a linguistic problem and hence its relevance to natu- mainly aim to automate and substitute diagnostic ral language processing (NLP). Dyslexia manifests procedures expensively managed by human experts. itself in several forms as this work centers on Letter Lakretz et al.(2015) used both LDA and Naive Position Dyslexia (LPD), a selective deficit to en- Bayes models and showed an area under curve coding the position of a letter within a word while (AUC) performance of about 0.8 that exceeded the sustaining both letter identification and character quality of clinician-rendered labels. In their study, binding to words (Friedmann and Gvion, 2001). Rello and Ballesteros(2015) proposed a statisti- A growing body of research advocates hetero- cal model that predicts dyslectic readers using eye geneity of dyslexia causes to poor non-word and tracking measures. Employing an SVM-based bi- irregular-word reading (McArthur et al., 2013). nary classifier, they achieved about 80% accuracy.
150 Proceedings of the BioNLP 2020 workshop, pages 150–155 Online, July 9, 2020 c 2020 Association for Computational Linguistics
(1) (n) k n Instead, our approach applies deep learning to anagram matrix A = [m ; ,..., ; m ] R × , 1:k 1:k ∈ the task of restoring LPD inline that we further for- where m(i) are column vectors, [ ; ] is column- · · mulate as a sequence transduction problem. Thus, bound matrix concatenation, and k and n are the given an input verse that contains shuffled-letter transposition and input verse dimensions, respec- words identified as transpositional errors, the objec- tively. In Table1, we render a subset of an anagram tive of our neural model is to predict the originat- matrix drawn from a target verse with a maximal ing unshuffled words. We use language similarity word length of seven letters. The anagram matrix between predicted verses and ground-truth target founds an effective context structure for a two-pass text-sequences to quantitatively evaluate our model. embedding training, and our training dataset thus Our main contribution is a concise representation reconstructs on the basis of a collection of anagram of the input verse that scales up to moderate an matrices with varying dimensions. exhaustive set of LPD permutable data. 3 LPD Embeddings 2 Anagram Matrix Models for learning word vectors train locally on a Using a colon notation, we denote an input verse to one-dimensional context window by scanning the our model as a text sequence w1:n = (w1, . . . , wn) entire corpus (Mikolov et al., 2013). Through eval- of n words interchangeably with n collections of uation on a word analogy task, these models cap- letters l = (l(1) , . . . , l(n) ). We generate mi- 1:n 1: w1 1: wn ture linguistic regularities as linear relationships be- | | | | grated word patterns synthetically by anchoring tween word embeddings. Mikolov et al.(2013) pro- the first and last character of each word and ran- posed the skip-gram and continuous-bag-of-words domly permuting the position of the inner letters (CBOW) neural architectures with the objective to (1) (n) (l , . . . , l ). Thus given a word with predict the context of the target word and the target 2: w1 1 2: wn 1 | |− | |− a character length l(i) , the number of possible word given its context, respectively. Notably LPD | | unique transpositions for each word follows t1:n = migrating words tend mostly outside the English ( l(1) !,..., l(n) !). Next, we extract a migration vocabulary and thus pretrained word embeddings | | | | n on large corpora are of limited use in our system. 1 amplification factor k = argmaxi=1 ti that we ap- ply to each word in an input verse independently and form the sequence m = (m , . . . , m ). 1:k 1 k 풕 ퟐ,풕 ퟐ 풕 ퟏ,풕 ퟐ 풕,풕 ퟐ 풕 ퟏ,풕 ퟐ 풕 ퟐ,풕 ퟐ Word length commonly used in experiments of pre- vious LPD studies averages five letters and ranges 풕 ퟐ,풕 ퟏ 풕 ퟏ,풕 ퟏ 풕,풕 ퟏ 풕 ퟏ,풕 ퟏ 풕 ퟐ,풕 ퟏ from four to seven letters long, hence migrating to feasible 2, 6, 24, and 120 letter substitutions, 풕 ퟐ,풕 풕 ퟏ,풕 풕,풕 풕 ퟏ,풕 풕 ퟐ,풕 respectively. We note that words with 1, 2, or 3 풕 ퟐ,풕 ퟏ 풕 ퟏ,풕 ퟏ 풕,풕 ퟏ 풕 ퟏ,풕 ퟏ 풕 ퟐ,풕 ퟏ letters are held intact and are not migratable.
풕 ퟐ,풕 ퟐ 풕 ퟏ,풕 ퟐ 풕,풕 ퟐ 풕 ퟏ,풕 ퟐ 풕 ퟐ,풕 ퟐ when forty winters shall besiege thy brow wehn fotry wenitrs sahll bseeige thy borw Figure 1: A two-dimensional context window of size when froty winrtes slhal begseie thy borw two drawn from outside context cells of an anagram ma- wehn fotry wrenits slahl begisee thy borw when forty wtenirs shall begeise thy brow trix. The center words are shown in gray for both the when froty wtneirs shall bigeese thy brow normal by-row wt,t 2, . . . , wt,t+2 and transposed { − } when ftory weinrts sahll bgiesee thy borw column-wise wt 2,t, . . . , wt+2,t forms of feeding { − } wehn frtoy wirtens slhal bisgeee thy borw our neural network. wehn froty wterins slahl beeisge thy brow when froty wtiners shlal beesgie thy borw wehn frtoy wnetris shall beisege thy borw While the essence of our task is formalized as verse simplification, mending LPD relies on robust Table 1: A snapshot of letter-position migration pat- discovery of word similarities along both the mi- terns in the form of an anagram matrix. The unedited gration and verse axes of the anagram matrix. To version of the text sequence is highlighted on top. this extent, we reshape the context window to train word embeddings from one to a two-dimensional To address the inherent semantic unrelatedness array. In Figure1, we show a bi-dimensional con- between transpositioned words, we define a two- dimensional migration-verse array in the form of an 1https://nlp.stanford.edu/projects/glove/
151 text window of size two that is a visible subset tial to the quality of repairing reading errors due to drawn from outside context cells of an anagram ma- extensive out-of-vocabulary non-migrating words. trix. Learning word vectors for LPD is a two-pass process in our model. First, the context window 5 Setup W feeds our neural network row-by-row for each To quantitatively evaluate our LPD transduction ap- transpositioned verse, and then follows by iterating proach, we chose to mainly report n-gram BLEU migration vectors m(i) in W T as inputs. precision (Papineni et al., 2002) that defines the lan- guage similarity between a predicted text sequence 4 Model and the ground-truth reference verse. In the BLEU Our task is inspired by recent advances in neural metric, higher scores indicate better performance. machine translation (NMT). NMT architectures have shown state-of-the-art results in both the form 5.1 Corpus of a powerful sequence model (Sutskever et al., Rather than clinical reading tests, we used the Son- 2014; Cho et al., 2014; Bahdanau et al., 2015) and nets by William Shakespeare (Shakespeare, 1997). more recently, the cross-attention ConvS2S (El- This is motivated by the apostrophe-rich data that bayad et al., 2018) and the self-attention based forces left-out letters. The raw dataset comprises transformer (Vaswani et al., 2017) networks. Given 2,154 verses that range from four to fifteen word an unintelligible diction of shuffled-letter words, sequences. In Figure3, we show the distribution our model aims to output a verse that preserves the of word length across the dataset, as 18,858 unique semantics of the input, and uses the transformer tokens are of up to seven-letter long inclusive and that outperforms both recurrent and convolutional take about 62 percentage points of the entire corpus configurations on many language translation tasks. words. To conform to preceding LPD research, we conducted a cleanup step that removes all words of eight letters or more from the dataset. We hypoth- esize that evaluating LPD on a single word basis softmax lets us perform this step without loss of generality. feed-forward network
shared self-attention 8000
word embeddings 6000