Proceedings of the 19Th Sigbiomed Workshop On

BioNLP 2020

The 19th SIGBioMed Workshop on Biomedical Language Processing

Proceedings of the Workshop

July 9, 2020 c 2020 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected]

ISBN 978-1-952148-09-5

ii BioNLP 2020: Research unscathed by COVID-19 Kevin Bretonnel Cohen, Dina Demner-Fushman, Sophia Ananiadou, Junichi Tsujii

The past year has been more than exciting for natural language processing in general, and for biomedical natural language processing in particular. A gradual accretion of studies of reproducibility and replicability in natural language processing, biomedical and otherwise had been making it clear that the reproducibility crisis that has hit most of the rest of science is not going to spare text mining or its related ﬁelds. Then, in March of 2020, much of the world ground to a sudden halt.

The outbreak of the COVID-19 disease caused by the novel coronavirus SARS-CoV-2 made computational work more obviously relevant than it had perhaps ever been before. Suddenly, newscasters were arguing about viral clades, the daily news was full of stories about modelling, and your neighbor had heard of PCR. But, some of us did not really see a role for natural language processing in the brave new world of computational instant reactions to an international pandemic.

That was wrong.

In mid-late March of 2020, a joint project between the Allen Artificial Intelligence Institute (Ai2), the National Library of Medicine (NLM), and the White House Office of Science and Technology Policy (OSTP) released CORD-19, a corpus of work on the SARS-CoV-2 virus, on COVID-19 disease, and on related coronavirus research. It was immediately notable for its inclusion of "gray literature" from preprint servers, which mostly have been neglected in text mining research, as well as for its flexibility with regards to licensing of content types. Perhaps most importantly, it was released in conjunction with a number of task types, including one related to ethics–although the value of medical ethics has been widely obvious since the Nazi "medical" experimentation horrors of the Second World War, the worldwide pandemic has made the value of medical ethicists more apparent to the general public than at any time since. Those task type definitions enabled the broader natural language processing community to jump into the fray quite quickly, and initial results have been quick to arrive.

Meanwhile, the pandemic did nothing to slow research in biomedical natural language processing on any other topic, either. That can be seen in the fact that this year the Association for Computational Linguistics SIGBIOMED workshop on biomedical natural language processing received 73 submissions. The unfortunate effect of the pandemic was the cancellation of the physical workshop, which would have allowed acceptance of all high-quality submissions as posters, if not for podium presentations. Indeed, the poster sessions at BioNLP have been continuously growing in size, due to the large number of high-quality submissions that the workshop receives annually. Unfortunately, because this year the Association for Computational Linguistics annual meeting will take place online only, there will be no poster session for the workshop. Consequently, only a handful of submissions could be accepted for presentation.

Transitioning of the traditional conferences to online presentations at the beginning of the COVID-19 pandemic showed that the traditional presentation formats are not as engaging remotely as they are in the context of in-person sessions. We are therefore exploring a new form of presentation, hoping it will be more engaging, interactive, and informative: 22 papers (about 30% of the submissions) will be presented in panel-like sessions. Papers will be grouped by similarity of topic, meaning that participants with related interests will be able to interact regarding their papers with a hopefully optimal number of people on line at the same time. As we write this introduction, the conference plans and platform are still evolving, as are the daily lives of much of the planet, so we hope that you will join us in planning for the worst, while hoping for the best.

iii Panel Discussions papers referenced in this section are included in this volume, unless otherwise indicated

Session 1: High accuracy information retrieval, spin and bias Invited talk and discussion lead: Kirk Roberts

Presentations: The exploration of Information Retrieval approaches enhanced with linguistic knowledge continues in the work that allows life-science researchers to search PubMed and the CORD-19 collection using patterns over dependency graphs (Taub-Tabib et al.) Representing biomedical relationships in the literature by encoding dependency structure with word embeddings promises to improve retrieval of relationships and literature-based discovery (Paullada et al.) Word embeddings trained on biomedical research articles and the tests based on their associations and coherence, among others, allow detecting and quantifying gender bias over time (Rios et al.) A BioBERT model ﬁne-tuned for relation extraction might assist in detecting spin in reporting the results of randomized clinical trials (Koroleva et al.) Finally, a novel sequence-to-set approach to generating terms for pseudo-relevance feedback is evaluated (Das et al.)

Session 2: Clinical Language Processing Invited talk and discussion lead: Tim Miller

Presentations: Not surprisingly, much of the potentially reproducible work in the clinical domain is based on the Medical Information Mart for Intensive Care (MIMIC) data (Johnson et al., 2016). Kovaleva et al. used the MIMIC-CXR data to explore Visual Dialog for radiology and prepare the ﬁrst publicly available silver- and gold-standard datasets for this task. Searle et al. present a MIMIC-based silver standard for automatic clinical coding and warn that frequently assigned codes in MIMIC-III might be undercoded. Mascio et al. used MIMIC and the Shared Annotated Resources (ShARe)/CLEF dataset in four classiﬁcation tasks: disease status, temporality, negation, and uncertainty. Temporality is explored in-depth by Lin et al., and Wang et al. explore approaches to a clinical Semantic Textual Similarity (STS) task. Xu et al. apply reinforcement learning to deal with noise in clinical text for readmission prediction after kidney transplant.

Session 3: Language Understanding Invited talk and discussion lead: Anna Rumshisky

Presentations: Bringing clinical problems and poetry together, this creative work seeks to better understand dyslexia through a self-attention transformer and Shakespearean sonnets (Bleiweiss). Detection of early stages of Alzheimer’s disease using unsupervised clustering is explored with 10 years of President Ronald Reagan’s speeches (Wang et al.). Stavropoulos et al. introduce BIOMRC, a cloze- style dataset for biomedical machine reading comprehension, along with new publicly available models, and provide a leaderboard for the task. Another type of question answering – answering questions that can be answered by electronic medical records–is explored by Rawat et al. Hur et al. study veterinary records to identify reasons for administration of antibiotics. DeYoung et al. expand the Evidence Inference dataset and evaluate BERT-based models for the evidence inference task.

Session 4: Named Entity Recognition and Knowledge Representation Invited talk and discussion lead: Hoifung Poon

Invited talk: Machine Reading for Precision Medicine

iv The advent of big data promises to revolutionize medicine by making it more personalized and effective, but big data also presents a grand challenge of information overload. For example, tumor sequencing has become routine in cancer treatment, yet interpreting the genomic data requires painstakingly curating knowledge from a vast biomedical literature, which grows by thousands of papers every day. Electronic medical records contain valuable information to speed up clinical trial recruitment and drug development, but curating such real-world evidence from clinical notes can take hours for a single patient. Natural language processing (NLP) can play a key role in interpreting big data for precision medicine. In particular, machine reading can help unlock knowledge from text by substantially improving curation efﬁciency. However, standard supervised methods require labeled examples, which are expensive and time-consuming to produce at scale. In this talk, Dr. Poon presents Project Hanover, where the team overcomes the annotation bottleneck by combining deep learning with probabilistic logic, and by exploiting self-supervision from readily available resources such as ontologies and databases. This enables the researchers to extract knowledge from millions of publications, reason efﬁciently with the resulting knowledge graph by learning neural embeddings of biomedical entities and relations, and apply the extracted knowledge and learned embeddings to supporting precision oncology.

Hoifung Poon is the Senior Director of Precision Health NLP at Microsoft Research and an afﬁliated professor at the University of Washington Medical School. He leads Project Hanover, with the overarching goal of structuring medical data for precision medicine. He has given tutorials on this topic at top conferences such as the Association for Computational Linguistics (ACL) and the Association for the Advancement of Artiﬁcial Intelligence (AAAI). His research spans a wide range of problems in machine learning and natural language processing (NLP), and his prior work has been recognized with Best Paper Awards from premier venues such as the North American Chapter of the Association for Computational Linguistics (NAACL), Empirical Methods in Natural Language Processing (EMNLP), and Uncertainty in AI (UAI). He received his PhD in Computer Science and Engineering from University of Washington, specializing in machine learning and NLP.

Presentations: Nejadgholi et al. analyze errors in NER and introduce an F-score that models a forgiving user experience. Peng et al. study NER, relation extraction, and other tasks with a multi-tasking learning approach. Amin et al. explore multi-instance learning for relation extraction. ShaﬁeiBavani et al. also explore relation and event extraction, but in the context of simultaneously predicting relationships between all mention pairs in a text. Chang et al. provide a benchmark for knowledge graph embedding models on the SNOMED-CT knowledge graph and emphasize the importance of knowledge graphs for learning biomedical knowledge representation.

Acknowledging the community

As always, we are profoundly grateful to the authors who chose BioNLP for presenting their innovative research. The authors’ willingness to continue sharing their work through BioNLP consistently makes the workshop noteworthy and stimulating. We are equally indebted to the program committee members (listed elsewhere in this volume) who produced thorough reviews on a tight review schedule and with an admirable level of insight, despite the timeline being even shorter than usual and the workload higher, while at the same time handling the unprecedented changes in their work and life caused by the COVID- 19 pandemic.

References

Johnson, A., Pollard, T., Shen, L. et al. MIMIC-III, a freely accessible critical care database. Sci Data 3, 160035 (2016). https://doi.org/10.1038/sdata.2016.35

v Wang, L.L., Lo,K., Chandrasekhar, Y. et al. Cord-19: The covid-19 open research dataset. ArXiv, abs/2004.10706, 2020.

vi Organizers: Dina Demner-Fushman, US National Library of Medicine Kevin Bretonnel Cohen, University of Colorado School of Medicine, USA Sophia Ananiadou, National Centre for Text Mining and University of Manchester, UK Junichi Tsujii, National Institute of Advanced Industrial Science and Technology, Japan

Program Committee: Sophia Ananiadou, National Centre for Text Mining and University of Manchester, UK Emilia Apostolova, Language.ai, USA Eiji Aramaki, University of Tokyo, Japan Asma Ben Abacha, US National Library of Medicine Siamak Barzegar, Barcelona Supercomputing Center, Spain Olivier Bodenreider, US National Library of Medicine Leonardo Campillos Llanos, Universidad Autonoma de Madrid, Spain Qingyu Chen, US National Library of Medicine Fenia Christopoulou, National Centre for Text Mining and University of Manchester, UK Aaron Cohen, Oregon Health & Science University, USA Kevin Bretonnel Cohen, University of Colorado School of Medicine, USA Brian Connolly, Kroger Digital, USA Viviana Cotik, University of Buenos Aires, Argentina Manirupa Das, Amazon Search, Seattle, WA, USA Dina Demner-Fushman, US National Library of Medicine Bart Desmet, Clinical Center, National Institutes of Health, USA Travis Goodwin, , US National Library of Medicine Natalia Grabar, CNRS, France Cyril Grouin, LIMSI - CNRS, France Tudor Groza, The Garvan Institute of Medical Research, Australia Antonio Jimeno Yepes, IBM, Melbourne Area, Australia Halil Kilicoglu, University of Illinois at Urbana-Champaign, USA Ari Klein, University of Pennsylvania, USA Andre Lamurias, University of Lisbon, Portugal Majid Latifi, Trinity College Dublin, Ireland Alberto Lavelli, FBK-ICT, Italy Robert Leaman, US National Library of Medicine Ulf Leser, Humboldt-Universität zu Berlin, Germany Maolin Li, National Centre for Text Mining and University of Manchester, UK Zhiyong Lu, US National Library of Medicine Timothy Miller, Children’s Hospital Boston, USA Claire Nedellec, INRA, France Aurelie Neveol, LIMSI - CNRS, France Mariana Neves, German Federal Institute for Risk Assessment, Germany Denis Newman-Griffis, Clinical Center, National Institutes of Health, USA Nhung Nguyen, The University of Manchester, UK Karen O’Connor, University of Pennsylvania, USA Naoaki Okazaki, Tokyo Institute of Technology, Japan Yifan Peng, US National Library of Medicine Laura Plaza, UNED, Madrid, Spain Francisco J. Ribadas-Pena, University of Vigo, Spain Angus Roberts, The University of Sheffield, UK Kirk Roberts, The University of Texas Health Science Center at Houston, USA

vii Roland Roller, DFKI GmbH, Berlin, Germany Diana Sousa, University of Lisbon, Portugal Karin Verspoor, The University of Melbourne, Australia Davy Weissenbacher, University of Pennsylvania, USA W John Wilbur, US National Library of Medicine Shankai Yan, US National Library of Medicine Chrysoula Zerva, National Centre for Text Mining and University of Manchester, UK Ayah Zirikly, Clinical Center, National Institutes of Health, USA Pierre Zweigenbaum, LIMSI - CNRS, France

Additional Reviewers: Jingcheng Du, School of Biomedical Informatics, UTHealth

Invited Speakers: Hoifung Poon, Microsoft Research Tim Miller, Boston Childrens Hospital and Harvard Medical School Kirk Roberts, School of Biomedical Informatics, UTHealth Anna Rumshisky, University of Massachusetts Lowell

viii Table of Contents

Quantifying 60 Years of Gender Bias in Biomedical Research with Word Embeddings Anthony Rios, Reenam Joshi and Hejin Shin ...... 1

Sequence-to-Set Semantic Tagging for Complex Query Reformulation and Automated Text Categoriza- tion in Biomedical IR using Self-Attention Manirupa Das, Juanxi Li, Eric Fosler-Lussier, Simon Lin, Steve Rust, Yungui Huang and Rajiv Ramnath...... 14

Interactive Extractive Search over Biomedical Corpora Hillel Taub Tabib, Micah Shlain, Shoval Sadde, Dan Lahav, Matan Eyal, Yaara Cohen and Yoav Goldberg...... 28

Improving Biomedical Analogical Retrieval with Embedding of Structural Dependencies Amandalynne Paullada, Bethany Percha and Trevor Cohen...... 38

DeSpin: a prototype system for detecting spin in biomedical publications Anna Koroleva, Sanjay Kamath, Patrick Bossuyt and Patrick Paroubek ...... 49

Towards Visual Dialog for Radiology Olga Kovaleva, Chaitanya Shivade, Satyananda Kashyap, Karina Kanjaria, Joy Wu, Deddeh Ballah, Adam Coy, Alexandros Karargyris, Yufan Guo, David Beymer Beymer, Anna Rumshisky and Vandana Mukherjee Mukherjee ...... 60

A BERT-based One-Pass Multi-Task Model for Clinical Temporal Relation Extraction Chen Lin, Timothy Miller, Dmitriy Dligach, Farig Sadeque, Steven Bethard and Guergana Savova 70

Experimental Evaluation and Development of a Silver-Standard for the MIMIC-III Clinical Coding Dataset Thomas Searle, Zina Ibrahim and Richard Dobson ...... 76

Comparative Analysis of Text Classiﬁcation Approaches in Electronic Health Records Aurelie Mascio, Zeljko Kraljevic, Daniel Bean, Richard Dobson, Robert Stewart, Rebecca Ben- dayan and Angus Roberts ...... 86

Noise Pollution in Hospital Readmission Prediction: Long Document Classiﬁcation with Reinforcement Learning Liyan Xu, Julien Hogan, Rachel E. Patzer and Jinho D. Choi ...... 95

Evaluating the Utility of Model Conﬁgurations and Data Augmentation on Clinical Semantic Textual Similarity Yuxia Wang, Fei Liu, Karin Verspoor and Timothy Baldwin ...... 105

Entity-Enriched Neural Models for Clinical Question Answering Bhanu Pratap Singh Rawat, Wei-Hung Weng, So Yeon Min, Preethi Raghavan and Peter Szolovits 112

Evidence Inference 2.0: More Data, Better Models Jay DeYoung, Eric Lehman, Benjamin Nye, Iain Marshall and Byron C. Wallace...... 123

ix Personalized Early Stage Alzheimer’s Disease Detection: A Case Study of President Reagan’s Speeches Ning Wang, Fan Luo, Vishal Peddagangireddy, Koduvayur Subbalakshmi and Rajarathnam Chan- dramouli...... 133

BioMRC: A Dataset for Biomedical Machine Reading Comprehension Dimitris Pappas, Petros Stavropoulos, Ion Androutsopoulos and Ryan McDonald ...... 140

Neural Transduction of Letter Position Dyslexia using an Anagram Matrix Representation Avi Bleiweiss ...... 150

Domain Adaptation and Instance Selection for Disease Syndrome Classiﬁcation over Veterinary Clinical Notes Brian Hur, Timothy Baldwin, Karin Verspoor, Laura Hardefeldt and James Gilkerson ...... 156

Benchmark and Best Practices for Biomedical Knowledge Graph Embeddings David Chang, Ivana Balaževic,´ Carl Allen, Daniel Chawla, Cynthia Brandt and Andrew Taylor167

Extensive Error Analysis and a Learning-Based Evaluation of Medical Entity Recognition Systems to Approximate User Experience Isar Nejadgholi, Kathleen C. Fraser and Berry de Bruijn ...... 177

A Data-driven Approach for Noise Reduction in Distantly Supervised Biomedical Relation Extraction Saadullah Amin, Katherine Ann Dunﬁeld, Anna Vechkaeva and Guenter Neumann...... 187

Global Locality in Biomedical Relation and Event Extraction Elaheh ShaﬁeiBavani, Antonio Jimeno Yepes, Xu Zhong and David Martinez Iraola...... 195

An Empirical Study of Multi-Task Learning on BERT for Biomedical Text Mining Yifan Peng, Qingyu Chen and Zhiyong Lu ...... 205

x Conference Program

Thursday July 9, 2020

08:30–08:40 Opening remarks

08:40–10:30 Session 1: High accuracy information retrieval, spin and bias

08:40–09:10 Invited Talk – Kirk Roberts

09:10–09:20 Quantifying 60 Years of Gender Bias in Biomedical Research with Word Embed- dings Anthony Rios, Reenam Joshi and Hejin Shin

09:20–09:30 Sequence-to-Set Semantic Tagging for Complex Query Reformulation and Auto- mated Text Categorization in Biomedical IR using Self-Attention Manirupa Das, Juanxi Li, Eric Fosler-Lussier, Simon Lin, Steve Rust, Yungui Huang and Rajiv Ramnath

09:30–09:40 Interactive Extractive Search over Biomedical Corpora Hillel Taub Tabib, Micah Shlain, Shoval Sadde, Dan Lahav, Matan Eyal, Yaara Cohen and Yoav Goldberg

09:40–09:50 Improving Biomedical Analogical Retrieval with Embedding of Structural Depen- dencies Amandalynne Paullada, Bethany Percha and Trevor Cohen

09:50–10:00 DeSpin: a prototype system for detecting spin in biomedical publications Anna Koroleva, Sanjay Kamath, Patrick Bossuyt and Patrick Paroubek

10:00–10:30 Discussion

10:30–10:45 Coffee Break

xi Thursday July 9, 2020 (continued)

10:45–13:00 Session 2: Clinical Language Processing

10:45–11:15 Invited Talk – Tim Miller

11:15–11:25 Towards Visual Dialog for Radiology Olga Kovaleva, Chaitanya Shivade, Satyananda Kashyap, Karina Kanjaria, Joy Wu, Deddeh Ballah, Adam Coy, Alexandros Karargyris, Yufan Guo, David Beymer Beymer, Anna Rumshisky and Vandana Mukherjee Mukherjee

11:25–11:35 A BERT-based One-Pass Multi-Task Model for Clinical Temporal Relation Extrac- tion Chen Lin, Timothy Miller, Dmitriy Dligach, Farig Sadeque, Steven Bethard and Guergana Savova

11:35–11:45 Experimental Evaluation and Development of a Silver-Standard for the MIMIC-III Clinical Coding Dataset Thomas Searle, Zina Ibrahim and Richard Dobson

11:45–11:55 Comparative Analysis of Text Classiﬁcation Approaches in Electronic Health Records Aurelie Mascio, Zeljko Kraljevic, Daniel Bean, Richard Dobson, Robert Stewart, Rebecca Bendayan and Angus Roberts

11:55–12:05 Noise Pollution in Hospital Readmission Prediction: Long Document Classiﬁcation with Reinforcement Learning Liyan Xu, Julien Hogan, Rachel E. Patzer and Jinho D. Choi

12:05–12:15 Evaluating the Utility of Model Conﬁgurations and Data Augmentation on Clinical Semantic Textual Similarity Yuxia Wang, Fei Liu, Karin Verspoor and Timothy Baldwin

12:15–12:45 Discussion

12:45–13:30 Lunch

xii Thursday July 9, 2020 (continued)

13:30–15:30 Session 3: Language Understanding

13:30–14:00 Invited Talk – Anna Rumshisky

14:00–14:10 Entity-Enriched Neural Models for Clinical Question Answering Bhanu Pratap Singh Rawat, Wei-Hung Weng, So Yeon Min, Preethi Raghavan and Peter Szolovits

14:10–14:20 Evidence Inference 2.0: More Data, Better Models Jay DeYoung, Eric Lehman, Benjamin Nye, Iain Marshall and Byron C. Wallace

14:20–14:30 Personalized Early Stage Alzheimer’s Disease Detection: A Case Study of President Reagan’s Speeches Ning Wang, Fan Luo, Vishal Peddagangireddy, Koduvayur Subbalakshmi and Ra- jarathnam Chandramouli

14:30–14:40 BioMRC: A Dataset for Biomedical Machine Reading Comprehension Dimitris Pappas, Petros Stavropoulos, Ion Androutsopoulos and Ryan McDonald

14:40–14:50 Neural Transduction of Letter Position Dyslexia using an Anagram Matrix Repre- sentation Avi Bleiweiss

14:50–15:00 Domain Adaptation and Instance Selection for Disease Syndrome Classiﬁcation over Veterinary Clinical Notes Brian Hur, Timothy Baldwin, Karin Verspoor, Laura Hardefeldt and James Gilker- son

15:00–15:30 Discussion

15:30–15:45 Coffee Break

xiii Thursday July 9, 2020 (continued)

15:45–17:45 Session 4: Named Entity Recognition and Knowledge Representation

15:45–16:25 Invited Talk: Machine Reading for Precision Medicine, Hoifung Poon

16:25–16:35 Benchmark and Best Practices for Biomedical Knowledge Graph Embeddings David Chang, Ivana Balaževic,´ Carl Allen, Daniel Chawla, Cynthia Brandt and Andrew Taylor

16:35–16:45 Extensive Error Analysis and a Learning-Based Evaluation of Medical Entity Recognition Systems to Approximate User Experience Isar Nejadgholi, Kathleen C. Fraser and Berry de Bruijn

16:45–16:55 A Data-driven Approach for Noise Reduction in Distantly Supervised Biomedical Relation Extraction Saadullah Amin, Katherine Ann Dunﬁeld, Anna Vechkaeva and Guenter Neumann

16:55–17:05 Global Locality in Biomedical Relation and Event Extraction Elaheh ShaﬁeiBavani, Antonio Jimeno Yepes, Xu Zhong and David Martinez Iraola

17:05–17:15 An Empirical Study of Multi-Task Learning on BERT for Biomedical Text Mining Yifan Peng, Qingyu Chen and Zhiyong Lu

17:15–17:45 Discussion

17:45–18:00 Closing remarks

xiv Quantifying 60 Years of Gender Bias in Biomedical Research with Word Embeddings

Anthony Rios1, Reenam Joshi2, and Hejin Shin3 1Department of Information Systems and Cyber Security 2Department of Computer Science 3Library Systems University of Texas at San Antonio Anthony.Rios, Reenam.Joshi, Hejin.Shin @utsa.edu { }

Abstract tion areas to understand the effect of bias. For example, Kay et al.(2015) found that the Google im- Gender bias in biomedical research can have age search application is biased (Kay et al., 2015). an adverse impact on the health of real peo- Specifically, they found an unequal representation ple. For example, there is evidence that heart of gender stereotypes in image search results for disease-related funded research generally focuses on men. Health disparities can form different occupations (e.g., all police images are between men and at-risk groups of women of men). Likewise, ad-targeting algorithms may (i.e., elderly and low-income) if there is not an include characteristics of sexism and racism (Datta equal number of heart disease-related studies et al., 2015; Sweeney, 2013). Sweeney(2013) for both genders. In this paper, we study tem- found that the names of black men and women poral bias in biomedical research articles by are likely to generate ads related to arrest records. measuring gender differences in word embed- In healthcare, much of the prior work has stud- dings. Specifically, we address multiple ques- ied the bias in the diagnosis process made by doc- tions, including, How has gender bias changed over time in biomedical research, and what tors (Young et al., 1996; Hartung and Widiger, health-related concepts are the most biased? 1998). There have also been studies about ethical Overall, we find that traditional gender stereo- considerations about the use of machine learning types have reduced over time. However, we in healthcare (Cohen et al., 2014). also find that the embeddings of many medical It is possible to analyze and measure the pres- conditions are as biased today as they were 60 ence of gender bias in text. Garg et al.(2018) an- years ago (e.g., concepts related to drug addic- tion and body dysmorphia). alyzed the presence of well-known gender stereotypes over the last 100 years. Hamberg(2008) 1 Introduction shown that gender blindness and stereotyped pre- conceptions are the key cause for gender bias in It is important to develop gender-specific best- medicine. Heath et al.(2019) studied the gender- practice guidelines for biomedical research (Hold- based linguistic differences in physician trainee croft, 2007). If research is heavily biased towards evaluations of medical faculty. Salles et al.(2019) one gender, then the biased guidance may con- measured the implicit and explicit gender bias tribute towards health disparities because the evi- among health care professionals and surgeons. dence drawn-on may be questionable (i.e., not well Feldman et al.(2019) quantified the exclusion of studied). For example, there is more research fund- females in clinical studies at scale with automated ing for the study of heart disease in men (Weisz data extraction. Recently, researchers have studied et al., 2004). Therefore, the at-risk populations methods to quantify gender bias using word embed- of older women in low economic classes are not dings trained on biomedical research articles (Ku- as well-investigated. Therefore, this opens up the rita et al., 2019). Kurita et al.(2019) shown that possibility for an increase in the health disparities the resulting embeddings capture some well-known between genders. gender stereotypes. Moreover, the embeddings ex- Among informatics researchers, there has been hibit the stereotypes at a lower rate than embed- increased interest in understanding, measuring, and dings trained on other corpora (e.g., Wikipedia). overcoming bias associated with machine learning However, to the best of our knowledge, there has methods. Researchers have studied many applica- not been an automated temporal study in the change

1 Proceedings of the BioNLP 2020 workshop, pages 1–13 Online, July 9, 2020 c 2020 Association for Computational Linguistics of gender bias. we analyze two groups of words: occupations In this paper, we look at the temporal change of and mental health disorders. For each group, gender bias in biomedical research. To study social we measure the overall change in bias over biases, we make use of word embeddings trained time. Moreover, we measure the individual on different decades of biomedical research arti- bias associated with each occupation and men- cles. The two main question driving this work are, tal health disorder. In what ways has bias changed over time, and Are there certain illnesses associated with a specific 2 Related Work gender? We leverage three computational tech- In this section, we discuss research related to the niques to answer these questions, the Word Em- three major themes of this paper: gender disparities bedding Association Test (WEAT) (Caliskan et al., in healthcare, biomedical word embeddings, and 2017), the Embedding Coherence Test (ECT) (Dev bias in natural language processing (NLP). and Phillips, 2019), and Relational Inner Product Association (RIPA) (Ethayarajh et al., 2019). To 2.1 Gender Disparities in Healthcare. the best of our knowledge, this will be the first temporal analysis of bias of word embeddings trained There is evidence of gender disparities in the health- on biomedical research articles. Moreover, to the care system, from the diagnosis of mental health best of our knowledge, this is the first analysis that disorders to differences in substance abuse. An measures the gender bias associated with individual important question is, Do similar biases appear biomedical words. in biomedical research? In this work, while we Our work is most similar to Garg et al.(2018). explore traditional gender stereotypes (e.g., Intelli- Garg et al.(2018) study the temporal change of gence vs Appearance), we also measure potential both gender and racial biases using word embed- bias in the occupations and mental health-related dings. Our work substantially differs in three ways. disorders associated with each gender. First, this paper is focused on biomedical litera- With regard to mental health, as an example, af- ture, not general text corpora. Second, we analyze fecting more than 17 million adults in the United gender stereotypes using three distinct methods to States (US) alone, major depression is one of the see if the bias is robust to various measurement most common mental health illnesses (Pratt and techniques. Third, we extend the study beyond Brody, 2014). Depression can cause people to lose gender stereotypes. Specifically, we look at bias in pleasure in daily life, complicate other medical sets of occupation words, as well as bias in men- conditions, and possibly lead to suicide (Pratt and tal health-related word sets. Moreover, we quan- Brody, 2014). Moreover, depression can occur to tify the bias of individual occupational and mental anyone, at any age, and to people of any race or health-related words. ethnic group. While treatment can help individuals In summary, the paper makes the following con- suffering from major depression, or mental illness tributions: in general, only about 35% of individuals suffering from severe depression seek treatment from mental We answer the question; How has the usage health professionals. It is common for people to • of gender stereotypes changed in the last 60 resist treatment because of the belief that depres- years of biomedical research? Specifically, sion is not serious, that they can treat themselves, we look at the change in well-known gender or that it would be seen as a personal weakness stereotypes (e.g., Math vs Arts, Career vs Fam- rather than a serious medical illness (Gulliver et al., ily, Intelligence vs Appearance, and occupa- 2010). Unfortunately, while depression can affect tions) in biomedical literature from 1960 to anyone, women are almost twice as likely as men 2020. to have had depression (Albert, 2015). Moreover, depression is generally higher among certain de- The second contribution answers the question; mographic groups, including, but not limited to, • What are the most gender-stereotyped words Hispanic, non-Hispanic black, low income, and for each decade during the last 60 years, and low education groups (Bailey et al., 2019). The have they changed over time? This contribu- focus of this paper is to understand the impact of tion is more focused than simply looking at these mental health disparities in word embeddings traditional gender stereotypes. Specifically, trained on biomedical corpora.

2 2.2 Biomedical Word Embeddings. Year # Articles Word embeddings capture the distributional nature 1960-1969 1,479,370 between words (i.e., words that appear in similar 1970-1979 2,305,257 contexts will have a similar vector encoding). Over 1980-1989 3,322,556 1990-1999 4,109,739 the years, there have been multiple methods of pro- 2000-2010 6,134,431 ducing word embeddings, including, but not lim- 2010-2020 8,686,620 ited to, latent semantic analysis (Deerwester et al., 1990), Word2Vec (Mikolov et al., 2013a,b), and Total 26,037,973 GLOVE (Pennington et al., 2014). Moreover, pre- Table 1: The total number of articles in each decade. trained word embeddings have been shown to be useful for a wide variety of downstream biomedical NLP tasks (Wang et al., 2018), such as text classi- also been work on measuring bias in sentence em- fication (Rios and Kavuluru, 2015), named entity beddings (May et al., 2019). Furthermore, there recognition (Habibi et al., 2017), and relation ex- has been a significant amount of research that extraction (He et al., 2019). In Chiu et al.(2016), the plores different ways to measure bias in word em- authors study a standard methodology to train good beddings (Caliskan et al., 2017; Dev and Phillips, biomedical word embeddings. Essentially, they 2019; Ethayarajh et al., 2019). In this work, we study the impact of the various Word2Vec-specific make use of many of the bias measurement tech- hyperparameters. In this paper, we use the strate- niques (Caliskan et al., 2017; Dev and Phillips, gies proposed in Chiu et al.(2016) to train optimal 2019; Ethayarajh et al., 2019) to apply them to the decade-specific biomedical word embeddings. biomedical domain.

2.3 Bias and Natural Language Processing. 3 Dataset Unfortunately, because word embeddings are We analyze PubMed-indexed titles and abstracts learned using naturally occurring data, implicit bi- published anytime between 1960 and 2020. The ases expressed in text will be transferred to the total number of articles per decade are shown in Ta- vectors. Bias (and fairness) is an important topic ble1. The text is lower-cased and tokenized using among natural language processing researchers. the SimpleTokenizer available in GenSim (Khos- Bias has been found in word embeddings (Boluk- rovian et al., 2008). We find that the total number basi et al., 2016; Zhao et al., 2018, 2019), text clas- of papers have grown substantially each decade, sification models (Dixon et al., 2018; Park et al., from 1.4 million indexed articles in the 1960s to 2018; Badjatiya et al., 2019; Rios, 2020), and in 8.6 million in the 2010s. Yet, the rate of growth machine translation systems (Font and Costa-jussa`, stayed relatively stable each decade. 2019; Escude´ Font, 2019). In general, each paper generally focuses on either testing whether bias 4 Method exists in various models, or on removing bias from We train the Skip-Gram model on PubMed-indexed classification models for specific applications. titles and abstracts from 1960 to 2020. The hyper- Much of the work on measuring (gender) bias parameters of the Skip-Gram model are optimized using word embeddings neither studies the tempo- independently for each decade. Next, given the ral aspect (i.e., how bias changes over time) nor best set of embeddings for each decade, we ex- focuses on biomedical research (Chaloner and Mal- plore three different techniques to measure bias: donado, 2019). For example, Caliskan et al.(2017) the Word Embedding Association Test (WEAT), studied the bias in groups of words—focusing on the Embedding Coherence Test (ECT), and the Re- traditional gender stereotypes. Kurita et al.(2019) lational Inner Product Association (RIPA). Each expanded on Caliskan et al.(2017) to generalize method allows us to quantify bias in different ways, to contextual word embeddings. Garg et al.(2018) such as comparing multiple sets of words (e.g., developed a technique to study 100 years of gen- comparing the bias with respect to Career vs Fam- der and racial bias using word embeddings. They ily), comparing a single set of words (e.g., occupa- evaluated the bias over time using the US Cen- tions), and measuring the bias of individual words sus as a baseline to compare embedding bias to (e.g., nurse). In this section, we briefly discuss the demographic and occupation shifts. There has procedure we used to train the word embeddings,

3 X male, man, boy, brother, he, him, his, son, father, uncle, grandfather Attribute Words Male vs Female Y female, woman, girl, sister, she, her, hers, daughter, mother, aunt, grandmother A executive, management, professional, corporation, salary, office, business, career Career vs Family B home, parents, children, family, cousins, marriage, wedding, relatives A math, algebra, geometry, calculus, equations, computation, numbers, addition Math vs Art B poetry, art, Shakespeare, dance, literature, novel, symphony, drama A science, technology, physics, chemistry, Einstein, NASA, experiment, astronomy Science vs Art B poetry, art, Shakespeare, dance, literature, novel, symphony, drama Target Words precocious, resourceful, inquisitive, genius, inventive, astute, adaptable, reflective, A discerning, intuitive, inquiring, judicious, analytical, apt, venerable, imaginative, shrewd, thoughtful, wise, smart, ingenious, clever, brilliant, logical, intelligent Intelligence vs Appearance alluring, voluptuous, blushing, homely, plump, sensual, gorgeous, slim, bald, athletic, B fashionable, stout, ugly, muscular, slender, feeble, handsome, healthy, attractive, fat, weak, thin, pretty, beautiful, strong power, strong, confident, dominant, potent, command, assert, loud, bold, succeed, A triumph, leader, shout, dynamic, winner Weak vs Strong weak, surrender, timid, vulnerable, weakness, wispy, withdraw, yield, failure, shy, B follow, lose, fragile, afraid, loser

Table 2: Attribute and Target words words used by WEAT to measure the presence of traditional gender stereotypes in biomedical literature. as well as provide descriptions of each of the bias alization of the implicit bias test for word embed- measurement techniques. dings, measuring the association between two sets of target concepts and two sets of attributes. We 4.1 Word2Vec Model Training. use the same target and attribute sets from Kurita We train a Skip-Gram model using GenSim (Khos- et al.(2019). We list the targets and attributes in rovian et al., 2008). Following Chiu et al.(2016), Table2. The attribute sets of words are related to we search over the following key hyper-parameters: the groups in which the embeddings are biases to- Negative sample size, sub-sampling, minimum- wards or against, e.g., Male vs Female. The words count, learning rate, vector dimension, and context in the target categories—Career vs Family, Math vs window size. See Chiu et al.(2016, Table 2) for Arts, Science vs Arts, Intelligence vs Appearance, more details. and Strength vs Weakness—represent the specific To find the best model, as we search over the var- types of biases. For example, using the attributes ious hyper-parameters, we make use of the UMLS- and targets, we want to know whether the learned Sim dataset (McInnes et al., 2009). UMLS-Sim embeddings that represent men are more related to consists of 566 medical concept pairs for measur- career than the female-related words (i.e., test if ing similarity. The degree of association between female words are more related to family, than male terms in UMLS-Sim was rated by four medical res- words). idents from the University of Minnesota medical Formally, let X and Y be equal-sized sets of tar- school. All these clinical terms correspond to Uni- get concept embeddings and let A and B be sets fied Medical Language System (UMLS) concepts of attribute embeddings. To measure the bias, we included in the Metathesaurus (Bodenreider, 2004). follow Caliskan et al.(2017), which defines the fol- Evaulation is performed using Spearman’s rho rank lowing test statistic that is the difference between correlation between a vector of cosine similarities the sums over the respective target concepts, between each of the 566 pairs of words and their respective medical-resident ratings. Intuitively, the s(X, Y, A, B) = s(x, A, B) ranking of the pairs using cosine similarity, from − x X most similar pairs to the least, should be similar to X∈ the human (medical expert) annotations. s(y, A, B) y Y 4.2 Word Embedding Association Test X∈ The implicit bias test measures unconscious prej- where s(w, A, B) measures the association be- udice (Greenwald et al., 1998). WEAT is a gener- tween a single target word w (e.g., career) with

4 each of the attribute (gendered) words as Year Sim Pair Cnt 1960-1969 .6586 101 s(w, A, B) = cos(w~ ,~a) 1970-1979 .6715 207 − 1980-1989 .7033 277 a A X∈ 1990-1999 .7282 265 cos(w~ , ~b) , 2000-2010 .7078 272 2010-2020 .6867 306 b B X∈ such that cos() represents the cosine similarity be- Table 3: Quality of the embeddings trained for each tween two vectors. w~ Rd, ~a Rd, and ~b Rd decade, measured using the UMLS-Sim dataset. Sim ∈ ∈ ∈ represents the word embedding for x, y, and w, represents Spearman’s rho ranking correlation. Pair count is the number of UMLS-Sim’s word-pairs that d respectively. Similarly, is the dimension of each were present in that decades embeddings. word embedding. Instead of using the test statistic directly, to measure bias, we use the effect size. Ef- fect size is a normalized measure of the separation diagnostic tool published by the American Psychi- of the two distributions, defined as atric Association (Association et al., 2013). For each mental health disorder in DSM-5, which are µx X s(x, A, B) µy Y s(y, A, B) ∈ − ∈ generally multi-word expressions, we split it into σw X Y s(w, A, B) individual words. Next, we manually remove unin- ∈ ∪ formative adjective and function words. For exam- where µx X and µy Y represent the mean score ∈ ∈ ple, the disorder “Specific learning disorder, with over target words for a specific attribute word. Like- impairment in mathematics” is tokenized into the wise, σw X Y is the standard deviation of the ∈ ∪ following words: “learning”, “disorder”, “impair- scores for the word w in the union of X and Y . ment”, and “mathematics”. A complete listing of Intuitively, a positive score means that the attribute the occupational and mental health words can be words in X (e.g., male, man, boy) are more similar found in the appendix. to the target words A (e.g., strong, power, domi- Formally, ECT first computes the mean vectors nant) than Y (e.g., female, woman, girl). Moreover, for the attribute word sets X and Y, defined as larger effects represent more biased embeddings. As previously stated, the Attribute and Target 1 ~v = ~x words are from Kurita et al.(2019). It is important X X | | x X to note that the list is manually curated. Moreover, X∈ the bias measurement can change depending on the d where ~vX R and X represents the number of ∈ | | exact list of words. RIPA is more robust to slight words in category X. ~vY is calculated similarly. changes to the attribute words than WEAT (Etha- For both ~vX and ~vY , ECT computes the (cosine) yarajh et al., 2019). similarities with all vectors a A, i.e., the cosine ∈ similarity is calculated between each target word a 4.3 Embedding Coherence Test. A and ~vX and stored in sX R| |. The two resultant We also explore a second method of measuring ∈ vectors of similarity scores, sX (for X) and sY (for bias, the Embedding Coherence Test (ECT) (Dev Y ) are used to obtain the final ECT score. It is and Phillips, 2019). Unlike WEAT, it compares the Spearman’s rank correlation between the rank the attribute Words (e.g., Male vs Female) with a orders of sX and sY —the higher the correlation, single target set (e.g., Career). Thus, we do not the lower the bias. Intuitively, if the correlation need two contrasting target sets (e.g., Career vs is high, then the rank of target words based on Family) to measure bias. We take advantage of this similarity is correlated when calculated for the both to measure bias associated with occupations and X and Y (i.e., male and female). mental health-related disorders. Specifically, we use a total of 290 occupation words and 222 mental 4.4 Relational Inner Product Association. health-related words. The occupation words come While ECT only requires a single target set, both from prior work measuring per-word bias (Dev and WEAT and ECT 1 calculate the bias between sets Phillips, 2019). To form a list of mental health 1The cosine similarities from ECT can be used to mea- words, we use the Diagnostic and Statistical Man- sure scores for individual words, but it is not as robust as ual of Mental Disorders (DSM-5), a taxonomic and RIPA (Ethayarajh et al., 2019).

5 Career vs Family Math vs Art Science vs Art 2.5 0.4 1.0 2.0 0.8 1.5 0.6 0.6 1.0 Bias Bias Bias 0.4 0.5 0.8 0.2 0.0 1.0 0.0 0.5

1960-1969 1970-1979 1980-1989 1990-1999 2000-2010 2010-2020 1960-1969 1970-1979 1980-1989 1990-1999 2000-2010 2010-2020 1960-1969 1970-1979 1980-1989 1990-1999 2000-2010 2010-2020 Decade Decade Decade (a) (b) (c)

Strong vs Weak Intelligence vs Appearance 1.6

1.5 1.4 1.2 1.0 1.0

0.5 Bias Bias 0.8

0.0 0.6

0.4 0.5

1960-1969 1970-1979 1980-1989 1990-1999 2000-2010 2010-2020 1960-1969 1970-1979 1980-1989 1990-1999 2000-2010 2010-2020 Decade Decade (d) (e)

Figure 1: Each subfigure plots the bias measures using WEAT for one of five gender stereotypes: (a) Career vs Family, (b) Math vs Art, (c) Science vs Art, (d) Intelligence vs Appearance, and (e) Strong vs Weak. A bias score of zero represents no bias, i.e., no measurable difference between the two target categories for each gender. The shaded area of each subplot represents the bootstrap estimated 95% confidence interval. of words. However, neither approach calculates a have changed over time. To understand the biased robust bias score for individual words. To study the stereotypes, we make use of the WEAT method. most gender biased words over time, we make use Third, we look at whether occupational and mental of RIPA (Ethayarajh et al., 2019). Intuitively, RIPA health-related words are biased, and how the bias uses a single vector to represent gender, then each has changed over time. For this result, we only use word is scored by taking the dot product between a single set of target words. Thus, we make use of the gender embedding and its respective embed- ECT. Fourth, we use RIPA to find the most biased ding. The sign of the score will determine whether words for each gender in each decade. the embedding is more male or female-related. The major aspect of RIPA is creating the gender 5.1 Embedding Quality. embedding. Formally, given S, a non-empty set of In Table3, we report the quality of each decade’s ordered word pairs (x, y) (e.g., (‘man’, ‘woman’), embeddings based on the UMLS-sim dataset. Over- (‘male’, ‘female’)) that defines the gender associ- all, we find that the quality consistently improves ation, we take the first principal component of all until the 1990s, however, we see drops in the 2000s the difference vectors ~x ~y (x, y) S , which { − | ∈ } and 2010s. We hypothesize that the reason for the we call the relation vector ~g Rd—that would be ∈ decrease in embedding quality is because of the a one-dimensional bias subspace. Then, for some growth of research articles indexed on PubMed. word vector w~ Rd the dot product is taken with ∈ Intuitively, word embeddings are only able to cap- ~g to measure bias. ture a single sense of a word. However, given the breadth of articles PubMed indexes—from ma- 5 Results chine learning (e.g., BioNLP) to biomaterials— In this section, we present the results of our study in multiple word meanings are being stored in a single four parts. First, we report the embedding quality vector. Thus, the overall quality begins to drop. using UMLS-sim. Second, we study the temporal bias of traditional gender stereotypes, such as 5.2 Traditional Gender Stereotypes. Career vs Family and Strong vs Weak. Ideally, we In Figure1, we plot the bias scores reported using want to understand how, and which, stereotypes WEAT. Remember, a large positive score means

6 Occupations Mental Disorders 0.95 1.00 0.90 0.95

0.85 0.90 Bias Bias 0.80 0.85

0.80 0.75 0.75

1960-1969 1970-1979 1980-1989 1990-1999 2000-2010 2010-2020 1960-1969 1970-1979 1980-1989 1990-1999 2000-2010 2010-2020 Decade Decade (a) (b)

Figure 2: ECT bias estimates for both the set of occupation and mental disorder words. The shaded area of each subplot represents the bootstrap estimated 95% confidence interval. that the male words are more similar to the targets bias in general text corpora to what is found in A (e.g., career) than the female words. There is biomedical literature. no measurable bias with a value of zero. Overall, we find that the results from the WEAT test vary 5.3 Occupational and Mental Health Bias. depending on the stereotype. For Career vs Family, In Figure2, we report the gender bias results from in Figure 1a, we find a steady linear decrease in ECT on two categories: occupations (e.g., doc- bias each decade—with the exception of the 1990s. tor, nurse, teacher) and mental health disorders We also find similar linear decreases in bias for (e.g., depression, alcoholism, PTSD). Again, un- both Science vs Art and Strong vs Weak (Figures 1c like WEAT, ECT calculates bias scores on a single and 1e). In Figure 1b, for Math vs Art, however, target set of words. Therefore, we do not need two the bias stays relatively static, i.e., it does not dra- contrasting target word sets (e.g., Math vs Art), in- matically change over time. Moreover, the WEAT stead we can focus on bias for a single set (e.g., score for Math vs Art is negative, meaning that the Math). Also, the larger the score, the lower the female words are more similar to math than the bias—a score of one would represent no difference male words. Likewise, for Intelligence vs Appear- between male and female words for that specific tar- ance (Figure 1d), we see relatively little bias from get set. Interestingly, we find that the ECT scores 1960 to 1989, however, in the 1990s and 2000s, we follow a similar pattern as found in Table3, the had a substantial jump in the bias score. better the embedding quality, the lower the bias. Comparing Figures 2a and 2b, we find that Our evaluation supports prior work evaluating the word embeddings for both occupations and bias in biomedical word embeddings (e.g., Strong mental disorders have relatively little bias in the vs Weak is the most biased stereotype in biomed- 1990s. Furthermore, while there was small vari- ical literature) (Chaloner and Maldonado, 2019, ation, mental disorders experienced little change Table 2). However, we also find differences when in bias decade-by-decade. Yet, occupation-related measuring bias over time. For example, we find words had a substantial amount of bias in the 1960s that from 2010 to 2019 there is not a lot evidence and 1970s. Moreover, we find that the bias re- for the Career vs Family stereotype in biomedical lated to occupations experienced more change, than corpora, matching the results from Chaloner and mental disorders, starting 0.83 in the 1960s and in- Maldonado(2019, Table 2). Yet, this is only a crease by more than ten points to 0.94 in the 1990s. recent phenomenon. The embeddings trained on ar- Whereas, mental disorder-related bias scores only ticles published from 1990 to 1999 exhibit a Career ranged from 0.90 to 0.94. vs Family bias score greater than 1.5. Overall, comparing to Chaloner and Maldonado(2019, Table 5.4 Biased Words. 2), this means that the bias in recently published In Figure2, we analyze the bias of individ- biomedical literature may not be as strong as what ual occupational and mental health-related words. is found in general text corpora. But, if we exclude We found a substantial change in the bias of the most recent decade’s embeddings, the bias in occupational-related words. biomedical literature becomes much stronger. Fu- We found little change in the bias of mental ture work should explore comparing the temporal health-related words since the 1960s. Yet, while

7 Male Female 1970-1979 1980-1989 1990-1999 2000-2010 2010-2020 1970-1979 1980-1989 1990-1999 2000-2010 2010-2020 Occupations 1 promoter conductor chef dentist mediator teacher housewife neurosurgeon swimmer priest 2 collector chef baker counselor promoter professor teenager pediatrician baker ﬁsherman 3 investigator biologist astronaut librarian dentist counselor bishop educator butcher teenager 4 principal collector swimmer pharmacist principal physician lawyer teenager medic chef 5 baker dad prisoner teenager collector pediatrician pediatrician counselor barber writer 6 researcher singer mechanic bishop cop consultant athlete neurologist physicist nanny 7 character chemist character acquaintance conductor doctor physician consultant soldier historian 8 mechanic butler worker cardiologist substitute student pathologist dentist baron president 9 analyst mechanic soldier promoter coach lawyer educator athlete director inventor 10 conductor promoter analyst attorney employee pathologist carpenter doctor singer housewife

Mental Disorders 1 caffeine cannabis separation lacunar lacunar dysmorphic factitious binge dissociative munchausen 2 restrictive hypnotic restrictive bulimia circadian psychogenic dysmorphic nervosa coordination mutism 3 attachment caffeine coordination erectile nicotine anorexia nervosa bulimia separation factitious 4 separation coordination dyskinesia gambling gambling adolescent mutism opioid parasitosis dysmorphic 5 circadian hallucinogen conversion bereavement phencyclidine nervosa bulimia hypersomnia terror hysteria 6 coordination dependence mathematics binge ocpd mutism tourette narcolepsy hysteria cotard 7 benzodiazepine attachment attachment nervosa cocaine infancy infancy anorexia conversion claustrophobia 8 dependence mathematics residual mood insomnia munchausen episode panic malingering ekbom 9 selective restrictive parasitosis depressive sleep factitious anorexia korsakoff tic diogenes 10 conversion pdd developmental polysubstance caffeine disorder munchausen factitious munchausen encopresis

Table 4: The top ten words with the largest RIPA scores (i.e., the most biased) across each decade. The RIPA scores are reported for both occupations and mental health disorders. While all the listed words are biased, they are ranked starting with the most biased word to the least. we found little change in mental health bias over- e.g., “caffeine”, “cannabis”, “nicotine”, and “gam- all, are there at least a few disorders that changed bling”. Similarly, disorders related to appearance over time? Moreover, we found a slight bias in are more female-related, e.g., “dysmorphic” 2 and mental health terms, therefore, What are the biased “anorexia”. We also find that disorders related to terms in each group? We look at the most gen- emotions are more female-related, such as “mun- der biased occupational and mental health-related chausen” 3, “hysteria” 4, and “terror”. Interest- terms for each decade in Table4. Because of space ingly, we find that the word “hysteria” is heavily limitations, we only display the gendered words biased in the 2010s. Even though the diagnosis of from the 1970s to the 2010s. The words from the female hysteria substantially fell in the 1900s (Mi- 1960s can be found in the appendix. The word-level cale, 1993), it still seems to be a biased term. We scores were generated using RIPA. First, for occu- want to note that this could simply be caused by pations, the words vary between male and female. research studying mental health diagnosis bias in For example, in the 1970s, male-related words in- women, however, the underlying cause of why the clude “mechanic”, “principal”, and “investigator”. term is biased in the 2010s is left for future work. The female-related words include “teacher”, “counselor”, and “pediatrician”. Interestingly, the jobs 6 Discussion associated with men such as “principal” and “researcher” are positions with power over the jobs In this section, we discuss the impact of the results associated with woman. For example “principals” on two stakeholders of this research: BioNLP re- (male) have power over “teachers” (female) and searchers and general biomedical researchers. Fur- “researchers” (male) have power over “students” thermore, we discuss the limitations of focusing on (female). We also find other well-known occupa- binary gender (Male vs Female). tions appear to be gender-related. For instance, “butler” in the 1980s is associated to male while 2Dysmorphia is a mental health disorder in which you “nanny” is related to female in the 2010s. can’t stop thinking about one or more perceived defects or flaws With regard to mental health, we find that dis- 3Munchausen is a mental disorder in which a person re- orders associated with well-known gender dispari- peatedly and deliberately acts as if he or she has a physical or ties appear to be biased using RIPA (Organization, mental illness 4Hysteria is a (biased) catch-all for symptoms including, 2013). For example, through the last 60 years, but not limited to, nervousness, hallucinations, and emo- words associated with addictions are male-related, tional outbursts.

8 6.1 Impact on BioNLP researchers. 2019), we can measure the bias of individual words The results in this paper are important for BioNLP across many contexts. Thus, we can possibly over- research in two ways. First, we have produced come the problem of a limited number of words decade-specific word embeddings. 5 Therefore, per gender. BioNLP research can use the embeddings to study 7 Conclusion other historical phenomenon in biomedical research articles. Second, the analysis of historical In this paper, we studied the historical bias present bias in biomedical research in this paper can be in word embeddings from 1960 to 2020. In sum- applied to other domains, beyond occupations and mary, we found that while some biases have shown mental disorders. a consistently decrease over time (e.g., Strong vs Weak), others have stayed relatively static, or worse, 6.2 Impact on Biomedical Researchers. increased (e.g., Intelligence vs Appearance). More- With regard to general biomedical researchers (e.g., over, we found that the gender bias towards occupa- medical researchers and biologist), this work can tions has substantially changed over time, showing provide a way to measure which demographics cur- that in the past, there was more gender bias associ- rent research is leaning towards in an automated ated with certain jobs. fashion. As discussed in Holdcroft(2007), if re- There are two major avenues for future work. search is heavily focused on a single gender, then First, this work quantified various aspects of gen- health disparities can increase. Treatments should der bias over time. However, we do not know why be explored equally for all at-risk patients. Fur- the bias is present in the word embeddings. For thermore,with the use of contextual word embed- example, is the word “hysteria” biased in 2010 be- dings (Scheuerman et al., 2019), implicit bias mea- cause researchers are associating it with women surement techniques can be used as part of the writ- implicitly, or is it that researchers are studying the ing process to avoid gendered language when it is historical usage of the diagnosis to ensure the di- not necessary (e.g., using singular they vs he/she). agnosis is not made because of implicit bias in the future? Thus, our future work will focus on causal 6.3 A Note About Gender. studies of bias in biomedical literature. Second, we Similar to prior work measuring gender simply independently trained Skip-Gram word em- bias (Chaloner and Maldonado, 2019), we beddings for each decade. However, recent work focus on binary gender. However, it is important has shown that dynamic embeddings, rather than to note that the results for binary gender do not static (decade-specific), perform better with regard necessarily generalize to other genders, including, to analyzing public perception over time (Gillani but not limited to, binary trans people, non-binary and Levy, 2019). Future work will focus on de- people, gender non-conforming people (Scheuer- veloping new techniques to study bias temporally. man et al., 2019). Therefore, we want to explicitly Moreover, many techniques may depend on the magnitude of the bias, therefore, we plan to ana- note that our research does not necessarily lyze the circumstances in which one embedding generalize beyond binary gender. In future work, we recommend that researcher’s studies approach may measure bias (e.g., Skip-Gram) bet- should be performed for other genders, beyond ter than another (e.g., dynamic embeddings). simply studying Male vs Female. Acknowledgements How can this study be expanded beyond binary gender? The three bias measurement techniques We would like to thank the anonymous reviewers studied in this paper (i.e., WEAT, ECT, and RIPA) for their invaluable help improving this manuscript. require sets of words representing a single gender This material is based upon work supported by (e.g., boy, men, male). Unfortunately, there is not the National Science Foundation under Grant a large number of words to represent every gender No. 1947697. of interest. A promising area of research is to explore bias in contextual word embeddings. With the References use of contextual word embeddings (Kurita et al., Paul R Albert. 2015. Why is depression more prevalent 5https://github.com/AnthonyMRios/ in women? Journal of psychiatry & neuroscience: Gender-Bias-PubMed JPN, 40(4):219.

9 American Psychiatric Association et al. 2013. Diag- Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, nostic and statistical manual of mental disorders and Lucy Vasserman. 2018. Measuring and mitigat- (DSM-5 R ). American Psychiatric Pub. ing unintended bias in text classification. In Confer- ence on AI, Ethics, and Society. Pinkesh Badjatiya, Manish Gupta, and Vasudeva Varma. 2019. Stereotypical bias removal for hate Joel Escude´ Font. 2019. Determining bias in machine speech detection task using knowledge-based gen- translation with deep learning techniques. Master’s eralizations. In The World Wide Web Conference, thesis, Universitat Politecnica` de Catalunya. pages 49–59. Kawin Ethayarajh, David Duvenaud, and Graeme Hirst. Rahn Kennedy Bailey, Josephine Mokonogho, and 2019. Understanding undesirable word embedding Alok Kumar. 2019. Racial and ethnic differences in associations. In Proceedings of the 57th Annual depression: current perspectives. Neuropsychiatric Meeting of the Association for Computational Lin- disease and treatment, 15:603. guistics, pages 1696–1705. Olivier Bodenreider. 2004. The unified medical lan- Sergey Feldman, Waleed Ammar, Kyle Lo, Elly Trep- guage system (umls): integrating biomedical termi- man, Madeleine van Zuylen, and Oren Etzioni. 2019. nology. Nucleic acids research, 32(suppl 1):D267– Quantifying sex bias in clinical studies at scale with D270. automated data extraction. JAMA network open, 2(7):e196700–e196700. Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. 2016. Joel Escude´ Font and Marta R Costa-jussa.` 2019. Man is to computer programmer as woman is to Equalizing gender biases in neural machine trans- homemaker? debiasing word embeddings. In Ad- lation with word embeddings techniques. arXiv vances in neural information processing systems, preprint arXiv:1901.03116. pages 4349–4357. Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and Aylin Caliskan, Joanna J Bryson, and Arvind James Zou. 2018. Word embeddings quantify Narayanan. 2017. Semantics derived automatically 100 years of gender and ethnic stereotypes. Pro- from language corpora contain human-like biases. ceedings of the National Academy of Sciences, Science, 356(6334):183–186. 115(16):E3635–E3644. Kaytlin Chaloner and Alfredo Maldonado. 2019. Mea- Nabeel Gillani and Roger Levy. 2019. Simple dynamic suring gender bias in word embeddings across do- word embeddings for mapping perceptions in the mains and discovering new gender bias word cate- public sphere. In Proceedings of the Third Work- gories. In Proceedings of the First Workshop on shop on Natural Language Processing and Compu- Gender Bias in Natural Language Processing, pages tational Social Science, pages 94–99. 25–32. Anthony G Greenwald, Debbie E McGhee, and Jor- dan LK Schwartz. 1998. Measuring individual dif- Billy Chiu, Gamal Crichton, Anna Korhonen, and ferences in implicit cognition: the implicit associa- Sampo Pyysalo. 2016. How to train good word em- tion test. Journal of personality and social psychol- beddings for biomedical nlp. In Proceedings of the ogy, 74(6):1464. 15th workshop on biomedical natural language processing, pages 166–174. Amelia Gulliver, Kathleen M Griffiths, and Helen Christensen. 2010. Perceived barriers and facilita- I Glenn Cohen, Ruben Amarasingham, Anand Shah, tors to mental health help-seeking in young people: Bin Xie, and Bernard Lo. 2014. The legal and a systematic review. BMC psychiatry, 10(1):113. ethical concerns that arise from using complex predictive analytics in health care. Health affairs, Maryam Habibi, Leon Weber, Mariana Neves, 33(7):1139–1147. David Luis Wiegandt, and Ulf Leser. 2017. Deep learning with word embeddings improves biomed- Amit Datta, Michael Carl Tschantz, and Anupam Datta. ical named entity recognition. Bioinformatics, 2015. Automated experiments on ad privacy set- 33(14):i37–i48. tings. Proceedings on Privacy Enhancing Technolo- gies, 2015(1):92–112. Katarina Hamberg. 2008. Gender bias in medicine. Women’s Health, 4(3):237–243. Scott Deerwester, Susan T Dumais, George W Fur- nas, Thomas K Landauer, and Richard Harshman. Cynthia M Hartung and Thomas A Widiger. 1998. 1990. Indexing by latent semantic analysis. Jour- Gender differences in the diagnosis of mental disor- nal of the American society for information science, ders: Conclusions and controversies of the dsm–iv. 41(6):391–407. Psychological bulletin, 123(3):260. Sunipa Dev and Jeff Phillips. 2019. Attenuating bias in Bin He, Yi Guan, and Rui Dai. 2019. Classifying med- word vectors. In The 22nd International Conference ical relations in clinical text via convolutional neural on Artificial Intelligence and Statistics, pages 879– networks. Artificial intelligence in medicine, 93:43– 887. 49.

10 Janae K Heath, Gary E Weissman, Caitlin B Clancy, Jeffrey Pennington, Richard Socher, and Christopher D Haochang Shou, John T Farrar, and C Jessica Dine. Manning. 2014. Glove: Global vectors for word rep- 2019. Assessment of gender-based linguistic differ- resentation. In Proceedings of the 2014 conference ences in physician trainee evaluations of medical fac- on empirical methods in natural language process- ulty using automated text mining. JAMA network ing (EMNLP), pages 1532–1543. open, 2(5):e193520–e193520. LA Pratt and DJ Brody. 2014. Depression and obe- Anita Holdcroft. 2007. Gender bias in research: how sity in the us adult household population, 2005-2010. does it affect evidence based medicine? Journal of NCHS data brief, (167):1–8. the Royal Society of Medicine, 100(1):2. Anthony Rios. 2020. FuzzE: Fuzzy fairness evaluation Matthew Kay, Cynthia Matuszek, and Sean A Munson. of offensive language classifiers on african-american 2015. Unequal representation and gender stereo- english. In Proceedings of the AAAI Conference on types in image search results for occupations. In Artificial Intelligence, volume 34. Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pages 3819– Anthony Rios and Ramakanth Kavuluru. 2015. Convo- 3828. ACM. lutional neural networks for biomedical text classification: application in indexing biomedical articles. Keyvan Khosrovian, Dietmar Pfahl, and Vahid Garousi. In Proceedings of the 6th ACM Conference on Bioin- 2008. Gensim 2.0: a customizable process simula- formatics, Computational Biology and Health Infor- tion model for software process evaluation. In Inter- matics, pages 258–267. national conference on software process, pages 294– 306. Springer. Arghavan Salles, Michael Awad, Laurel Goldin, Kelsey Krus, Jin Vivian Lee, Maria T Schwabe, and Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W Black, Calvin K Lai. 2019. Estimating implicit and ex- and Yulia Tsvetkov. 2019. Quantifying social bi- plicit gender bias among health care professionals ases in contextual word representations. In 1st ACL and surgeons. JAMA network open, 2(7):e196545– Workshop on Gender Bias for Natural Language e196545. Processing. Morgan Klaus Scheuerman, Jacob M. Paul, and Jed R. Chandler May, Alex Wang, Shikha Bordia, Samuel Brubaker. 2019. How computers see gender: An Bowman, and Rachel Rudinger. 2019. On measur- evaluation of gender classification in commercial fa- ing social biases in sentence encoders. In Proceed- cial analysis services. Proc. ACM Hum.-Comput. In- ings of the 2019 Conference of the North American teract., 3(CSCW). Chapter of the Association for Computational Lin- guistics: Human Language Technologies, Volume 1 Latanya Sweeney. 2013. Discrimination in online ad (Long and Short Papers), pages 622–628. delivery. Queue, 11(3):10. Bridget T McInnes, Ted Pedersen, and Serguei VS Yanshan Wang, Sijia Liu, Naveed Afzal, Majid Pakhomov. 2009. Umls-interface and umls- Rastegar-Mojarad, Liwei Wang, Feichen Shen, Paul similarity: open source software for measuring paths Kingsbury, and Hongfang Liu. 2018. A comparison and semantic similarity. In AMIA annual sympo- of word embeddings for the biomedical natural lan- sium proceedings, volume 2009, page 431. Ameri- guage processing. Journal of biomedical informat- can Medical Informatics Association. ics, 87:12–20. Mark S Micale. 1993. On the” disappearance” of hys- Daniel Weisz, Michael K Gusmano, and Victor G Rod- teria: A study in the clinical deconstruction of a di- win. 2004. Gender and the treatment of heart dis- agnosis. Isis, 84(3):496–526. ease in older persons in the united states, france, and england: a comparative, population-based view of Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey a clinical phenomenon. Gender medicine, 1(1):29– Dean. 2013a. Efficient estimation of word represen- 40. tations in vector space. ICLR Workshop. Terry Young, Rebecca Hutton, Laurel Finn, Safwan Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Badr, and Mari Palta. 1996. The gender bias in rado, and Jeff Dean. 2013b. Distributed representa- sleep apnea diagnosis: are women missed because tions of words and phrases and their compositional- they have different symptoms? Archives of internal ity. In Advances in neural information processing medicine, 156(21):2445–2451. systems, pages 3111–3119. Jieyu Zhao, Tianlu Wang, Mark Yatskar, Ryan Cot- World Health Organization. 2013. Gender Disparities terell, Vicente Ordonez, and Kai-Wei Chang. 2019. in Mental Health. World Health Organization. Gender bias in contextualized word embeddings. arXiv preprint arXiv:1904.03310 Ji Ho Park, Jamin Shin, and Pascale Fung. 2018. Re- . ducing gender bias in abusive language detection. Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai- In Proceedings of the 2018 Conference on Empiri- Wei Chang. 2018. Learning gender-neutral word cal Methods in Natural Language Processing, pages embeddings. In Proc. of EMNLP, pages 4847–4853. 2799–2804.

11 A 1960s Most Biased Words depression, depressive, derealization, dermatillo- mania, desynchronosis, deux, developmental, dio- Male: genes, disease, disorder, dissociative, dyscalcu- lia, dyskinesia, dyslexia, dysmorphic, eating, ejac- physician • ulation, ekbom, encephalitis, encopresis, enure- doctor sis, epilepsy, episode, erectile, erotomania, exhi- • bitionism, factitious, fantastica, fetishism, fregoli, president • fugue, functioning, gambling, ganser, grandiose, dentist hallucinogen, hallucinosis, histrionic, huntington, • psychiatrist hyperactivity, hypersomnia, hypnotic, hypochon- • driasis, hypomanic, hysteria, ideation, identity, surgeon • impostor, induced, infancy, insomnia, intellec- student tual, intermittent, intoxication, kleptomania, ko- • rsakoff, lacunar, lethargica, love, major, maladap- nurse • tive, malingering, mania, mathematics, megalo- worker mania, melancholia, misophonia, mood, mun- • chausen, mutism, narcissistic, narcolepsy, nervosa, professor • neurocysticercosis, neurodevelopmental, nicotine, nightmare, nos, obsessive, obsessive–compulsive, Female: ocd, ocpd, oneirophrenia, opioid, oppositional, or- substitute thorexia, pain, panic, paralysis, paranoid, parasito- • sis, parasomnia, parkinson, partialism, pathologi- principal • cal, pdd, perception, persecutory, personality, per- editor vasive, phencyclidine, phobia, phobic, phonolog- • ical, physical, pica, polysubstance, posttraumatic, baker • pseudologia, psychogenic, psychosis, psychotic, character ptsd, pyromania, reactive, residual, retrograde, ru- • mination, schizoaffective, schizoid, schizophrenia, author • schizophreniform, schizotypal, seasonal, sedative, pharmacist selective, separation, sexual, sleep, sleepwalking, • scientist social, sociopath, somatic, somatization, somato- • form, stereotypic, stockholm, stress, stuttering, sub- therapist • stance, suicidal, suicide, tardive, terror, tic, tourette, teacher transient, transvestic, tremens, trichotillomania, tru- • man, withdrawal, wonderland] B Mental Health-Related Terms C Occupations [abuse, acute, adaptation, adjustment, adolescent, adult, affective, agoraphobia, alcohol, alco- [detective, ambassador, coach, officer, epidemiol- holic, alzheimer, amnesia, amnestic, amphetamine, ogist, rabbi, ballplayer, secretary, actress, man- anorexia, anosognosia, anterograde, antisocial, anx- ager, scientist, cardiologist, actor, industrial- iety, anxiolytic, asperger, atelophobia, attachment, ist, welder, biologist, undersecretary, captain, attention, atypical, autism, autophagia, avoidant, economist, politician, baron, pollster, environmen- avoidant, restrictive, barbiturate, behavior, benzodi- talist, photographer, mediator, character, housewife, azepine, bereavement, bibliomania, binge, bipolar, jeweler, physicist, hitman, geologist, painter, em- body, borderline, brief, bulimia, caffeine, cannabis, ployee, stockbroker, footballer, tycoon, dad, pa- capgras, catalepsy, catatonia, catatonic, childhood, trolman, chancellor, advocate, bureaucrat, strate- circadian, claustrophobia, cocaine, cognitive, com- gist, pathologist, psychologist, campaigner, magis- munication, compulsive, condition, conduct, con- trate, judge, illustrator, surgeon, nurse, mission- version, coordination, cotard, cyclothymia, day- ary, stylist, solicitor, scholar, naturalist, artist, dreaming, defiant, deficit, delirium, delusion, delu- mathematician, businesswoman, investigator, cura- sional, delusions, dependence, depersonalization, tor, soloist, servant, broadcaster, fisherman, land-

12 lord, housekeeper, crooner, archaeologist, teenager, rian, citizen, worker, pastor, serviceman, filmmaker, councilman, attorney, choreographer, principal, sportswriter, poet, dentist, statesman, minister, der- parishioner, therapist, administrator, skipper, aide, matologist, technician, nun, instructor, alderman, chef, gangster, astronomer, educator, lawyer, mid- analyst, chaplain, inventor, lifeguard, bodyguard, fielder, evangelist, novelist, senator, collector, goal- bartender, surveyor, consultant, athlete, cartoonist, keeper, singer, acquaintance, preacher, trumpeter, negotiator, promoter, socialite, architect, mechanic, colonel, trooper, understudy, paralegal, philoso- entertainer, counselor, janitor, firebrand, sports- pher, councilor, violinist, priest, cellist, hooker, man, anthropologist, performer, crusader, envoy, jurist, commentator, gardener, journalist, warrior, trucker, publicist, commander, professor, critic, co- cameraman, wrestler, hairdresser, lawmaker, psy- median, receptionist, financier, valedictorian, in- chiatrist, clerk, writer, handyman, broker, boss, spector, steward, confesses, bishop, shopkeeper, lieutenant, neurosurgeon, protagonist, sculptor, ballerina, diplomat, parliamentarian, author, sociol- nanny, teacher, homemaker, cop, planner, la- ogist, photojournalist, guitarist, butcher, mobster, borer, programmer, philanthropist, waiter, barrister, drummer, astronaut, protester, custodian, maestro, trader, swimmer, adventurer, monk, bookkeeper, pianist, pharmacist, chemist, pediatrician, lecturer, radiologist, columnist, banker, neurologist, bar- foreman, cleric, musician, cabbie, fireman, farmer, ber, policeman, assassin, marshal, waitress, artiste, headmaster, soldier, carpenter, substitute, director, playwright, electrician, student, deputy, researcher, cinematographer, warden, marksman, congress- caretaker, ranger, lyricist, entrepreneur, sailor, man, prisoner, librarian, magician, screenwriter, dancer, composer, president, dean, comic, medic, provost, saxophonist, plumber, correspondent, or- legislator, salesman, observer, pundit, maid, arch- ganist, baker, doctor, constable, treasurer, superin- bishop, firefighter, vocalist, tutor, proprietor, restau- tendent, boxer, physician, infielder, businessman, rateur, editor, saint, butler, prosecutor, sergeant, protege] realtor, commissioner, narrator, conductor, histo-

13 Sequence-to-Set Semantic Tagging for Complex Query Reformulation and Automated Text Categorization in Biomedical IR using Self-Attention

Manirupa Das†, Juanxi Li†, Eric Fosler-Lussier†, Simon Lin‡, Steve Rust‡, Yungui Huang‡ & Rajiv Ramnath† The Ohio State University† & Nationwide Children’s Hospital‡ das.65, li.8767, fosler.1, ramnath.6 @osu.edu Simon.Lin,{ Steve.Rust, Yungui.Huang @nationwidechildrens.org} }

Abstract lor their medical decisions to the individual patient, based on the patient’s genetic information, other Novel contexts, comprising a set of terms re- molecular analysis, and the patient’s preference. ferring to one or more concepts, may often arise in complex querying scenarios such as This often requires them to combine clinical expe- in evidence-based medicine (EBM) involving rience with evidence from scientiﬁc research, such biomedical literature. These may not explicitly as that available from biomedical literature, in a refer to entities or canonical concept forms oc- process known as evidence-based medicine (EBM). curring in a fact-based knowledge source, e.g. Finding the most relevant recent research however, the UMLS ontology. Moreover, hidden asso- is challenging not only due to the volume and the ciations between related concepts meaningful pace at which new research is being published, but in the current context, may not exist within also due to the complex nature of the information a single document, but across documents in the collection. Predicting semantic concept need, arising for example, out of a clinical note tags of documents can therefore serve to as- which may be used as a query. This calls for bet- sociate documents related in unseen contexts, ter automated methods for natural language under- or categorize them, in information ﬁltering standing (NLU), e.g. to derive a set of key terms or or retrieval scenarios. Thus, inspired by the related concepts helpful in appropriately transform- success of sequence-to-sequence neural mod- ing a complex query, by reformulation so as to be els, we develop a novel sequence-to-set frame- able to handle and possibly resolve medical jargon, work with attention, for learning document representations in a unique unsupervised set- lesser-used acronyms, misspelling, multiple sub- ting, using no human-annotated document la- ject areas and often multiple references to the same bels or external knowledge resources and only entity or concept, and retrieve the most related, yet corpus-derived term statistics to drive the train- most comprehensive set of useful results. ing. This can effect term transfer within a At the same time, tremendous strides have been corpus for semantically tagging a large collec- made by recent neural machine learning mod- tion of documents. Our sequence-to-set models in reasoning with texts on a wide variety of eling approach to predict semantic tags , gives to the best of our knowledge, the state-of-the- NLP tasks. In particular, sequence-to-sequence art for both, an unsupervised query expansion (seq2seq) neural models often employing attention (QE) task for the TREC CDS 2016 challenge mechanisms, have been largely successful in deliv- dataset when evaluated on an Okapi BM25– ering the state-of-the-art for tasks such as machine based document retrieval system; and also translation (Bahdanau et al., 2014), (Vaswani et al., over the MLTM system baseline (Soleimani 2017), handwriting synthesis (Graves, 2013), im- and Miller, 2016), for both supervised and age captioning (Xu et al., 2015), speech recognition semi-supervised multi-label prediction tasks (Chorowski et al., 2015) and document summariza- with del.icio.us and Ohsumed datasets. We make our code and data publicly available 1. tion (Cheng and Lapata, 2016). Inspired by these successes, we aimed to harness the power of se- 1 Introduction quential encoder-decoder architectures with attention, to train end-to-end differentiable models that Recent times have seen an upsurge in efforts to- are able to learn the best possible representation of wards personalized medicine where clinicians tai- input documents in a collection while being predic- 1https://github.com/mcoqzeug/seq2set-semantic-tagging tive of a set of key terms that best describe the docu-

14 Proceedings of the BioNLP 2020 workshop, pages 14–27 Online, July 9, 2020 c 2020 Association for Computational Linguistics ment. These will be later used to transfer a relevant setting, to select most relevant terms. (Palangi et al., but diverse set of key terms from the most related 2016) employ a deep sentence embedding approach documents, for “semantic tagging” of the original using LSTMs and show improvement over standard input documents so as to aid in downstream query sentence embedding methods, but as a means for refinement for IR by pseudo-relevance feedback directly deriving encodings of queries and docu- (Xu and Croft, 2000). ments for use in IR, and not as a method for QE To this end and to the best of our knowledge, we by PRF. In another approach, (Xu et al., 2017) are the first to employ a novel, completely unsu- train autoencoder representations of queries and pervised end-to-end neural attention-based docu- documents to enrich the feature space for learning- ment representation learning approach, using no to-rank, and show gains in retrieval performance external labels, in order to achieve the most mean- over pre-trained rankers. But this is a fully super- ingful term transfer between related documents, vised setup where the queries are seen at train time. i.e. semantic tagging of documents, in a “pseudo- (Pfeiffer et al., 2018) also use an autoencoder-based relevance feedback”–based (Xu and Croft, 2000) approach for actual query refinement in pharma- setting for unsupervised query expansion. This may cogenomic document retrieval. However here too, also be seen as a method of document expansion their document ranking model uses the encoding of as a means for obtaining query refinement terms the query and the document for training the ranker, for downstream IR. The following sections give an hence the queries are not unseen with respect to the account of our specific architectural considerations document during training. They mention that their in achieving an end-to-end neural framework for work can be improved upon by the use of seq2seq- semantic tagging of documents using their repre- based approaches. In this sense, i.e. with respect sentations, and a discussion of the results obtained to QE by PRF and learning a sequential document from this approach. representation for document ranking, our work is most similar to (Pfeiffer et al., 2018). However the 2 Related Work queries are completely unseen in our case and we use only the documents in the corpus, to train our Pseudo-relevance feedback (PRF), a local context neural document language models from scratch in analysis method for automatic query expansion a completely unsupervised way. (QE), is extensively studied in information retrieval (IR) research as a means of addressing the word Classic sequence-to-sequence models like mismatch between queries and documents. It ad- (Sutskever et al., 2014) demonstrate the strength justs a query relative to the documents that initially of recurrent models such as the LSTM in captur- appear to match it, with the main assumption that ing short and long range dependencies in learn- the top-ranked documents in the first retrieval result ing effective encodings for the end task. Works contain many useful terms that can help discrimi- such as (Graves, 2013), (Bahdanau et al., 2014), nate relevant documents from irrelevant ones (Xu (Rocktaschel¨ et al., 2015), further stress the key and Croft, 2000), (Cao et al., 2008). It is moti- role that attention, and multi-headed attention vated by relevance feedback (RF), a well-known (Vaswani et al., 2017) can play in solving the end IR technique that modifies a query based on the task. We use these insights in our work. relevance judgments of the retrieved documents According to the detailed report provided for this (Salton et al., 1990). It typically adds common dataset and task in (Roberts et al., 2016) all of the terms from the relevant documents to a query and systems described perform direct query reweight- re-weights the expanded query based on term fre- ing aside from supervised term expansion and quencies in the relevant documents relative to the are highly tuned to the clinical queris in this dataset. non-relevant ones. Thus in PRF we find an ini- In a related medical IR challenge (Roberts et al., tial set of most relevant documents, then assuming 2017) the authors specifically mention that with that the top k ranked documents are relevant, RF only six partially annotated queries for system de- is done as before, without manual interaction by velopment, it is likely that systems were either the user. The added terms are, therefore, common under- or over-tuned on these queries. Since the terms from the top-ranked documents. setup of the seq2set framework is an attempt to To this end, (Cao et al., 2008) employ term classi- model the PRF based query expansion method of fication for retrieval effectiveness, in a “supervised” its closest related work (Das et al., 2018) where

15 the effort is also to train a neural generalized lan- 3.1 The Sequence-to-Set Semantic Tagging guage model for unsupervised semantic tagging, Framework we choose this system as the benchmark to com- Task Definition: For each query document dq pare against to our end-to-end approach for the in a given a collection of documents D = same task. d , d , ..., d , represented by a set of k key- { 1 2 N } words or labels, e.g. k terms in dq derived from top V TFIDF-scored terms, find an alternate set −| | 3 Methodology of k most relevant terms coming from documents “most related” to dq from elsewhere in the collection. These serve as semantic tags for expanding Drawing on sequence-to-sequence modeling ap- dq. proaches for text classification, e.g. textual entail- In the unsupervised task setting described later, ment (Rocktaschel¨ et al., 2015) and machine trans- a document to be tagged is regarded as a query lation (Sutskever et al., 2014), (Bahdanau et al., document d ; its semantic tags are generated via 2014) we adapt from these settings into a sequence- q PRF, and these terms will in turn be used for PRF– to-set framework, for learning representations of based expansion of unseen queries in downstream input documents, in order to derive a meaningful IR. Thus d could represent an original complex set of terms, or semantic tags drawn from a closely q query text or a document in the collection. related set of documents, that expand the origi- In the following sections we describe the build- nal documents. These document expansion terms ing blocks used in the setup for the baseline and are then used downstream for query reformulation proposed models for sequence-to-set semantic tag- via PRF, for unseen queries. We employ an end- ging as described in the task definition. to-end framework for unsupervised representation pseudo- learning of documents using TFIDF-based 3.2 Training and Inference Setup labels (Figure1(a))and a separate cosine similarity- based ranking module for semantic tag inference The overall architecture for sequence-to-set seman- (Figure1(b)). tic tagging consists of two phases, as depicted in the block diagrams in Figures1(a) and1(b): the We employ various methods such as doc2vec, first, for training of input representations of doc- Deep Averaging, sequential models such as LSTM, uments; and the second for inference to achieve GRU, BiGRU, BiLSTM, BiLSTM with Attention term transfer for semantic tagging. As shown in and Self-attention, detailed in Figure1(c)-(f), see Figure1(a), the proposed model architecture would AppendixA, for learning fixed-length input docu- first learn the appropriate feature representations ment representations in our framework. We apply of documents in a first pass of training, by taking methods like DAN (Iyyer et al., 2015), LSTM, and in the tokens of an input document sequentially, BiLSTM as our baselines and formulate attentional using a document’s pre-determined top k TFIDF- − models including a self-attentional Transformer- scored terms as the pseudo-class labels for an input based one (Vaswani et al., 2017) as our proposed instance, i.e. prediction targets for a sigmoid layer augmented document encoders. for multi-label classification. The training objec- Further, we hypothesize that a sequential, bi- tive is to maximize probability for these k terms, or y = (t , t , ..t ) V , i.e. directional or attentional encoder coupled with a p 1 2 k ∈ decoder, i.e. a sigmoid or softmax prediction layer, that conditions on the encoder output v (similar to arg max P (yp = (t1, t2, ..tk) V v; θ) (1) θ ∈ | an approach by (Kiros et al., 2015) for learning a neural probabilistic language model), would en- given the document’s encoding v. For computa- able learning of the optimal semantic tags in our tional efficiency, we take V to be the list of top- unsupervised query expansion setting, while mod- 10K TFIDF-scored terms from our corpus, thus eling directly for this task in an end-to-end neural V = 10, 000. k is taken as 3, so each document is | | framework. In our setup the decoder predicts a initially labeled with 3 terms. The sequential model meaningful set of concept tags that best describe a is then trained with the k-hot 10K-dimensional la- document according to the training objective. The bel vector as targets for the sigmoid classification following sections describe our setup. layer, employing a couple of alternative training

16 Figure 1: Overview of Sequence-to-Set Framework. (a) Method for training document or query representations, (b) Method for Inference via term transfer for semantic tagging; Document Sequence Encoders: (c) Deep Averaging encoder; (d) LSTM last hidden state, GRU encoders; (e) BiLSTM last hidden state, BiGRU (shown in dotted box), BiLSTM attended hidden states encoders; and (f) Transformer self-attentional encoder [source: (Alammar, 2018)]. objectives. The first, typical for multi-label classifi- conditioned on the document encoding. Thus, cation, minimizes a categorical cross-entropy loss, which for a single training instance with ground- truth label set, yp, is: P (L d ) = P (l d ) = log(P (l d )) d| q | q − | q l Ld l Ld Y∈ X∈ (3) V Since P (l dq) exp(v v), where v and v are | | | ∝ l · l LCE(y ˆp) = yi log(y î) (2) the label and document encodings, it is equivalent i=1 to minimizing: X

Since our goal is to obtain the most meaningful document representations most predictive of their L (y ˆ ) = log(exp(v v)) (4) assigned terms, and that can also be predictive LM p − l · l L of semantic tags not present in the document, we X∈ d also consider a language model–based loss objective converting our decoder to a neural language model. Thus, we employ a training objective that Equation (4) represents our language model– maximizes the conditional log likelihood of the style loss objective. We run experiments training label terms Ld of a document dq, given the doc- with both losses (Equations (2)&(4)) as well as a ument’s representation v, i.e. P (L d ) (where variant that is a summation of both, with a hyper- d| q y = L V ). This amounts to minimizing the parameter α used to tune the language model com- p d ∈ negative log likelihood of the label representations ponent of the total loss objective.

17 4 Task Settings the Summary Text + Sum. UMLS terms as the “augmented” query. We use the Adam op- 4.1 Unsupervised Task – Semantic Tagging timizer (Kingma and Ba, 2014) for training our for Query Expansion models. After several rounds of hyper-paramater We now describe the setup and results for experi- tuning, batch size was set to 128, dropout to 0.3, ments run on our unsupervised task setting of se- the prediction layer was fixed to sigmoid, the loss mantic tagging of documents for PRF–based query function switched between cross-entropy and sum- expansion. mation of cross entropy and LM losses, and models trained with early stopping. 4.1.1 Dataset – TREC CDS 2016 Results from various Seq2Set encoder models The 2016 TREC CDS challenge dataset, makes on base (Summary Text) and augmented available actual electronic health records (EHR) (Summary Text + Summary-based of patients (de-identified), in the form of case re- UMLS terms) query, are outlined in Table1. ports, typically describing a challenging medical Evaluating on base query, a Seq2Set-Transformer case. Such a case report represents a query in model beats all other Seq2Set encoders, and our system, having a complex information need. also the TFIDF, MeSH QE terms and Expert There are 30 queries in this dataset, corresponding QE terms baselines. On the augmented query, to such case reports, at 3 levels of granularity Note, the Seq2Set-BiGRU and Seq2Set-Transformer Description and Summary text as described in models outperform all other Seq2Set encoders, (Roberts et al., 2016). The target document collec- and Seq2Set-Transformer outperforms all non- tion is the Open Access Subset of PubMed Central ensemble baselines and the Phrase2VecGLM (PMC), containing 1.25 million articles consisting unsupervised QE ensemble system baseline of title, keywords, abstract and body sections. significantly, with P@10 of 0.4333. Best per- We use a subset of 100K of these articles for which forming supervised QE systems for this dataset, human relevance judgments are made available by tuned on all 30 queries, range between 0.35– TREC, for training. Final evaluation however is 0.4033 P@10 (Roberts et al., 2016), better than done on an ElasticSearch index built on top of the unsupervised QE systems on base query, but entire collection of 1.25 million PMC articles. surpassed by the best Seq2Set-based models such as Seq2Set-Transformer on augmented query, 4.1.2 Unsupervised Task Experiments even without ensemble. Semantic tags from a We ran several sets of experiments with vari- best-performing model, do appear to pick terms ous document encoders, employing pre-trained relating to certain conditions, e.g.: . Phrase2VecGLM (Das et al., 2018), the only other known system for this dataset, that performs “unsu- 4.2 Supervised Task – Automated Text pervised semantic tagging of documents by PRF”, Categorization for downstream query expansion. Thus we take this system as the current state-of-the-art system The Seq2set framework’s unsupervised semantic baseline, while our non-attention-based document tagging setup is primarily applicable in those set- encoding models constitute our standard baselines. tings where no pre-existing document labels are Our document-TFIDF representations–based query available. In such a scenario, of unsupervised se- expansion forms yet another baseline. Summary mantic tagging of a large document collection, the text UMLS (Lindberg et al., 1993; Bodenreider, Seq2set framework therefore consists of separate 2004) terms for use in our augmented models is training and inference steps to infer tags from other available to us via the UMLS Java Metamap API documents after encodings have been learnt. We (Demner-Fushman et al., 2017). The first was a therefore conduct a series of extensive evaluations set of experiments with our different models using in the manner described in the previous section, the Summary Text as the base query. Follow- using a downstream QE task in order to validate ing this we ran experiments with our models using our method. However, when a tagged document

18 Unsupervised QE Systems (Base Query) P@10 BM25+Seq2Set-doc2vec (baseline) 0.0794 BM25+Seq2Set-TFIDF Terms (baseline) 0.2000 BM25+MeSH QE Terms (baseline) 0.2294 BM25+Human Expert QE Terms (baseline) 0.2511 BM25+unigramGLM+Phrase2VecGLM ensemble (system baseline) 0.2756 BM25+Seq2Set-Transformer (LCE )(model) 0.2861* Supervised QE Systems (Base Query) BM25+ETH Zurich-ETHSummRR 0.3067 BM25+Fudan Univ.DMIIP-AutoSummary1 0.4033 Unsupervised QE Systems (Augmented Query) BM25+Seq2Set-doc2vec ( 0.1345 baseline) Figure 2: Seq2Set– on , best BM25+Seq2Set-TFIDF Terms (baseline) 0.3000 supervised del.icio.us BM25+unigramGLM+Phrase2VecGLM Transformer model–based encoder ensemble (system baseline) 0.3091 BM25+Seq2Set-BiGRU (LM only loss) (model) 0.3333* BM25+Seq2Set-Transformer (LCE + LLM ) (model) 0.4333*

Table 1: Results on IR for best Seq2set models, in an unsupervised PRF–based QE setting. Boldface indicates statistical significance @p<<0.01 over previous. collection is available where the set of document labels are already known, we can learn to predict tags from this set of known labels on a new set of Figure 3: A comparison of document labeling perfor- similar documents. Thus, in order to generalize our mance of Seq2set versus MLTM Seq2set approach to such other tasks and setups, we therefore aim to validate the performance of our framework on such a labeled dataset of tagged documents, which is equivalent to adapting the Seq2set 4.2.2 Supervised Task Experiments framework for a supervised setup. In this setup we We then run Seq2set-based training for our 8 dif- therefore only need to use the training module of ferent encoder models on the training set for the the Seq2set framework shown in Figure1(a), and 20 labels, and perform evaluation on the test set measure tag prediction performance on a held out measuring sentence-level ROC AUC on the labeled set of documents. For this evaluation, we there- documents in the test set. fore choose to work with the popular Delicious (del.icio.us) folksonomy dataset, same as that used Figure2 shows the ROC AUC for the best by (Soleimani and Miller, 2016) in order to do an performing Transformer model from the Seq2set appropriate comparison with their MLTM frame- framework on the del.icio.us dataset, which was work that is also evaluated on a similar document trained with a sigmoid–based prediction layer on multi-label prediction task. cross entropy loss with a batch size of 64 and dropout set to 0.3. This best model got an ROC 4.2.1 Dataset – del.icio.us AUC of 0.85 , statistically significantly surpassing The Delicious dataset contains tagged web MLTM (AUC 0.81 @ p << 0.001) for this task pages retrieved from the social bookmark- and dataset. ing site, del.icio.us. There are 20 com- Figure3 also shows a comparison of the ROC mon tags used as class labels: reference, AUC scores obtained with training Seq2set and design, programming, internet, computer, MLTM based models for this task with various web, java, writing, English, grammar, labeled data proportions. Here again we see style, language, books, education, philosophy, that Seq2set has clear advantage over the current politics, religion, science, history and culture. MLTM state-of-the-art, statistically significantly The training set consists of 8250 documents and surpassing it (p << 0.01) when trained with the test set consists of 4000 documents. greater than 25% of the labeled dataset.

19 based Seq2set model performing best. Next we experiment with the semi-supervised setting where we first train the Seq2set framework models on a large number of documents that do not have pre- existing labels. This pre-training is performed in exactly a similar fashion as the unsupervised setup. Thus we first preprocess the Ohsumed data from years 87-90 to obtain a top-1000 TFIDF score– based vocabulary of tags, pseudo-labeling all the documents in the training set with these. Our training and evaluation for the semi-supervised setup Figure 4: Seq2Set–semi-supervised on Ohsumed, consists of 3 phases: Phase 1: We employ our best Transformer model–based encoder w/ Cross Entropy–based Softmax prediction; 4 layers, 10 atten- seq2set framework (using each one of our encoder tion heads, dropout=0 models) for multi-label prediction on this pseudo- labeled data, having an output prediction layer of 1000 having a penultimate fully-connected layer 4.3 Semi-Supervised Text Categorization of dimension 23, same as the number of labels in We then seek to further validate how well the the Ohsumed dataset; Phase 2: After pre-training Seq2set framework can leverage large scale pre- with pseudolabels we discard the final layer and training on unlabeled data given only a small continue to train labeled Ohsumed dataset from amount of labeled data for training, to be able to 91 by 5-fold cross-validation with early stopping. improve prediction performance on a held out set Phase 3: This is the final evaluation step of our of these known labels. This amounts to a semi- semi-supervised trained Seq2set model on the la- supervised setup–based evaluation of the Seq2set beled Ohsumed test dataset used by (Soleimani framework. In this setup, we perform the training and Miller, 2016). This constitutes simply infer- and evaluation of Seq2set similar to the supervised ring predicted tags using the trained model on the setup, except we have an added step of pre-training test data. As shown in Figure4, our evaluation of the multi-label prediction on large amounts of un- the Seq2set framework for the Ohsumed dataset, labeled document data in exactly the same way as comparing supervised and semi-supervised train- the unsupervised setup. ing setups, yields an ROC AUC of 0.94 for our best performing semi-supervised–trained model of Fig. 4.3.1 Dataset – Ohsumed 4, compared to the various supervised trained mod- We employ the Ohsumed dataset available from the els for the same dataset that got a best ROC AUC of TREC Information Filtering tracks of years 87-91 0.93. The top performing semi-supervised model and the version of the labeled Ohsumed dataset again involves a Transformer–based encoder using used by (Soleimani and Miller, 2016) for evalua- a softmax layer for prediction, with 4 layers, 10 tion, to have an appropriate comparison with their attention heads, and no dropout. Thus, the best re- MLTM system also evaluated for this dataset. The sults on the semi-supervised training experiments version of the Ohsumed dataset due to (Soleimani (ROC AUC 0.94) statistically significantly out- and Miller, 2016) consists of 11122 training and performs (p << 0.01) the MLTM system base- 5388 test documents, each assigned to one or multi- line (ROC AUC 0.90) on the Ohsumed dataset, ple labels of 23 MeSH diseases categories. Almost while also clearly surpassing the top-performing half of the documents have more than one label. supervised Seq2set models on the same dataset. This demonstrates that our Seq2set framework is 4.3.2 Semi-Supervised Task Experiments able to leverage the benefits of data augmentation We first train and test our framework on the la- in the semi-supervised setup by training with large beled subset of the Ohsumed data from (Soleimani amounts of unlabeled data on top of limited labeled and Miller, 2016) similar to the supervised setup data. described in the previous section. This evaluation gives a statistically significant ROC AUC of 0.93 over the 0.90 AUC for the MLTM system of (Soleimani and Miller, 2016) for a Transformer–

20 5 Conclusion References Mart´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng We develop a novel sequence-to-set end-to- Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, end encoder-decoder–based neural framework for Sanjay Ghemawat, Geoffrey Irving, Michael Isard, multi-label prediction, by training document repre- et al. 2016. Tensorflow: A system for large-scale sentations using no external supervision labels, for machine learning. In 12th USENIX Symposium on Operating Systems Design{ and Implementation} pseudo-relevance feedback–based unsupervised se- ( OSDI 16), pages 265–283. mantic tagging of a large collection of documents. { } We find that in this unsupervised task setting of Jay Alammar. 2018. The illustrated transformer. On- PRF-based semantic tagging for query expansion, a line; posted June 27, 2018. multi-term prediction training objective that jointly Ben Athiwaratkun, Andrew Gordon Wilson, and An- optimizes both prediction of the TFIDF–based doc- ima Anandkumar. 2018. Probabilistic fasttext for ument pseudo-labels and the log likelihood of the multi-sense word embeddings. arXiv preprint arXiv:1806.02901. labels given the document encoding, surpasses previous methods such as Phrase2VecGLM (Das et al., Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- 2018) that used neural generalized language mod- gio. 2014. Neural machine translation by jointly els for the same. Our initial hypothesis that bi- learning to align and translate. arXiv preprint arXiv:1409.0473. directional or self-attentional models could learn the most efficient semantic representations of doc- Olivier Bodenreider. 2004. The unified medical lan- uments when coupled with a loss more effective guage system (umls): integrating biomedical terminology. Nucleic acids research, 32(suppl 1):D267– than cross-entropy at reducing language model per- D270. plexity of document encodings, is corroborated in all experimental setups. We demonstrate the ef- Guihong Cao, Jian-Yun Nie, Jianfeng Gao, and fectiveness of our novel framework in every task Stephen Robertson. 2008. Selecting good expansion terms for pseudo-relevance feedback. In Proceed- setting, viz. for unsupervised QE via PRF-based ings of the 31st annual international ACM SIGIR semantic tagging for a downstream medical IR chal- conference on Research and development in infor- lenge task; as well as for both, supervised and mation retrieval, pages 243–250. ACM. semi-supervised task settings, where Seq2set sta- Jianpeng Cheng and Mirella Lapata. 2016. Neural sum- tistically significantly outperforms the state-of-art marization by extracting sentences and words. arXiv MLTM baseline (Soleimani and Miller, 2016) on preprint arXiv:1603.07252. the same held out set of documents as MLTM, for Jan K Chorowski, Dzmitry Bahdanau, Dmitriy multi-label prediction on a set of known labels, for Serdyuk, Kyunghyun Cho, and Yoshua Bengio. automated text categorization; achieving to the best 2015. Attention-based models for speech recogni- of our knowledge, the current state-of-the-art for tion. In Advances in neural information processing multi-label prediction on documents, with or with- systems, pages 577–585. out known labels. We therefore demonstrate the Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, effectiveness of our Sequence-to-Set framework and Yoshua Bengio. 2014. Empirical evaluation of for multi-label prediction, on any set of documents, gated recurrent neural networks on sequence model- applicable especially towards the automated cate- ing. arXiv preprint arXiv:1412.3555. gorization, filtering and semantic tagging for QE- Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, based retrieval, of biomedical literature for EBM. and Yoshua Bengio. 2015. Gated feedback recur- Future directions would involve experiments re- rent neural networks. In International Conference on Machine Learning, pages 2067–2075. placing TDIDF labels with more meaningful terms (using unsupervised term extraction) for query ex- Manirupa Das, Eric Fosler-Lussier, Simon Lin, So- pansion, initialization with pre-trained embeddings heil Moosavinasab, David Chen, Steve Rust, Yungui for the biomedical domain such as BlueBERT Huang, and Rajiv Ramnath. 2018. Phrase2vecglm: Neural generalized language model–based semantic (Peng et al., 2019) and BioBERT (Lee et al., 2020), tagging for complex query reformulation in medical and multi-task learning with closely related tasks ir. In Proceedings of the BioNLP 2018 workshop, such as biomedical Named Entity Recognition and pages 118–128. Relation Extraction to learn better document repre- Dina Demner-Fushman, Willie J Rogers, and Alan R sentations and thus more meaningful semantic tags Aronson. 2017. Metamap lite: an evaluation of of documents useful for downstream EBM tasks. a new java implementation of metamap. Journal

21 of the American Medical Informatics Association, Yifan Peng, Shankai Yan, and Zhiyong Lu. 2019. 24(4):841–844. Transfer learning in biomedical natural language processing: An evaluation of bert and elmo on ten Alex Graves. 2013. Generating sequences with benchmarking datasets. In Proceedings of the 2019 recurrent neural networks. arXiv preprint Workshop on Biomedical Natural Language Process- arXiv:1308.0850. ing (BioNLP 2019), pages 58–65.

Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Long short-term memory. Neural computation, Gardner, Christopher Clark, Kenton Lee, and Luke 9(8):1735–1780. Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daume´ III. 2015. Deep unordered compo- Jonas Pfeiffer, Samuel Broscheit, Rainer Gemulla, and sition rivals syntactic methods for text classification. Mathias Goschl.¨ 2018. A neural autoencoder ap- In Proceedings of the 53rd Annual Meeting of the proach for document ranking and query refinement ACL, volume 1, pages 1681–1691. in pharmacogenomic information retrieval. In Pro- ceedings of the BioNLP 2018 workshop, pages 87– Yoon Kim. 2014. Convolutional neural net- 97. works for sentence classification. arXiv preprint ˇ arXiv:1408.5882. Radim Rehu˚rekˇ and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Cor- Diederik P Kingma and Jimmy Ba. 2014. Adam: A pora. In Proceedings of the LREC 2010 Workshop method for stochastic optimization. arXiv preprint on New Challenges for NLP Frameworks, pages 45– arXiv:1412.6980. 50, Valletta, Malta. ELRA. http://is.muni.cz/ publication/884893/en. Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, Kirk Roberts, Anupama E Gururaj, Xiaoling Chen, and Sanja Fidler. 2015. Skip-thought vectors. In Saeid Pournejati, William R Hersh, Dina Demner- Advances in neural information processing systems, Fushman, Lucila Ohno-Machado, Trevor Cohen, pages 3294–3302. and Hua Xu. 2017. Information retrieval for biomedical datasets: the 2016 biocaddie dataset retrieval Quoc V Le and Tomas Mikolov. 2014. Distributed rep- challenge. Database, 2017. resentations of sentences and documents. In ICML, Kirk Roberts, Ellen Voorhees, Dina Demner-Fushman, volume 14, pages 1188–1196. and William R. Hersh. 2016. Overview of the trec 2016 clinical decision support track. Online; posted Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, August-2016. Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. Biobert: a pre-trained biomed- Tim Rocktaschel,¨ Edward Grefenstette, Karl Moritz ical language representation model for biomedical Hermann, Toma´sˇ Kociskˇ y,` and Phil Blunsom. 2015. text mining. Bioinformatics, 36(4):1234–1240. Reasoning about entailment with neural attention. arXiv preprint arXiv:1509.06664. Donald AB Lindberg, Betsy L Humphreys, and Alexa T McCray. 1993. The unified medical lan- Gerard Salton, Chris Buckley, and Maria Smith. 1990. guage system. Yearbook of Medical Informatics, On the application of syntactic methodologies in au- 2(01):41–51. tomatic text analysis. Information Processing & Management, 26(1):73–92. Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- frey Dean. 2013a. Efficient estimation of word Hossein Soleimani and David J Miller. 2016. Semi- representations in vector space. arXiv preprint supervised multi-label topic models for document arXiv:1301.3781. classification and sentence labeling. In Proceedings of the 25th ACM international on conference on in- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- formation and knowledge management, pages 105– rado, and Jeff Dean. 2013b. Distributed representa- 114. ACM. tions of words and phrases and their compositional- ity. In Advances in neural information processing Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. systems, pages 3111–3119. Sequence to sequence learning with neural networks. In Advances in neural information processing sys- Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, tems, pages 3104–3112. Xiaodong He, Jianshu Chen, Xinying Song, and Rabab Ward. 2016. Deep sentence embedding using Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob long short-term memory networks: Analysis and ap- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz plication to information retrieval. IEEE/ACM Trans- Kaiser, and Illia Polosukhin. 2017. Attention is all actions on Audio, Speech and Language Processing you need. In Advances in Neural Information Pro- (TASLP), 24(4):694–707. cessing Systems, pages 5998–6008.

22 Bo Xu, Hongfei Lin, Yuan Lin, and Kan Xu. 2017. Learning to rank with query-level semi-supervised autoencoders. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Manage- ment, pages 2395–2398. ACM. Jinxi Xu and W Bruce Croft. 2000. Improving the effectiveness of information retrieval with local context analysis. ACM Transactions on Information Sys- tems (TOIS), 18(1):79–112. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057.

23 A Sequence-based Document Encoders the LSTM is able to learn a language model for the entire document, encoded in the hidden state of the We describe below the different neural models that final timestep, which we use as the document en- we use for the sequence encoder, as part of our coding to give to the prediction layer. By the same encoder-decoder architecture for deriving semantic token, owing to the bi-directional processing of its tags for documents. input, a BiLSTM-based document representation is expected to be even more robust at capturing docu- A.1 doc2vec encoder ment semantics than the LSTM, with respect to its doc2vec is the unsupervised algorithm due to (Le prediction targets. Here, the document representa- and Mikolov, 2014), that learns fixed-length rep- tion used for final classification is the concatenated resentations of variable length documents, repre- l l hidden state outputs from the final step, [−→h t; ←−h t], senting each document by a dense vector trained depicted by the dotted box in Fig.1(e). to predict surrounding words in contexts sampled from each document. We derive these doc2vec en- A.3 BiLSTM with Attention encoder codings by pre-training on our corpus. We then In addition, we also propose a BiLSTM with use them directly as features for inferring semantic attention-based document encoder, where the out- tags per Figure1(b) without training them within put representation is the weighted combination of our framework against the loss objectives. We ex- the concatenated hidden states at each time step. pect this to be a strong document encoding baseline Thus we learn an attention-weighted representation in capturing the semantics of documents. TFIDF at the final output as follows. Let X (d L) be ∈ × Terms is our other baseline where we don’t train a matrix consisting of output vectors [h1, . . . , hL] within the framework but rather use the top-k neigh- that the Bi-LSTM produces when reading L tokens bor documents’ TFIDF pseudo-labels as the seman- of the input document. Each word representation tic tags for the query document. hi is obtained by concatenating the forward and backward hidden states, i.e. hi = [−→hi ; ←−hi ]. d is A.2 Deep Averaging Network encoder the size of embeddings and hidden layers. The The Deep Averaging Network (DAN) for text clas- attention mechanism produces a vector α of atten- sification due to (Iyyer et al., 2015) Figure1 (c), tion weights and a weighted representation r of the is formulated as a neural bag of words encoder input, via: model for mapping an input sequence of tokens X (d L) M = tanh(WX),M × (5) to one of k labels. v is the output of a composition ∈ function g, in this case averaging, applied to the sequence of word embeddings v for w X. For T L w ∈ α = softmax(w M), α (6) our multi-label classification problem, v is fed to a ∈ sigmoid layer to obtain scores for each independent r = XαT , r d (7) classification. We expect this to be another strong ∈ document encoder given results in the literature Here, the intermediate attention representation th th and it proves in practice to be. mi (i.e. the i column vector in M) of the i word in the input document is obtained by apply- A.2.1 LSTM and BiLSTM encoders ing a non-linearity on the matrix of output vectors LSTMs (Hochreiter and Schmidhuber, 1997), by X, and the attention weight for the ith word in design, encompass memory cells that can store the input is the result of a weighted combination information for a long period of time and are there- (parameterized by w) of values in m . Thus r d i ∈ fore capable of learning and remembering over is the attention weighted representation of the − long and variable sequences of inputs. In addition word and phrase tokens in an input document used to three types of gates, i.e. input, forget, and output in optimizing the training objective in downstream gates, that control the flow of information into and multi-label classification, as shown by the final at- out of these cells, LSTMs have a hidden state vec- tended representation r in Figure1(e). l l tor ht, and a memory vector ct. At each time step, corresponding to a token of the input document, the A.4 GRU and BiGRU encoders LSTM can choose to read from, write to, or reset A Gated Recurrent Unit (GRU) is a type of re- the cell using explicit gating mechanisms. Thus current unit in recurrent neural networks (RNNs)

24 that aims at tracking long-term dependencies while Multi-headed self-attention further lends the model keeping the gradients in a reasonable range. In con- the ability to focus on different positions in the in- trast to the LSTM, a GRU has only 2 gates: a reset put, with multiple sets of Query/Key/Value weight gate and an update gate. First proposed by (Chung matrices, which we hypothesize should result in et al., 2014), (Chung et al., 2015) to make each re- the most effective document representation, among current unit to adaptively capture dependencies of all the models, for our downstream task. different time scales, the GRU, however, does not have any mechanism to control the degree to which A.6 CNN encoder its state is exposed, exposing the whole state each Inspired by the success of (Kim, 2014) in employ- time. In the LSTM unit, the amount of the memory ing CNN architectures successfully for achieving content that is seen, or used by other units in the gains in NLP tasks we also employ a CNN-based network is controlled by the output gate, while the encoder in the seq2set framework. (Kim, 2014) GRU exposes its full content without any control. train a simple CNN with a layer of convolution on Since the GRU has simpler structure, models us- top of pre-trained word vectors, as a sequence of ing GRUs generally converge faster than LSTMs, length n embeddings concatenated to form a ma- hence they are faster to train and may give better trix input. Filters of different sizes, representing performance in some cases for sequence modeling various context windows over neighboring words, tasks. The BiGRU has the same structure as GRU are then applied to this input, over each possible except constructed for bi-directional processing of window of words in the sequence to obtain feature the input, depicted by the dotted box in Fig.1(e). maps. This is followed by a max-over-time pooling operation to take maximum value of the feature A.5 Transformer self-attentional encoder map as the feature corresponding to a particular filter. The model then combines these features to Recently, the Transformer encoder-decoder archi- form a penultimate layer which is passed to a fully tecture due to (Vaswani et al., 2017), based on a self- connected softmax layer whose output is the prob- attention mechanism in the encoder and decoder, ability distribution over labels. In case of seq2set has achieved the state-of-the-art in machine trans- these features are passed to sigmoid layer for final lation tasks at a fraction of the computation cost. multi-label prediction used cross entropy loss or a Based entirely on attention, and replacing the re- combination of cross-entropy and LM losses. We current layers commonly used in encoder-decoder use filters of sizes 2, 3, 4 and 5. Like our other en- architectures with multi-headed self-attention, it coders, we fine-tune the document representations has outperformed most previously reported ensem- learnt. bles on the task. Thus we hypothesize that this self- attention-based model could learn the most efficient B Embedding Algorithms Experimented semantic representations of documents for our un- with supervised task. Since our models use tensorflow (Abadi et al., 2016), a natural choice was docu- We describe here the various algorithms used to ment representation learning using the Transformer train word embeddings for use in our models. model’s available tensor2tensor API. We hoped Skip-Gram word2vec: We generate word em- to leverage apart from the computational advan- beddings trained with the skip-gram model with tages of this model, the capability of capturing se- negative sampling (Mikolov et al., 2013b) with di- mantics over varying lengths of context in the input mension settings of 50 with a context window of document, afforded by multi-headed self-attention, 4, and also 300, with a context window of 5, using Figure1(f). Self-attention is realized in this archi- the gensim package 2 (Rehˇ u˚rekˇ and Sojka, 2010). tecture, by training 3 matrices, made up of vectors, Probabilistic FastText: The Probabilistic Fast- corresponding to a Query vector, a Key Vector and Text (PFT) word embedding model of (Athi- a Value vector for each token in the input sequence. waratkun et al., 2018) represents each word with a The output of each self-attention layer is a sum- Gaussian mixture density, where the mean of a mix- mation of weighted Value vectors that passes on ture component given by the sum of n-grams, can to a feed-forward neural network. Position-based capture multiple word senses, sub-word structure, encoding to replace recurrences help to lend more parallelism to computations and make things faster. 2https://radimrehurek.com/gensim/

25 and uncertainty information. This model outper- top documents as the terms for query expansion forms the n-gram averaging of FastText getting for the original query. We then re-run these ex- state-of-the-art performance on several word simi- panded queries through the ES index to record the larity and disambiguation benchmarks. The proba- retrieval performance. Thus the queries our system bilistic word representations with flexible sub-word is evaluated on, are not seen at the time of training structures, can achieve multi-sense representations our models, but only during evaluation, hence it is that also give rich semantics for rare words. This unsupervised QE. makes them very suitable to generalize for rare and Similar to Das et al.(2018), for the feedback out-of-vocabulary words motivating us to opt for loop based query expansion method, we had two 3 PFT-based word vector pre-training over regular separate human judgment–based baselines, one us- FastText. ing the MeSH terms available from PMC for the ELMo: Another consideration was to use em- top 15 documents returned in a first round of query- beddings that can explicitly capture the language ing with Summary text, and the other based on model underlying sentences within a document. human expert annotations of the 30 query topics, ELMo (Embeddings from Language Models) word made available by the authors. vectors (Peters et al., 2018) presented such a choice Since we had mixed results initially with our where the vectors are derived from a bidirectional models, we explored various options to increase the LSTM trained with a coupled language model (LM) training signal. First was by the use of neighbor- objective on a large text corpus. The representa- document’s labels in our label vector for training. tions are a function of all of the internal layers of In this scheme, we used the 3-hot TFIDF label the biLM. Using linear combinations of the vec- representation for each document to pick a list of n– tors derived from each internal state has shown nearest neighbors to it. We then included the labels marked improvements over various downstream of those nearest documents into the label vector for NLP tasks, because the higher-level LSTM states the original document for use during training. We capture context-dependent aspects of word mean- experimented with choices 3, 5, 10, 15, 20, 25 for ing (e.g., they can be used without modification n. We observed improvements with incorporating to perform well on supervised word sense disam- neighbor labels. biguation tasks) while lower- level states model Next we experimented with incorporating word aspects of syntax. Using the API 4 we generate neighbors into the input sequence for our models. ELMo embeddings fine-tuned for our corpus with We did this in two different ways, the first was to dimension settings of 50 and 100 using only the average all the neighbors and concatenate with the top layer final representations. A discussion of the original token embedding, the other was to average results from each set of experiments is outlined in all of the embeddings together. The word itself was the following section and summarized in Table1. always weighted more than the neighbors. This C Experimental Considerations and scheme also gave improvement. Hyperparameter Settings Finally we experimented with incorporating embeddings pre-trained by latest state-of-the-art meth- Of the metrics available, P@10 gives the number ods (AppendixB) as the input tokens for our of relevant items returned in the top-10 results and models. After several rounds of hyper-parameter NDCG looks at precision of the returned items at tuning, batch size was set to 128, and dropout the correct rankings. For our particular dataset to 0.3. We also performed a small grid search domain, the number of relevant results returned in into the space of hyperparameters like number the top-10 is more important, hence Table1 reports of hidden layers varied as 2, 3, 4, and α varied results ranked in ascending order of P@10. as [1.0, 10.0, 100.0, 1000.0, 10000.0], determining The PRF setting shown in the results table means the best settings for each encoder. that, we take the top 10-15 documents returned by A glossary of acronyms and parameters used an ElasticSearch (ES) index for each of the 30 Sum- in training of our models is as follows: sg=skip- mary Text queries in our dataset, and subsequently gram; pft=Probablistic FastText; elmo=ELMo; use the semantic tags assigned to each of these d=embedding dimension; kln=number of “neigh- 3https://github.com/benathi/multisense-prob-fasttext bor documents’” labels; nl=number of hidden lay- 4https://allennlp.org/elmo ers in the model; h= Number of multi-attention

26 heads; bs=batch size; dp=dropout; ep=no. of epochs; α=weight parameter for language model loss component. Best-performing model settings: Our best performing models on the base query was a Trans- former encoder with 10 attention heads: nh = 10, loss: cross-entropy + LM loss with α = 1000.0, input embedding: 50-d pft, bs = 64 and dp = 0.3; and a GRU encoder for the ensemble with parameters, loss: LM only loss with α = 1000.0, input embedding: 50-d pft, nl = 4, kln = 10, bs = 64 and dp = 0.2. For augmented query, our best performing models were: (1) a BiGRU trained with parameters, loss function: LM only loss with α = 1.0, input embedding: 50-d skip gram, nl = 3, − kln = 5, bs = 128 and dp = 0.3, and (2) a Transformer trained with parameters, loss function: cross-entropy + LM loss with α = 1000.0, input embedding: 50-d skip gram, nl = 4, kln = 5, − bs = 128 and dp = 0.3. While we obtain signiﬁcant improvement over the compared baselines with our best-performing models, we believe further gains are possible by a more targeted search through the parameter space.

27 Interactive Extractive Search over Biomedical Corpora

Hillel Taub-Tabib1 Micah Shlain1,2 Shoval Sadde1 Dan Lahav3 Matan Eyal1 Yaara Cohen1 Yoav Goldberg1,2 1 Allen Institute for AI, Tel Aviv, Israel 2 Bar Ilan University, Ramat-Gan, Israel 3 Tel Aviv University, Tel-Aviv, Israel hillelt,micahs @allenai.org { }

Abstract (e.g. all chemical-protein interactions or all risk factors for a disease) the task becomes challeng- We present a system that allows life-science re- ing and researchers will typically resort to curated searchers to search a linguistically annotated knowledge bases or designated survey papers in corpus of scientific texts using patterns over case ones are available. dependency graphs, as well as using patterns over token sequences and a powerful variant We present a search system that works in a of boolean keyword queries. In contrast to paradigm which we call Extractive Search, and previous attempts to dependency-based search, which allows rapid information seeking queries we introduce a light-weight query language that are aimed at extracting facts, rather than doc- that does not require the user to know the de- uments. Our system combines three query modes: tails of the underlying linguistic representa- boolean, sequential and syntactic, targeting differ- tions, and instead to query the corpus by pro- ent stages of the analysis process, and different ex- viding an example sentence coupled with sim- traction scenarios. Boolean queries ( 4.1) are the ple markup. Search is performed at an inter- § active speed due to efficient linguistic graph- most standard, and look for the existence of search indexing and retrieval engine. This allows for terms, or groups of search terms, in a sentence, re- rapid exploration, development and refinement gardless of their order. These are very powerful for of user queries. We demonstrate the system us- finding relevant sentences, and for co-occurrence ing example workflows over two corpora: the searches. Sequential queries ( 4.2) focus on the § PubMed corpus including 14,446,243 PubMed order and distance between terms. They are intu- abstracts and the CORD-19 dataset1, a collection of over 45,000 research papers fo- itive to specify and are very effective where the cused on COVID-19 research. The system text includes “anchor-words” near the entity of in- https://allenai. terest. Lastly, syntactic queries ( 4.4) focus on is publicly available at § github.io/spike the linguistic constructions that connect the query words to each other. Syntactic queries are very 1 Introduction powerful, and can work also where the concept to be extracted does not have clear linear anchors. Recent years have seen a surge in the amount of However, they are also traditionally hard to spec- accessible Life Sciences data. Search engines like ify and require strong linguistic background to use. Google Scholar, Microsoft Academic Search or Our systems lowers their barrier of entry with a Semantic Scholar allow researchers to search for specification-by-example interface. published papers based on keywords or concepts, Our proposed system is based on the following but search results often include thousands of papers components. and extracting the relevant information from the Minimal but powerful query languages. There papers is a problem not addressed by the search is an inherent trade-off between simplicity and con- engines. This paradigm works well when the in- trol. On the one extreme, web search engines like formation need can be answered by reviewing a Google Search offer great simplicity, but very lit- number of papers from the top of the search results. tle control, over the exact information need. On However, when the information need requires ex- the other extreme, information extraction pattern- traction of information nuggets from many papers specification languages like UIMA Ruta offer great 1https://pages.semanticscholar.org/coronavirus-research precision and control, but also expose a low-level

28 Proceedings of the BioNLP 2020 workshop, pages 28–37 Online, July 9, 2020 c 2020 Association for Computational Linguistics view of the text and come with over hundred-page which is the main target of the extraction, with the manual.2 rich signals available in its surrounding context. Our system is designed to offer high degree of Interactive Speed. Central to the approach is an expressivity, while remaining simple to grasp: the indexed solution, based on (Valenzuela-Escarcega´ syntax and functionality can be described in a few et al., 2020), that allows to perform all types of paragraphs. The three query languages are de- queries efficiently over very large corpora, while signed to share the same syntax to the extent possi- getting results almost immediately. This allows ble, to facilitate knowledge transfer between them the users to interactively refine their queries and and to ease the learning curve. improve them based on the feedback from the re- Linguistic Information, Captures, and Expan- sults. This contrasts with machine learning based sions. Each of the three query types are linguis- solutions that, even neglecting the development tically informed, and the user can condition not time, require substantially longer turnaround times only on the word forms, but also on their lemmas, between query and results from a large corpus. parts-of-speech tags, and identified entity types. The user can also request to capture some of the 2 Existing Information Discovery search terms, and to expand them to a linguistic Approaches context. For example, in a boolean search query The primary paradigm for navigating large scien- looking for a sentence that contains the lemmas tific collections such as MEDLINE/PubMed3 is “treat” and “treatment” (‘lemma=treat treatment’), a | document-level search. chemical name (‘entity=SIMPLE CHEMICAL’) and The most immediate document-level searching the word “infection” (‘infection’), a user can mark technique is boolean search (“keyword search”). the chemical name and the word “infection” as cap- However, these methods suffer from an inability tures. This will yield a list of chemical/infection to capture the concepts aimed for by the user, as pairs, together with the sentence from which they biomedical terms may have different names in dif- originated, all of which contain the words relating ferent sub-fields and as the user may not always to treatments. Capturing the word “infection” is know exactly what they are looking for. To over- not very useful on its own: all matches result in the come this issue several databases offer semantic exact same word. But, by expanding the captured searching by exploiting MeSH terms that indicate word to its surrounding linguistic environment, the related concepts. While in some cases MeSH terms captures list will contain terms such as “PEDV in- can be assigned automatically, e.g (Mork et al., fection”, “acyclovir-resistant HSV infection” and 2013), in others obtaining related concepts require “secondary bacterial infection”. Running this query a manual assignment which is laborious to obtain. over PubMed allows us to create a large and rela- Beyond the methods incorporated in the liter- tively focused list in just a few seconds. The list ature databases themselves, there are numerous can then be downloaded as a CSV file for further external tools for biomedical document searching. processing. The search becomes extractive: we are Thalia (Soto et al., 2018) is a system for seman- not only looking for documents, but also, by use of tic searching over PubMed. It can recognize dif- captures, extract information from them. ferent types of concepts occurring in Biomedical Sentence Focus, Contextual Restrictions. As abstracts, and additionally enables search based on our system is intended for extraction of informa- abstract metadata; LIVIVO (Muller¨ et al., 2017) tion, it works at the sentence level. However, each takes the task of vertically integrating information sentence is situated in a context, and we allow sec- from divergent research areas in the life sciences; ondary queries to condition on that context, for SWIFT-Review 4 offers iterative screening by re- example by looking for sentences that appear in ranking the results based on the user’s inputs. paragraphs that contain certain words, or which ap- All of these solutions are focused on the docu- pear in papers with certain words in their titles, in ment level, which can be limiting: they often sur- papers with specific MeSH terms, in papers whose face hundreds of papers or more, requiring careful abstracts include specific terms, etc. This combines reading, assessing and filtering by the user, in order the focus and information density of a sentence, to locate the relevant facts they are looking for.

2https://uima.apache.org/d/ 3https://www.ncbi.nlm.nih.gov/pubmed/ ruta-current/tools.ruta.book.pdf 4https://www.sciome.com/swift-review/

29 To complement document searching, some sys- search papers, using a novel multifaceted query tems facilitate automatic extraction of biomedical language which we designed, and which supports concepts, or patterns, from documents. Such sys- boolean search, sequential patterns search, and by- tems are often equipped with analysis capabili- example syntactic search (Shlain et al., 2020), as ties of the extracted information. For example, well as specification of search terms whose matches NaCTem has created systems that extract biomed- should be captured or expanded. The queries can ical entities, relations and events.5; ExaCT and be further restricted by contextual information. RobotReviewer (Kiritchenko et al., 2010; Marshall We demonstrate the system on two datasets: a et al., 2015) take a RCT report and retrieve sen- comprehensive dataset of PubMed abstracts and a tences that match certain study characteristics. dataset of full text papers focused on COVID-19 To improve the development of automatic doc- research. ument selection and information extraction the Comparison to existing systems. In contrast to BioNLP community organized a series of shared document level search solutions, the results re- tasks (Kim et al., 2009, 2011;N edellec´ et al., 2013; turned by our system are sentences which include Segura Bedmar et al., 2013; Deleger´ et al., 2016; highlighted spans that directly answer the user’s Chaix et al., 2016; Jin-Dong et al., 2019). The tasks information need. In contrast to supervised IE so- address a diverse set of biomed topics addressed by lutions, our solution does not require a lengthy a range of NLP-based techniques. While effective, process of data collection and labeling or a precise such systems require annotated training data and definition of the problem settings. substantial expertise to produce. As such, they are Compared to rule based systems our system dif- restricted to several “head” information extraction ferentiates itself in a number of ways: (i) our query needs, those that enjoy a wide community interest engine automatically translates lightly tagged natu- and support. The long tail of information needs of ral language sentences to syntactic queries (query- “casual” researchers remain mostly un-addressed. by-example) thus allowing domain experts to benefit from the advantages of syntactic patterns without 3 Interactive IE Approach a deep understanding of syntax; (ii) our queries run against indexed data, allowing our translated syn- Existing approaches to information extraction from tactic queries to run at interactive speed; and (iii) bio-medical data suffer from significant practical our system does not rely on relation schemas and limitations. Techniques based on supervised train- does not make assumptions about the number of ing require extensive data collection and annota- arguments involved or their types. tion (Kim et al., 2009, 2011;N edellec´ et al., 2013; In many respects, our system is similar to the Segura Bedmar et al., 2013; Deleger´ et al., 2016; PropMiner system (Akbik et al., 2013) for ex- Chaix et al., 2016), or a high degree of technical ploratory relation extraction (Akbik et al., 2014). savviness in producing high quality data sets from Both PropMiner and our system support by- distant supervision (Peng et al., 2017; Verga et al., example queries in interactive speed. However, the 2017; Wang et al., 2019). On the other hand, rule query languages we describe in section4 are signifi- based engines are generally too complex to be used cantly more expressive than PropMiner’s language, directly by domain experts and require a linguist which supports only binary relations. Furthermore, or an NLP specialist to operate. Furthermore, both compared to PropMiner, our annotation pipeline rule based and supervised systems typically oper- was optimized specifically for the biomedical do- ate in a pipeline approach where an NER engine main and our system is freely available online. identifies the relevant entities and subsequent ex- Technical details. The datasets were annotated traction models identify the relations between them. for biomedical entities and syntax using a custom This approach is often problematic in real world SciSpacy pipeline (Neumann et al., 2019)6, and biomedical IE scenarios, where relevant entities the syntactic trees were enriched to BART format often cannot be extracted by stock NER models. using pyBART (Tiktinsky et al., 2020). The an- To address these limitations we present a sys- notated data is indexed using the Odinson engine tem allowing domain experts to interactively query (Valenzuela-Escarcega´ et al., 2020). linguistically annotated datasets of scientific re- 6All abstracts underwent sentence splitting, tokenization, tagging, parsing and NER using all the 4 NER models avail- 5http://www.nactem.ac.uk/ able in SciSpacy

30 4 Extractive Query Languages co-occur with “fatal” and “asymptomatic”. We can also issue a query such as 4.1 Boolean Queries ‘chem:e=SIMPLE CHEMICALd :e=DISEASE’ Boolean queries are the standard in information to get a list of chemical-disease co-occurrences. retrieval (IR): the user provides a set of terms that Using additional terms, we can narrow down to should, and should not, appear in a document, and co-occurrences with specific words, and by using the system returns a set of documents that adhere contextual restrictions ( 4.3) we can focus on co- § to these constraints. This is a familiar and intuitive occurrences in specific papers or domains. model, which can be very effective for initial data Expansions. Finally, for captured terms we also exploration as well as for extraction tasks that fo- support linguistic expansions. After the term is cus on co-occurrence. We depart from standard matched, we can expand it to a larger linguistic boolean queries and extend them by (a) allowing environment based on the underlying syntactic sen- to condition on different linguistic aspects of each tence representation. An expansion is expressed by token; (b) allowing capturing of terms into named prefixing a term with angle brackets : hi variables; and (c) allowing linguistic expansion of ‘ inf:infection asymptomatic fatal’ will capture the hi the captured terms. word “infection” under the variable “inf” and ex- The simplest boolean query is a list of terms, pand it to its surrounding noun-phrase, captur- where each term is a word, i.e: ‘infection asymp- ing phrases like “malarialike infection”, “asymptomatic fatal’ The semantics is that all the terms tomatic infection”, “chronic infection” and “a mild must appear in the query. A term can be made op- subclinical infection 9”. tional by prefixing it with a ‘?’ symbol (‘infection asymptomatic?fatal ’ ). Each term can also specify a 4.2 Sequential (Token) Queries list of alternatives: ‘fatal deadly lethal’. While boolean queries allow terms to appear in any | | Beyond words. In addition to matching order, we sometimes care about the exact linear words, terms can also specify linguistic placements of words with respect to each other. properties: lemmas, parts-of-speech, and The term-specification, capture and expansion syn- domain-specific entity-types: ‘lemma=infect tax is the same as in boolean queries, but here terms entity=DISEASE’. Conditions can also be combined: must match as in the query. ‘lemma=cause reason&tag=NN’. We find that the ‘interspecies transmission’ looks for the exact phrase | ability to search for domain-specific types is very “interspecies transmission” and effective in boolean queries, as it allows to search ‘tag=NNS transmission’ looks for the word transmis- for concepts rather than words. In addition to sion immediately preceded by a plural noun. By exact match, we also support matching on regular capturing the noun (‘which:tag=NNS transmission’) expressions (‘lemma=/caus.*/’). The field names we obtain a list of terms that includes the words word, lemma, entity, tag can be shortened to w,l,e,t. “bacteria”, “diseases”, “nuclei” and “crossspecies”. Captures. Central to our extractive approach is Wildcards. sequential queries can also use wild- the ability to designate specific search term to be card symbols:* (matching any single word), ‘...’ captured. Capturing is indicated by prefixing the (0 or more words), ‘...2-5...’ (2-5 words). The query term with ‘:’ (for an automatically-named capture) ‘interspecies kind:...1-3... transmission’ looks for the or with ‘name:’ (for a named capture). The query words “interspecies” and “transmission” with 1 to 3 ‘fatal asymptomaticd :e=DISEASE’ will look for sen- intervening words, capturing the intervening words tences that contain the terms ‘fatal‘ and ‘asymp- under “kind”. First results include “host-host”, tomatic‘ as well as a name of a disease, and will “zoonotic”, “virus”, “TSE agent”, “and interclass”. capture the disease name under a variable “d”. Repetitions. We also allow to specify repetitions Each query result will be a sentence with a sin- of terms. To do so, the term is enclosed in [] and gle disease captured. If several diseases appear in followed by a quantifier. We support the standard the same sentence, each one will be its own result. list of regular expression quantifiers: *,+,?, n,m . { } The user can then focus on the captured entities, For example, ‘tag=DT[tag=JJ]*[tag=NN]+ ’. and export the entire query result to a CSV file, in which each row contains the sentence, its source, 4.3 Contextual Restrictions and the captured variables. In the current exam- Each query can be associated with contextual re- ples, the result will be a list of disease names that strictions, which are secondary queries that oper-

31 Figure 1: Query Graph of the syntactic query ‘ p1:[e]BMP-6$induces the$phosphorylation$of p2:Smad1.’ hi hi ate on the same data and restrict the set of sen- to matching specific terms (using the term specifi- tences that are considered for the main queries. cation syntax as in boolean or token queries) and These queries currently have the syntax of the can likewise relax the exact-match constraints on Lucene query language.7 Our system allows the anchor words. Like in other query types, capture secondary queries to condition on the paragraph nodes can be marked for expansion. The syntactic the sentence appears in, and on the title, abstract, graph is then matched against the pre-parsed and authors, publication data, publication venue and indexed corpus. MeSH terms of the paper the sentence appears This simple markup provides a rich syntax-based in. Additional sources of information are easy to query system, while alleviating the user from the add. For example, adding the contextual restriction need to know linguistic syntax. ‘#d+title :cancer+mesh :”Age Distribution”’ restricts a For example, consider the query below, the de- query results to sentences from papers which have tails of which will be discussed shortly: the word “cancer” in their title and whose MeSH ‘ p1:[e=GENE OR GENE PRODUCT]BMP-6$induces ‘#d+ti- hi terms include “Age Distribution”. Similarly the$phosphorylation$of p2:Smad1’ tle:/corona.*/+year : [2015 TO 2020]’ restricts queries hi The words ‘induce’, ‘phosphorylation’ and ‘of’ to include sentences from papers published be- are anchors (designated by ‘$’), while ‘p1’ and ‘p2’ tween 2015 and 2020 and have a word starting are captures for ‘BMP-6’ and ‘Smad1’. Both cap- with corona in their title. ture nodes are marked for expansion using angle These secondary queries greatly increase the braces (‘ ’). Node p1 is restricted to match tokens power of boolean, sequential and syntactic queries: hi with the same entity type of BMP-6 (indicated by one could look for interspecies transmissions that ‘e=GENE OR...’). The query can be shortened by relate to certain diseases, or for sentence-level omitting the entity type and retaining only the en- disease-chemical co-occurrences in papers that dis- tity restriction (‘e’): cuss specific sub-populations. ‘ p1:[e]BMP-6$induces the$phosphorylation$of 4.4 Example-based Syntactic Queries hi p2:Smad1’ Recent advances in machine learning brought with hi Here, the entity type is inferred by the system them accurate syntactic parsing, but parse-trees from the entity type of BMP-6.8 The graph for remain hard to use. We remedy this by employing the query is displayed in Figure1. It has 5 tokens a novel query language we introduced in (Shlain in a specific syntactic configuration determined et al., 2020) which is based on the principle of by directed labeled edges. The 1st token must query-by-example. have the entity tag of ‘GENE OR...’, the 2nd, 3rd, The query is specified by starting with a simple and 4th tokens must be the exact words “induces natural language sentence that conveys the desired phosphorylation of”, and the 5th is unconstrained. syntactic structure, for example, ‘BMP-6 induces Sentences whose syntactic graph has a subgraph the phosphorylation of Smad1’. Then, words can that aligns to the query and adheres to the be marked as anchor words (that need to match ex- constraints will match the query. Example of actly) or capture nodes (that are variables). Words matching sentences are: can also be neither anchor or capture, in which case - ERK activation induces phosphorylation of they only support the scaffolding of the sentence. p1 Elk-1 . The system then translates the sentence with the p2 - Thrombopoietin activates human platelets captures and anchors syntax into a syntactic query p1 and induces tyrosine phosphorylation of graph, which is presented to the user. The user can p80/85 cortactin then restrict capture nodes from “match anything” p2 7https://lucene.apache.org/core/6_ 8Similarly, we could specify ‘$[lemma]induces’, re- 0_0/queryparser/org/apache/lucene/ sulting in the restriction ‘lemma=induce’ instead of queryparser/classic/package-summary.html ‘word=induces’ for the anchor.

32 The sentence tokens corresponding to the p1 and p2 graph nodes will be bound to variables with these names: p1=ERK, p2=Elk-1 for the { } first sentence and p1=Thrombopoietin, p2=p80/85 { cortactin for the second. } 5 Example Workflow: Risk-factors We describe a workflow which is based on using our extractive search system over a corpus of all (a) ranked risk factors for stroke (b) ranked disease risk factors PubMed abstracts. While the described researcher is hypothetical, the results we discuss are real. Figure 2: Grouped and ranked results Consider a medical researcher who is trying to compile an up to date list of the risk factors for stroke. A PubMed search for “risk factors for ‘r:Diabetes is a$risk$factor for$stroke ’. stroke” yields 3317 results, and reading through all results is impractical. A Google query for the same phrase brings out an info box from NHLBI9 listing 16 common risk factors including high blood pressure, diabetes, heart disease, etc. Having a curated list which clearly outlines the risk factors is The figure shows the top results for the query helpful, but curated lists or survey papers will often and the risk factors are indeed labeled with r as not include rare or recent research findings. expected. Unfortunately, some of the captured risk The researcher thus turns to extractive search factors names are not fully expanded. For exam- and tries an exploratory boolean query: ple, we capture syndrome instead of metabolic syn- ‘risk factor stroke’ drome and temperature instead of low temperature. Being interested in capturing the full names, the researcher adds angle brackets ‘ ’ to expand the hi captured elements:

‘ r:Diabetes is a$risk$factor for$stroke ’. hi

. The figure shows the top results for the query and the majority of sentences retrieved indeed specify specific risk factors for stroke. This is an improve- The full names are now captured as expected. ment over the PubMed results as the researcher can Now that that researcher has verified that the quickly identify the risk factors discussed without query yields relevant results, she clicks the down- going through the different papers. load button to download the full result set. Furthermore, the top results contain risk factors The resulting tab separated file has 1212 rows. like migrane or C804A polymorphism not listed Each row includes a result sentence, the captured in the NHLBI knowledge base. However, the full elements in it (in this case, just the risk factor), and result list is lengthy and extracting all the risk fac- their offsets. Using a spreadsheet to group the rows tors from it manually would be tedious. Instead, by risk factor and order the results by frequency, the the researcher notes that many of the top results researcher obtains a list of 640 unique risk factors, are variations on the “X is a risk factor for stroke” 114 of them appearing more than once in the data. structure. She thus continues by issuing the follow- Figure 2a lists the top results. ing syntactic query, where a capture labeled r is Reviewing the list, the researcher decides that used to directly capture the risk factors: she’s not interested in general risk factors, but 9https://www.nhlbi.nih.gov/ rather in diseases only. She modifies the query health-topics/stroke by adding an entity restriction to the ‘r’ capture:

33 Identifying COVID-19 Aliases Since the CORD-19 corpus includes papers about the entire Coronavirus family of viruses, it’s useful to identify papers and sentences dealing specifically As seen in the query graph, even though the with COVID-19. Before converging on the researcher didn’t specify the exact entity type, the acronym COVID-19 researchers have referred to query parser correctly resolved it to DISEASE. The the virus by many names: nCov-19, SARS-COV-ii, results now include diseases like sleep apnoea and novel coronavirus, etc. Luckily, it’s fairly easy to hypertension but do not include smoking, age and identify many of these aliases using a sequential alcohol (see Figure 2b). pattern: Analyzing the results, the researcher now wants ‘novel coronavirus( alias :...1-2...) ’ to compare the risk factors in the general population to ones listed in research papers dealing with children and infants. Luckily, such papers are indexed with corresponding MeSH terms and the researcher can utilize this fact by appending ‘#d mesh:Child mesh:Infant-mesh :Adult’ to her query. In The pattern looks for the words “novel coronavirus” cases where a desired MeSH term does not exist, followed by an open parenthesis, one-or-two words an alternative approach is filtering the results based which are to be captured under the ‘alias’ variable, on words in the abstract or title. For example, ap- and a closing parenthesis. The query retrieves 52 pending ‘#d abstract:child abstract:children’ to a query unique candidate aliases for COVID-19, though will ensure that the result sentences come from ab- some of them refer to older coronaviruses such as stracts which contain the word child or the word “MERS”, or non-relevant terms such as “Fig2”. Af- children. ter ranking by frequency and validating the results, Happy with the results of the initial query, the we can reuse the pattern on newly retrieved aliases researcher can further augment her list by query- to extend the list. Through this iterative process we ing for other structures which identify risk factors quickly compile a list of 47 aliases. We marked all (e.g. “‘r:Diabetes$causes$stroke ’”, “‘$risk$factors occurrences of these terms in the underlying corpus for$stroke$includer :Diabetes’”, etc.). as a new entity type, COVID-19, and re-indexed Importantly, once the researcher has identified the dataset with this entity information. one or more effective queries to extract the risk fac- Exploring Drugs and Treatments. To explore tors for stroke, the queries can easily be modified drugs and treatments for COVID-19 we search in useful ways. For example, with a small modifi- the corpus for chemicals co-occuring with the cation to our original query we can extract: COVID-19 entity using a boolean query: ‘chemical:e=SIMPLE CHEMICAL CHEMICAL risk factors for cancer: | ‘r:Diabetes is a$risk$factor for$cacner ’ e=COVID-19’ diseases which can be caused by smoking: Table1(a) shows the top matching chemicals by ‘$Smoking is a$risk$factor ford:[e]stroke ’. frequency. While some of the substances listed like Chloroquine and Remdesivir are drugs being tested ad-hoc KB of (risk factor, disease) tuples (for for treating COVID-19, others are only hypothe- self use or as an easily queryable public resource): sized as useful or appear in other contexts. ‘r:Diabetes is a$risk$ factor ford:[e]stroke ’. To guide the search toward therapeutic sub- 6 Example Workflow: CORD-19 stances in different stages of maturity we can add indicative terms to the query. For example, the The COVID-19 Open Research Dataset (Wang following query can be used to detect substances et al., 2020) is a collection of 45,000 research pa- at the stage of clinical trials: pers, including over 33,000 with full text, about ‘chemical:e=SIMPLE CHEMICAL CHEMICAL | COVID-19 and the coronavirus family. The cor- e=COVID-19l=trial experiment’, while adding | pus was released by the Allen Institute for AI and ‘l=suggest hypothesize candidate’ can assist in | | associated partners in an attempt to encourage re- detecting substances in ideation stage. searchers to apply recent advances in NLP to the Table1(b,c) shows the frequency distributions data to generate insights. of the chemicals resulting from the two queries.

34 (a) Unrestricted Treatments (via syntactic query) nucleic acid (171), chloroquine (118), nucleotide (115), NCP ribavirin (11), oseltamivir (9), ECMO (6), convales- (87), CR3022 (47), Ksiazek (46), IgG (45), lopinavir/ritonavir cent plasma (4), TCM (3), LPV/r (3), three fusions of (42), ECMO (40), LPV/r (35), corticosteroids (35), oxygen MSCs (2), supportive care (2), protective conditions (2), (32), ribavirin (31), lopinavir (31), Hydroxychloroquine (30), lopinavir/ritonavir (2), intravenous remdesivir (2), hydrox- amino acid (30), ritonavir (27), corticosteroid (24), Sofosbuvir ychloroquine (2), HCQ (2), glucocorticoids (2), FPV (2), ef- (22), amino acids (22), HCQ (19), glucocorticoids (19) fective isolation (2), chloroquine (2), caution (2), bDMARDs (b) Trial (2), azithromycin (2), ARBs (2), antivirals (2), ACE inhibitors (2), 500 mg chloroquine (2), masks (1) chloroquine (29), Remdesivir (8), LPV/r (7), lopinavir (6), HCQ (6), ritonavir (4), Arbidol (4), Sofosbuvir (3), nu- Table 2: Top elements occurring in the syntactic cleotide (3), nucleic acid (3), lopinavir/ritonavir (3), CQ “treated with X” configuration. Note that this query (3), oseltamivir (2), NCT04257656 (2), NCT04252664 (2), does not rely on NER information. Meplazumab (2), Hydroxychloroquine (2), glucocorticoids(2), CEP(2) The top results by frequency are shown in Table (c) Ideation 2. The top ranking results show many of the chem- chloroquine (6), ritonavir (5), S-RBD (4), nucleotide (4), icals obtained by equivalent boolean queries10, but Lopinavir (4), CR3022 (4), Ribavirin (3), nucleic acid (3), interestingly, they also contain non-chemical treat- logP (3), Li (3), ledipasvir (3), IgG (3), HCQ (3), TGEV (2), ments like supportive care, isolation and masks. teicoplanin (2), nelfinavir (2), NCP (2), HWs (2) glucocorti- This demonstrates a benefit of using entity agnos- coids (2), ENPEP (2), ECMO (2), darunavir (2), creatinine tic syntactic patterns even in cases where a strong (2), creatine (2), CQ (2), corticosteroid (2), CEP (2), ARB (2) NER model exists. Table 1: Top chemicals co-occuring with the COVID- 19 entity and their counts. (a) Unrestricted. (b) with 7 More Examples Trial related terms. (c) with Ideation related terms. While the workflows discussed above pertain mainly to the medical domain, the system is optimized for the broader life science domain. Here While the queries are very basic and include only are a sample of additional queries, showing differ- a few terms for each category, the difference is ent potential use-cases. clearly noticeable: while the Malaria drug Chloro- Which genes regulate a cell process: quine tops both lists, the antiviral drug Remdesivir ‘ p1:[e]CD95v:[l]regulates p:[e]apoptosis’ which is currently tested for COVID-19 is second hi hi Which specie is the natural host of a disease: on the list of trial related drugs but does not appear ‘ host:[e]bat is a$natural$host of at all as a top result for ideation related drugs. hi disease:[e]coronavirus’ hi Importantly, entity co-mention queries like the Documented LOF mutations in genes: ones above rely on the availability and accuracy ‘$loss$of$function m:[w]mutation in gene:[e]PAX8’ of underlying NER models. As we’ve seen in hi hi Section5, in cases where the relevant types are 8 Conclusion not extracted by NER, syntactic queries can be used instead. For example the following query We presented a search system that targets extract- captures sentences including chemicals being used ing facts from a biomed corpus and demonstrated on patients (the abstract or paragraph are required its utility in a research and a clinical context over to include COVID-19 related terms). CORD-19 and PubMed. The system works in an ‘he was$treated$with a chem:treatment Extractive Search paradigm which allows rapid in- hi #d paragraph:ncov* paragraph:covid* abstract:ncov* formation seeking practices in 3 modes: boolean, abstract:covid*’ sequential and syntactic. The interactive and flexible nature of the system makes it suitable for users in different levels of sophistication.

10to get a more comprehensive coverage we can issue queries for other syntactic structures like ‘ chem:chemical was used$in$treatment ’ and combinehi the results of the different queries.

35 Acknowledgements The work performed at Iain J Marshall, Joel¨ Kuiper, and Byron C Wallace. BIU is supported by funding from the Europoean 2015. RobotReviewer: evaluation of a system for Research Council (ERC) under the Europoean automatically assessing bias in clinical trials. Jour- nal of the American Medical Informatics Associa- Union’s Horizon 2020 research and innovation tion, 23(1):193–201. programme, grant agreement No. 802774 (iEX- TRACT). J.G. Mork, Antonio Jimeno-Yepes, and Alan Aronson. 2013. The nlm medical text indexer system for indexing biomedical literature. CEUR Workshop Pro- ceedings, 1094. References Bernd Muller,¨ Christoph Poley, Jana Possel,¨ Alexan- Alan Akbik, Oresti Konomi, and Michail Melnikov. dra Hagelstein, and Thomas Gubitz.¨ 2017. LI- 2013. Propminer: A workﬂow for interactive infor- VIVO : the vertical search engine for life sciences. mation extraction and exploration using dependency Datenbank-Spektrum, pages 1–6. trees. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: Sys- Claire Nedellec,´ Robert Bossy, Jin-Dong Kim, Jung- tem Demonstrations, pages 157–162. Jae Kim, Tomoko Ohta, Sampo Pyysalo, and Pierre Zweigenbaum. 2013. Overview of bionlp shared Alan Akbik, Thilo Michael, and Christoph Boden. task 2013. In Proceedings of the BioNLP shared 2014. Exploratory relation extraction in large text task 2013 workshop, pages 1–7. corpora. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguis- Mark Neumann, Daniel King, Iz Beltagy, and Waleed tics: Technical Papers, pages 2087–2096. Ammar. 2019. Scispacy: Fast and robust models for biomedical natural language processing. Estelle Chaix, Bertrand Dubreucq, Abdelhak Fatihi, Dialekti Valsamou, Robert Bossy, Mouhamadou Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina Ba, Louise Deleger,´ Pierre Zweigenbaum, Philippe Toutanova, and Wen-tau Yih. 2017. Cross-sentence Bessieres, Loic Lepiniec, et al. 2016. Overview of n-ary relation extraction with graph lstms. Transac- the regulatory network of plant seed development tions of the Association for Computational Linguis- (seedev) task at the bionlp shared task 2016. In tics, 5:101–115. Proceedings of the 4th bionlp shared task workshop, pages 1–11. Isabel Segura Bedmar, Paloma Mart´ınez, and Mar´ıa Herrero Zazo. 2013. Semeval-2013 task 9: Ex- Louise Deleger,´ Robert Bossy, Estelle Chaix, traction of drug-drug interactions from biomedical Mouhamadou Ba, Arnaud Ferre,´ Philippe Bessieres, texts (ddiextraction 2013). Association for Compu- and Claire Nedellec.´ 2016. Overview of the bacteria tational Linguistics. biotope task at bionlp shared task 2016. In Proceed- ings of the 4th BioNLP shared task workshop, pages Micah Shlain, Hillel Taub-Tabib, Shoval Sadde, and 12–22. Yoav Goldberg. 2020. Syntactic search by example. In Proceedings of ACL 2020, System Demon- Kim Jin-Dong, Nedellec´ Claire, Bossy Robert, and strations. Deleger´ Louise, editors. 2019. Proceedings of The 5th Workshop on BioNLP Open Shared Tasks. Asso- Axel J Soto, Piotr Przybyła, and Sophia Ananiadou. ciation for Computational Linguistics, Hong Kong, 2018. Thalia: semantic search engine for biomed- China. ical abstracts. Bioinformatics, 35(10):1799–1801.

Jin-Dong Kim, Tomoko Ohta, Sampo Pyysalo, Yoshi- Aryeh Tiktinsky, Yoav Goldberg, and Reut Tsarfaty. nobu Kano, and Jun’ichi Tsujii. 2009. Overview of 2020. pybart: Evidence-based syntactic transforma- bionlp’09 shared task on event extraction. In Pro- tions for ie. In Proceedings of ACL 2020, System ceedings of the BioNLP 2009 workshop companion Demonstrations. volume for shared task, pages 1–9. Marco A. Valenzuela-Escarcega,´ Gus Hahn-Powell, Jin-Dong Kim, Yue Wang, Toshihisa Takagi, and Aki- and Dane Bell. 2020. Odinson: A fast rule-based in- nori Yonezawa. 2011. Overview of genia event task formation extraction framework. In Proceedings of in bionlp shared task 2011. In Proceedings of the the Twelfth International Conference on Language BioNLP Shared Task 2011 Workshop, pages 7–15. Resources and Evaluation (LREC 2020), Marseille, Association for Computational Linguistics. France. European Language Resources Association (ELRA). Svetlana Kiritchenko, Berry de Bruijn, Simona Carini, Joel Martin, and Ida Sim. 2010. Exact: Automatic Patrick Verga, Emma Strubell, Ofer Shai, and Andrew extraction of clinical trial characteristics from jour- McCallum. 2017. Attending to all mention pairs nal publications. BMC medical informatics and de- for full abstract biological relation extraction. arXiv cision making, 10:56. preprint arXiv:1710.08312.

36 Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang, Darrin Eide, Kathryn Funk, Rodney Michael Kinney, Ziyang Liu, William. Merrill, Paul Mooney, Dewey A. Murdick, Devvret Rishi, Jerry Sheehan, Zhihong Shen, Brandon Stil- son, Alex D. Wade, Kuansan Wang, Christopher Wil- helm, Boya Xie, Douglas M. Raymond, Daniel S. Weld, Oren Etzioni, and Sebastian Kohlmeier. 2020. Cord-19: The covid-19 open research dataset. ArXiv, abs/2004.10706.

Lucy Lu Wang, Oyvind Tafjord, Sarthak Jain, Arman Cohan, Sam Skjonsberg, Carissa Schoenick, Nick Botner, and Waleed Ammar. 2019. Extracting evidence of supplement-drug interactions from literature. arXiv preprint arXiv:1909.08135.

37 Improving Biomedical Analogical Retrieval with Embedding of Structural Dependencies

Amandalynne Paullada∗, Bethany Percha†, Trevor Cohen‡ ∗Department of Linguistics, University of Washington, Seattle, WA, USA † Dept. of Medicine and Dept. of Genetics & Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA ‡Dept. of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA [email protected], [email protected], [email protected]

Abstract relationships are being discovered faster than human curators can manually update the knowledge Inferring the nature of the relationships be- bases. Furthermore, it is appealing to automati- tween biomedical entities from text is an im- cally generate hypotheses about novel relationships portant problem due to the difficulty of main- given the information in scientific literature (Swan- taining human-curated knowledge bases in rapidly evolving fields. Neural word embed- son, 1986), a process also known as ‘literature- dings have earned attention for an apparent based discovery.’ A trustworthy model should also ability to encode relational information. How- be able to reliably represent known relationships ever, word embedding models that disregard that are validated by existing literature. syntax during training are limited in their abil- Neural word embedding techniques such as ity to encode the structural relationships fun- word2vec1 and fastText2 are a widely-used damental to cognitive theories of analogy. In this paper, we demonstrate the utility of en- and effective approach to the generation of vector coding dependency structure in word embed- representations of words (Mikolov et al., 2013a) dings in a model we call Embedding of Struc- and biomedical concepts (De Vine et al., 2014). An tural Dependencies (ESD) as a way to repre- appealing feature of these models is their capac- sent biomedical relationships in two analog- ity to solve proportional analogy problems using ical retrieval tasks: a relationship retrieval simple geometric operators over vectors (Mikolov (RR) task, and a literature-based discovery et al., 2013b). In this way, it is possible to find ana- (LBD) task meant to hypothesize plausible relationships between pairs of entities unseen logical relationships between words and concepts in training. We compare our model to skip- without the need to specify the relationship type gram with negative sampling (SGNS), using explicitly, a capacity that has recently been used to 19 databases of biomedical relationships as our identify therapeutically-important drug/gene rela- evaluation data, with improvements in perfor- tionships for precision oncology (Fathiamini et al., mance on 17 (LBD) and 18 (RR) of these 2019). However, neural embeddings are trained sets. These results suggest embeddings encod- to predict co-occurrence events without considering dependency path information are of value ation of syntax, limiting their ability to encode for biomedical analogy retrieval. information about relational structure, which is an 1 Introduction essential component of cognitive theories of analogical reasoning (Gentner and Markman, 1997). Distributed vector space models of language have Additionally, recent work (Peters et al., 2018) has been shown to be useful as representations of re- found that contextualized word embeddings from latedness and can be applied to information re- language models such as ELMo, when evaluated on trieval and knowledge base augmentation, includ- analogy tasks, perform worse on semantic relation ing within the biomedical domain (Cohen and Wid- tasks than static embedding models. dows, 2009). A vast amount of knowledge on The present work explores the utility of encod- biomedical relationships of interest, such as thera- ing syntactic structure in the form of dependency peutic relationships, drug-drug interactions, and ad- paths into neural word embeddings for analogical verse drug events, exists in largely human-curated knowledge bases (Zhu et al., 2019). However, the 1https://github.com/tmikolov/word2vec rate at which new papers are published means new 2https://fasttext.cc/

38 Proceedings of the BioNLP 2020 workshop, pages 38–48 Online, July 9, 2020 c 2020 Association for Computational Linguistics

Figure 1: Overview of training and evaluation pipeline. Two embedding models, Embedding of Structural Depen- dencies (ESD) and Skip-gram with Negative Sampling (SGNS), are trained on data from a corpus of 70 million ≈ sentences from Medline. The resulting representations are then evaluated on data collected from biomedical knowledge bases. retrieval of biomedical relations. To this end, we information about relationships between similar build and evaluate vector space models for repre- concepts. We refer to this task as relationship re- senting biomedical relationships, using a corpus of trieval (RR). The second task involves identifying dependency-parsed sentences from biomedical lit- concepts that are related in a particular way to one erature as a source of grammatical representations another, where this relationship has not been de- of relationships between concepts. scribed in the literature previously. We refer to this We compare two methods for learning biomedi- task as literature-based discovery (LBD), as identical concept embeddings, the skip-gram with neg- fying such implicit knowledge is the main goal of ative sampling (SGNS) algorithm (Mikolov et al., this ﬁeld (Swanson, 1986). 2013a) and Embedding of Semantic Predications We evaluate on four kinds of biomedical rela- (ESP) (Cohen and Widdows, 2017), which adapts tionships, characterized by the semantic types of SGNS to encode concept-predicate-concept triples. the entity pairs involved, namely chemical-gene, In the current work, we adapt ESP to encode de- chemical-disease, gene-gene, and gene-disease re- pendency paths, an approach we call Embedding lationships. of Structural Dependencies (ESD). We train ESD The following paper is structured as follows. and SGNS on a corpus of approximately 70 mil- 2 describes vector space models of language as § lion sentences from biomedical research paper ab- they are evaluated for their ability to solve pro- stracts from Medline, and evaluate each model’s portional analogy problems, as well as prior work ability to solve analogical retrieval problems de- in encoding dependency paths for downstream ap- rived from various biomedical knowledge bases. plications in relation extraction. 3 presents the § We train ESD on concept-path-concept triples ex- dependency path corpus from Percha and Altman tracted from these sentences, and SGNS on full (2018). 4 summarizes the knowledge bases from § sentences that have been minimally preprocessed which we develop our evaluation data sets. 5 de- § with named entities (see 3). Figure1 shows the scribes the training details for each vector space § pipeline from training to evaluation. model. 6 and 7 describe the methods and re- § § From an applications perspective, we aim to eval- sults for the RR and LBD evaluation paradigms. 8 and 9 offer discussion and conclude the paper. uate the utility of these representations of relation- § § ships for two tasks. The ﬁrst involves correctly Code and evaluation data will be made available at identifying a concept that is related in a particular https://github.com/amandalynne/ESD. way to another concept, when this relationship has 2 Background already been described explicitly in the biomedical literature. This task is related to the NLP task of We look to prior work in using proportional analo- relationship extraction, but rather than considering gies as a test of relationship representation in the one sentence at a time, distributional models rep- general domain with existing studies on vector resent information from across all of the instances space models trained on generic English. While our in which this pair have co-occurred, as well as biomedical data is largely in English, we constrain

39 our evaluation to speciﬁc biomedical concepts and of these particular representations, but future work relationships as we apply and extend established will examine the application of these vector space methods. models to analogies in the general domain.

Vector space models of language and Dependency embeddings analogical reasoning Levy and Goldberg(2014a) adapt the SGNS Vector space models of semantics have been ap- model to encode direct dependency relationships, plied in information retrieval, cognitive science rather than dependency paths. In this approach, and computational linguistics for decades (Turney a dependency-type/relative pair is treated as a tar- and Pantel, 2010), with a resurgence of interest in get for prediction when the head of a phrase is recent years. Mikolov et al.(2013a) and Mikolov observed (e.g. P (scientist/nsubj discovers)). | et al.(2013b) introduce the skip-gram architecture. The dependency-based skipgram embeddings were This work demonstrated the use of a continuous shown to better reflect the functional roles of words vector space model of language that could be used than those trained on narrative text, which tended for analogical reasoning when vector offset meth- to emphasize topical associations. Recent work ods are applied, providing the following canoni- (Zhang et al.(2018), Zhou et al.(2018), Li et al. cal example: if xi is the vector corresponding to (2019)) has also integrated dependency path rep- word i, xking - xman + xwoman yields a vector that is resentations in neural architectures for biomedi- close in proximity to xqueen. This result suggests cal relation extraction, framing it as a classifica- that the model has learned something about seman- tion task rather than an analogical reasoning task. tic gender. They identified some other linguistic The work of Washio and Kato(2018) is perhaps patterns recoverable from the vector space model, the most closely related to our approach, in that such as pluralization: x - x x - x , apple apples ≈ car cars neural embeddings are trained on word-path-word and developed evaluation sets of proportional anal- triples. Aside from our application of domain- ogy problems that have since been widely used specific Named Entity Recognition (NER), a key as benchmarks for distributional models (see for methodological difference between this work and example (Levy et al., 2015)). the current work is that their approach represents However, work soon followed that pointed out word pairs as a linear transformation of the con- some of the shortcomings of attributing these re- catenation of their embeddings, while we use XOR sults to the models’ analogical reasoning capacity. as a binding operator (following the approach of For example, Linzen(2016) showed that the vector Kanerva(1996)), which was first used to model for ‘queen’ is itself one of the nearest neighbors to biomedical analogical retrieval with semantic pred- the vector for ‘woman,’ and so it can be argued that ications extracted from the literature by Cohen et al. the model does not actually learn relational infor- (2011)3. On account of the use of a binding opera- mation that can be applied to analogical reasoning, tor, individual entities, pairs of entities and depen- but rather, can rely on the direct similarity between dency paths are all represented in a common vector the target terms in the analogy to produce desirable space. results. Furthermore, Gladkova et al.(2016) introduce 3 Text Data the Better Analogy Test Set (BATS) to provide an evaluation set for analogical reasoning that in- We train both the ESD and SGNS models on data cludes a broader set of semantic and syntactic re- released by Percha and Altman(2018). This cor- 4 lationships between words. This set proved far pus consists of about 70 million sentences from more challenging for embedding-based approaches. a subset of MEDLINE (approximately 16.5 mil- Newman-Griffis et al.(2017) provide results of vec- lion abstracts) which have PubTator (Wei et al., tor offset methods applied to a dataset of biomedi- 2013) annotations applied to identify phrases that cal analogies derived from UMLS triples, showing denote names of chemicals (including drugs and that certain biomedical relationships are more diffi- other chemicals of interest), genes (and the proteins cult to learn with analogical reasoning than others. they code for), and diseases (including side effects Because the aim of this project is to robustly 3For related work, see Widdows and Cohen(2014) learn a handful of biomedical relationships, we are 4Version 7 of the corpus retrieved at https://zenodo. less concerned about the linguistic generalizability org/record/3459420

40 Figure 2: Example of a path of dependencies between two entities of interest. The full parse is not shown, but rather, the minimum path of dependency relations between the two entities given the sentence. and other phenotypes). Throughout this paper, we 4 Knowledge Bases use these shorthand names for each of these categories, following the convention established in Wei We construct our evaluation data sets with exem- et al.(2013) and followed by Percha and Altman plars from knowledge bases for four primary kinds (2018). of biomedical relationships, characterized by the interactions between pairs of entities of the fol- The following example sentence from an article lowing types: chemical-gene, chemical-disease, PubTator processed by shows how multi-word gene-disease, and gene-gene. phrases that denote biomedical entities of interest, We evaluate on pairs of entities from the fol- in this case atypical depression and seasonal affec- lowing knowledge bases: DrugBank (Wishart tive disorder, are concatenated by underscores to et al., 2018), Online Mendelian Inheritance in constitute single tokens: Man (OMIM) (Hamosh et al., 2005), PharmGKB Chromium has a beneficial effect on eating-related atypical (PGKB) (Whirl-Carrillo et al., 2012), Reactome symptoms of depression, and may be a valuable agent in (Fabregat et al., 2016), Side Effect Resource treating atypical depression and seasonal affective disorder. (SIDER) (Kuhn et al., 2016), and Therapeutic Tar- Percha and Altman(2018) also provide pruned get Database (TTD) Wang et al.(2020). Stanford dependency (De Marneffe and Manning, Each knowledge base consists of pairs of en- 2008) parses for the sentences in the corpus, con- tities that relate in a specific way. For example, sisting, for each sentence, of the minimal path of de- SIDER Side Effects consists of chemical-disease- pendency relations connecting pairs of biomedical typed pairs such that the chemical is known to have named entities identified by PubTator. Specifi- the disease as a side effect, e.g. (sertraline, in- cally, they extract dependency paths that connect somnia). Meanwhile, another chemical-disease chemicals to genes, chemicals to diseases, genes pair from a different database, Therapeutic Target to diseases, and genes to genes. Figure2 shows an Database (TTD) indications, is such that the chem- example of a dependency path of relations between ical is indicated as a treatment for the disease, e.g. two terms, risperidone and rage. We use these de- (carphenazine, schizophrenia). In constructing our pendency paths as representations for predicates evaluation sets, we process all terms such that they that denote biomedical relationships of interest by are lower-cased, and multi-word terms are concate- concatenating the string representations of each nated by underscores. Furthermore, we eliminate path element, which are shown below the sentence from our evaluation sets any knowledge base terms in Figure2. Following Percha and Altman(2018), that do not appear in the training corpus described we exclude paths that denote a coordinating con- in 3 at least 5 times. It should be noted that across § junction between elements and paths that denote an these sets, a single biomedical entity may appear appositive construction, both of which are highly with numerous spellings and naming conventions. common in the set. In this corpus of 70 million Table2 shows the corresponding relationship sentences, there are about 44 million unique depen- type for each of the knowledge bases we use, as dency paths that connect concepts of interest, the well as the number of pairs from each that are used vast majority (around 40 million) of which appear in our evaluation data. The relationship retrieval just once in the corpus. 540,011 of these paths data consists of knowledge base pairs that appear appear at least 5 times in the corpus. in our training corpus connected by a dependency

41 path at least once, while the literature-based dis- was originally developed to encode knowledge ex- covery targets are those knowledge base pairs that tracted from the literature using a small set of prede- do not appear connected by a dependency path in ﬁned predicates (e.g. TREATS), we adapt it here to the corpus. encode a large variety (n=546,085) of dependency paths. For training, we concatenate the dependency 5 Training Details relations (the underscored parts in Figure2) into a SGNS With SGNS, a shallow neural network is single predicate token for which a vector is learned. trained to estimate the probability of encounter- Some examples of path tokens (concatenated de- ing a context term, tc, within a sliding window pendency relations) can be seen in Table1. Unlike centered on an observed term, to. The train- the original ESP implementation where predicate ing objective involves maximizing this probabil- vectors were held constant, we permit dependency 6 ity for true context terms P (t t ), and minimiz- path vectors to evolve during training . Further de- c| o ing it for randomly drawn counterexamples t c, tails on ESP can be found in (Cohen and Widdows, ¬ P (t c to), with probability estimated as the sig- 2017). For the current work, we set the dimension- ¬ | moid function of the scalar product between the in- ality at 8000 bits (as this is equivalent in representa- put weight vector for the observed term and the out- tional capacity to 250-dimensional single precision real vectors). For ESD, Table1 shows the nearest put weight vector of the context term, σ(−→to .−−→tc c). |¬ We used the Semantic Vectors5 implemen- neighboring dependency path vectors to the bound product I(metformin) O(diabetes), illustrat- tation of SGNS (which performs similarly to the ⊗ fastText implementation across a range of ana- ing paths that indicate the relationship between logical retrieval benchmarks (Cohen and Widdows, these terms, and ESD’s capability to learn similar 2018)) to train 250-dimensional embeddings, with representations for paths with similar meaning. a sliding window radius of two, on the complete Both SGNS and ESD were trained over ﬁve 5 set of full sentences from the corpus described in epochs, with a subsampling threshold of 10− , a 3 as the training corpus. As previously mentioned, minimum term frequency threshold of 5 (which § multi-word phrases corresponding to named enti- includes concatenated dependency paths for ESD), 6 ties recognized by the PubTator system in these and a maximum frequency threshold of 10 . sentences are concatenated by underscores, and consequently receive a single vector representation. 6 Evaluation Methods ESD With ESD, a shallow neural network is trained to estimate the probability of encountering We use a proportional analogy ranked retrieval task the object, o, of a subject-predicate-object triple for both the RR and LBD tasks, following prior work as described in 2. Figure3 visualizes this sP o. The training objective involves maximiz- § ing this probability for true objects P (o s, P ) and process. From a set of (X, Y) entity pairs from a | minimizing it for randomly drawn counterexam- knowledge base, given a term C and all terms D ples, o, P ( o s, P ). We adapted the Semantic such that (C, D) is a pair in the set, we select n ¬ ¬ | Vectors5 implementation of ESP to encode de- random (A, B) cue pairs from a disjoint set of pairs. pendency paths, with binary vectors as represen- We refer to (C, D) pairs as ‘target pairs,’ correct D tational basis (Widdows and Cohen, 2012) and completions as ‘targets,’ and (A, B) pairs as ‘cues.’ the non-negative normalized Hamming distance The vectors for the cue terms (A, B) and the term (NNHD) to estimate the similarity between them. C are summed in the following fashion to produce the resulting vector v. Given an analogical pair A:B::C:D, where A and C, B and D are of the same 2 Hamming distance NNHD = max 0, 1 × semantic type, respectively, we develop cue vectors − dimensionality for the target D in each model as follows: With this representational paradigm, probability can be estimated as NNHD(o, s P ), where SGNS : −→v = −→B −→A + −→C ⊗ − represents the use of pairwise exclusive OR as ESD : v = I−−→(A) O−−−→(B) I−−→(C) ⊗ −→ ⊗ ⊗ a binding operator, in accordance with the Bi- nary Spatter Code (Kanerva, 1996). While ESP 6This capability has been used to to predict drug interactions, with performance exceeding that of models with orders 5https://github.com/semanticvectors/semanticvectors of magnitude more parameters (Burkhardt et al., 2019).

42 SCORE PATH 0.974 controlled nmod start entity end entity amod controlled 0.935 add-on nmod start entity end entity amod add-on 0.565 reduces nsubj start entity reduces dobj requirement requirement nmod end entity 0.537 associated compound start entity end entity nsubj associated 0.516 start entity conj efﬁcacy efﬁcacy acl treating treating dobj end entity 0.438 treatment amod start entity treatment nmod end entity

Table 1: Nearest neighboring dependency path embeddings to I(metformin) O(diabetes) where I and O indicate ⊗ input and output weight vectors respectively.

Figure 3: Overview of analogical ranked retrieval paradigm. where I and O represent the input and out- Table2 shows, for each knowledge base, how put weight vectors of the ESD model, respec- many total unique X terms and total (X, Y) pairs tively. The SGNS method is the same as the are used for each task. Additionally, we show the 3COSADD method as described in Levy and Gold- average number of correct Y terms per X and the berg(2014b). maximum number of correct Y terms per X. For the relationship retrieval task, we consider those A K-nearest neighbor search is performed for v (X, Y) pairs which are connected by at least one (using cosine distance for SGNS, NNHD for ESD) dependency path in our corpus. Meanwhile, (X, Y) over the search space, and we record the ranks pairs for the LBD task must not be connected by a for each correct D target. The search space is con- dependency path in the corpus (we treat these held- strained such that it consists of those terms from our out pairs as a proxy for estimating the quality of training corpus that have a vector in both ESD and novel hypotheses). We know from the (X, Y) pair’s SGNS, a total of about 300,000 terms overall. For presence in the knowledge base that it is a gold ESD, this space consists of the output weight vec- standard pair for the given relationship type, but tors for each concept. For the proportional analogy from the models’ perspective this information is task using K-nearest neighbors to rank completions not available from the text alone. Thus, we believe to the analogy, the desired outcome is for the cor- it is a good test of the models’ ability to generate rect targets to be highly similar to the analogy cue plausible hypotheses. To reiterate, the methodology vector v, such that the highest ranks are assigned to for both the relationship retrieval and literature- the correct target terms D in a search over the entire based discovery evaluations is the same; the only vector space. In this fashion, we perform this KNN difference is in which pairs of terms from each search for every (X, Y) pair in the knowledge base knowledge base are used for evaluation data. and record the ranks for correct targets. We then compare the ranks of terms D across both vector We examine the role of increasing the number spaces; the higher the ranks, the better the model is of cues in improving retrieval. For example, for a at capturing relational similarity. given (C, D) target pair, we can combine vectors

43 Relationship Retrieval Literature-based Discovery Total X Total Pairs Mean Y / X Max Y / X Total X Total Pairs Mean Y / X Max Y / X Gene Targets (DrugBank) 1626 6290 4 107 3569 37162 10 420 PGKB 535 2089 4 48 1563 28053 18 144 Agonists (TTD) 148 172 1 3 307 462 2 7 Chem-Gene Antagonists (TTD) 188 200 1 2 508 620 1 5 Gene Targets (TTD) 1179 1436 1 7 4088 6430 2 15 Inhibitors (TTD) 522 669 1 7 1273 2082 2 15 Side Effects (SIDER) 334 1289 4 31 892 6591 7 46 Drug Indication (SIDER) 1077 2737 3 22 2160 8356 4 45 Chem-Disease Biomarker-Disease (TTD) 298 417 1 11 253 321 1 6 Drug Indication (TTD) 1749 1958 1 6 2664 2999 1 10 Disease Targets (TTD) 710 1502 2 22 1085 3088 3 27 OMIM 2197 2870 1 9 3461 5545 2 11 Gene-Disease PGKB 600 1693 3 34 1609 12605 8 73 Enzymes (DrugBank) 966 3622 4 33 1781 16242 9 71 Carriers (DrugBank) 203 345 2 27 444 1174 3 18 Transporters (DrugBank) 510 2357 5 44 1140 13889 12 94 Gene-Gene PGKB 497 2595 5 50 940 14142 15 89 Complex (Reactome) 1757 3061 2 9 2550 6593 3 31 Reaction (Reactome) 579 1031 2 9 1274 4024 3 29

Table 2: Total unique X terms, total (X, Y) pairs, average number of correct Y terms per X, and maximum number of correct Y terms per X for each knowledge base. for multiple (A, B) pairs with the C term vector to B are extracted from the same A, B cue pairs as produce a final cue vector that is closer to the target those used for the full analogy setting to ensure a D. When multiple cues are used, we superpose the reasonable comparison across methods. In the C:D cue vector for each of the cues, and normalize the comparison setting, no cues are aggregated. resulting vector, with normalization of real vectors to unit length in SGNS, and normalization of binary 7 Results vectors using the majority rule with ties split at We present qualitative and quantitative results for random with ESD. Cues are always selected from each vector space model’s ability to represent and the subset of knowledge base pairs that co-occur retrieve relational information. in our training corpus. We ensure that none of the Qualitative Results Table3 shows a side-by- (A, B) cue terms overlap with each other, nor with side comparison of the top 10 retrieved terms given the (C, D) target terms, to assure that self-similarity the vector for the term risperidone composed with does not inflate performance. We produced results 25 randomly selected (drug, indication) cues from for a range of 1, 5, 10, 25, and 50 cues, finding that SIDER. The goal is to complete the proportional the best results come from using 25 cues; we only analogy corresponding to the treatment relation- report these resulting scores in 7. § ship. Of the top 10 terms retrieved in the ESD vec- As a baseline inspired partly by Linzen(2016), tor space, 4 are correct completions to the analogy, we compute the similarity of vectors for B and D while 3 more are plausible completions based on lit- terms and C and D terms compared directly to each erature. ‘Tardive oromandibular dystonia,’ while of other, omitting the analogical task. The intuition the correct semantic type targeted by this analogy, here is that C and D terms are potentially close is actually a side effect of risperidone. A major- together in the vector space merely due to frequent ity of the retrieved results, however, are known or co-occurrence in the corpus, and any analogical plausible treatment targets. Meanwhile, most of the reasoning performance is merely relying on that top 10 terms retrieved by SGNS are names of other fact. Meanwhile, terms B and D can be close to- drugs that are similar to risperidone. Additionally, gether in the vector space simply because they are ‘psychiatric and visual disturbances’ and ‘tardive the same semantic type, and thus occur in similar dyskinesia’ are side effects of risperidone, not treat- contexts. In this case, relational analogy might not ment targets. Notably, all of the results retrieved explain the performance, but mere distributional with ESD are of the correct semantic type, i.e., they similarity. In the B:D comparison setting, cues B are disorders, while SGNS retrieves a mix of drugs are added together to create a single cue vector with and side effects. which to perform the KNN ranking over terms in Quantitative Results For each C term in each which to find the target term D. These cue terms evaluation set, we record the ranks of all D tar-

44 rank ESD (ours) SGNS 1 separation anxiety risperidone × 2 schizophrenia olanzapine × 3 depressed state quetiapine × 4 bipolar mania aripiprazole × 5 tardive oromanibular dystonia clozapine × 6 treatment of trichotillomania psychiatric and visual disturbances ∗ 7 pervasive developmental disorder (NOS) ziprasidone ∗ × 8 borderline personality disorder amisulpride × 9 psychotic disorders paliperidone × 10 mania tardive dyskinesia

Table 3: Top 10 results for a K-nearest neighbor search over terms for treatment targets for the drug risperidone (an antipsychotic drug), using 25 (drug, indication) pairs from SIDER as cues. Bolded terms are correct targets, i.e., they are listed as treatment targets for risperidone in SIDER. : a disorder that risperidone treats or might treat, ∗ based on external literature or a synonym for a target from SIDER; : a chemical, i.e., something that could not × be a treatment target for a drug. get terms resulting from the K-nearest neighbor for each task separately. search. For ease of comparison, we normalize all raw ranks by the length of the full search space 8.1 Relationship retrieval (324363 terms in total), and then subtract this value For a total of 12 out of 19 sets, ESD on full analo- from 1 so that lower ranks (i.e., better results) are gies outperforms ESD on direct B:D comparisons, displayed as higher numbers, for ease of interpre- suggesting that the model has learned generalizable tation. For a baseline score, we ran a simulation relationship information for these types of relations in which the entire search space was shuffled ran- rather than relying on distributional term similarity. domly 100 times, and recorded the median ranks of Because gene-gene pairs consist of entities of the multiple target D terms, given some C. We find that same semantic type, it can be argued that B:D simi- the median rank for D terms in a randomly shuffled larity should be very high, and yet scores are higher space tended toward the middle of the ranked list. for the full analogy over the B:D baseline for most Thus, the baseline score is established as 0.5; any of these sets, for both ESD and SGNS. For SIDER score lower than this means the model performed side effects, the B:D baseline for ESD shows higher worse than a random shuffle at retrieving target scores than the full analogy for both LBD and RR; terms. In Table4, 1 is the highest possible score, one reason for this could be that there is a high and 0 is the lowest. degree of side effect overlap between drugs, and so We report results at 25 (A, B) cues, the setting the side effect terms themselves are highly similar for which performance was best for both ESD and to each other. SGNS. ‘Full’ in Table4 refers to evaluation with 8.2 Literature-based discovery a full A:B::C:D analogy, while ‘B:D’ refers to the baseline that compares vectors for terms directly, The best performance on a majority of the sets rather than using relational information. We do comes from the ESD B:D model, suggesting that not report C:D comparison results, as they were the model relies on term similarity over relational categorically worse than both Full and B:D results. information for performance. Although SGNS doesn’t perform the best overall, the full analogy 8 Discussion model tends to outperform its B:D counterpart, suggesting that SGNS has managed to extrapolate re- The results in Table4 show that ESD outperforms lational information to the retrieval of held-out tar- SGNS on the RR task for 18 of 19 databases, and gets. As previously mentioned, performance on for 17 of 19 databases on the LBD task. It is clear this task is made difficult due to the lack of normal- that literature-based discovery is harder than rela- ization of concepts across our datasets. Addition- tionship retrieval, as the scores are generally lower ally, as Table4 shows, several top ranked terms are across the board for this task. We discuss the results plausible analogy completions, but do not appear as

45 Relationship retrieval LBD ESD (ours) SGNS ESD (ours) SGNS Full B:D Full B:D Full B:D Full B:D Gene Targets (DrugBank) 0.912 0.897 0.839 0.212 0.715 0.806 0.496 0.250 PGKB 0.969 0.994 0.705 0.361 0.737 0.918 0.366 0.317 Agonists (TTD) 0.997 0.907 0.998 0.647 0.802 0.781 0.924 0.708 Chem-Gene Antagonists (TTD) 1.000 0.900 0.999 0.732 0.802 0.703 0.831 0.750 Gene Targets (TTD) 0.998 0.867 0.994 0.387 0.746 0.760 0.625 0.479 Inhibitors (TTD) 0.998 0.874 0.993 0.415 0.773 0.759 0.682 0.392 Side Effects (SIDER) 0.997 0.999 0.967 0.942 0.952 0.994 0.799 0.932 Drug Indication (SIDER) 1.000 0.995 0.949 0.588 0.969 0.988 0.663 0.605 Chem-Disease Biomarker-Disease (TTD) 0.996 0.997 0.944 0.781 0.932 0.977 0.799 0.726 Drug Indication (TTD) 1.000 0.994 0.981 0.675 0.977 0.992 0.722 0.661 Disease Targets (TTD) 0.990 0.997 0.900 0.711 0.887 0.989 0.663 0.648 OMIM 0.997 0.911 0.950 0.599 0.668 0.792 0.578 0.578 Gene-Disease PGKB 0.982 0.996 0.781 0.624 0.836 0.969 0.592 0.618 Enzymes (DrugBank) 1.000 1.000 0.987 0.981 0.979 0.999 0.900 0.975 Carriers (DrugBank) 0.987 1.000 0.636 0.555 0.841 0.962 0.360 0.487 Transporters (DrugBank) 1.000 1.000 0.974 0.947 0.996 0.999 0.870 0.951 Gene-Gene PGKB 0.999 0.995 0.899 0.471 0.907 0.956 0.479 0.425 Complex (Reactome) 1.000 0.819 1.000 0.206 0.866 0.731 0.838 0.399 Reaction (Reactome) 1.000 0.917 0.996 0.273 0.878 0.826 0.699 0.366

Table 4: Results for relationship retrieval (RR) and literature-based discovery (LBD) for full analogy (A:B::C:D) and B:D retrieval. Scores are displayed here as the median of scores (1 - normalized rank) for all D terms in a knowledge base evaluation set.

gold-standard targets in the databases. Considering Acknowledgements the case of SIDER, which is built from automatically extracted information (not human-curated) This research was supported by U.S. National Li- the plausible results here are missing from the brary of Medicine Grant No. R01 LM011563. The database but are supported by evidence from pub- authors would like to thank the anonymous review- lished papers (e.g. Oravecz and Stuhecˇ (2014)). ers for their feedback.

References 9 Conclusion Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chai- tanya Malaviya, Asli Celikyilmaz, and Yejin Choi. 2019. COMET: Commonsense transformers for automatic knowledge graph construction. In Proceed- We have compared two vector space models of ings of the 57th Annual Meeting of the Association language, Embedding of Structural Dependencies for Computational Linguistics, pages 4762–4779, and Skip-gram with Negative Sampling, for their Florence, Italy. Association for Computational Lin- guistics. ability to represent biomedical relationships from literature in an analogical retrieval task. Our results Hannah A Burkhardt, Devika Subramanian, Justin suggest that encoding structural information in the Mower, and Trevor Cohen. 2019. Predicting ad- form of dependency paths connecting biomedical verse drug-drug interactions with neural embedding entities of interest can improve performance on of semantic predications. In AMIA Annual Sympo- sium Proceedings, volume 2019, page 992. Ameri- two analogical retrieval tasks, relationship retrieval can Medical Informatics Association. and literature-based discovery. In future work, we would like to compare our methods with knowledge Trevor Cohen and Dominic Widdows. 2009. Empiri- base completion techniques using contextualized cal distributional semantics: methods and biomedical applications. Journal of biomedical informatics, vectors from language models as in Bosselut et al. 42(2):390–405. (2019) as another method applicable to literature- based discovery. Trevor Cohen and Dominic Widdows. 2017. Embed-

46 ding of semantic predications. Journal of biomedi- Michael Kuhn, Ivica Letunic, Lars Juhl Jensen, and cal informatics, 68:150–166. Peer Bork. 2016. The sider database of drugs and side effects. Nucleic acids research, 44(D1):D1075– Trevor Cohen and Dominic Widdows. 2018. Bringing D1079. order to neural word embeddings with embeddings augmented by random permutations (earp). In Pro- Omer Levy and Yoav Goldberg. 2014a. Dependency- ceedings of the 22nd Conference on Computational based word embeddings. In Proceedings of the Natural Language Learning, pages 465–475. 52nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 2: Short Papers), pages Trevor Cohen, Dominic Widdows, Roger Schvan- 302–308. eveldt, and Thomas C Rindﬂesch. 2011. Finding schizophrenia’s prozac emergent relational similar- Omer Levy and Yoav Goldberg. 2014b. Linguistic ity in predication space. In International Symposium regularities in sparse and explicit word representa- on Quantum Interaction, pages 48–59. Springer. tions. In Proceedings of the eighteenth conference on computational natural language learning, pages Marie-Catherine De Marneffe and Christopher D Man- 171–180. ning. 2008. The stanford typed dependencies representation. In Coling 2008: proceedings of the workshop on cross-framework and cross-domain parser Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im- evaluation, pages 1–8. Association for Computa- proving distributional similarity with lessons learned tional Linguistics. from word embeddings. Transactions of the Associ- ation for Computational Linguistics, 3:211–225. Lance De Vine, Guido Zuccon, Bevan Koopman, Lau- rianne Sitbon, and Peter Bruza. 2014. Medical se- Zhiheng Li, Zhihao Yang, Chen Shen, Jun Xu, Yaoyun mantic similarity with a neural language model. In Zhang, and Hua Xu. 2019. Integrating shortest de- Proceedings of the 23rd ACM international confer- pendency path and sentence sequence into a deep ence on conference on information and knowledge learning framework for relation extraction in clinical management, pages 1819–1822. text. BMC medical informatics and decision making, 19(1):22. Antonio Fabregat, Konstantinos Sidiropoulos, Phani Garapati, Marc Gillespie, Kerstin Hausmann, Robin Tal Linzen. 2016. Issues in evaluating semantic spaces Haw, Bijay Jassal, Steven Jupe, Florian Korninger, using word analogies. In Proceedings of the 1st Sheldon McKay, et al. 2016. The reactome Workshop on Evaluating Vector-Space Representa- pathway knowledgebase. Nucleic acids research, tions for NLP, pages 13–18, Berlin, Germany. As- 44(D1):D481–D487. sociation for Computational Linguistics.

Safa Fathiamini, Amber M Johnson, Jia Zeng, Vi- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey jaykumar Holla, Nora S Sanchez, Funda Meric- Dean. 2013a. Efficient estimation of word repre- Bernstam, Elmer V Bernstam, and Trevor Cohen. sentations in vector space. In Proceedings of Inter- 2019. Rapamycin- mtor+ braf=? using rela- national Conference on Learning Representations tional similarity to find therapeutically relevant drug- (ICLR). gene relationships in unstructured text. Journal of biomedical informatics, 90:103094. Toma´sˇ Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013b. Linguistic regularities in continuous space Dedre Gentner and Arthur B Markman. 1997. Struc- word representations. In Proceedings of the 2013 ture mapping in analogy and similarity. American Conference of the North American Chapter of the psychologist, 52(1):45. Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, Anna Gladkova, Aleksandr Drozd, and Satoshi Mat- Georgia. Association for Computational Linguistics. suoka. 2016. Analogy-based detection of morpho- logical and semantic relations with word embed- Denis Newman-Griffis, Albert Lai, and Eric Fosler- dings: What works and what doesn’t. In Proceed- Lussier. 2017. Insights into analogy completion ings of the NAACL-HLT SRW, pages 47–54, San from the biomedical domain. In BioNLP 2017, Diego, California, June 12-17, 2016. ACL. pages 19–28, Vancouver, Canada,. Association for Ada Hamosh, Alan F Scott, Joanna S Amberger, Computational Linguistics. Carol A Bocchini, and Victor A McKusick. 2005. Online mendelian inheritance in man (omim), a Robert Oravecz and Matej Stuhec.ˇ 2014. Trichotillo- knowledgebase of human genes and genetic disor- mania successfully treated with risperidone and nal- ders. Nucleic acids research, 33(suppl 1):D514– trexone: a geriatric case report. Journal of the Amer- D517. ican Medical Directors Association, 15(4):301–302.

Pentti Kanerva. 1996. Binary spatter-coding of ordered Bethany Percha and Russ B Altman. 2018. A global k-tuples. In International Conference on Artiﬁcial network of biomedical relationships derived from Neural Networks, pages 869–873. Springer. text. Bioinformatics, 34(15):2614–2624.

47 Matthew Peters, Mark Neumann, Luke Zettlemoyer, Huiwei Zhou, Shixian Ning, Yunlong Yang, Zhuang and Wen-tau Yih. 2018. Dissecting contextual Liu, Chengkun Lang, and Yingyu Lin. 2018. word embeddings: Architecture and representation. Chemical-induced disease relation extraction with In Proceedings of the 2018 Conference on Em- dependency information and prior knowledge. Jour- pirical Methods in Natural Language Processing, nal of biomedical informatics, 84:171–178. pages 1499–1509, Brussels, Belgium. Association for Computational Linguistics. Yongjun Zhu, Olivier Elemento, Jyotishman Pathak, and Fei Wang. 2019. Drug knowledge bases and Don R Swanson. 1986. Fish oil, raynaud’s syndrome, their applications in biomedical informatics research. and undiscovered public knowledge. Perspectives in Brieﬁngs in bioinformatics, 20(4):1308–1321. biology and medicine, 30(1):7–18.

Peter D Turney and Patrick Pantel. 2010. From frequency to meaning: Vector space models of semantics. Journal of artiﬁcial intelligence research, 37:141–188.

Yunxia Wang, Song Zhang, Fengcheng Li, Ying Zhou, Ying Zhang, Zhengwen Wang, Runyuan Zhang, Jiang Zhu, Yuxiang Ren, Ying Tan, et al. 2020. Therapeutic target database 2020: enriched resource for facilitating research and early development of targeted therapeutics. Nucleic acids research, 48(D1):D1031–D1041.

Koki Washio and Tsuneaki Kato. 2018. Filling missing paths: Modeling co-occurrences of word pairs and dependency paths for recognizing lexical semantic relations. arXiv preprint arXiv:1809.03411.

Chih-Hsuan Wei, Hung-Yu Kao, and Zhiyong Lu. 2013. Pubtator: a web-based text mining tool for assisting biocuration. Nucleic acids research, 41(W1):W518– W522.

Michelle Whirl-Carrillo, Ellen M McDonagh, JM Hebert, Li Gong, K Sangkuhl, CF Thorn, Russ B Altman, and Teri E Klein. 2012. Pharma- cogenomics knowledge for personalized medicine. Clinical Pharmacology & Therapeutics, 92(4):414– 417.

Dominic Widdows and Trevor Cohen. 2012. Real, complex, and binary semantic vectors. In Interna- tional Symposium on Quantum Interaction, pages 24–35. Springer.

Dominic Widdows and Trevor Cohen. 2014. Reason- ing with vectors: A continuous model for fast robust inference. Logic Journal of the IGPL, 23(2):141– 173.

David S Wishart, Yannick D Feunang, An C Guo, Elvis J Lo, Ana Marcu, Jason R Grant, Tanvir Sajed, Daniel Johnson, Carin Li, Zinat Sayeeda, et al. 2018. Drugbank 5.0: a major update to the drugbank database for 2018. Nucleic acids research, 46(D1):D1074–D1082.

Yijia Zhang, Hongfei Lin, Zhihao Yang, Jian Wang, Shaowu Zhang, Yuanyuan Sun, and Liang Yang. 2018. A hybrid model based on neural networks for biomedical relation extraction. Journal of biomedical informatics, 81:83–92.

48 DeSpin: a prototype system for detecting spin in biomedical publications Anna Koroleva Sanjay Kamath Zurich University of Applied Sciences (ZHAW), Total, Waedenswil, Switzerland France Swiss Institute of Bioinformatics (SIB), [email protected] Lausanne, Switzerland [email protected]

Patrick M.M. Bossuyt Patrick Paroubek Academic Medical Center, LIMSI, CNRS, Universite´ Paris-Saclay, University of Amsterdam, Orsay, France Amsterdam, Netherlands [email protected] [email protected]

Abstract 1 Background

Improving the quality of medical research re- It is widely acknowledged nowadays that that the porting is crucial to reduce avoidable waste in quality of reporting of research results in the clin- research and to improve the quality of health ical domain is suboptimal. As a consequence, re- care. Despite various initiatives aiming at im- search findings can often not be replicated, and proving research reporting – guidelines, check- billions of euros may be wasted yearly (Ioannidis, lists, authoring aids, peer review procedures, etc. – overinterpretation of research results, 2005). also known as distorted reporting or spin, is Numerous initiatives aim at improving the qual- still a serious issue in research reporting. ity of research reporting. Guidelines and checklists In this paper, we propose a Natural Language have been developed for every type of clinical re- Processing (NLP) system for detecting several search. Still, the quality of reporting remains low: types of spin in biomedical articles reporting authors fail to choose and follow a correct guide- randomized controlled trials (RCTs). We use a line/checklist (Samaan et al., 2013). Automated combination of rule-based and machine learn- tools, such as Penelope1, are introduced to facili- ing approaches to extract important informa- tate the use of guidelines/checklists. It was proved tion on trial design and to detect potential spin. that authoring aids improve the completeness of The proposed spin detection system includes reporting (Barnes et al., 2015). algorithms for text structure analysis, sentence Enhancing the quality of peer reviewing is an- classification, entity and relation extraction, other step to improve research reporting. Peer re- semantic similarity assessment. Our algorithms achieved operational performance for viewing requires assessing a large number of in- the these tasks, F-measure ranging from 79,42 formation items. Nowadays, Natural Language to 97.86% for different tasks. The most diffi- Processing (NLP) is applied to facilitate laborious cult task is extracting reported outcomes. manual tasks such as indexing of medical literature Our tool is intended to be used as a semi- (Huang et al., 2011) and systematic review process automated aid tool for assisting both authors (Ananiadou et al., 2009). Similarly, the peer re- and peer reviewers to detect potential spin. viewing process can be partially automated with The tool incorporates a simple interface that al- the help of NLP. lows to run the algorithms and visualize their Our project tackles a specific issue of research output. It can also be used for manual annota- reporting that, to our knowledge, has not been ad- tion and correction of the errors in the outputs. dressed by the NLP community: spin, also referred The proposed tool is the first tool for spin de- to as overinterpretation of research results. In the tection. The tool and the annotated dataset are context of clinical trials assessing a new (experi- freely available. At the time of reported work, Anna Koroleva was a PhD Netherlands. Sanjay Kamath was a PhD student at LIMSI- student at LIMSI-CNRS in Orsay, France and at the Academic CNRS and LRI Univ. Paris-Sud in Orsay, France. Medical Center, University of Amsterdam in Amsterdam, the 1https://www.penelope.ai/

49 Proceedings of the BioNLP 2020 workshop, pages 49–59 Online, July 9, 2020 c 2020 Association for Computational Linguistics mental) intervention, spin consists in exaggerating prototype comprises a set of spin-detecting algo- the beneficial effects of the studied intervention rithms and a simple interface to run the algorithms (Boutron et al., 2010). and display their output. Spin is common in articles reporting random- This paper is organized as follows: first, we pro- ized controlled trials (RCTs) - clinical trials com- vide an overview of some existing semi-automated paring health interventions, to which participants aid systems for authors, reviewers and readers of are allocated randomly to avoid biases - with non- biomedical articles. Second, we introduce in more significant primary outcome. Abstracts are more detail the notion of spin, the types of spin that we prone to spin than full texts. Spin is found in a address, and the information that is required to as- high percentage of abstracts of articles in surgical sess an article for spin. After that, we describe our research (40%) (Fleming, 2016), cardiovascular current algorithms, methods employed and provide diseases (57%) (Khan et al., 2019), cancer (47%) their evaluation. Finally, we discuss the potential (Vera-Badillo et al., 2016), obesity (46.7%) (Austin future development of the prototype. et al., 2018), otolaryngology (70%) (Cooper et al., 2018), anaesthesiology (32,2%) (Kinder et al., 2 Related work 2018), and wound care (71%) (Lockyer et al., Although there has been no attempt to automate 2013). Although the problem of spin has started to spin detection in biomedical articles, a number of attract attention in the medical community in the works addressed developing automated aid tools recent years, the shown prevalence of spin proves to assist authors and readers of scientific articles that it often remains unnoticed by editors and peer in performing various other tasks. Some of these reviewers. tools were tested and were shown to reduce the Abstracts are often the only part of the article workload and improve the performance of human available to readers, and spin in abstracts of RCTs experts on the corresponding task. poses a serious threat to the quality of health care by causing overestimation of the intervention by 2.1 Authoring aid tools clinicians (Boutron et al., 2014), which may lead to the use of an ineffective of unsafe intervention in Barnes et al.(2015) assessed the impact of a writing clinical practice. Besides, spin in research articles aid tool based on the CONSORT statement (Schulz is linked to spin in press releases and health news et al., 2010) on the completeness of reporting of (Haneef et al., 2015; Yavchitz et al., 2012), which RCTs. The tools was developed for six domains of has the negative impact of raising false expectations the Methods section (trial design, randomization, regarding the intervention among the public. blinding, participants, interventions, and outcomes) and consisted of reminders of the corresponding The importance of the problem of spin motivated CONSORT item(s), bullet points enumerating the our work. We aimed at developing NLP algorithms key elements to report, and good reporting exam- to aid authors and readers in detecting spin. We ples. The tool was assessed in an RCT in which focused on randomized controlled trials (RCTs) as the participants were asked to write a Methods sec- they are the most important source of evidence for tion of an article based on a trial protocol, either Evidence-based medicine, and spin in RCTs has using the aid tool (’intervention’ group) or without high negative impact. using the tool (’control’ group). The results of 41 Our work lies within the scope of the Methods in participants showed that the mean global score for Research on Research (MiRoR) project2, an inter- reporting completeness was higher with the use of national project devoted to improving the planning, the tool than without it. conduct, reporting and peer reviewing of health care research. For the design and development 2.2 Aid tools for readers and reviewers of our toolkit, we benefited from advice from the MiRoR consortium members. Kiritchenko et al.(2010) developed a system called In this paper, we introduce a prototype of a sys- ExaCT to automatically extract 21 key characteris- tem, called DeSpin (Detector of Spin), that au- tics of clinical trial design, such as treatment names, tomatically detects potential spin in abstracts of eligibility criteria, outcomes, etc. ExaCT consists RCTs and relevant supporting information. This of an information extraction algorithm that looks for text fragments corresponding to the target in- 2http://miror-ejd.eu/ formation elements, a web-based user interface

50 through which human experts can view and correct et al.(2015), who divided instances of spin into the suggested fragments. several types and subtypes. The National Library of Medicine’s Medical We addressed the following types of spin: Text Indexer (MTI) is a system providing auto- 1. Outcome switching – unjustified change of the matic recommendations based on the Medical Sub- pre-defined trial outcomes, leading to report- ject Headings (MeSH) terms for indexing medical ing only the favourable outcomes that support articles (Mork et al., 2013). MTI is used to assist the hypothesis of the researchers (Goldacre human indexers, catalogers, and NLM’s History of et al., 2019). Outcome switching is one of the Medicine Division in their work. Its use by index- most common types of spin. It can consist in ers was shown to grow over years (used to index omitting the primary outcome in the results / 15.75% of the articles 2002 vs 62.44% in 2014) conclusions of the abstract, or in the focus on and to improve the performance (precision, recall significant secondary outcomes, e.g.: and F-measure) of indexers (Mork et al., 2017). Marshall et al.(2015) addressed the task of au- The primary end point of this trial was overall tomating assessment of risk of bias in clinical trials. survival. <...> This trial showed a signifi- Bias is phenomenon related to spin: it is a sys- cantly increased R0 resection rate although tematic error or a deviation from the truth in the it failed to demonstrate a survival benefit. results or conclusions that can cause an under- or In this example, the primary outcome (”over- overestimation of the effect of the examined treat- all survival”), the results for which were not ment (Higgins and Green, 2008). The authors de- favourable, is mentioned in the conclusion, veloped a system called RobotReviewer that used but it is not reported in the first place and oc- machine learning to assess an article for the risk curs within a concessive clause (starting by of different types of bias and to extract text frag- ”although”). This way of reporting puts the ments that support these judgements. These works focus on the other, favourable, outcome (”R0 showed that automated risk of bias assessment can resection rate”). be achieve reasonable performance, and the extraction of supporting text fragments reached similar 2. Interpreting non-significant outcome as a quality to that of human experts. Marshall et al. proof of equivalence of the treatments, e.g.: (2017) further developed RobotReviewer, adding The median PFS was 10.3 months in the functionality for extracting the PICO (Population, XELIRI and 9.3 months in the FOLFIRI arm Interventions/Comparators, Outcomes) elements (p = 0.78). Conclusion: The XELIRI regi- from articles and detecting study design (RCT), men showed similar PFS compared to the for the purpose of automated evidence synthesis. FOLFIRI regimen. Soboczenski et al.(2019) assessed RobotReviewer The results for the outcome ”median PFS” are in a user study involving 41 participants, evaluating not significant, which is often erroneously time spent for bias assessment, text fragment sug- interpreted as a proof of similarity of the gestions by machine learning, and usability of the treatments. However, a non-significant result tool. Semi-automation in this study was shown to means that the null hypothesis of a difference be quicker than manual assessment; 91% of the au- could not be rejected, which is not equivalent tomated risk of bias judgments and 62% of support- to a demonstration of similarity of the treating text suggestions were accepted by the human ments. This would require the rejection of the reviewers. null hypothesis of a difference, or a substantial The cited works demonstrate that semi- difference, in outcomes between treatments. automated aid tools can prove useful for both authors and readers/reviewers of medical articles and 3. Focus on within-group comparisons, e.g.: has a potential to improve the quality of the articles Both groups showed robust improvement in and facilitate the analysis of the texts. both symptoms and functioning. 3 Spin: definition and types The goal of randomized controlled trials is to compare two treatments with regard to some We adopt the definition and classification of spin outcomes. If the superiority of the experimen- introduced by Boutron et al.(2010) and Lazarus tal treatment over the control treatment was

51 not shown, within-group comparisons (report- the primary outcome, or to selective reporting ing the changes within a group of patients of significant outcomes only. receiving a treatment, instead of comparing patients receiving different treatments) can be 4 Algorithms used to persuade the reader of beneficial ef- Spin is a complex notion and thus detecting spin fects of the experimental treatment. cannot be seen as a binary classification problem. Two concepts are vital for spin detection and We believe that the most viable approach to spin de- play a key role in our algorithms: tection is to assess each (sub)type of spin separately. We aimed at developing algorithms to extract and 1. The primary outcome of a trial – the most analyse pieces of information relevant to the ad- important variable monitored during the trial dressed types of spin. The extracted information to assess how the studied treatment impacts and its analysis, provided by our tool, can help hu- it. Primary outcomes are recorded in trial man experts in making the conclusion on presence registries (open online databases storing the or absence of spin of the given (sub)type. information about registered clinical trials), Detection of spin and related information is a and should be defined in the text of clinical complex task which cannot by fully automated. articles, e.g.: Our system is designed as a semi-automated tool The primary end point was a difference of that finds potential instances of the addressed types > 20% in the microvascular flow index of of spin and extracts the supporting information that small vessels among groups. can help the user to make the final decision on the presence of spin. In this section, we present 2. Statistical significance of the primary out- the algorithms currently included in the system, come. Statistical hypothesis testing is used to according to the types of spin that they are used to check for a significant difference in outcomes detect. between two patient groups, one receiving the As we aim at detecting spin in the Results and experimental treatment and the other receiving Conclusions sections of articles’ abstracts, we first the control treatment. Statistical significance need an algorithm analyzing the given article to is often reported as a P-value compared to pre- detect its abstract and the Results and Conclusions defined threshold, usually set to 0.05. Spin sections within the abstract. We will not mention most often occurs when the results for the this algorithm in the list of algorithms for each spin primary outcome are not significant (Boutron type to avoid repetition. If we talk about extracting et al., 2010; Fleming, 2016; Khan et al., 2019; some information from the abstract, it implies that Vera-Badillo et al., 2016; Austin et al., 2018; the text structure analysis algorithm was applied. Cooper et al., 2018; Kinder et al., 2018; Lock- yer et al., 2013), although trials with signifi- 4.1 Outcome switching cant effect on the primary outcome may also We focus on the switching (change/omission) of be prone to spin (Beijers et al., 2017). the primary outcome. Primary outcome switching Trial results are commonly reported as an ef- can occur at several points: 3 fect on the (primary) outcome , along with the primary outcome(s) recorded in the trial • the p-value. registry can differ from the primary out- Microcirculatory flow indices of small and come(s) declared in the article; medium vessels were significantly higher in the primary outcome(s) declared in the ab- the levosimendan group as compared to the • stract can differ from the primary outcome(s) control group (p < 0.05). declared in the body of the article; Statistical significance levels of trial outcomes the primary outcome(s) recorded in the trial are vital for spin detection, as spin is com- • monly related to non-significant results for registry can be omitted when reporting the results for the outcomes in the abstract; 3It is important to distinguish between the notions of outcome, effect and result in this context: an outcome is a mea- the primary outcome(s) recorded in the article sure/variable monitored during a clinical trial; effect refers to • the change in an outcome observed during a trial; trial results can be omitted when reporting the results for refer to the set of effects for all measured outcomes. the outcomes in the abstract.

52 Primary outcome switching detection involves the ”overall survival” and ”survival” is high, thus, following algorithms: we conclude that the primary outcome ”overall survival” is reported as ”survival”. 1. Identification of primary outcomes in trial registries and in the article’s text. As semantic similarity often depends on the context, the conclusions of the system are pre- 2. Identification of reported outcomes from sen- sented to the user, who can check them to tences reporting the results, e.g. (reported make the conclusions on correctness of the outcomes are in bold): analysis. The results of this study showed that symptom 4. Assessing the discourse prominence of the in massage group were improved sig- Scores reported primary outcome (detected by the nificantly compared with control group, and previous algorithms) by checking if it is re- the rate of , and in the dyspnea cough wheeze ported the first place among all the outcomes; experimental group than the control group if it is reported in a concessive clause. were reduced by approximately 45%, 56% and 52%. In the example above, the system will detect that the primary outcome ”survival” is 3. Assessment of semantic similarity of pairs of reported within a concessive clause (starting outcomes extracted by the above algorithms to by ”although”) and will flag the sentence as check for missing outcomes. We perform the potentially focusing on secondary outcomes. assessment for the following sets of outcomes: 4.2 Interpreting non-significant outcome as a The primary outcome extracted from the • proof of equivalence of the treatments registry is compared to the primary out- As we stated above, conclusions on the similar- come(s) declared in the article; ity/equivalence of the studies treatments are justi- The primary outcome extracted from the • fied only if the trial was of non-inferiority or equiv- abstract is compared to the primary out- alence type. Thus, we employ two algorithms to come(s) declared in the body of the arti- detect this type of spin: cle; The primary outcome extracted from the 1. Identification of statements of similarity be- • article is compared to the outcomes re- tween treatments, e.g.: ported in the abstract; Both products caused similar leukocyte The primary outcome extracted from the • counts diminution and had similar safety pro- registry is compared to the outcomes re- files. ported in the abstract. 2. Identifying the markers of non-inferiority or These assessments allow to detect switching equivalence trial design, e.g.: of the primary outcome at all the possible stages. If the primary outcome in the registry ONCEMRK is a phase 3, multicenter, double- and in the article, or in the abstract and body of blind, noninferiority trial comparing ralte- the article differ, we conclude that there is po- gravir 1200mg QD with raltegravir 400mg tential outcome switching, which is reported BID in treatment-naive HIV-1–infected adults. to the user. Similarly, if the primary outcome If there is a statement of similarity of treatments (from the article or from the registry) is miss- while no markers of non-inferiority / equivalence ing from the list of the reported outcomes, we design are found, we conclude the presence of spin suspect selective reporting of outcomes, and and report it to the user. the system reports it to the user. In the example on the page3, the system 4.3 Focus on within-group comparisons should extract ”overall survival” as the pri- Any statement in the results and conclusions of the mary outcome, and ”R0 resection rate” and abstract that presents a comparison of two states ”survival” as reported outcomes. The similar- of a patient group without comparing it to another ity between ”overall survival” and ”R0 resec- group is a within-group comparison. This type of tion rate” is low, while the similarity between spin is detected by a single algorithm that identifies

53 within-group comparisons that are further reported To find the abstract, we use regular expres- • to the user: sions rules that are evaluated on the set of Young Mania Rating Scale total scores improved 3938 PubMed Central (PMC)4 articles in with ritanserin. XML format with a specific tag for the abstract, used as the gold standard. To evaluate 4.4 Other algorithms our algorithm, we applied it to the raw texts We support extraction of some information that is extracted from the XML files and compared not directly involved in the detection of spin, but the extracted abstracts to those obtained using that can help user in spin assessment and that can the XML tag. be used in the future when new spin types are added. To extract outcomes from trial registries, we The algorithms include: • use regular expressions to extract the trial reg- istration number from the article; using it, we 1. Extraction of measures of statistical signifi- find on the web, download and parse the reg- cance, both numerical and verbal (in bold): istry entry corresponding to the trial. Study group patients had a significant lower To extract significance levels, we use rules reintubation rate than did controls; six • patients (17%) versus 19 patients (48%), based on regular expressions and token, P<0.05; respectively. lemma and pos-tag information. To assess the discourse prominence of an out- 2. Extraction of the relation between the reported • come, to detect statements of similarity be- outcomes and their statistical significance, ex- tween treatments, within-group comparisons tracted at the previous stages. For the ex- and markers of non-inferiority design, we em- ample above, we extract pairs (”reintubation ploy rules based on token, lemma and pos-tag rate”, ”significant”) and (”reintubation rate”, information. ”P<0.05”). We annotated abstracts of 180 articles (2402 These algorithms, in combination with the sentences) for similarity statements and assessment of semantic similarity of extracted within-group comparisons (Koroleva, 2020). outcomes, allows to identify the significance The proportion of these types of statements level for the primary outcome. in our corpus is low: we identified only 72 similarity statements and 127 within-group 5 Methods comparisons. The evaluation of statements In this section, we briefly outline the methods used of similarity between treatments and within- in our algorithms, the datasets used for evaluation, group comparisons was performed with two and the current performance of the algorithms. Our settings: 1) using the whole text of abstracts; approach is based on some previous works for the 2) using only the Results and Conclusions related tasks. As the details on development of the sections of the abstract, which raised the pre- algorithms, annotating the data and testing differ- cision, as expected (Table1). ent approaches are described in detail in the corre- 5.2 Machine learning methods sponding articles, we limit ourselves here to only a brief description of the best-performing method For the core tasks of our system, we either used that we selected for each task. an existing annotated corpus or annotated our own The methods we employ can be divided into two corpora. Our corpora were annotated by a single groups: machine learning, including deep learning, annotator (AK), consulted by consulted our med- used for the core tasks for which we have sufficient ical advisors from the MiRoR network (Isabelle training data, and rule-based methods, used for the Boutron, Patrick Bossuyt and Liz Wager). simpler tasks or for tasks where we do not have We tested several approaches for each task, enough data for machine learning. including rule-based and machine-learning approaches (see details below). Overall, we found 5.1 Rule-based methods that the best performance on our tasks was shown We developed rules for the following tasks: 4https://www.ncbi.nlm.nih.gov/pmc/

54 Algorithm Method Annotated dataset Precision Recall F1 Primary outcomes Deep 2,000 sentences / 1,694 86.99 90.07 88.42 extraction learning outcomes Reported out- Deep 1,940 sentences / 2,251 81.17 78.09 79.42 comes extraction learning outcomes Outcome similar- Deep 3,043 pairs of outcomes 88.93 90.76 89.75 ity assessment learning Similarity state- Rules 180 abstracts / 2402 sen- ments extraction tences whole abstract 77.8 87.5 82.4 results and conclusions 85.1 87.5 86.3 Within-group com- Rules 180 abstracts / 2402 sen- parisons tences whole abstract 53.2 90.6 67.1 results and conclusions 71.9 90.6 80.1 Abstract extrac- Rules 3938 abstracts 94.7 94 94.3 tion Text structure Deep PubMed200k 97.82 95.81 96.8 analysis: sections learning of abstract Extraction of sig- Rules 664 sentences / 1,188 99.18 96.58 97.86 nificance levels significance level markers Outcome - signif- Deep 2,678 pairs of outcomes 94.3 94 94 icance level rela- learning and significance level tion extraction markers

Table 1: Overview of algorithms, methods, results and annotated datasets by a deep learning approach that was recently We compared a rule-based approach and BERT, proved to be highly successful in many NLP appli- SciBERT and BioBERT models, fine-tuned for the cations. It employs language representations pre- sentence classification task on the PubMed 200k trained on large unannotated data and fine-tuned dataset. The best performance was shown by the on a relatively small amount of annotated data for fine-tuned BioBERT model. a specific downstream task. The language representations that we tested include: BERT (Bidirec- 5.2.2 Outcome extraction tional Encoder Representations from Transformers) The outcome extraction task includes two subtasks: models (Devlin et al., 2018), trained on a general- extracting primary and reported outcomes. For domain corpus of 3.3B words; BioBERT model each subtask, we annotated a separate corpus. For (Lee et al., 2019), trained on the BERT corpus and primary outcome extraction, we annotated a cor- a biomedical corpus of 18B words; and SciBERT pus of 2,000 sentences, coming from 1,672 articles. models (Beltagy et al., 2019), trained on the BERT The sentences were selected randomly, from both corpus and a scientific corpus of 3.1B words. For abstracts and full texts, without restriction to a par- each task, we chose the best-performing model. ticular medical domain. A total of 1,694 primary Details about the annotated datasets that we used outcomes was annotated (Koroleva, 2019a). For re- and the tested approaches can be found below. The ported outcome extraction, we annotated reported best results for each task are summarised in Table1. outcomes in the abstracts of articles for which we annotated the primary outcomes. The corpus con- 5.2.1 Identification of sections in the abstract tains 1,940 sentences from 402 articles, with a total For identifying sections within the abstract (in par- of 2,251 reported outcomes (Koroleva, 2019a). ticular, Results and Conclusions), we used the We compared a rule-based system and several PubMed 200k dataset introduced in Dernoncourt machine learning algorithms for primary and re- and Lee(2017). This dataset contains approxi- ported outcome extraction. Details about the an- mately 200,000 abstracts of RCTs with 2.3 million notated datasets and the methods that we tested sentences. Each sentence is annotated with one of can be found in Koroleva et al.(EasyChair, 2020). the following classes, corresponding to the sections We selected the best performing approach to be of the abstract: background, objective, method, re- included in our tool sult, or conclusion. We used the train-dev-test split For primary outcomes extraction, the best perfor- provided by the developers of the dataset. mance was demonstrated by the BioBERT model

55 fine-tuned for named entity recognition task on our 6 Interface corpus of 2,000 sentences annotated for primary Our prototype system allows the user to load a outcomes. For reported outcomes extraction, the text (with or without annotations), run algorithms, best performance was achieved by the SciBERT visualize their output, correct, add or remove anno- model fine-tuned for named entity recognition task tations. The expected input is an article reporting on our corpus of 1,940 sentences annotated with an RCT in the text format, including the abstract. reported outcomes. Figure1 shows the interface with an example of a processed text. 5.2.3 Assessment of semantic similarity of The main items of the drop-down menu on the outcomes top of the page are Annotations, allowing to visualize and manage the annotations, and Algorithms, To annotate semantic similarity between outcomes, allowing to run the described algorithms to detect we used pairs of sentences from our corpora of potential spin and the related information. The text outcomes: the first sentence in each pair comes fragments identified by the algorithms can be high- from the corpus of primary outcomes, the second lighted in the text. When running the algorithms, sentence comes from the corpus of reported out- a report is generated that contains the extracted comes, and both sentences are from the same arti- information and its analysis by the tool (e.g. a cle. We assigned a binary label of similarity (sim- mismatch between the outcomes in the text and ilar/dissimilar) to each pair of outcomes in each in the trial registry; absence of the declared pri- sentence pair. The corpus contains 3,043 pairs of mary outcome among the reported outcomes in outcomes (Koroleva, 2019b). the abstract). The report is saved into the Meta- We tested several semantic similarity measures data section of Annotations menu, which can be (string-based, lexical, vector-based) and the BERT, accessed through the interface, and can be exported SciBERT and BioBERT models, fine-tuned for sen- to a file via the Generate report item of the Algo- tence pair classification task on the corpus of out- rithms menu. Human experts can use this report come pairs. Details on the corpus annotation and to check the extracted information and the analysis on the methods tested can be found in Koroleva performed by the tool, and to make a final decision et al.(2019). The best performance was shown by on the presence/absence of a given type of spin. the fine-tuned BioBERT model. 7 Results and conclusions 5.2.4 Extraction of the relation between The current functionality, methods in use, anno- reported outcomes and statistical tated datasets and the best achieved results are out- significance levels lined in Table1. Performance is assessed per-token for outcome and significance level extraction and To annotate the relation between reported outcomes per-unit for other tasks. and statistical significance levels, we selected sen- In this paper, we presented a first prototype tool tences containing markers of statistical significance for assisting authors and reviewers to detect spin from the corpus annotated with reported outcomes. and related information in abstracts of articles re- We annotated the pairs of outcomes and signifi- porting RCTs. The employed algorithms show op- cance levels with a binary label (“positive”: the erational performance in complex semantic tasks, significance level is related to the outcome; “neg- even with relatively low volume of available an- ative”: the significance level is not related to the notated data. We envisage two possible applica- outcome). The final corpus contains 663 sentences tions of our system: as an authoring aid or as peer- with 2,552 annotated relations (Koroleva, 2019c). reviewing tool. The authoring aid version can be We tested several machine learning algorithms further developed into an educational tool, explain- and the BERT, SciBERT and BioBERT model fine- ing the notion of spin and its types to the user. tuned for the relation extraction task on the anno- Possible directions for future work include: im- tated corpus. The details on the corpus and the proving the implementation and interface (adding method can be found in Koroleva and Paroubek prompts for interaction with the user; facilitating (2019). The best result for this task was achieved installation process), algorithms (improving cur- by the fine-tuned BioBERT model. rent performance, adding detection of new spin

56 Figure 1: Example of a processed text types), application (promoting the tool among the spin, prevalent in the biomedical domain, would target audience; encouraging users to submit their not be relevant in domains that do not use the no- manually annotated data, to be used to improve the tion of outcome). Hence, we suppose that spin- algorithms), and optimization (parallel processing detection algorithms are domain-specific as well of multiple input text files). Our system can be and cannot be applied to other domains. easily incorporated into other text processing tools. Another interesting yet challenging direction for 8 Availability the future work is detecting spin/distorted reporting in texts belonging to scientific domains other than The proposed prototype tool and associated models biomedicine. First of all, a qualitative study of spin are available at: is needed to define and classify spin in each scien- https://github.com/aakorolyova/DeSpin. tific domain (similar to the work of Boutron et al. (2010) and Lazarus et al.(2015) for clinical trials). To our best knowledge, there have been no attempts Acknowledgments to conduct such a study for non-biomedical texts. It is therefore difficult to hypothesise whether spin- This project has received funding from the Eu- detection algorithms developed for texts reporting ropean Union’s Horizon 2020 research and inno- clinical trials could be applicable for other domains. vation programme under the Marie Sklodowska- It appears that the definition and the types of spin Curie grant agreement No 676207. are domain-specific (e.g. outcome-related types of

57 References Ben Goldacre, Henry Drysdale, Aaron Dale, Ioan Milo- sevic, Eirion Slade, Philip Hartley, Cicely Marston, Sophia Ananiadou, Brian Rea, Naoaki Okazaki, Rob Anna Powell-Smith, Carl Heneghan, and Kamal R. Procter, and James Thomas. 2009. Supporting Mahtani. 2019. Compare: a prospective cohort systematic reviews using text mining. Social Sci- study correcting and monitoring 58 misreported tri- ence Computer Review - SOC SCI COMPUT REV, als in real time. Trials, 20(1):118. 27:509–523.

Jennifer Austin, Christopher Smith, Kavita Natarajan, Romana Haneef, Clement Lazarus, Philippe Ravaud, Mousumi Som, Cole Wayant, and Matt Vassar. 2018. Amelie Yavchitz, and Isabelle Boutron. 2015. In- Evaluation of spin within abstracts in obesity ran- terpretation of results of studies evaluating an inter- domized clinical trials: A cross-sectional review: vention highlighted in google health news: a cross- Spin in obesity clinical trials. Clinical Obesity, sectional study of news. PLoS ONE. 9:e12292. Julian P. Higgins and Sally Green, editors. 2008. Caroline Barnes, Isabelle Boutron, Bruno Giraudeau, Cochrane handbook for systematic reviews of inter- Raphael Porcher, Douglas Altman, and Philippe ventions. Wiley & Sons Ltd., West Sussex. Ravaud. 2015. Impact of an online writing aid tool for writing a randomized trial report: The cob- Minlie Huang, Aurelie´ Nev´ eol,´ and Zhiyong Lu. 2011. web (consort-based web tool) randomized controlled Recommending mesh terms for annotating biomedi- trial. BMC medicine, 13:221. cal articles. Journal of the American Medical Infor- matics Association : JAMIA, 18:660–7. Lian Beijers, Bertus F. Jeronimus, Erick H. Turner, Pe- ter de Jonge, and Annelieke M. Roest. 2017. Spin John Ioannidis. 2005. Why most published research in rcts of anxiety medication with a positive primary findings are false. PLoS medicine, 2:e124. outcome: a comparison of concerns expressed by the us fda and in the published literature. BMJ Open, Muhammad Khan, Noman Lateef, Tariq Siddiqi, 7(3). Karim Abdur Rehman, Saed Alnaimat, Safi Khan, Haris Riaz, M Hassan Murad, John Mandrola, Rami Iz Beltagy, Arman Cohan, and Kyle Lo. 2019. Scibert: Doukky, and Richard Krasuski. 2019. Level and Pretrained contextualized embeddings for scientific prevalence of spin in published cardiovascular ran- text. arXiv preprint arXiv:1903.10676. domized clinical trial reports with statistically non- Isabelle Boutron, Douglas Altman, Sally Hopewell, significant primary outcomes: A systematic review. Francisco Vera-Badillo, Ian Tannock, and Philippe JAMA Network Open, 2:e192622. Ravaud. 2014. Impact of spin in the abstracts of articles reporting results of randomized controlled trials N.C. Kinder, M.D. Weaver, Cole Wayant, and Matt Vas- in the field of cancer: the SPIIN randomized con- sar. 2018. Presence of ‘spin’ in the abstracts and ti- trolled trial. Journal of Clinical Oncology. tles of anaesthesiology randomised controlled trials. British Journal of Anaesthesia, 122. Isabelle Boutron, Susan Dutton, Philippe Ravaud, and Douglas Altman. 2010. Reporting and interpretation Svetlana Kiritchenko, Berry de Bruijn, Simona Carini, of randomized controlled trials with statistically non- Joel D. Martin, and Ida Sim. 2010. Exact: automatic significant results for primary outcomes. JAMA. extraction of clinical trial characteristics from journal publications. BMC Med Inform Decis Mak. Craig M. Cooper, Harrison M. Gray, Andrew E. Ross, Tom A. Hamilton, Jaye B. Downs, Cole Wayant, and Anna Koroleva. 2019a. MiRoR11 - P2 - Annotated cor- Matt Vassar. 2018. Evaluation of spin in the ab- pus for primary and reported outcomes extraction. stracts of otolaryngology randomized controlled trials: Spin found in majority of clinical trials. The Anna Koroleva. 2019b. MiRoR11 - P2 - Annotated cor- Laryngoscope. pus for semantic similarity of clinical trial outcomes. Franck Dernoncourt and Ji Young Lee. 2017. PubMed Anna Koroleva. 2019c. MiRoR11 - P2 - Annotated cor- 200k RCT: a dataset for sequential sentence classi- pus for the relation between reported outcomes and fication in medical abstracts. In 8th IJCNLP (Vol- their significance levels. ume 2: Short Papers), pages 308–313, Taipei, Tai- wan. Asian Federation of NLP. Anna Koroleva. 2020. MiRoR11 - P2 - Annotated Jacob Devlin, Ming-Wei Chang, Kenton Lee, and dataset for spin-related types of statements (state- Kristina Toutanova. 2018. BERT: pre-training of ments of similarity and within-group comparisons). deep bidirectional transformers for language understanding. CoRR, abs/1810.04805. Anna Koroleva, Sanjay Kamath, and Patrick Paroubek. 2019. Measuring semantic similarity of clinical Padhraig S. Fleming. 2016. Evidence of spin in clini- trial outcomes using deep pre-trained language rep- cal trials in the surgical literature. Ann Transl Med., resentations. Journal of Biomedical Informatics: X, 4,19(385). 4:100058.

58 Anna Koroleva, Sanjay Kamath, and Patrick Paroubek. Frank Soboczenski, Thomas Trikalinos, Joel¨ Kuiper, EasyChair, 2020. Extracting outcomes from arti- Randolph G. Bias, Byron Wallace, and Iain J. Mar- cles reporting randomized controlled trials using pre- shall. 2019. Machine learning to help researchers trained deep language representations. EasyChair evaluate biases in clinical trials: A prospective, ran- Preprint no. 2940. domized user study. BMC Medical Informatics and Decision Making, 19. Anna Koroleva and Patrick Paroubek. 2019. Extracting relations between outcomes and significance levels Francisco E. Vera-Badillo, Marc Napoleone, Monika K. in randomized controlled trials (RCTs) publications. Krzyzanowska, Shabbir M.H. Alibhai, An-Wen In Proceedings of the 18th BioNLP Workshop and Chan, Alberto Ocana, Bostjan Seruga, Arnoud J. Shared Task, pages 359–369, Florence, Italy. Asso- Templeton, Eitan Amir, and Ian F. Tannock. 2016. ciation for Computational Linguistics. Bias in reporting of randomised clinical trials in oncology. European Journal of Cancer, 61:29 – 35. Clement´ Lazarus, Romana Haneef, Philippe Ravaud, and Isabelle Boutron. 2015. Classification and Amelie´ Yavchitz, Isabelle Boutron, Aida Bafeta, prevalence of spin in abstracts of non-randomized Ibrahim Marroun, Pierre Charles, Jean J. C. Mantz, studies evaluating an intervention. BMC Med Res and Philippe Ravaud. 2012. Misrepresentation of Methodol. randomized controlled trials in press releases and Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, news coverage: a cohort study. PLoS Med. Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. Biobert: a pre-trained biomedical language representation model for biomedical text mining. arXiv preprint arXiv:1901.08746. Suzanne Lockyer, Robert Willard Hodgson, Jo C. Dumville, and Nicky Cullum. 2013. ”Spin” in wound care research: the reporting and interpretation of randomized controlled trials with statistically non-significant primary outcome results or unspecified primary outcomes. In Trials. Iain Marshall, Joel¨ Kuiper, Edward Banner, and By- ron C. Wallace. 2017. Automating biomedical evidence synthesis: RobotReviewer. In Proceedings of ACL 2017, System Demonstrations, pages 7–12, Vancouver, Canada. Association for Computational Linguistics. Iain James Marshall, Joel¨ Kuiper, and Byron C. Wal- lace. 2015. Robotreviewer: Evaluation of a system for automatically assessing bias in clinical trials. Journal of the American Medical Informatics Asso- ciation : JAMIA, 23. James Mork, Alan Aronson, and Dina Demner- Fushman. 2017. 12 years on – is the NLM medical text indexer still useful and relevant? Journal of Biomedical Semantics, 8(1). James G. Mork, Antonio Jimeno-Yepes, and Alan R. Aronson. 2013. The NLM medical text indexer system for indexing biomedical literature. CEUR Work- shop Proceedings, 1094. Zainab Samaan, Lawrence Mbuagbaw, Daisy Kosa, Victoria Borg Debono, Rejane Dillenburg, Shiyuan Zhang, Vincent Fruci, Brittany Dennis, Monica Ba- wor, and Lehana Thabane. 2013. A systematic scop- ing review of adherence to reporting guidelines in health care literature. Journal of multidisciplinary healthcare, 6:169–88. Kenneth F. Schulz, Douglas G. Altman, and David Mo- her. 2010. Consort 2010 statement: updated guidelines for reporting parallel group randomised trials. BMJ, 340.

59 Towards Visual Dialog for Radiology

Olga Kovaleva ,∗ Chaitanya Shivade , Satyananda Kashyap?, Karina Kanjaria?, ‡ †∗ Adam Coy?, Deddeh Ballah?, Joy Wu?, Yufan Guo?, Alexandros Karargyris?, David Beymer?, Anna Rumshisky , Vandana Mukherjee? University of Massachussets,‡ Lowell Amazon ‡ ? IBM Almaden Research Center.†

Abstract over additional information gathered from previous question answers in the dialog. Current research in machine learning for ra- Although limited work exploring VQA in radiol- diology is focused mostly on images. There ogy exists, VisDial in radiology remains an unex- exists limited work in investigating intelligent interactive systems for radiology. To address plored problem. With the healthcare setting increas- this limitation, we introduce a realistic and ingly requiring efficiency, evaluation of physicians information-rich task of Visual Dialog in radi- is now based on both the quality and the timeli- ology, specific to chest X-ray images. Using ness of patient care. Clinicians often depend on MIMIC-CXR, an openly available database official reports of imaging exam findings from ra- of chest X-ray images, we construct both a diologists to determine the appropriate next step. synthetic and a real-world dataset and pro- However, radiologists generally have a long queue vide baseline scores achieved by state-of-the- art models. We show that incorporating medi- of imaging studies to interpret and report, caus- cal history of the patient leads to better perfor- ing subsequent delay in patient care (Bhargavan mance in answering questions as opposed to et al., 2009; Siewert et al., 2016). Furthermore, it is conventional visual question answering model common practice for clinicians to call radiologists which looks only at the image. While our ex- asking follow-up questions on the official reporting, periments show promising results, they indi- leading to further inefficiencies and disruptions in cate that the task is extremely challenging with the workflow (Mangano et al., 2014). significant scope for improvement. We make Visual dialog is a useful imaging adjunct that both the datasets (synthetic and gold standard) and the associated code publicly available to can help expedite patient care. It can potentially the research community. answer a physician’s questions regarding official in- terpretations without interrupting the radiologist’s 1 Introduction workflow, allowing the radiologist to concentrate their efforts on interpreting more studies in a timely Answering questions about an image is a complex manner. Additionally, visual dialog could provide multi-modal task demonstrating an important ca- clinicians with a preliminary radiology exam in- pability of artificial intelligence. A well-defined terpretation prior to receiving the formal dictation task evaluating such capabilities is Visual Question from the radiologist. Clinicians could use the infor- Answering (VQA) (Antol et al., 2015) where a sys- mation to start planning patient care and decrease tem answers free-form questions reasoning about the time from the completion of the radiology exam an image. VQA demands careful understanding to subsequent medical management (Halsted and of elements in an image along with intricacies of Froehle, 2008). the language used in framing a question about it. In this paper, we address these gaps and make Visual Dialog (VisDial) (Das et al., 2017; de Vries the following contributions: 1) we introduce con- et al., 2016) is an extension to the VQA problem, struction of RadVisDial - the first publicly available where a system is required to engage in a dialog dataset for visual dialog in radiology, derived from about the image. This adds significant complex- the MIMIC-CXR (Johnson et al., 2019) dataset, ity to VQA where a system should now be able 2) we compare several state-of-the-art models for to associate the question in the image, and reason VQA and VisDial applied to these images, and 3) ∗ Equal contribution, Work done at IBM Research we conduct a comprehensive set of experiments

60 Proceedings of the BioNLP 2020 workshop, pages 60–69 Online, July 9, 2020 c 2020 Association for Computational Linguistics highlighting different challenges of the problem 206,576 reports. Each report is well structured and and propose solutions to overcome them. typically consists of sections such as Medical Condition, Comparison, Findings, and 2 Related Work Impression. Each report can map to one or Most of the large publicly available datasets (Kag- more images and each patient can have one or gle, 2017; Rajpurkar et al., 2017) for radiology more reports. The images consist of both frontal consist of images associated with a limited amount and lateral views. The frontal views are either of structured information. For example, Irvin et al. anterior-posterior (AP) or posterior-anterior (PA). (2019); Johnson et al.(2019) make images avail- The initial release of data also consists of annota- able along with the output of a text extraction mod- tions for 14 labels (13 abnormalities and one No ule that produces labels for 13 abnormalities in a Findings label) for each image. These anno- chest X-ray. Of note recently, the task of generating tations are obtained by running the CheXpert la- reports from radiology images has become popular beler (Irvin et al., 2019); a rule-based NLP pipeline in the research community (Jing et al., 2018; Wang against the associated report. The labeler output et al., 2018). Two recent shared tasks at Image- assigns one of four possibilities for each of the 13 abnormalities: yes, no, maybe, not mentioned in CLEF explored the VQA problem with radiology { the report . images (Hasan et al., 2018; Abacha et al., 2019). } Lau et al.(2018) also released a small dataset VQA- RAD for the specific task. 3.2 Visual Dialog dataset construction The first VQA shared task at ImageCLEF (Hasan Every training record of the original VisDial dataset et al., 2018) used images from articles at PubMed (Das et al., 2017) consists of three elements: an im- Central. While Abacha et al.(2019) and Lau age I, a caption for the image C, and a dialog et al.(2018) use clinical images, the sizes of these history H consisting of a sequence of ten question- datasets are limited. They are a mix of several answer pairs. Given the image I, the caption C, modalities including 2D modalities such as X-rays, a possibly empty dialog history H, and a follow- and 3D modalities such as ultrasound, MRI, and CT up question q, the task is to generate an answer a scans. They also cover several anatomic locations where q, a H. Following the original formula- { } ∈ from the brain to the limbs. This makes a multi- tion, we synthetically create our dataset using the modal task with such images overly challenging, plain text reports associated with each image (this with shared task participants developing separate synthetic dataset will be considered to be silver- models (Al-Sadi et al., 2019; Abacha et al., 2018; standard data for the experiments described in sec- Kornuta et al., 2019) to first address these subtasks tion 5). The Medical Condition section of (such as modality detection) before actually solving the radiology report is a single sentence describing the problem of VQA. the medical history of the patient. We treat this sen- We address these limitations and build up on tence from the Medical Condition section as MIMIC-CXR (Johnson et al., 2019) the largest the caption of the image. We use NegBio (Peng publicly available dataset of chest X-rays and cor- et al., 2018) for extracting sections within a report. responding reports. We focus on the problem of We discard all images that do not have a medical visual dialog for a single modality and anatomy in condition in their report. Further, each CheXpert the form of 2D chest X-rays. We restrict the num- label is formulated as a question probing the pres- ber of questions and generate answers for them ence of a disorder, and the output from the labeler automatically which allows us to report results on is treated as the corresponding answer. Thus, ignor- a large set of images. ing the No Findings label, there are 52 possible 3 Data question-answer pairs as a result of 13 questions and 4 possible answers. 3.1 MIMIC-CXR We decided to focus on PA images for most of The MIMIC-CXR dataset1 consists of 371,920 our experiments as this is the most informative chest X-ray images in the Digital Imaging and view for chest X-rays, according to our team radi- Communications (DICOM) format along with ologists. The original VisDial dataset (Das et al., 1https://physionet.org/content/ 2017) consists of ten questions per dialog and one mimic-cxr/1.0.0/ dialog per image. Since we only have a set of 13

61 Two people are in a wheelchair and one is holding a racket 80 year old man s/p vats R lower lobectomy

Q: Airspace opacity? A: Yes Q: Fracture? A: Not in report Q: Lung lesion? A: No

Pneumonia? Yes

Figure 1: Comparison of VisDial 1.0 (left) with our synthetically constructed dataset (right). possible questions, we limit the length of the dialog to 5 randomly sampled questions. The resulting dataset has 91060 images in the PA view (with train/validation/test splits containing 77205, 7340 and 6515 images, respectively). This synthetic data will be made available through the MIMIC Derived Pneumonia? LSTM No Data Repository.2 Thus any individual with access Airspace + to MIMIC-CXR will have access to our data. Fig- opacity? No LSTM ure 1 shows an example from our dataset and how ... it compares with one from VisDial 1.0. Figure 2: The modified architecture of the SAN model 3.3 Evaluation (image taken from (Yang et al., 2016)). The proposed The questions in our dataset are limited to probing modification shown in orange incorporates the history the presence of an abnormality in a chest X-ray. of dialog turns in the same way as the question through an LSTM. In our ablation experiments the changed part Similarly, the answers are limited to one of the either reduces to encoding an image caption only or four choices. Owing to the restricted nature of the gets cut completely. problem, we deviate from the evaluation protocol outlined in (Das et al., 2017) and instead calculate the F1-score for each of the four answers. We also guided attention over image features in an itera- report a macro-averaged F1 score across the four tive manner. The attended image features are then answers to make model comparisons easier. combined with the question features for answer prediction. SAN has been successfully adapted for 4 Models medical VQA tasks such as VQA-RAD (Lau et al., For our experiments, we selected a set of models 2018) and VQA-Med task of the ImageCLEF 2018 designed for image-based question answering tasks. challenge (Ionescu et al., 2018). In our setup, we Namely, we experimented with three architectures: use a stack of two image attention layers and an Stacked Attention Network (SAN) (Yang et al., LSTM-based question representation. 2016), Late Fusion Network (LF) (Das et al., 2017), To take the dialog history into account and there- and Recursive Visual Attention Network (RVA) fore adjust the SAN model for the needs of the (Niu et al., 2019). Following the original VisDial Visual Dialog task, we modify the first image at- study (Das et al., 2017), we use an encoder-decoder tention layer of the network by adding a term for structure with a discriminative decoder for each of LSTM representation of the history. This modifica- the models. Below we give an overview of all the tion forces the image attention weights to become three algorithms. both question- and history-guided (see Figure 2).

4.1 Stacked Attention Network 4.2 Late Fusion Network The original conﬁguration of SAN was introduced Proposed by (Das et al., 2017) as a baseline model for the general-domain VQA task. The model perfor the Visual Dialog task, Late Fusion Network forms multi-step reasoning by reﬁning question- encodes the question and the dialog history through 2https://physionet.org/physiotools/ two separate RNNs, and the image through a CNN. mimic-code/HEADER.shtml The resulting representations are simply concate-

62 nated in a single vector, which is then used by a decoder for predicting the answer. We use this model unchanged, as released in the original Visual Dialog challenge.

4.3 Recursive Visual Attention This model is the winner of the 2019 Visual Di- alog challenge3. It recursively browses the past history of dialog turns until the current question is paired with the turn containing the most relevant information. This strategy is particularly useful for resolving co-references, naturally occurring in Figure 3: Downsampling strategies. Every bar along general-domain dialog questions. As previously, the X axis represents a single question-answer pair, we do not modify the architecture of the model. where questions (13 in total) and answers (4 in total) are obtained through CheXpert. 5 Experiments

This section presents our down-sampling strategy, in report’ and ‘Fracture’ ‘Not in report’ } { → } gives details about conducted ablation studies, and remained the first and second largest counts in both describes experiments with various representations the unbalanced and downsampled versions of the of images and texts. datasets. To reduce the disparity between dominant 5.1 Downsampling and underrepresented categories, we tuned the parameters outlined in (Hudson and Manning, 2019). A closer analysis of our data showed that the major- We experimented with two different sets of param- ity of the reports processed by the CheXpert labeler eter values and obtained two datasets with more resulted in no mention of most of the 13 patholo- balanced question-answer distributions. We further gies. This presented a heavily skewed dataset that refer to them as “minor” and “major” downsam- would lead to a biased model instead of true visual pling, reflecting the total amount of data reduced understanding. This issue is not unique to radiol- (shown in blue and gray in Figure 3). ogy; it is observed even in the current benchmarks for VQA, and attempts have been made to mitigate the resulting problems (Hudson and Manning, 5.2 Evaluating importance of context 2019; Zhang et al., 2016; Agrawal et al., 2018). In order to dissuade the answer biases, we per- To assess the importance of the dialog context for formed data balancing, specifically by downsam- question answering, we compare the performance pling major labels in our dataset. As mentioned of different variations of the Stacked Attention Net- above, the CheXpert labeler outputs four possible work, selected as the best-performing model in the answers for 13 labels. To investigate the skew in the previous experiment (see subsection 6.1). In par- data, we plotted a distribution of the 52 question- ticular, we examine three scenarios: (a) the model answer pairs (Figure 3). Further, we downsampled makes a prediction based solely on a given image the question-answer pairs to fit a smoother answer (essentially solving the VQA task rather than the distribution with the method presented in GQA Visual Dialog task), (b) the model makes its pre- based on the Earth Mover’s Distance method (Hud- diction given an image and its caption, and (c) the son and Manning, 2019; Rubner et al., 2000). We model makes its prediction given an image, a cap- iterated over the 52 pairs in decreasing frequency tion, and a history of question-answer pairs. Simi- order and downsampled the categories belonging lar to the model modifications described in subsec- to the skewed head of the distribution. The relative tion 4.1 and Figure 2, we achieve the goal through label ranks by frequency remained the same for the experimenting with the SAN model by changing its balanced sets as with the unbalanced sets. For ex- first image attention layer to accordingly take in (a) ample, the pairs ‘Other pleural findings’ ‘Not { → question and image features, (b) question, image, 3https://visualdialog.org/challenge/ and caption features, and (c) question, image, and 2019 full dialog history features.

63 5.3 Image representations corresponding lateral view are concatenated as an We test three approaches for pre-trained image rep- aggregate image representation. resentations. The first approach uses a ResNet- 5.5 Text representations 101 architecture (He et al., 2016) for multiclass classification of input X-ray images into 14 find- Finally, we investigate the best way for representing labels extracted from the associated reports ing the textual data by incorporating different pre- (as described in section 3.2). Our second method trained word vectors. More specifically, we mea- aims to replicate the original CheXpert study (Irvin sure the performance of our best-performing SAN et al., 2019). Here we use a DenseNet-121 image model reached with (a) randomly initialized word classifier trained for prediction of five pre-selected embeddings trained jointly with the rest of the and clinically important labels, namely, atelectasis, models, (b) domain-independent GloVe Common cardiomegaly, consolidation, edema, and pleural Crawl embeddings (Pennington et al., 2014), and effusion. In both ResNet and DenseNet-based ap- (c) domain-specific fastText embeddings trained proaches we take the features obtained from the by (Romanov and Shivade, 2018). The latter are last pooling layer. initialized with GloVe embeddings trained on Com- Finally, we adopted a bottom-up mechanism for mon Crawl, followed by training on 12M PubMed image region proposal introduced by Anderson abstracts, and finally on 2M clinical notes from et al.(2018). More specifically, we first trained MIMIC-III database (Johnson et al., 2016). In a neural network predicting bounding boxes for all the experiments, we use 300-dimensional word the image regions, corresponding to a set of 11 vectors. We also experimented with transformer- handcrafted clinical annotations adopted from an based contextual vectors using BERT (Devlin et al., existing chest X-ray dataset4. We then represented 2019). More specifically, instead of using LSTM every region as a latent feature vector of a trained representations of the textual data, we extracted the patch-wise convolution autoencoder, and (3) con- last layer vectors from ClinicalBERT (Alsentzer catenated all the obtained vectors to represent the et al., 2019) pre-trained on MIMIC notes, and aver- entire image. aged them over input sequence tokens. Based on the results of the experiment (subsec- 5.6 Question order tion 6.3), we found that ResNet-101 image vectors yielded the best performance, so we used them in In a visual dialog setting, a model is conditioned on other experiments. the image vector, the image caption, and the dialog history to predict the answer to a new question. 5.4 Effect of incorporating a lateral view We hypothesized that a model should be able to One of the crucial aspects of X-ray radiography answer later questions in a dialog better since it exams is to capture the subject from multiple views. has more information from the previous questions Typically, in case of chest X-rays, radiologists order and their answers. As described in Section 3.2, we an additional lateral view to confirm and locate randomly sample 5 questions out of 13 possible findings that are not clearly visible from a frontal choices to construct a dialog. We re-ordered the (PA or AP) view. We test whether the VisDial question-answer pairs in the dialog to reflect the models are able to leverage the additional visual order in which the corresponding abnormality label information offered by a lateral (LAT) view. We mentions occurred in the report. However, results filter the data down to the patients whose chest X- for questions ordered based on their occurrence in ray exams had both a frontal and lateral views and the narrative did not vary from the setup with a re-sample the resulting data-set into train (52952 random order of questions. PA and 8086 AP images), validation (6614 PA and 964 AP images), and test (6508 PA and 1035 AP 6 Results images). We train a separate ResNet-101 model We report macro-averaged F1-scores achieved on for each of the three views on this re-sampled data the same unbalanced validation set for each of the using the method described in the previous section. experiments. When experimenting with different The vector representations of a frontal view and the configurations of the same model, we also break 4https://www.kaggle.com/c/ down the aggregate score to the F1 scores for indi- rsna-pneumonia-detection-challenge vidual answer options.

64 Model ‘Yes’ ‘No’ ‘Maybe’ ‘Not in report’ Macro F1 SAN (VQA) 0.24 0 0.09 0.84 0.29 SAN (caption only) 0.30 0.09 0.09 0.81 0.33 SAN (full history) 0.22 0.26 0.04 0.83 0.34

Table 1: Ablation experiments. Per-answer F1-scores along with the macro F1-score are shown for tested SAN conﬁgurations.

6.1 Downsampling this might be due to the fact that, by design, the net- Our results show (Table 2) consistent improvement work uses a limited set of pre-training classes not of the scores across all the models as the train- sufﬁcient to generalize well to a full set of diseases ing data becomes more balanced. All the mod- used in the Visual Dialog task. els yielded comparable scores, with SAN being slightly better than other models (0.34 against 0.33 Model DenseNet-121 Region ResNet macro F1-score). Later in our experiments, we used Proposal the major down sampled version of the data-set. SAN 0.27 0.29 0.34 LF 0.33 0.31 0.33 Downsampled Model Unbalanced RvA 0.29 0.32 0.33 Minor Major SAN 0.25 0.28 0.34 Table 3: Comparative performance (macro-F1) of Vi- LF 0.28 0.31 0.33 sual Dialog models on the test set with different image RvA 0.24 0.33 0.33 representations.

Table 2: Data balancing experiments. Macro F1 scores are reported for every tested model. 6.4 Effect of incorporating a lateral view

6.2 Evaluating importance of context As expected, for both variations of the frontal view One of the main findings of our study revealed the (i.e. AP and PA) appending lateral image vectors importance of contextual information for answer- enhanced the performance of the tested SAN model ing questions about a given image. As shown in (see Table4). This suggests that lateral and frontal Table 1, adding the image caption and the history image vectors complement each other, and the mod- of turns results in incremental increases of macro els can benefit from using both. However, in our F1-scores. Notably, the VQA setup in which the data-set only a subset of reports has both views model relies on the image only, it fails to detect the available, which significantly reduces the amount ‘No’ answer, whereas the history-aware configura- of training data. tion leads to a significant performance gain for this particular label. As expected and due to the skewed 6.5 Word embeddings nature of the data-set, the highest and the lowest per-label scores were achieved for the most and the Another observation from our experiments is that least frequent labels (‘Not in report’ and ‘Maybe’), domain-specific pre-trained word embeddings con- respectively. tribute to better scores (see Table5). This is due to the fact that domain-specific embeddings con- 6.3 Image representation tain medical knowledge that helps the model make Out of the tested image representations, ResNet- more justified predictions. derived vectors perform consistently better than the When using BERT, we did not notice gains in other approaches (see Table3). Although in our performance, which most likely means that the last- DenseNet-121 image classification pre-training we layer averaging strategy is not optimal and more were able to replicate the performance of (Irvin sophisticated approaches such as (Xiao, 2018) are et al., 2019), the Visual Dialog scores for the corre- required . Alternatively, the final representation of sponding vectors turned out to be lower. We believe the CLS can be used to represent input text.

65 View ‘Yes’ ‘No’ ‘Maybe’ ‘Not in report’ Macro F1 AP 0.40 0.21 0.12 0.79 0.381 LAT 0.41 0.23 0.13 0.75 0.379

AP+LAT AP + LAT 0.41 0.22 0.12 0.79 0.385 PA 0.30 0.30 0.08 0.88 0.392 LAT 0.32 0.32 0.07 0.86 0.391

PA+LAT PA + LAT 0.32 0.34 0.06 0.87 0.396

Table 4: Effect of adding the lateral view to a frontal view (AP and PA).

Embedding ‘Yes’ ‘No’ ‘Maybe’ ‘Not in report’ Macro F1 Random 0.26 0.22 0.04 0.73 0.31 GloVe (common crawl) 0.27 0 0.09 0.80 0.29 fastText (MedNLI) 0.24 0.22 0.07 0.84 0.33

Table 5: Comparative performance of the SAN model with different word embeddings.

7 Comparison with the gold-standard answer was elaborated with additional information data if the radiologists found it necessary. The whole data collection procedure resulted in 100 annotated To complement our experiments with the silver dialogs. data and investigate the applicability of the trained models to real-world scenarios, we also collected a 7.2 Gold standard results set of gold standard data which consisted of two expert radiologists having a dialog about a particular Following the gold standard data collection, we chest X-ray. These X-ray images were randomly performed some preliminary analyses with the best sampled PA views from the test our data. In this silver standard SAN model. Our gold standard data section, we present the data collection workflow, was split into train (70), validation (20), and test outline the associated challenges, compare the re- (10) sets. We experimented with three setups: (a) sulting data-set with the silver-standard, and report evaluating the silver-data trained networks on the the performance of trained models. gold standard data, (b) training and evaluating the models on the gold data, and (c) fine-tuning the 7.1 Gold Standard Data Collection silver-data trained networks on the gold standard We laid the foundations for our data collection in data. Table 6 shows the results of these experiments. a manner similar to that of the general visual dia- We found the best macro-F1 score of 0.47 was log challenge (Das et al., 2017). Two radiologists, achieved by the silver data-trained SAN network designated as a “questioner” and an “answerer”, fine-tuned on the gold standard data. We observed conversed with each other following a detailed an- that the model could not directly predict any of the notation guideline created to ensure consistency. classes if directly evaluated on the gold data-set, The “answerer” in each scenario was provided with suggesting that it was trained to fit the data patterns an image and a caption (medical condition). The significantly different from those present in the “questioner” was provided with only the caption, collected data-set. However, pre-training on the and tasked with asking follow-up questions about silver data serves as a good starting point for further the image, visible only to the “answerer”. In or- model fine-tuning. The obtained scores in general der to make the gold data-set comparable to the imply that there are many differences between the silver-standard one, we restricted the beginning of gold and silver data, including their vocabularies, each answer to contain a direct response of ‘Yes’, answer distributions, and level of question detail. ‘No’, ‘Maybe’, or ‘Not mentioned’. In our annotation guidelines ‘Not mentioned’ referred to the 7.3 Comparison of gold and silver data lack of evidence of the given medical condition To provide a meaningful analysis of the sources that was asked by the “questioner” radiologist. The of difference between the gold and silver datasets,

66 Train data ‘Yes’ ‘No’ Macro F1 posed heavily of ‘Yes’ or ‘No’ answers, the silver comprised mostly of ‘Not in report’. Silver 0.00 0.00 0.00 Gold 0.27 0.77 0.35 8 Discussion Silver+gold 0.60 0.82 0.47 Our main finding is that the introduced task of vi- Table 6: Comparative performance of the SAN model sual dialog in radiology presents a lot of challenges trained on different combinations of silver and gold from the machine learning perspective, including a data, and evaluated on the test subset of gold data. Note that the gold annotations did not contain ‘Not in report’ skewed distribution of classes and a required abil- and ‘Maybe’ options. ity to reason over both visual and textual input data. The best of our baseline models achieved 0.34 macro-averaged F1-score, indicating on a sig- we grouped the gold questions semantically by us- nificant scope for potential improvements. Our ing the CheXpert vocabulary for the 13 labels used comparison of gold and silver standard data shows for the construction of the silver dataset. The gold some trends are in line with medical doctors’ strate- questions that are unable to be grouped via CheX- gies in medical history taking, starting with broader, pert were mapped manually using expert clinical general questions and then narrowing the scope of knowledge. We systematically compared the gold their questions to more specific findings (Talbot and silver dialogs on the same 100 chest X-rays et al.; Campillos-Llanos et al., 2020). and noted the following differences. Despite the difficulty and the practical usefulness of the task, it is important to list the limitations of Frequency of semantically equivalent ques- • our study. The questions were limited to presence tions. Just under half of the gold question of 13 abnormalities extracted by CheXpert and the types were semantically covered by the ques- answers were limited to 4 options. The studies tions in the silver dataset. used in this work (from MIMIC-CXR) originate from a single tertiary hospital in the United States. Granularity of questions. We observed that • Moreover, they correspond to a specific group of the silver dataset tends to ask highly granular patients, namely those admitted to the Emergency questions about specific findings (e.g. “con- Department (ED) from 2012 to 2014. Therefore, solidation”) as expected. The radiology ex- the data and hence the model reflect multiple real- perts, however, asked a range of low (e.g. world biases. It should also be noted that chest “Are there any bone abnormalities?), medium X-rays are mostly used for screening than diagnos- (e.g. “Are the lungs clear?”) and high (e.g. tic purposes. A radiology image is only one of the “Is there evidence of pneumonia?”) granular- many data points (e.g. labs, demographics, medi- ity questions. The gold dialogs tend to start cations) used while making a diagnosis. Therefore, with broader (low granularity) questions and although predicting presence of abnormalities (e.g. narrow the differential diagnosis down as the pneumonia) based on brief knowledge of the pa- dialogs progress. tient’s medical history and the chest X-ray might Question sense. The radiologists also asked be a good exercise and a promising first step in • questions in the form of whether some struc- evaluating machine learning models, it is clinically ture is “normal” (e.g. “Is the soft tissue limited. normal?”). Whereas, the silver questions There are plenty of directions for future work only asked whether an abnormality is present. that we intend to pursue. To make the synthetic Since chest X-rays are screening exams where data more realistic and expressive, both questions a good proportion of the images may be “nor- and answers should be diversified with the help of mal”, having more questions asking whether clinicians’ expertise and external knowledge bases different anatomies are normal would, there- such as UMLS(Bodenreider, 2004). We plan to fore, yield more ‘Yes’ answers. enrich the data with more question types, addressing, for example, the location or the size of a given Answer distributions The answer distribu- lung abnormality. We plan to collect more real life • tions of the gold and silver data differ greatly. dialog between radiologists and augment the two Specifically, while the gold data was com- datasets to get a richer set of more expressive dia-

67 log. We anticipate that bridging the gap between Peter Anderson, Xiaodong He, Chris Buehler, Damien the silver- and the gold-standard data in terms of Teney, Mark Johnson, Stephen Gould, and Lei natural language formulations would significantly Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In reduce the difference in model performance for the Proceedings of the IEEE Conference on Computer two setups. Vision and Pattern Recognition, pages 6077–6086. Another direction is to develop a strategy to man- Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- age the uncertain labels such as ‘Maybe’ and ‘Not garet Mitchell, Dhruv Batra, C Lawrence Zitnick, in report’ to make the dataset more balanced. and Devi Parikh. 2015. Vqaa: Visual question answering. In Proceedings of the IEEE International 9 Conclusion Conference on Computer Vision, pages 2425–2433. We explored the task of Visual Dialog for radiol- Mythreyi Bhargavan, Adam H Kaye, Howard P For- ogy using chest X-rays and released the first pub- man, and Jonathan H Sunshine. 2009. Workload of licly available silver- and gold-standard datasets radiologists in united states in 2006–2007 and trends since 1991–1992. Radiology, 252(2):458–467. for this task. Having conducted a set of rigorous experiments with state-of-the-art machine learning Olivier Bodenreider. 2004. The unified medical lan- models used for the combination of visual and language system (umls): integrating biomedical terminology. Nucleic acids research, 32(suppl 1):D267– guage reasoning, we demonstrated the complexity D270. of the task and outlined the promising directions ´ for further research. Leonardo Campillos-Llanos, Catherine Thomas, Eric Bilinski, Pierre Zweigenbaum, and Sophie Rosset. Acknowledgments 2020. Designing a virtual patient dialogue system based on terminology-rich resources: Challenges We would like to thank Mousumi Roy for her help and evaluation. Natural Language Engineering, in this project. We are also thankful to Mehdi 26(2):183–220. Moradi for helpful discussions. Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jose´ MF Moura, Devi Parikh, and Dhruv Batra. 2017. Visual dialog. In References Proceedings of the IEEE Conference on Computer AB Abacha, SA Hasan, VV Datla, J Liu, D Demner- Vision and Pattern Recognition, pages 326–335. Fushman, and H Muller.¨ 2019. Vqa-med: Overview of the medical visual question answering task at Jacob Devlin, Ming-Wei Chang, Kenton Lee, and image-clef 2019. In CLEF2019 Working Notes. Kristina Toutanova. 2019. Bert: Pre-training of CEUR Workshop Proceedings (CEURWS. org), deep bidirectional transformers for language under- ISSN, pages 1613–0073. standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association Asma Ben Abacha, Soumya Gayen, Jason J Lau, for Computational Linguistics: Human Language Sivaramakrishnan Rajaraman, and Dina Demner- Technologies, Volume 1 (Long and Short Papers), Fushman. 2018. Nlm at imageclef 2018 visual ques- pages 4171–4186. tion answering in the medical domain. In CLEF (Working Notes). Mark J Halsted and Craig M Froehle. 2008. De- sign, implementation, and assessment of a radiology Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and workflow management system. American Journal of Aniruddha Kembhavi. 2018. Don’t just assume; Roentgenology, 191(2):321–327. look and answer: Overcoming priors for visual question answering. In Proceedings of the Sadid A Hasan, Yuan Ling, Oladimeji Farri, Joey IEEE Conference on Computer Vision and Pattern Liu, Henning Muller,¨ and Matthew Lungren. 2018. Recognition, pages 4971–4980. Overview of imageclef 2018 medical domain visual question answering task. In CLEF (Working Notes). Aisha Al-Sadi, Bashar Talafha, Mahmoud Al-Ayyoub, Yaser Jararweh, and Fumie Costen. 2019. Just at im- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian ageclef 2019 visual question answering in the medi- Sun. 2016. Deep residual learning for image recog- cal domain. In CLEF (Working Notes). nition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770– Emily Alsentzer, John Murphy, William Boag, Wei- 778. Hung Weng, Di Jin, Tristan Naumann, and Matthew McDermott. 2019. Publicly available clinical BERT Drew A Hudson and Christopher D Manning. 2019. embeddings. In Proceedings of the 2nd Clinical Gqa: A new dataset for real-world visual reason- Natural Language Processing Workshop, pages 72– ing and compositional question answering. In 78, Minneapolis, Minnesota, USA. Association for Proceedings of the IEEE Conference on Computer Computational Linguistics. Vision and Pattern Recognition, pages 6700–6709.

68 Bogdan Ionescu, Henning Muller,¨ Mauricio Villegas, 2018. Negbio: a high-performance tool for nega- Alba Garc´ıa Seco de Herrera, Carsten Eickhoff, Vin- tion and uncertainty detection in radiology recent Andrearczyk, Yashin Dicente Cid, Vitali Li- ports. AMIA Summits on Translational Science auchuk, Vassili Kovalev, Sadid A Hasan, et al. 2018. Proceedings, 2018:188. Overview of imageclef 2018: Challenges, datasets and evaluation. In International Conference of Jeffrey Pennington, Richard Socher, and Christopher the Cross-Language Evaluation Forum for European Manning. 2014. Glove: Global vectors for word rep- Languages, pages 309–334. Springer. resentation. In Proceedings of the 2014 conference on empirical methods in natural language processing Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, (EMNLP), pages 1532–1543. Silviana Ciurea-Ilcus, Chris Chute, Henrik Mark- Pranav Rajpurkar, Jeremy Irvin, Aarti Bagul, Daisy lund, Behzad Haghgoo, Robyn Ball, Katie Shpan- Ding, Tony Duan, Hershel Mehta, Brandon Yang, skaya, et al. 2019. Chexpert: A large chest radio- Kaylie Zhu, Dillon Laird, Robyn L Ball, et al. graph dataset with uncertainty labels and expert com- 2017. Mura: Large dataset for abnormality detec- parison. In Proceedings of AAAI. tion in musculoskeletal radiographs. arXiv preprint arXiv:1712.06957. Baoyu Jing, Pengtao Xie, and Eric Xing. 2018. On the automatic generation of medical imaging reports. Alexey Romanov and Chaitanya Shivade. 2018. In Proceedings of the 56th Annual Meeting of the Lessons from natural language inference in the Association for Computational Linguistics (Volume clinical domain. In Proceedings of the 2018 1: Long Papers), pages 2577–2586. Conference on Empirical Methods in Natural Language Processing, pages 1586–1596. Alistair EW Johnson, Tom J Pollard, Seth Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih- Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. ying Deng, Roger G Mark, and Steven Horng. 2000. The earth mover’s distance as a metric for 2019. Mimic-cxr: A large publicly available image retrieval. International journal of computer database of labeled chest radiographs. arXiv vision, 40(2):99–121. preprint arXiv:1901.07042. Bettina Siewert, Olga R Brook, Mary Hochman, and Ronald L Eisenberg. 2016. Impact of communica- Alistair EW Johnson, Tom J Pollard, Lu Shen, tion errors in radiology on patient care, customer H Lehman Li-wei, Mengling Feng, Moham- satisfaction, and work-flow efficiency. American mad Ghassemi, Benjamin Moody, Peter Szolovits, Journal of Roentgenology, 206(3):573–579. Leo Anthony Celi, and Roger G Mark. 2016. Mimic-iii, a freely accessible critical care database. Thomas B Talbot, Kenji Sagae, Bruce John, and Al- Scientific data, 3:160035. bert A Rizzo. Designing useful virtual standardized patient encounters. Kaggle. 2017. Data science bowl. https://www. kaggle.com/c/data-science-bowl-2017. Harm de Vries, Florian Strub, A. P. Sarath Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron C. Tomasz Kornuta, Deepta Rajan, Chaitanya Shivade, Courville. 2016. Guesswhat?! visual object dis- Alexis Asseman, and Ahmet S Ozcan. 2019. Lever- covery through multi-modal dialogue. In 2017 aging medical visual question answering with sup- IEEE Conference on Computer Vision and Pattern porting facts. In CLEF (Working Notes). Recognition, pages 4466–4475. Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, and Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Ronald M Summers. 2018. Tienet: Text-image em- Dina Demner-Fushman. 2018. A dataset of clini- bedding network for common thorax disease classifi- cally generated visual questions and answers about cation and reporting in chest x-rays. In Proceedings radiology images. Scientific Data, 5:180251. of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9049–9058. Mark D Mangano, Arifeen Rahman, Garry Choy, Dushyant V Sahani, Giles W Boland, and Andrew J Han Xiao. 2018. bert-as-service. https://github. Gunn. 2014. Radiologists’ role in the communi- com/hanxiao/bert-as-service. cation of imaging examination results to patients: perceptions and preferences of patients. American Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, Journal of Roentgenology, 203(5):1034–1039. and Alex Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the Yulei Niu, Hanwang Zhang, Manli Zhang, Jianhong IEEE Conference on Computer Vision and Pattern Zhang, Zhiwu Lu, and Ji-Rong Wen. 2019. Recur- Recognition, pages 21–29. sive visual attention in visual dialog. In Proceedings Peng Zhang, Yash Goyal, Douglas Summers-Stay, of the IEEE Conference on Computer Vision and Dhruv Batra, and Devi Parikh. 2016. Yin and yang: Pattern Recognition, pages 6679–6688. Balancing and answering binary visual questions. In Proceedings of the IEEE Conference on Computer Yifan Peng, Xiaosong Wang, Le Lu, Mohammad- Vision and Pattern Recognition, pages 5014–5022. hadi Bagheri, Ronald Summers, and Zhiyong Lu.

69 A BERT-based One-Pass Multi-Task Model for Clinical Temporal Relation Extraction Chen Lin1, Timothy Miller1*, Dmitriy Dligach2, Farig Sadeque1, Steven Bethard3 and Guergana Savova1

*Co-first author 1Boston Children’s Hospital and Harvard Medical School 2Loyola University Chicago 3University of Arizona 1 first.last @childrens.harvard.edu { [email protected]} [email protected] Abstract the CONTAINS task remains a challenge for both conventional learning approaches (Sun et al., 2013; Recently BERT has achieved a state-of-the- Bethard et al., 2015, 2016, 2017) and neural models art performance in temporal relation extraction from clinical Electronic Medical Records text. (structured perceptrons (Leeuwenberg and Moens, However, the current approach is inefficient 2017), convolutional neural networks (CNNs) (Dli- as it requires multiple passes through each in- gach et al., 2017; Lin et al., 2017), and Long Short- put sequence. We extend a recently-proposed Term memory (LSTM) networks (Tourille et al., one-pass model for relation classification to 2017; Dligach et al., 2017; Lin et al., 2018; Galvan a one-pass model for relation extraction. We et al., 2018)). The difficulty is that the limited la- augment this framework by introducing global beled data is insufficient for training deep neural embeddings to help with long-distance rela- models for complex linguistic phenomena. Some tion inference, and by multi-task learning to increase model performance and generalizabil- recent work (Lin et al., 2019) has used massive ity. Our proposed model produces results on pre-trained language models (BERT; Devlin et al., par with the state-of-the-art in temporal rela- 2018) and their variations (Lee et al., 2019) for this tion extraction on the THYME corpus and is task and significantly increased the CONTAINS much “greener” in computational cost. score by taking advantage of the rich BERT representations. However, that approach has an input 1 Introduction representation that is highly wasteful – the same The analysis of many medical phenomena (e.g., sentence must be processed multiple times, once disease progression, longitudinal effects of medi- for each candidate relation pair. cations, treatment regimen and outcomes) heavily Inspired by recent work in Green AI (Schwartz depends on temporal relation extraction from the et al., 2019; Strubell et al., 2019), and one-pass en- clinical free text embedded in the Electronic Medi- codings for multiple relations extraction (Wang cal Records (EMRs). At a coarse level, a clinical et al., 2019), we propose a one-pass encoding event can be linked to the document creation time mechanism for the CONTAINS relation extraction (DCT) as Document Time Relations (DocTimeRel), task, which can significantly increase the efficiency with possible values of BEFORE, AFTER, OVER- and scalability. The architecture is shown in Fig- LAP, and BEFORE OVERLAP (Styler IV et al., ure1. The three novel modifications to the original 2014). At a finer level, a narrative container one-pass relational model of Wang et al.(2019) (Pustejovsky and Stubbs, 2011) can temporally are: (1) Unlike Wang et al.(2019), our model subsume an event as a contains relation. The operates in the relation extraction setting, mean- THYME corpus (Styler IV et al., 2014) consists ing it must distinguish between relations and non- of EMR clinical text and is annotated with time relations, as well as classifying by relation type. expressions (TIMEX3), events (EVENT), and tem- (2) We introduce a pooled embedding for relational poral relations (TLINK) using an extension of classification across long distances. Wang et al. TimeML (Pustejovsky et al., 2003; Pustejovsky and (2019) focused on short-distance relations, but clin- Stubbs, 2011). It was used in the Clinical Temp- ical CONTAINS relations often span multiple sen- Eval series (Bethard et al., 2015, 2016, 2017). tences, so a sequence-level embedding is necessary While the performance of DocTimeRel models for such long-distance inference. (3) We use the has reached above 0.8 F1 on the THYME corpus, same BERT encoding of the input instance for both

70 Proceedings of the BioNLP 2020 workshop, pages 70–75 Online, July 9, 2020 c 2020 Association for Computational Linguistics

the conventional token BERT inserts at the start of every input sequence and its embedding is viewed as the representation of the entire sequence. The concatenated embedding is passed to a linear classi- ﬁer to predict the CONTAINS, CONTAINED-BY, or NONE relation, rˆij, as in eq. (1). P (r ˆ x,E ,E )=softmax(W L[G : e : e ] + b) ij| i j i j (1)

L 3dz lr where W R × , dz is the dimension of the ∈ Figure 1: Model Architecture. e1, e2, and t repre- BERT embedding, lr = 3 for the CONTAINS la- sent entity-embeddings for “surgery”, “scheduled”, and bels, b is the bias, and x is the input sequence. “March 11, 2014” respectively. G is the pooled embed- Similarly, for the DocTimeRel (dtr) task we ding for the entire input instance. feed each entity’s embedding, ei, together with the global pooling G, to another linear classifier to predict the entity’s five “temporal statuses”: TIMEX DocTimeRel and CONTAINS tasks, i.e. adding if the entity is a time expression or the dtr type multi-task learning (MTL) on top of one-pass en- (BEFORE, AFTER, etc.) if the entity is an event: coding. DocTimeRel and CONTAINS are related tasks. For example, if a medical event A happens P (dtrˆ x,E ) = softmax(W D[G : e ] + b) i| i i BEFORE the DCT, while event B happens AFTER (2) the DCT, it is unlikely that there is a CONTAINS D 2dz l relation between A and B. MTL provides an effec- where W R × d , and ld = 5. ∈ tive way to leverage useful knowledge learned in For the combined task, we define loss as: one task to benefit other tasks. What is more, MTL L(r ˆ , r ) + α(L(dtrˆ , dtr ) + L(dtrˆ , dtr )) can potentially employ a regularization effect that ij ij i i j j (3) alleviates overfitting to a specific task. where rˆ is the predicted relation type, dtrˆ and 2 Methodology ij i dtrˆ j are the predicted temporal statuses for Ei and 2.1 Twin Tasks Ej respectively, rij is the gold relation type, and dtr and dtr are the gold temporal statuses. α is a Apache cTAKES (Savova et al., 2010)(http:// i j weight to balance CONTAINS loss and dtr loss. ctakes.apache.org) is used for segmenting and tokenizing the THYME corpus in order to gen- 2.2 Window-based token sequence processing erate instances. Each instance is a sequence of Following Lin et al.(2019), we use a set window of tokens with the gold standard event and time ex- tokens (Token-Window) disregarding natural sen- pression annotations marked in the token sequences tence boundaries for generating instances. BERT by logging their positional information. Using the may still take punctuation tokens into account. entity-aware self-attention based on relative dis- Each token sequence is limited by a set number tance (Wang et al., 2019), we can encode every of entities (Entity-Window) to be processed. We entity, E , by its BERT embedding, e . If an entity i i apply a sliding token window (windows may over- e consists of multiple tokens (many time expres- i lap), thus every entity gets processed. Positional sions are multi-token), it is average-pooled (local information for each entity is output along the to- pool in Figure1) over the embedding of the corre- ken sequence and is propagated through different sponding tokens in the last BERT layer. layers via the entity-aware self-attention mecha- For the CONTAINS task, we create relation can- nism (Wang et al., 2019). didates from all pairs of entities within an input sequence. Each candidate is represented by the 3 Experiments concatenation of three embeddings, ei, ej, and G, as [G:ei:ej], where G is an average-pooled embed- 3.1 Data and Settings ding over the entire sequence, and is different from We adopt the THYME corpus (Styler IV et al., the embedding of [CLS] token. The [CLS] token is 2014) for model fine-tuning and evaluation. The

71 Model P R F1 Model Single MTL Multi-pass 0.735 0.613 0.669 AFTER 0.86 0.83 Multi-pass+Silver 0.674 0.695 0.684 BEFORE 0.88 0.89 One-pass 0.647 0.671 0.659 BEFORE/OVERLAP 0.63 0.56 One-pass+[CLS] 0.665 0.673 0.669 OVERLAP 0.89 0.85 One-pass+Pooling 0.670 0.689 0.680 TIMEX 0.98 0.98 One-pass+Pooling+MTL 0.686 0.687 0.686 OVERALL 0.88 0.86

Table 1: Model performance of CONTAINS relation on Table 2: Model performance in F1-scores of tem- colon cancer test set. Multi-pass baselines are from Lin poral statuses on colon cancer test set. Single: et al.(2019)’s system without and with self-training us- One-pass+Pooling for a single dtr Task; MTL: One- ing silver instances (system predictions on a unlabeled pass+Pooling for twin tasks: CONTAINS and dtr. colon cancer set). We tested a one pass system with just argument embeddings; with the [CLS] token as the Model P R F1 global context vector ([CLS]); with argument embed- Lin et al.(2019) 0.473 0.700 0.565 dings plus a globally pooled context vector (Pooling); and with global pooling as well as multi-task learning One-pass+Pooling 0.506 0.643 0.566 (MTL) with DocTimeRel. One-pass+Pooling+MTL 0.545 0.624 0.582 Table 3: Model performance of CONTAINS relation on one-pass multi-task model is fine-tuned on the brain cancer test set. THYME Colon Cancer training set with uncased BERT base model, using the code released 0.684 F1 (Lin et al., 2019). by Wang et al.(2019) 1 as a base. The batch size Table2 shows the performance of our one-pass is set to 4, the learning rate is selected from (1e-5, models for the DocTimeRel task on the Clinical 2e-5, 3e-5, 5e-5), the Token-Window size is se- TempEval colon cancer test set. The single-task lected from (60, 70, 100), the Entity-Window size model achieves 0.88 weighted average F1, while is selected from (8, 10, 16), the training epochs the MTL model compromises the performance to are selected from (2, 3, 4, 5), the clipping distance 0.86 F1. Of note, this result is not directly com- k (the maximum relative position to consider) is parable to Bethard et al.(2016) results because selected from (3, 4, 5), and α is selected from (0.01, the Clinical TempEval evaluation script does not 0.05). A single NVIDIA GTX Titan Xp GPU is take into account if an entity is correctly recog- used for the computation. The best model is se- nized as a time expression (TIMEX). There are two lected on the Colon cancer development set and types of entities in the THYME annotation: events tested on the Colon cancer test set, and on THYME and time expressions (TIMEX). The Bethard et al. Brain cancer test set for portability assessment. (2016) evaluation on DocTimeRel was focused on 3.2 Results on THYME all events, and classified an event into four Doc- TimeRel types. Our evaluation was for all entities. Table1 shows performance of our one-pass models For a given entity, we classify it as a TIMEX or for the CONTAINS task on the Clinical TempEval an event; if it is an event, we classify it into four colon cancer test set. The one-pass (OP) model DocTimeRel types, for a total of five classes. alone obtains an F1 score of 0.659. Adding the Table3 shows the portability of our one-pass [CLS] token as the global context vector increases models on the THYME brain cancer test set. With- the F1 score to 0.669. Using a globally average- out any tuning on brain cancer data, the MTL pooled context vectors G instead of [CLS] im- model with global pooling performs at 0.582 F1, proves performance to 0.680, better than the multi- which is better than the multi-pass model trained pass model without silver instances (Lin et al., with additional silver instances (0.565 F1) re- 2019). Applying the MTL setting, the one-pass ported in Lin et al.(2019), trading roughly equal twin-task (CONTAINS and DocTimeRel) model amounts of precision for recall to obtain a bet- without any silver data reaches 0.686 F1, which ter balance. Without MTL, the one-pass CON- is on par with the multi-pass model trained with TAINS model with global context embeddings additional silver instances on the CONTAINS task, (One-pass+Pooling) achieves 0.566 F1 on the brain 1https://github.com/helloeve/mre-in-one-pass cancer test set, significantly lower than the MTL

72 Model ﬂops/inst inst# Ratio steps used for generating silver instances (Lin et al., OP 218,767,889 20k 1 2019), the time savings is even greater. OP+MTL 218,783,260 20k 1 Multi-pass 218,724,880 427k 23 4 Discussion Multi-pass+Silver 218,724,880 497k 25

Table 4: Computational complexity in flops per in- Through table1 row 3-5, we can see that sequence- stance (flops/inst) total number of instances (inst#). wise embedding, either global pooling G or [CLS], × is important for clinical temporal relation extraction which involves long-distance relations that model (using a Wilcoxon Signed-rank test over may go across multiple natural sentences. Entity document-by-document comparisons, as in (Cherry embeddings are good for tasks that focus on short- et al., 2013), p-value=0.01962). distance relations (such as (Gabor´ et al., 2018)), but may not be sufficient for picking enough context 3.3 Computational Efficiency for long-distance relations. Table4 shows the computational burden for dif- Combining MTL with a one-pass mechanism ferent models in terms of floating point operations produces a more efficient and generalizable model. (flops). The flops are derived from TensorFlow’s With merely additional 15k flops (table4 row 1 profiling tool on saved model graphs. The sec- and 2), the model achieves high performance for ond column is the flops per one training instance, both tasks. However, we found that it is hard for the third column lists the number of instances for both tasks to get top performance. If the weight different model settings. The total computational for dtr loss is increased, the dtr F1 increases at complexity for one training epoch is thus the mul- the cost of the CONTAINS scores. Even though tiplication between column 2 and 3. The Ratio the majority of entities in CONTAINS relations column is the relative ratio of total complexity us- have aligned dtr values (e.g., in Figure2(#1), both ing the OP total flops as the comparator. entities have matching dtr value, AFTER), some re- For relation extraction, all entities within a se- lations do have conflicted dtr values. For example, quence must be paired. If there are n entities in in Figure2(#2), the dtr for screening is BEFORE, a token sequence, there are n (n 1)/2 ways while test is a BEFORE OVERLAP (the present × − to combine those entities for relational candidates. perfect tense signifies tests happened in the past but The multi-pass model would encode the same se- lasts through present, hence BEFORE OVERLAP). quence n (n 1)/2 times, while the one-pass Even though it is a gold CONTAINS annotation, × − model would only encode it once and add the pair- the model may be confused by an event that hap- ing computation on top of the BERT encoding rep- pened in the past (screening) to contain another resented in Figure1 with very minor increase in event (test) that is longer than its temporal scope. computation per one instance (about 43K flops); Due to these conflicts, we thus pick the more chal- and the MTL model adds another 15k flops; but lenging CONTAINS task as our priority and set α they are of the same magnitude, 219K flops. The relatively low (0.01) in order to optimize the model one-pass models save a lot of passes on the training towards the CONTAINS task, ignoring some of the instances, 20k vs. 497k, which results in a signif- dtr errors or conflicts. In the meantime, the MTL icant difference in computational load, 1 vs. 25, setting does help prevent the model from over- which could be several hours to several days differ- fitting to one specific task, thus achieving some ence in GPU hours. The exact number of training level of generalization. The significant 1.6% in- instances processed by the one-pass model is af- crease in F1-score on the Brain test set in table3 fected by the Token-Window and Entity-Window demonstrates the improved generalizability. hyper-parameters. However, even in the worst case In conclusion, we built a ”green” model for a scenario, when the Token-Window is set to 100, challenging problem. Deployed on a single gpu and the Entity-Window is set to 8, there are 108K with 25 times better efficiency, it succeeded in training instances for the one-pass model, which both temporal tasks, achieved better generalizabil- is still substantially fewer training instances than ity, and suited to other pre-trained models (Liu what are used for the multi-pass model. In addi- et al., 2019; Alsentzer et al., 2019; Beltagy et al., tion, since the one-pass models do not run the extra 2019; Lan et al., 2019; Yang et al., 2019, etc.)

73 AFTER AFTER 2012 i2b2 nlp challenge. Journal of the American #1: Subsequent staging will include MRI of the pelvis... Medical Informatics Association, 20(5):843–848. BEFORE #2: His colon cancer screening in the past has been fecal Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep BEFORE/OVERLAP bidirectional transformers for language understand- occult blood tests yearly since the age of 55... ing. arXiv preprint arXiv:1810.04805.

Figure 2: CONTAINS Relations with match- Dmitriy Dligach, Timothy Miller, Chen Lin, Steven ing(#1)/conflicting(#2) DocTimeRel values. Bethard, and Guergana Savova. 2017. Neural temporal relation extraction. EACL 2017, page 746. Kata Gabor,´ Davide Buscaldi, Anne-Kathrin Schu- Acknowledgments mann, Behrang QasemiZadeh, Ha¨ıfa Zargayouna, and Thierry Charnois. 2018. SemEval-2018 task The study was funded by R01LM10090, 7: Semantic relation extraction and classification in R01GM114355 and UG3CA243120 from the scientific papers. In Proceedings of The 12th Inter- Unites States National Institutes of Health. The national Workshop on Semantic Evaluation, pages content is solely the responsibility of the authors 679–688, New Orleans, Louisiana. Association for and does not necessarily represent the official Computational Linguistics. views of the National Institutes of Health. We Diana Galvan, Naoaki Okazaki, Koji Matsuda, and thank the anonymous reviewers for their valuable Kentaro Inui. 2018. Investigating the challenges of suggestions and criticism. The Titan Xp GPU temporal relation extraction from clinical text. In Proceedings of the Ninth International Workshop on used for this research was donated by the NVIDIA Health Text Mining and Information Analysis, pages Corporation. 55–64. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. References 2019. Albert: A lite bert for self-supervised learn- Emily Alsentzer, John R Murphy, Willie Boag, Wei- ing of language representations. arXiv preprint Hung Weng, Di Jin, Tristan Naumann, and Matthew arXiv:1909.11942. McDermott. 2019. Publicly available clinical bert embeddings. arXiv preprint arXiv:1904.03323. Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. Scibert: and Jaewoo Kang. 2019. BioBERT: a pre-trained A pretrained language model for scientific text. In biomedical language representation model for Proceedings of the 2019 Conference on Empirical biomedical text mining. Bioinformatics. Methods in Natural Language Processing and the Tuur Leeuwenberg and Marie-Francine Moens. 2017. 9th International Joint Conference on Natural Lan- Structured learning for temporal relation extraction guage Processing (EMNLP-IJCNLP), pages 3606– from clinical records. In Proceedings of the 15th 3611. Conference of the European Chapter of the Associa- tion for Computational Linguistics. Steven Bethard, Leon Derczynski, Guergana Savova, Guergana Savova, James Pustejovsky, and Marc Ver- Chen Lin, Timothy Miller, Dmitriy Dligach, Hadi hagen. 2015. Semeval-2015 task 6: Clinical temp- Amiri, Steven Bethard, and Guergana Savova. 2018. eval. In Proceedings of the 9th International Work- Self-training improves recurrent neural networks shop on Semantic Evaluation (SemEval 2015), pages performance for temporal relation extraction. In 806–814. Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis, pages Steven Bethard, Guergana Savova, Wei-Te Chen, Leon 165–176. Derczynski, James Pustejovsky, and Marc Verhagen. 2016. Semeval-2016 task 12: Clinical tempeval. Chen Lin, Timothy Miller, Dmitriy Dligach, Steven Proceedings of SemEval, pages 1052–1062. Bethard, and Guergana Savova. 2017. Repre- sentations of time expressions for temporal rela- Steven Bethard, Guergana Savova, Martha Palmer, tion extraction with convolutional neural networks. James Pustejovsky, and Marc Verhagen. 2017. BioNLP 2017, pages 322–327. Semeval-2017 task 12: Clinical tempeval. Proceed- ings of the 11th International Workshop on Semantic Chen Lin, Timothy Miller, Dmitriy Dligach, Steven Evaluation (SemEval-2017), pages 563–570. Bethard, and Guergana Savova. 2019. A bert-based universal model for both within-and cross-sentence Colin Cherry, Xiaodan Zhu, Joel Martin, and Berry clinical temporal relation extraction. In Proceedings de Bruijn. 2013. la recherche du temps perdu: ex- of the 2nd Clinical Natural Language Processing tracting temporal relations from medical text in the Workshop, pages 65–71.

74 Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, bonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Luke Zettlemoyer, and Veselin Stoyanov. 2019. Xlnet: Generalized autoregressive pretraining for Roberta: A robustly optimized bert pretraining ap- language understanding. Advances in Neural Infor- proach. arXiv preprint arXiv:1907.11692. mation Processing Systems 32, pages 5754–5764.

James Pustejovsky, Jose´ M Castano, Robert Ingria, Roser Sauri, Robert J Gaizauskas, Andrea Set- zer, Graham Katz, and Dragomir R Radev. 2003. Timeml: Robust speciﬁcation of event and temporal expressions in text. New directions in question answering, 3:28–34.

James Pustejovsky and Amber Stubbs. 2011. Increas- ing informativeness in temporal annotation. In Pro- ceedings of the 5th Linguistic Annotation Workshop, pages 152–160. Association for Computational Lin- guistics.

Guergana K Savova, James J Masanz, Philip V Ogren, Jiaping Zheng, Sunghwan Sohn, Karin C Kipper- Schuler, and Christopher G Chute. 2010. Mayo clinical text analysis and knowledge extraction system (ctakes): architecture, component evaluation and applications. Journal of the American Medical Infor- matics Association, 17(5):507–513.

Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni. 2019. Green ai. arXiv preprint arXiv:1907.10597.

Emma Strubell, Ananya Ganesh, and Andrew McCal- lum. 2019. Energy and policy considerations for deep learning in nlp. Proceedings of the 57th An- nual Meeting of the Association for Computational Linguistics, pages 3645–3650.

William F Styler IV, Steven Bethard, Sean Finan, Martha Palmer, Sameer Pradhan, Piet C de Groen, Brad Erickson, Timothy Miller, Chen Lin, Guergana Savova, et al. 2014. Temporal annotation in the clinical domain. Transactions of the Association for Computational Linguistics, 2:143–154.

Weiyi Sun, Anna Rumshisky, and Ozlem Uzuner. 2013. Evaluating temporal relations in clinical text: 2012 i2b2 challenge. Journal of the American Medical Informatics Association, 20(5):806–813.

Julien Tourille, Olivier Ferret, Aurelie Neveol, and Xavier Tannier. 2017. Neural architecture for temporal relation extraction: A bi-lstm approach for detecting narrative containers. In Proceedings of the 55th Annual Meeting of the Association for Compu- tational Linguistics (Volume 2: Short Papers), volume 2, pages 224–230.

Haoyu Wang, Ming Tan, Mo Yu, Shiyu Chang, Dakuo Wang, Kun Xu, Xiaoxiao Guo, and Saloni Potdar. 2019. Extracting multiple-relations in one-pass with pre-trained transformers. Proceedings of the 57th Annual Meeting of the Association for Computa- tional Linguistics, pages 1371–1377.

75 Experimental Evaluation and Development of a Silver-Standard for the MIMIC-III Clinical Coding Dataset Thomas Searle1, Zina Ibrahim1, Richard JB Dobson1,2 1Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, U.K. 2Institute of Health Informatics, University College London, London, London, U.K. firstname.lastname @kcl.ac.uk { } Abstract To alleviate the burden of the status quo of manual coding, several Machine learning (ML) auto- Clinical coding is currently a labour-intensive, mated coding models have been developed (Larkey error-prone, but critical administrative process and Croft, 1996; Aronson et al., 2007; Farkas and whereby hospital patient episodes are manually assigned codes by qualified staff from Szarvas, 2008; Perotte et al., 2014; Ayyar et al., large, standardised taxonomic hierarchies of 2016; Baumel et al., 2018; Mullenbach et al., 2018; codes. Automating clinical coding has a long Falis et al., 2019). However, despite continued in- history in NLP research and has recently seen terest, translation of ML systems into real-world novel developments setting new state of the art deployments has been limited. An important factor results. A popular dataset used in this task is contributing to the limited translation is the fluctuat- MIMIC-III, a large intensive care database that ing quality of the manually-coded real hospital data includes clinical free text notes and associated codes. We argue for the reconsideration of the used to train and evaluate such systems, where large validity MIMIC-III’s assigned codes that are margins of error are a direct consequence of the often treated as gold-standard, especially when difficulty and error-prone nature of manual coding. MIMIC-III has not undergone secondary vali- To our knowledge, the literature contains only two dation. This work presents an open-source, re- systematic evaluations of the quality of clinically- producible experimental methodology for as- coded data, both based on UK trusts and showing sessing the validity of codes derived from accuracy to range between 50 to 98% Burns et al. EHR discharge summaries. We exemplify the (2012) and error rates between 1%-45.8% CHKS methodology with MIMIC-III discharge summaries and show the most frequently assigned Ltd(2014) respectively. In Burns et al.(2012), codes in MIMIC-III are under-coded up to the actual accuracy is likely to be lower because 35%. the reviewed trusts used varying statistical evaluation methods, validation sources (clinical text vs 1 Introduction clinical registries), sampling modes for accuracy Clinical coding is the process of translating state- estimation (random vs non-random), and the qual- ments written by clinicians in natural language to ity of validators (qualified clinical coders vs lay describe a patient’s complaint, problem, diagnosis people). CHKS Ltd(2014) highlight that 48% of and treatment, into an internationally-recognised the reviewed trusts used discharge summaries alone coded format (World Health Organisation, 2011). or as the primary source for coding an encounter, Coding is an integral component of healthcare and to minimise the amount of raw text used for code provides standardised means for reimbursement, assignment. However, further portions of the docu- care administration, and for enabling epidemiolog- mented encounter are often needed to assign codes ical studies using electronic health record (EHR) accurately. data (Henderson et al., 2006). The Medical Information Mart for Intensive Manual clinical coding is a complex, labour- Care (MIMIC-III) database (Johnson et al., 2016) intensive, and specialised process. It is also error- is the largest free resource of hospital data and con- prone due to the subtleties and ambiguities com- stitutes a substantial portion of the training of au- mon in clinical text and often strict timelines im- tomated coding models. Nevertheless, MIMIC-III posed on coding encounters. The annual cost of is significantly under-coded for specific conditions clinical coding is estimated to be $25 billion in the (Kokotailo and Hill, 2005), and has been shown US alone (Farkas and Szarvas, 2008). to exhibit reproducibility issues in the problem of

76 Proceedings of the BioNLP 2020 workshop, pages 76–85 Online, July 9, 2020 c 2020 Association for Computational Linguistics mortality prediction (Johnson et al., 2017). There- with health services result in a set of clinical codes fore, serious consideration is needed when using that directly correlate to the care provided. MIMIC-III to train automated coding solutions. Top-level ICD codes represent the highest level In this work, we seek to understand the limita- of the hierarchy, with ICD-9/10 (ICD-10 being the tions of using MIMIC-III to train automated cod- later version) listing 19 and 21 chapters respec- ing systems. To our knowledge, no work has at- tively. Clinically meaningful hierarchical subdivi- tempted to validate the MIMIC-III clinical coding sions of each chapter provide further specialisation dataset for all admissions and codes, due to the of a given condition. time-consuming and costly nature of the endeavour. Coding clinical text results in the assignment of To illustrate the burden, having two clinical coders, a single primary diagnosis and further secondary working 38 hours a week re-coding all 52,726 ad- diagnosis codes (World Health Organisation, 2011). mission notes at a rate of 5 minutes and $3 per The complexity of coding encounters largely stems document, would amount to $316,000 and 115 ∼ ∼ from the substantial number of available codes. For weeks work for a ‘gold standard’ dataset. Even example, ICD-10-CM is the US-specific extension then, documents with a low inter-annotator agree- to the standard ICD-10 and includes 72,000 codes. ment would undergo a final coding round by a Although a significant portion of the hierarchy cor- third coder, further raising the approximate cost responds to rare conditions, ‘common’ conditions to $316,000 and stretching the 70 weeks. ∼ to code are still in the order of thousands. In this work, we present an experimental eval- Moreover, clinical text often contains special- uation of coding coverage in the MIMIC-III dis- ist terminology, spelling mistakes, implicit men- charge summaries. The evaluation uses text ex- tions, abbreviations and bespoke grammatical rules. traction rules and a validated biomedical named However, even qualified clinical coders are not per- entity recognition and linking (NER+L) tool, Med- mitted to infer codes that are not explicitly men- CAT (Kraljevic et al., 2019) to extract ICD-9 codes, tioned within the text. For example, a diagnostic reconciling them with those already assigned in test result that indicates a condition (with the con- MIMIC-III. The training and experimental setup dition not explicitly written), or a diagnosis that yield a reproducible open-source procedure for is written as ‘questioned’ or ‘possible’ cannot be building silver-standard coding datasets from clin- coded. ical notes. Using the approach, we produce a Another factor contributing to the laborious na- silver-standard dataset for ICD-9 coding based on ture of coding is the large amount of duplication MIMIC-III discharge summaries. present in EHRs, as a result of features such as This paper is structured as follows: Section2 copy & paste being made available to clinical staff. reviews essential background and related work in It has been reported that 20-78% of clinicians dupli- automated clinical coding, with a particular focus cate sections of records between notes (Bowman, on MIMIC-III. Section3 presents our experimental 2013), subsequently producing an average data re- setup and the semi-supervised development of a sil- dundancy of 75% (Zhang et al., 2017). ver standard dataset of clinical codes derived from unstructured EHR data. The results are presented 2.2 MIMIC-III - a Clinical Coding Database in Section4, while Section5 discusses the wider impact of the results and future work. MIMIC-III (Johnson et al., 2016) is a de-identified database containing data from the intensive care 2 Background unit of the Beth Israel Medical Deaconess Center, Boston, Massachusetts, USA, collected 2001-12. 2.1 Clinical Coding Overview MIMIC-III is the world’s largest resource of freely- The International Statistical Classification of Dis- accessible hospital data and contains demograph- eases and Health Related Problems (ICD) provides ics, laboratory test results, procedures, medications, a hierarchical taxonomic structure of clinical ter- caregiver notes, imaging reports, admission and dis- minology to classify morbidity data (World Health charge summaries, as well as mortality (both in and Organisation, 2011). The framework provides con- out of the hospital) data of 52,726 critical care pa- sistent definitions across global health care services tients. MIMIC provides an open-source platform to describe adverse health events including illness, for researchers to work on real patient data. At the injury and disability. Broadly, patient encounters time of writing, MIMIC-III has over 900 citations.

77 2.3 Automated Clinical Coding maries. We also describe the semi-supervised cre- Early ML work on automated clinical coding con- ation of a silver-standard dataset of clinical codes sidered ensembles of simple text classifiers to pre- from unstructured EHR text based on MIMIC-III dict codes from discharge summaries (Larkey and discharge summaries. Croft, 1996). Rule-based models have also been 3.1 Data Preparation formulated, by directly replicating coding manuals. A prominent example of rule-based models is the Discharge summary reports are used to provide BioNLP 2007 shared task (Aronson et al., 2007), an overview for the given hospital episode. Au- which supplied a gold standard labelled dataset tomated coding systems often only use discharge of radiology reports. The dataset continues to be reports as they contain the salient diagnostic text used to train and validate ML coding. For example, (Perotte et al., 2014; Baumel et al., 2018; Mullen- Kavuluru et al.(2015) used the dataset in addition bach et al., 2018) without over burdening the model. to two US-based hospital EHRs. Although the two MIMIC-III discharge summaries are categorised additional datasets used by Kavuluru et al.(2015) distinctly from other clinical text. The text is of- were not validated to a gold standard, they are re- ten structured with section headings and content flective of the diversity found in clinical text. Their section delimiters such as line breaks. We identify largest dataset contained 71,463 records, 60,238 Discharge Diagnosis (DD) sections in the majority distinct code combinations and had an average doc- of discharge summary reports 92% (n=48,898) us- ument length of 5303 words. ing a simple rule based approach. These sections The majority of automated coding systems are are lists of diagnoses assigned to the patient during trained and tested Using MIMIC-III. Perotte et al. admission. Xie and Xing(2018) previously used (2014) trained hierarchical support vector machine these sections to develop a matching algorithm models on the MIMIC-II EHR (Saeed et al., 2011), from discharge diagnosis to ICD code descriptions the earlier version of MIMIC. The models were with moderate success demonstrating state-of-the- trained using the full ICD-9-CM terminology, cre- art sensitivity (0.29) and specificity (0.33) scores. ating baseline results for subsequent models of For the 8% (n=3,828) that are missing these sec- 0.395 F1-micro score. Ayyar et al.(2016) used tions we manually inspect a handful of examples a long-short-term-memory (LSTM) neural network and observe instances of patient death and admin- to predict ICD-9 codes in MIMIC-III. However, istration errors. The SQL procedures used to ex- Ayyar et al.(2016) cannot be directly compared to tract the raw data from a locally built replica of the former methods as the model only predicts the top MIMIC-III database and the extraction logic for nineteen level codes. DDs are available open-source as part of this wider 1 Methodological developments continued to use analysis . MIMIC-III with Tree-of-sequence LSTMs (Xie Table1 lists example extracted DDs. There is a and Xing, 2018), hierarchical attention gated re- large variation in structure, use of abbreviations and current unit (HA-GRU) neural networks (Baumel extensive use of clinical terms. Some DDs list the et al., 2018) and convolutional neural networks primary diagnosis alongside secondary diagnosis, with attention (CAML) (Mullenbach et al., 2018). whereas others simply list a set of conditions. The HA-GRU and CAML models were directly compared with (Perotte et al., 2014), achieving 3.2 Semi-Supervised Named Entity 0.405 and 0.539 F1-micro respectively. A recent Recognition and Linkage Tool empirical evaluation of ICD-9 coding methods pre- We use MedCAT (Kraljevic et al., 2019), a dicted the top fifty ICD-9 codes from MIMIC-III, pre-trained named entity recognition and linking suggesting condensed memory networks as a supe- (NER+L) model, to identify and extract the cor- rior network topology (Huang et al., 2018). responding ICD codes in a discharge summary note. MedCAT utilises a fast dictionary-based 3 Semi-Supervised Extraction of Clinical algorithm for direct text matches and a shallow Codes neural network concept to learn fixed length distributed semantic vectors for ambiguous text spans. In this section, we describe the data preprocessing, The method is conceptually similar to Word2Vec methodology and experimental design for evaluating the coding quality of MIMIC-III discharge sum- 1https://tinyurl.com/t7dxn3j

78 Extracted Discharge Diag- Admission 3.3 Code Prediction Datasets nosis ID We run our MedCAT model over each extracted CAD now s/p CABG 102894 DD subsection. The model assigns each token or HTN, DM, Osteoarthritis, sequence of tokens a UMLS code and therefore Dyslipidemia an associated ICD code. In our comparison of the Left convexity, tento- 161919 MedCAT produced annotations with the MIMIC- rial, parafalcine Subdural III assigned codes we have 3 distinct datasets: hematoma Primary Diagnoses: 152382 1. MedCAT does not identify a concept and the 1. Acute ST segment Eleva- code has been assigned in MIMIC-III. Denoted tion Myocardial Infarction A NP for ‘Assigned, Not Predicted’. Secondary Diagnoses: 1. Hypertension 2. MedCAT identifies a concept and this matches 2. Hyperlipidemia with an assigned code in MIMIC-III. Denoted for ‘Predicted, Assigned’. Seizures. 132065 P A 3. MedCAT identifies a concept and this does not Table 1: Example discharge diagnosis subsections extracted from MIMIC-III discharge summaries match with an assigned code in MIMIC-III dataset. Denoted P NA for ‘Predicted, Not As- signed’. (Mikolov et al., 2013) in that word representations are learnt by detecting correlations of context We do not consider the case where both MedCAT words, and learnt vectors exhibit the semantics of and the existing MIMIC-III assigned codes have the underlying words. The tool can be trained in missed an assignable code as this would involve a unsupervised or a supervised manner. However, manual validation of all notes, and as previously unlike Word2Vec that learns a single representa- discussed, is infeasible for a dataset of this size. tion for each word, MedCAT enables the learn- 3.3.1 Producing the Silver Standard ing of ‘concept’ representations by accommodat- Given the above initial datasets we produce our ing synonymous terms, abbreviations or alternative final silver-standard clinical coding dataset by: spellings. We use a MedCAT model pre-loaded with the 1. Sampling from the missing predictions dataset Unified Medical Language System (Bodenreider, (A NP) to manually collect annotations where 2004) (UMLS). UMLS is a meta-thesaurus of med- our out-of-the-box MedCAT model fails to ical ontologies that provides rich synonym lists recognise diagnoses. that can be used for recognition and disambiguation of concepts. Mappings from UMLS to the 2. Fine-tuning our MedCAT model with the col- ICD-92 taxonomy are then used to extract UMLS lected annotations and re-running on the entire concept to ICD codes. Our large pre-trained Med- DD subsection dataset producing updated A NP, CAT UMLS model contains 1.6 million concepts. P A, P NA datasets. ∼ This model cannot be made publicly available due 3. Sampling from P NA and P A and annotating to constraints on the UMLS license, but can be predicted diagnoses to validate correctness of trained in an an unsupervised method in 1 week ∼ the MedCAT predicted codes. on MIMIC-III with standard CPU only hardware3. In an effort to keep our analysis tractable we 4. Exclusion of any codes that fail manual valida- limit our MedCAT model to only extract the 400 tion step as they are not trustworthy predictions ICD-9 codes that occur most frequently in the made by MedCAT. dataset. This equates to 76% (n=48,2379) of to- We use the MedCATTrainer annotator (Searle tal assigned codes (n=634,709). We exclude the et al., 2019) to both collect annotations (stage 1) other 6,441 codes that occur less frequently. Future and to validate predictions from MedCAT (stage work could consider including more of these codes. 3). To collect annotations, we manually inspect 10 2https://bioportal.bioontology.org/ontologies/ICD9CM randomly sampled predictions for each of the 400 3https://tinyurl.com/yadtnz3w unique codes from A NP and add further acronyms,

79 and metabolic diseases) and 460-519 (diseases of the respiratory system) we see proportionally fewer manually collected examples despite the high number of occurrence of codes assigned within MIMIC- III. We explain this by the DD subsection lacking appropriate detail to assign the specific code. For example codes under 250.* for diabetes mellitus and the various forms of complications are assigned frequently but often lack the appropriate level of detail specifying the type, control status and the manifestation of complication. Figure 1: The distributions of manually annotated ICD- Using the manual amendments made on the 864 9 codes and the assigned codes in MIMIC-III grouped new annotations, we re-run the MedCAT model by top-level ICD-9 axis. on the entire DD subsection dataset, producing updated P NA, P A and A NP datasets. We acknowledge A NP likely still includes cases of ab- abbreviations, synonyms etc for diagnoses if they breviations, synonyms as we only subsampled 10 are present in the DD subsection to improve the documents per code allowing for further improve- underlying MedCAT model. To validate predic- ments to the model. tions from P A and P NA, we use the MedCAT- The MedCAT fine-tuning process was run until Trainer annotator to inspect 10 randomly sampled convergence as measured by precision, recall and predictions for each of the 179 & 182 unique codes F1 achieving scores 0.90, 0.92 and 0.91 respec- respectively found. We mark each prediction made tively on a held out a test-set with train/test splits by MedCAT as correct or incorrect and report re- 80/20. The fine-tuning code is made available4. sults in Section 4.1. Annotations are available upon request given the 4 Results appropriate MIMIC-III licenses. The following section presents the distribution 4.1 P A&P NA Validation of manually collected annotations from sampling We use the MedCATTrainer interface to validate A NP, our validation of updated P A and P NA our MedCAT model predictions in the ‘Predicted, post MedCAT fine-tuning, and the final distribu- Assigned’ (P A) and ‘Predicted, Not Assigned’ tion of codes found in our produced silver standard (P NA) datasets. We sample (a maximum of) 10 dataset. unique predictions for each ICD-code resulting in Adding annotations to selected text spans di- 179 & 182 ICD-9 codes and 1588 & 1580 man- rectly adds the spans to the MedCAT dictionary, ually validated predictions from P A and P NA thereby ensuring further text spans of the same con- respectively. The validation of code assignment is tent are annotated by the model - if the text span performed by a medical informatics PhD student is unique. We collect 864 annotations after review- with no professional clinical coding experience and ing 4000 randomly sampled DD notes from the a qualified clinical coder, marking each term as A NP (Assigned, Not Predicted) dataset. 21.6% correct or incorrect. We achieve good agreement of DDs provide further annotations suggesting that with a Cohen’s Kappa of 0.85 and 0.8 resulting in the majority of missed codes lie outside the DD 95.51% and 87.91% marked correct for P A and subsection, or are incorrectly assigned. P NA respectively. We exclude from further exper- Figure1 shows the distributions of manually iments all codes that fail this validation step as they collected code annotations and the current MIMIC- are not trustworthy predictions made by MedCAT. III set of clinical codes, grouped by their top-level axis as specified by ICD-9-CM hierarchy. 4.2 Aggregate Assigned Codes & Codes We collect proportionally consistent annotations Silver Standard for most groups, including the 390-459 chapter We proportionally predict 10% (n=42,837) of to- ∼ (Diseases Of The Circulatory System), which is tal assigned codes (n=432,770). We predict 16% ∼ the top occurring group in both scales. However, 4https://github.com/tomolopolis/MIMIC-III-Discharge- for groups such as 240-279 (endocrine, nutritional Diagnosis-Analysis/blob/master/Run MedCAT.ipynb

80 of total assigned codes (n=258,953) if we only consider the 182 codes that resulted in at least one matched assignment to those present in the MIMIC- III assigned codes. We label and gather our three datasets into a single table, with an extra column called ‘validated’, with values: ‘yes’ for codes that have matched with an assigned code (P A), ‘new code’ for newly discovered codes (P NA), and ‘no’ for codes that we Figure 2: Left: Counts of admissions and the associ- were not able to validate (A NP). We have made ated % of characters covered by MedCAT code predictions. Right:Distribution of DD token lengths this silver-standard dataset available alongside our analysis code5.

4.3 Undercoding in MIMIC-III This work aims to identify inconsistencies and vari- ability in coding accuracy in the current MIMIC-III dataset. Ultimately to rigorously identify undercoding of clinical text full, double blind manual coding would be performed. However, as previously discussed, this is prohibitively expensive. Comparing the codes predicted by MedCAT to the existing assigned codes enables the develop- Figure 3: Proportions of matching predictions against total number of assigned codes per admission. ment of an understanding of specific groups of codes that exhibit possible undercoding. In this section we firstly show the effectiveness of our 4.3.2 Predicted & Assigned method in terms of DD subsection prediction coverage. We then present our predicted code distri- Figure3 shows the distributions of the number butions against the MIMIC-III assigned codes at of assigned codes and the proportion of matches the ICD code chapter level, highlighting the most grouped into buckets of 10% intervals. We see a prevalent missing codes and showing correlations high proportion of matches in assigned codes in the between document length and prevalence. 1-40% range, indicating that although the DD subsection does contribute to the assigned ICD codes, 4.3.1 Prediction Coverage many of the assigned codes are still missed. We MedCAT provides predictions at the text span level, exclude the admissions that had 0 matched codes with only one concept prediction per span. We can and discuss this result further in Section 4.3.4. therefore calculate the breadth of coverage of our If we order codes by the number of predicted predictions across all DD subsections. Figure2 and assigned we find the three highest occurring shows the proportion of DD subsection text that codes (4019, 41404, 4280) in MIMIC-III also rank are included in code predictions. We note the 100% highest in our predictions. However, we note that proportion (n=2105) is 75% larger than the next these three common codes only yield 25-39% of largest indicating that we are often utilising the their total assigned occurrence, which could be ex- entire DD subsection to suggest codes although the plained by these chronic conditions not being listed majority of the coverage distribution is around the in the DD subsection and referred elsewhere in the 40-50% range. note. If we normalise predictions by their preva- We find a token length distribution of DD sub- lence, we are most successful in matching specific sections with µ =14.54, σ =15.9, Med = 10 and conditions applicable to preterm newborns (7470, IQR = 14 and a code extraction distribution with 7766), pulmonary embolism (41519) and liver can- µ = 3.6 and σ = 3.1, Med = 3 and IQR = 4 cer (1550), all of which we match between 69-55% suggesting the DD subsections are complex and but rank 114-305 in total prevalence. We suggest often list multiple conditions of which we identify, these diagnoses are either acute, or the primary on average, 3 to 4 conditions. cause of an ICU admission so will be specified in 5https://tinyurl.com/u8yae8n the DD subsection.

81 From this dataset we calculate how many examples of each code that has potentially been missed, or potentially under-coded. For the 10 most frequently assigned codes we see 0-35% missing occurrences. We also identify the most frequent code 4019 (Unspecified Essential Hypertension) has 16% or 3312 potentially missing occurrences. To understand if DD subsection length impacts the occurrence of ‘missed’ codes we first calculate a Pearson-Correlation coefficient of 0.17 for DD subsection line length and counts of assigned codes over all admissions. This suggests a weak posi- Figure 4: Predicted, Assigned Codes grouped by top- level code group vs total assigned codes tive correlation between admission complexity and number of existing assigned codes. In contrast we find a stronger positive correlation of 0.504 for predicted and not assigned codes and DD subsection line length. This implies that where an episode has a greater number of diagnoses or the complexity of an admission is greater, there is a likelihood to result in more codes being missed during routine collection. We compute the Wasserstein metric between these two distributions at 1.6 10 2. This demon- × − strates a degree of similarity between distributions albeit is 8x further from the Predicted and Assigned dataset distance presented in Section 4.3.2. We ex- Figure 5: Predicted, Not Assigned Codes grouped by pect to see a larger distance here as we are detecting top-level code group vs total assigned codes codes that are indicated in the text but have been missed during routine code assignment.

We also group the predicted codes into their re- 4.3.4 Assigned & Not Predicted spective top-level ICD-9 groups in Figure4 and ob- We observe that the distribution of assigned and serve that predicted assigned codes display a simi- not predicted codes largely mirrors the distribu- lar distribution to total assigned codes. We quantify tion of total codes assigned in MIMIC-III with a the difference in distributions via the Wasserstein Wasserstein distance of 2.7 10 3 that is similar × − metric or ‘Earth Movers Distance’(Ramdas et al., to the distance observed in in our Predicted and 2015). This metric provides a single measure to Assigned Section 4.3.2) dataset. This suggests that compare the difference in our 3 datasets distribu- our method is proportionally consistent at not an- tions when compared with the current assigned notating codes that have likely been assigned from code distribution. We compute a small 2.7 10 3 elsewhere in the admission, but may also be incor- × − distance between both distributions, suggesting rectly assigned. our method proportionally identifies previously assigned codes from the DD subsection alone. 5 Discussion On aggregate, the predicted codes by our MedCAT 4.3.3 Predicted & Not Assigned model suggest that the discharge diagnosis sections This dataset highlights codes that may have been listed in 92% of all discharge notifications are not missed from the current assigned codes. sufficient for full coding of an episode. Unsur- Figure5 shows that the distribution of predicted prisingly, this confirms that clinicians infrequently but not assigned codes is minimally different for document all codeable diagnoses within the dis- most codes, supporting our belief that the MIMIC- charge summary. Although, as previously stated, III assigned codes are not wholly untrustworthy, coders are not permitted to make clinical infer- but are likely under-coded in specific areas. ences. Therefore, to correctly assign a code, the

82 diagnoses must be present within the documented parse DD subsections, an example script to build patient episode within the structured or unstruc- a pre-trained MedCAT model, the script required tured data. to run MedCAT on the DD subsections, load into However, the positive correlation between doc- the annotator and finally re-run MedCAT and per- ument length and number of predicted codes inform experimental analysis alongside outputting dicates that missed codes are more prevalent in the silver standard dataset6. highly complex cases with many diagnoses. From Given these materials it is possible for re- a coding workflow perspective, coders operate un- searchers to replicate and build upon our method, der strict time schedules and are required to code a or directly use the silver standard dataset in future minimum number of episodes each day. Therefore, work that investigates automated clinical coding us- it logically follows that the complexity of a case ing MIMIC-III. The silver standard dataset clearly directly correlates to the number of codes missed marks if each assigned code has been validated or during routine collection. not, or if it is a new code according to our method. Looking at individual code groups we find 240- 279 is not predicted proportionally with assigned 6 Conclusions & Future Work codes both in P A and P NA. We explain this as This work highlighted specific problems with using follows. Firstly, DD subsections generally convey MIMIC-III as a dataset for training and testing an clinically important diagnoses for follow-up care. automated clinical coding system that would limit Certain codes such as (250.*) describe diabetes model performance within a real deployment. mellitus with specific complications, but the DD We identified and deterministically extracted the subsection will often only describe the diagnoses discharge diagnosis (DD) subsections from dis- ‘DMI’ or ‘DMII’. Secondly, ICU admissions are charge summaries. We subsequently trained an for individuals with severe illness and therefore are NER+L model (MedCAT) to extract ICD-9 codes likely to have a high degree of co-morbidity. This from the DD subsections, comparing the results is implied by the majority of patients (74%) are across the full set of assigned codes. We find our assigned between 4 and 16 codes. method covers 47% of all tokens, considering we We also observe E000-E999 and V01-V99 codes only take 400 of the 7k unique codes and per- ∼ are disproportionately not predicted. However, this form minimal data cleaning of the DD subsection. is expected given that both groups are supplemen- We have shown in Section 4.3.2 and 4.3.3 that the tary codes that describe conditions or factors that MedCAT predicted codes are proportionally inline contribute to an admission but would likely not be with assigned codes in MIMIC-III. relevant for the DD subsection. Interestingly, we found a 0.504 positive correla- In contrast, we observe a disproportionately tion between DD length and the number of codes large number of predictions for 001-139 (Infectious predicted by MedCAT, but not assigned in MIMIC- and Parasitic Diseases). This is primarily driven III. This result can be understood by observing that by 0389 (Unspecified septicemia). A proportion the ICU admissions in MIMIC-III can be extremely of these predictions may be in error as the specific complex, with up to 30 clinical codes assigned to a form of septicemia is likely described in more de- single episode. The DD subsections alone can con- tail elsewhere in the note and therefore coded as tain up to 50 line items indicating highly complex the more specific form. cases where codes could easily be missed. We found that the code group 390-459 (Diseases 5.1 Method Reproducibility & Wider Utility of the Circulatory System) is both the most as- Inline with the suggestions of Johnson et al.(2017), signed group and the group of codes where there the original authors of MIMIC-III, we have at- are the most missing predictions from our model. tempted to provide the research community all Furthermore, codes such as Hypertension (4019), available materials to reproduce and build upon Sepsis and Septicemia (0389, 99591), Gastroin- our experiments and method for the development testinal hemorrhage (95789), Chronic Kidney dis- of silver standard datasets. Specifically, we have ease (5859), anemia (2859) and Chronic obstruc- made the following available as open-source: the tive asthma (49320) are all frequently assigned but SQL scripts to extract the raw data from a replica 6https://github.com/tomolopolis/MIMIC-III-Discharge- of the MIMIC-III database, the script required to Diagnosis-Analysis

83 are also the highest occurring conditions that ap- Care Directorates, Health and Social Care Research pear in the DD diagnosis subsection but are not and Development Division (Welsh Government), assigned in the MIMIC-III dataset. This suggests Public Health Agency (Northern Ireland), British that MIMIC-III exhibits specific cases of under- Heart Foundation and Wellcome Trust. 3. The coding, especially with codes that are frequently National Institute for Health Research University occurring in patients but are not likely to be the College London Hospitals Biomedical Research primary diagnosis for an admission to the ICU. Centre. This paper represents independent research As we only use the DD section, there are many part funded by the National Institute for Health codes which likely appear elsewhere in the note Research (NIHR) Biomedical Research Centre at that we cannot assign. Although 92% of discharge South London and Maudsley NHS Foundation summaries contain DD subsections we only match Trust and King’s College London. The views ex- 16% of assigned codes. We suggest this is due pressed are those of the author(s) and not necessar- ∼ to: our NER+L model lacking the ability to identify ily those of the NHS, MRC, NIHR or the Depart- more synonyms and abbreviations for conditions, ment of Health and Social Care. the DD subsections lacking enough detail to assign codes and in some occasions, little evidence to suggest a code assignment. Our textual span cover- References age, presented in Section 4.3.1 demonstrates that Alan R Aronson, Olivier Bodenreider, Dina Demner- we often cover all available discharge diagnosis, Fushman, Kin Wah Fung, Vivian K Lee, James G Mork, Aurelie´ Nev´ eol,´ Lee Peters, and Willie J although there is still room for improvement as the Rogers. 2007. From indexing the biomedical liter- majority of the coverage distribution is around the ature to coding clinical text: Experience with MTI 50% mark. and machine learning approaches. In Proceedings For future work we foresee applying the same of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing, method to either the entire discharge summary or BioNLP ’07, pages 105–112, Stroudsburg, PA, USA. more specific sections such as ‘previous medical Association for Computational Linguistics. history’ to surface chronic codeable diagnoses that Sandeep Ayyar, O B Don, and W Iv. 2016. Tagging could be validated against the current assigned code patient notes with icd-9 codes. In Proceedings of the set. Researchers would however likely need to 29th Conference on Neural Information Processing address false positive code predictions as clinical Systems. coding requires assigned codes to be from current Tal Baumel, Jumana Nassour-Kassis, Raphael Co- conditions associated with an admission. hen, Michael Elhadad, and Noemie´ Elhadad. 2018. In conclusion, this work has found that fre- Multi-Label classification of patient notes: Case quently assigned codes in MIMIC-III display signs study on ICD code assignment. In Workshops at the Thirty-Second AAAI Conference on Artificial of undercoding up to 35% for some codes. With Intelligence. this finding we urge researchers to continue to develop automated clinical coding systems using Olivier Bodenreider. 2004. The unified medical language system (UMLS): integrating biomedical MIMIC-III, but to also consider using our silver terminology. Nucleic Acids Res., 32(Database standard dataset or build on our method to further issue):D267–70. improve the dataset. Sue Bowman. 2013. Impact of electronic health record systems on information integrity: quality and safety Acknowledgments implications. Perspect. Health Inf. Manag., 10:1c.

RD’s work is supported by 1.National Institute E M Burns, E Rigby, R Mamidanna, A Bottle, P Aylin, P Ziprin, and O D Faiz. 2012. Systematic review of for Health Research (NIHR) Biomedical Research discharge coding accuracy. Centre at South London and Maudsley NHS Foun- dation Trust and King’s College London. 2. Health CHKS Ltd. 2014. CHKS - insight for better healthcare: The quality of clinical coding in the NHS. https: Data Research UK, which is funded by the UK //bit.ly/34eU5g3. Accessed: 2019-5-10. Medical Research Council, Engineering and Phys- ical Sciences Research Council, Economic and Matus Falis, Maciej Pajak, Aneta Lisowska, Patrick Schrempf, Lucas Deckers, Shadia Mikhael, Sotirios Social Research Council, Department of Health Tsaftaris, and Alison O’Neil. 2019. Ontological at- and Social Care (England), Chief Scientist Of- tention ensembles for capturing semantic concepts ﬁce of the Scottish Government Health and Social in ICD code prediction from clinical text.

84 Richard´ Farkas and Gyorgy¨ Szarvas. 2008. Automatic Aaditya Ramdas, Nicolas Garcia, and Marco Cuturi. construction of rule-based ICD-9-CM coding sys- 2015. On wasserstein two sample testing and related tems. BMC Bioinformatics, 9 Suppl 3:S10. families of nonparametric tests. Toni Henderson, Jennie Shepheard, and Vijaya Sun- Mohammed Saeed, Mauricio Villarroel, Andrew T dararajan. 2006. Quality of diagnosis and procedure Reisner, Gari Clifford, Li-Wei Lehman, George coding in ICD-10 administrative data. Med. Care, Moody, Thomas Heldt, Tin H Kyaw, Benjamin 44(11):1011–1019. Moody, and Roger G Mark. 2011. Multiparameter intelligent monitoring in intensive care II: a public- Jinmiao Huang, Cesar Osorio, and Luke Wicent Sy. access intensive care unit database. Crit. Care Med., 2018. An empirical evaluation of deep learning for 39(5):952–960. ICD-9 code assignment using MIMIC-III clinical notes. Thomas Searle, Zeljko Kraljevic, Rebecca Bendayan, Daniel Bean, and Richard Dobson. 2019. Med- Alistair E W Johnson, Tom J Pollard, and Roger G CATTrainer: A biomedical free text annotation in- Mark. 2017. Reproducibility in critical care: a mor- terface with active learning and research use case tality prediction case study. In Proceedings of the specific customisation. In Proceedings of the 2nd Machine Learning for Healthcare Conference, 2019 Conference on Empirical Methods in Natural volume 68 of Proceedings of Machine Learning Language Processing and the 9th International Research, pages 361–376, Boston, Massachusetts. Joint Conference on Natural Language Processing PMLR. (EMNLP-IJCNLP): System Demonstrations, pages 139–144, Stroudsburg, PA, USA. Association for Alistair E W Johnson, Tom J Pollard, Lu Shen, Li- Computational Linguistics. Wei H Lehman, Mengling Feng, Mohammad Ghas- semi, Benjamin Moody, Peter Szolovits, Leo An- World Health Organisation. 2011. International thony Celi, and Roger G Mark. 2016. MIMIC-III, Statistical Classification of Diseases and Related a freely accessible critical care database. Sci Data, Health Problems. 3:160035. Pengtao Xie and Eric Xing. 2018. A neural archi- Ramakanth Kavuluru, Anthony Rios, and Yuan Lu. tecture for automated ICD coding. In Proceedings 2015. An empirical evaluation of supervised learn- of the 56th Annual Meeting of the Association ing approaches in assigning diagnosis codes to for Computational Linguistics (Volume 1: Long electronic medical records. Artif. Intell. Med., Papers), pages 1066–1076. 65(2):155–166. Rui Zhang, Serguei V S Pakhomov, Elliot G Arso- Rae A Kokotailo and Michael D Hill. 2005. Coding niadis, Janet T Lee, Yan Wang, and Genevieve B of stroke and stroke risk factors using international Melton. 2017. Detecting clinically relevant new in- classification of diseases, revisions 9 and 10. Stroke, formation in clinical notes across specialties and set- 36(8):1776–1781. tings. BMC Med. Inform. Decis. Mak., 17(Suppl 2):68. Zeljko Kraljevic, Daniel Bean, Aurelie Mascio, Lukasz Roguski, Amos Folarin, Angus Roberts, Rebecca Bendayan, and Richard Dobson. 2019. MedCAT – medical concept annotation tool. Leah S Larkey and W Bruce Croft. 1996. Combin- ing classifiers in text categorization. In SIGIR, volume 96, pages 289–297. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositional- ity. In C J C Burges, L Bottou, M Welling, Z Ghahra- mani, and K Q Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc. James Mullenbach, Sarah Wiegreffe, Jon Duke, Jimeng Sun, and Jacob Eisenstein. 2018. Explainable prediction of medical codes from clinical text. Adler Perotte, Rimma Pivovarov, Karthik Natarajan, Nicole Weiskopf, Frank Wood, and Noemie´ Elhadad. 2014. Diagnosis code assignment: models and evaluation metrics. J. Am. Med. Inform. Assoc., 21(2):231–237.

85 Comparative Analysis of Text Classiﬁcation Approaches in Electronic Health Records

Aurelie Mascio∗ Zeljko Kraljevic∗ Daniel Bean Richard Dobson Robert Stewart Rebecca Bendayan Angus Roberts Department of Biostatistics & Health Informatics King’s College London, UK [email protected] [email protected]

Abstract versus patient related (Dai, 2019); (3) obesity classification from free text (Uzuner, 2009); (4) identi- Text classification tasks which aim at harvest- fication of patients for clinical trials (Meystre et al., ing and/or organizing information from elec- 2019). tronic health records are pivotal to support clin- Most of these tasks involve mapping mentions in ical and translational research. However these present specific challenges compared to other narrative texts (e.g. “pneumonia”) to their corre- classification tasks, notably due to the particu- sponding medical concepts (and concept ID) gen- lar nature of the medical lexicon and language erally using the Unified Medical Language System used in clinical records. (UMLS) (Bodenreider, 2004), and then training a Recent advances in embedding methods have classifier to identify these correctly (e.g. “pneumo- shown promising results for several clinical nia positive” versus “pneumonia negative”) (Yao tasks, yet there is no exhaustive comparison et al., 2019). of such approaches with other commonly used word representations and classification models. Text classification performed on medical records In this work, we analyse the impact of various presents specific challenges compared to the gen- word representations, text pre-processing and eral domain (such as newspaper texts), including classification algorithms on the performance of four different text classification tasks. The dataset imbalance, misspellings, abbreviations or results show that traditional approaches, when semantic ambiguity (Mujtaba et al., 2019). tailored to the specific language and structure Despite recent advances in NLP, including neural- of the text inherent to the classification task, network based word representations such as BERT can achieve or exceed the performance of more (Devlin et al., 2019), few approaches have been recent ones based on contextual embeddings extensively tested in the medical domain and rule- such as BERT. based algorithms remain prevalent (Koleck et al., 2019). Furthermore, there is no consensus on which word representation is best suited to specific 1 Introduction downstream classification tasks (Si et al., 2019; Wang et al., 2018). Clinical text classification is an important task in natural language processing (NLP) (Yao et al., The purpose of this study is to analyse the impact 2019), where it is critical to harvest data from of numerous word representation methods (bag- electronic health records (EHRs) and facilitate its of-word versus traditional and contextual word use for decision support and translational research. embeddings) as well as classification approaches Thus, it is increasingly used to retrieve and orga- (deep learning versus traditional machine learning nize information from the unstructured portions of methods) on the performance of four different text EHRs (Mujtaba et al., 2019). classification tasks. To our knowledge this is the Examples include tasks such as: (1) detection of first paper to test a comprehensive range of word smoking status (Uzuner et al., 2008); (2) classi- representation, text pre-processing and classifica- fication of medical concept mentions into family tion methods combinations on several medical text ∗ These two authors contributed equally. tasks.

86 Proceedings of the BioNLP 2020 workshop, pages 86–94 Online, July 9, 2020 c 2020 Association for Computational Linguistics

2 Materials & Methods ShARe/CLEF (MIMIC-II) dataset The ShARe/CLEF annotated dataset proposed by 2.1 Datasets and text classiﬁcation tasks Mowery et al.(2014) is based on 433 clinical In order to conduct our analysis we derived text records from the MIMIC-II database (Saeed classiﬁcation tasks from MIMIC-III (Multiparame- et al., 2002). It was generated for community ter Intelligent Monitoring in Intensive Care) (John- distribution as part of the Shared Annotated son et al., 2016), and the Shared Annotated Re- Resources (ShARe) project (Elhadad et al., 2013), sources (ShARe)/CLEF dataset (Mowery et al., and contains annotations including disorder 2014). These datasets are commonly used for chal- mention spans, with several contextual attributes. lenges in medical text mining and act as bench- For our analysis we derived two tasks from this marks for evaluating machine learning models (Pu- dataset, focusing on two attributes, comprising rushotham et al., 2018). 8075 annotations for each:

MIMIC-III dataset MIMIC-III (Johnson et al., Negation (yes/no, indicating if the disorder is • 2016) is an openly available dataset developed by negated or affirmed); the MIT Lab for Computational Physiology. It comprises clinical notes, demographics, vital signs, lab- Uncertainty (yes/no, indicating if the disorder • oratory tests and other data associated with 40,000 is hypothetical or affirmed). critical care patients. Text classification tasks For both annotated We used MedCAT (Kraljevic et al., 2019) to pre- datasets, we extracted from each document the por- pare the dataset and annotate a sample of clinical tions of text containing a mention of the concepts notes from MIMIC-III with UMLS concepts (Bo- of interest, keeping 15 words on each side of the denreider, 2004). We selected the concepts with the mention (including line breaks). Each task is then UMLS semantic type Disease or Syndrome (corre- made up of sequences comprising around 31 words, sponding to T047), out of which we picked the 100 centered on the mention of interest, with its cor- most frequent Concept Unique Identifier (CUIs, responding meta-annotation (status, temporality, allowing to group mentions with the same mean- negation, uncertainty), making up four text classifi- ing). For each concept we then randomly sampled cation tasks, denoted: 4 documents containing a mention of each concept, resulting in 400 documents with 2367 annotations MIMIC Status; in totals. The 100 most frequent concepts in these • | documents were manually annotated (and manu- MIMIC Temporality; ally corrected in case of disagreement) for two text • | classification tasks: ShARe Negation; • | Status (affirmed/other, indicating if the dis- • ShARe Uncertainty. ease is affirmed or negated/hypothetical); • |

Temporality (current/other, indicating if the Table1 summarizes the class distribution for • disease is current or past). each task.

Task Class 1 Class 2 Total Such contextual properties are often critical in the MIMIC Status | 1586 (67%) 781 (33%) 2367 medical domain in order to extract valuable in- (1: afﬁrmed, 2: other) formation, as evidenced by the popularity of al- MIMIC Temporality | 2026 (86%) 341 (14%) 2367 gorithms like ConText or NegEx (Harkema et al., (1: current, 2: other) ShARe Negation 2009; Chapman et al., 2001). | 1470 (18%) 6605 (82%) 8075 (1: yes, 2: no) ShARe Uncertainty | 729 (9%) 7346 (91%) 8075 Annotations were performed by two annota- (1: yes, 2: no) tors, achieving an overall inter-annotator agreement above 90%. These annotations will be made pub- Table 1: Class distribution licly available.

87 Figure 1: Main workﬂow

2.2 Evaluation steps and main workflow For each step we measured the impact by evaluating the best possible combination, based on the We used the four different text classification tasks average F1 score (weighted average score derived described in Section 2.1 in order to explore vari- from 10-fold cross validation results). ous combinations of word representation models (see Section 2.3), text pre-processing and tokeniz- 2.3 Word representation models ing variations (Section 2.4) and classification algorithms (Section 2.5). In order to evaluate the Word embeddings as opposed to bag-of-words different approaches we followed the steps detailed (BoW) present the advantage of capturing semantic in Table2 and Figure1 for all four classification and syntactic meaning by representing words as tasks. real valued vectors in a dimensional space (vectors that are close in that space will represent similar words). Contextual embeddings go one step fur- Outcome Step Description (best F1) ther by capturing the context surrounding the word, Run all bag-of-word and traditional embeddings whilst traditional embeddings assign a single repre- + classification algorithms and select the A A-1 best combination (using baseline methods for sentation to a given word. text pre-processing and tokenization) Using A-1 as the new baseline model, test For our analysis we considered four off-the-shelf B different pre-processing methods (lowercasing, B-1 embedding models, pre-trained on public domain punctuation removal, lemmatization, stemming) Using B-1 as the new baseline model, compare various data, and compared them to the same embedding C C-1 tokenizers (word and subword level) models trained on biomedical corpora, as well as a Test contextual embedding approaches: D D-1 BERT (base, uncased) and BioBERT BoW representation.

Table 2: Evaluation steps For the traditional embeddings we chose three

88 commonly used algorithms, namely Word2Vec for clinical data mining tasks and achieve state-of- (Mikolov et al., 2013), GloVe (Pennington et al., the-art performance (Yao et al., 2019), namely arti- 2014) and FastText (Bojanowski et al., 2017). ficial neural network (ANN), convolutional neural We used publicly available models pre-trained on network (CNN), recurrent neural network (RNN), Wikipedia and Google News for all three (Yamada bi-directional long short term memory (Bi-LSTM), et al., 2018). and BERT (Devlin et al., 2019; Wolf et al., 2020). To obtain medical specific models we trained all We compared these with a statistics-based approach three on MIMIC-III clinical notes (covering 53,423 as a baseline, using a Support Vector Machine intensive care unit stays, including those used in (SVM) classifier, a popular method used for classi- the classification tasks) (Johnson et al., 2016). The fication tasks (Cortes and Vapnik, 1995). following hyperparameters, aligned to off-the-shelf pre-trained models, were used: dimension of 300, For Bi-LSTM and RNN, we tested both a stan- window size of 10, minimum word count of 5, un- dard approach and one that is configured to simu- cased, punctuation removed. late attention on the medical entity of interest. This For the contextual embeddings we used BERT custom approach consisted in taking the represen- base (Devlin et al., 2019), and BioBERT (Lee et al., tation of the network at the position of the entity 2019) which are pre-trained respectively on general of interest, which in most cases corresponds to the domain corpora and biomedical literature (PubMed center for each sequence. We refer to this latter abstracts and PMC articles). approach as custom Bi-LSTM and custom RNN. Finally we used a BoW representation as a base- For ANN and statistics-based models, which are line approach. limited by the size of the dataset and embeddings (300 dimensions x 31 words x 2300 or 8000 se- 2.4 Text pre-processing and tokenizers quences), we chose to represent sequences by av- In addition to pre-training several embedding mod- eraging the embeddings of the words composing els, we tested two different text tokenization meth- each sequence. This representation method is com- ods, using the following types of tokenizers: (1) monly used and has proven efficient for various SciSpaCy (Neumann et al., 2019), a traditional NLP applications (Kenter et al., 2016). tokenizer based on word detection; and (2) byte- Furthermore, each of these models was tested pair-encoding (BPE) adapted to word segmentation using different sets of parameters (e.g. varying the that works on subword level (Gage, 1994; Sennrich support function, dropout, optimizer, as reported in et al., 2016). Table3), the ones producing the best performance For the word level tokenizer we chose SciSpaCy were selected for further testing and are summa- as it is specifically aimed at biomedical and scien- rized in Table3. tific text processing. We further tested additional RNN SVM ANN CNN text pre-processing: lowercasing, punctuation re- Bi-LSTM Radial basis ReLU moval, stopwords removal, stemming and lemmati- Kernel or Linear ReLU + (with activation N/A zation. Poly sigmoid max function For the subword BPE tokenizer we used byte Sigmoid pooling) level byte-pair-encoding (BBPE) (Wang et al., Layers N/A 2 3 2 Filters N/A N/A 128 N/A 2019; Wolf et al., 2020). In this case the only Hidden units N/A 100 N/A 300 pre-processing performed is lowercasing, whilst dimensions 0.5 0.5 0.5 everything else including line breaks and spaces is Dropout N/A 0 0 0 left as is. This approach allows to limit the vocab- Adam Adam Adam Stochastic Stochastic Stochastic ulary size and is especially useful in the medical Optimizer N/A Gradient Gradient Gradient domain where a large number of words are very Descent Descent Descent rare. We limited the number of words to 30522, a Learning rate N/A 0.001 0.001 0.001 tolerance: standard vocabulary size also used in BERT (De- Epochs 5000 200 50 0.001 vlin et al., 2019). Table 3: Classifiers and corresponding parameters eval- 2.5 Text classification algorithms uated. Parameters highlighted in bold were the ones On all four classification tasks, we tested various selected based on performance. machine learning algorithms which are widely used

89 3 Results For each variant investigated, the same pre- processing settings were applied to prepare the an- 3.1 Performance comparison for all notated corpus as well as to the entire MIMIC-III embedding and algorithm combinations dataset, which was then used to re-train Word2Vec. (steps A & D) This ensured the same vocabulary was used across In this section we compare the performance of the the embedding and sequences to classify for each different embeddings and classification approaches. experiment. We report the weighted average F1/precision/recall The results, summarized in Table6, suggest that (weighted average value obtained from the 10-fold text pre-processing has a minor impact for all clas- cross-validation results) for selected combinations sification algorithms tested. Notably, stemming on the four classification tasks in Tables4 and5 and lemmatization have a slightly negative impact (full results in Appendix A.1). on performance. For all word embedding methods tested (Word2Vec, GloVe, FastText), the ones trained on biomedical 3.3 Impact of tokenizers (step C) data show the best performance (see Table4). We tested the impact of tokenization on the performance of text classification tasks, focusing on For classification algorithms, the best perfor- SciSpacy and BBPE tokenizers, as they allow us to mance is obtained when using the custom Bi- compare whole word versus subword unit methods. LSTM model configured to target the biomed- The results for the MIMIC Status task (and us- | ical concept of interest (see Table5). Both ing Word2Vec trained on MIMIC-III) are shown contextual embeddings (BERT and BioBERT), in Table7, and indicate that the performances are whether trained on biomedical or general cor- roughly similar when using the BBPE tokenizer pora, outperform any other combination of em- compared to SciSpacy. bedding/classification algorithm tested, and give Furthermore we compared both approaches in results very close to the customized Bi-LSTM, as terms of speed and vocabulary size. Tokenizing shown in Table5. text took on average 2.5 times longer with Scispacy (250 seconds to tokenize 100,000 medical notes for This indicates that for tasks incorporating infor- SciSpacy versus 99 seconds for BBPE, excluding mation about the position of the entity of interest in model loading time). For the models trained on the text (e.g. ShARe which reports disorder men- MIMIC-III corpus, Scispacy comprised 474,145 tions span offsets), the custom Bi-LSTM approach words, and BBPE 29,452 subword units. performs better than BioBERT, without necessitat- ing any text pre-processing. 3.4 Embeddings analysis: word similarities On the other hand, when looking at pure text classi- comparison fication, BioBERT shows better performance than Finally, in order to analyse the differences between a Bi-LSTM approach, and consequently may be embeddings trained on general and medical cor- preferred for tasks where the sequence of interest pora, we compared the semantic information cap- is not easily centered on a specific entity. tured by Word2Vec (using SciSpacy tokenizer and Finally, whilst the performance of BERT and without any preliminary text pre-processing). BioBERT is relatively similar, BioBERT converges Table8 explores word similarities by showing faster across all tasks tested. the top ten similar words for medical (“cancer”) and non-medical (“concentration” and ”attention”) 3.2 Impact of text pre-processing (step B) terms. In addition to exploring various embeddings, we Notably, it highlights the numerous misspellings, tested the impact of text pre-processing on classi- abbreviations and domain-specific meanings con- fication task performance. In order to do so, we tained in the medical lexicon, suggesting that gen- selected the best performing word embedding ob- eral corpora such as Wikipedia may not be appro- tained in the previous step (Word2Vec trained on priate when working on data from medical records MIMIC-III, using SciSpacy tokenizer), and com- (and by implication, for other specific domains). pared performances between all text cleaning variations (lowercasing, punctuation removal, stemming, lemmatization).

90 F1-score (average from 10-fold cross validation) MIMIC MIMIC ShARe ShARe Model Tokenizer Embedding Status Temporality Negation Uncertainty Bi-LSTM (custom) SciSpacy Wiki Word2Vec 92.8% 97.3% 98.4% 96.7% | Bi-LSTM (custom) SciSpacy Wiki GloVe 93.4% 97.2% 98.4% 97.2% | Bi-LSTM (custom) SciSpacy Wiki FastText 93.6% 96.9% 98.6% 96.4% | Bi-LSTM (custom) SciSpacy MIMIC Word2Vec 94.5% 97.9% 98.7% 97.3% | Bi-LSTM (custom) SciSpacy MIMIC GloVe 93.9% 97.9% 98.7% 96.9% | Bi-LSTM (custom) SciSpacy MIMIC FastText 93.7% 97.6% 98.5% 97.2% | BERT WordPiece BERTbase 91.5% 97.3% 98.2% 93.6% BioBERT WordPiece BioBERT 93.4% 97.3% 98.5% 94.2% SVM SciSpacy Wiki Word2Vec 76.9% 94.8% 88.5% 85.9% | SVM SciSpacy Wiki GloVe 78.6% 94.9% 88.8% 87.1% | SVM SciSpacy Wiki FastText 78.1% 94.4% 88.7% 86.3% | SVM SciSpacy BoW 82.7% 96.0% 90.2% 91.7% SVM SciSpacy MIMIC Word2Vec 80.6% 95.1% 89.8% 90.2% | SVM SciSpacy MIMIC GloVe 79.1% 94.1% 89.4% 87.6% | SVM SciSpacy MIMIC FastText 79.6% 93.7% 88.9% 88.0% | Table 4: Comparison of embeddings (steps A & D)

F1-score (average from 10-fold cross validation) MIMIC MIMIC ShARe ShARe Model Tokenizer Embedding Status Temporality Negation Uncertainty Bi-LSTM SciSpacy MIMIC Word2Vec 88.4% 97.1% 96.2% 94.1% | Bi-LSTM (custom) SciSpacy MIMIC Word2Vec 94.5% 97.9% 98.7% 97.3% | BERT WordPiece BERTbase 91.5% 97.3% 98.2% 93.6% BioBERT WordPiece BioBERT 93.4% 97.3% 98.5% 94.2% ANN SciSpacy MIMIC Word2Vec 80.9% 96.5% 88.6% 86.7% | CNN SciSpacy MIMIC Word2Vec 84.6% 97.3% 92.0% 87.5% | RNN SciSpacy MIMIC Word2Vec 77% 96.8% 94.0% 87.1% | RNN (custom) SciSpacy MIMIC Word2Vec 89.5% 96.7% 97.9% 96.5% | SVM SciSpacy MIMIC Word2Vec 80.6% 95.1% 89.8% 90.2% | ANN SciSpacy BoW 79.8% 94.8% 89.3% 89.3% SVM SciSpacy BoW 82.7% 96% 90.2% 91.7%

Table 5: Comparison of classiﬁcation algorithms (steps A & D)

F1-score (average from 10-fold cross validation) Task Embedding Text pre-processing SVM ANN RNN RNN (custom) CNN Bi-LSTM (custom) MIMIC Status MIMIC Word2Vec Lowercase (L) 80.6% 80.9% 77.0% 89.5% 84.6% 94.5% | | MIMIC Status MIMIC Word2Vec L + punctuation removal (LP) 80.1% 80.0% 80.2% 86.1% 84.7% 94.4% | | MIMIC Status MIMIC Word2Vec LP + lemmatizing 80.6% 79.6% 78.0% 86.3% 83.8% 94.1% | | MIMIC Status MIMIC Word2Vec LP + stemming 80.4% 79.7% 79.4% 86.1% 84.1% 94.1% | | Table 6: Comparison of text pre-processing methods (step B)

F1-score (average from 10-fold cross validation) Task Embedding Tokenizer SVM ANN RNN RNN (custom) CNN Bi-LSTM (custom) MIMIC Status MIMIC Word2Vec Sciscpacy 80.6% 80.9% 77.0% 89.5% 84.6% 94.5% | | MIMIC Status MIMIC Word2Vec BBPE 78.8% 80.5% 76.5% 86.0% 84.3% 94.7% | | Table 7: Comparison of tokenizing methods (step C)

Term: “cancer” Term: “concentration” Term: “attention” Word2Vec Medical Word2Vec General Word2Vec Medical Word2Vec General Word2Vec Medical Word2Vec General ca (0.78) prostate (0.85) hmf (0.51) concentrations (0.71) paid (0.43) attentions (0.65) carcinoma (0.78) colorectal (0.82) concentrations (0.49) arbeitsdorf (0.67) approximation (0.34) notoriety (0.63) cancer- (0.75) melanoma (0.8) formula (0.47) vulkanwerft (0.65) followup (0.33) attracted (0.63) caner (0.71) pancreatic (0.8) mct (0.47) sophienwalde (0.64) proximity (0.32) criticism (0.63) adenocarcinoma (0.71) leukemia (0.79) polycose (0.47) lagerbordell (0.64) short-term (0.31) publicity (0.57) ca- (0.64) entity/breast cancer (0.79) virtue (0.45) sterntal (0.64) mangagement (0.31) praise (0.57) melanoma (0.64) leukaemia (0.78) corn (0.45) durrgoy¨ (0.62) atetntion (0.31) aroused (0.56) cancer;dehydration (0.63) tumour (0.77) dosage (0.44) straﬂager (0.61) attnetion (0.3) acclaim (0.55) cancer/sda (0.61) cancers (0.76) planimetry (0.44) maidanek (0.61) atention (0.3) interest (0.55) rcc (0.61) ovarian (0.75) equation (0.44) szebnie (0.61) non-rotated (0.3) admiration (0.55)

Table 8: Comparison of word similarities between general and domain-speciﬁc embeddings

91 4 Discussion Finally, custom Bi-LSTM outperforms BioBERT when it simulates attention on the entity This study compared the impact of various embed- of interest. However, this configuration requires ding and classification methods on four different information on the entity mention span, and then text classification tasks. Notably we investigated to center each document on this span. For some the impact of pre-training embedding models on datasets, such information may either be readily clinical corpora versus off-the-shelf models trained available, or can be obtained by performing an on general corpora. additional named-entity extraction step. Unfortu- nately, many text classification tasks do not usually The results suggest that using embeddings pre- have this information, or may not rely on the trained for the specific task (clinical corpora in our specific entities/keywords required (e.g. sentiment case) leads to better performance with any classifi- analysis tasks). When Bi-LSTM is not customized, cation algorithm tested. However, pre-training such then both BERT models (trained on general and embeddings is not necessarily feasible due to either specific domains) produce the best performance, data or technical constraints. In this case our re- and consequently should be preferred for texts not sults highlight that using off-the-shelf embeddings easily allowing such customization. trained on large general corpora such as Wikipedia still produce acceptable performance. In particular BERTbase outperformed most algorithms tested, 5 Conclusion even when these were combined with clinical embeddings. In this article we have explored the performance Additionally, BioBERT was not pre-trained on of various word representation approaches (com- medical notes but on texts from a related domain paring bag-of-words to traditional and contextual (biomedical articles and abstracts as opposed to embeddings trained on both specific and general clinical records), and therefore excludes speci- corpora, combined with various text pre-processing ficities inherent to the medical domain such as and tokenizing methods) as well as classification al- misspellings or technical jargon. Despite this, gorithms on four different text classification tasks, BioBERT’s performance is only marginally below all based on publicly available datasets. that of the best model (custom Bi-LSTM) combined with clinical embeddings. A detailed performance comparison on these four tasks highlighted the efficacy of contextual The various experiments conducted on text pre- embeddings when compared to traditional methods processing only lead to small variations in terms when no customization is possible, whether these of performance, and even negatively impact the embeddings are trained on specific or general cor- performance of several algorithms, for the text clas- pora. sification task and embedding model tested. Given When combined with appropriate entity extraction the additional constraints required to perform this tasks and specific domain embeddings, Bi-LSTM step (need to train embeddings on pre-processed outperforms contextual embeddings. Across all texts and to clean input data) and the mixed results classification algorithms, text pre-processing and in performance, pre-processing does not appear to tokenization approaches showed limited impact for be essential. the task and embedding tested, suggesting a rule of thumb to opt for the least time and resource Novel tokenization methods based on subword intensive method. dictionaries, whilst not improving the performance, Acknowledgments eliminate several shortcomings presented by SciS- pacy and similar methods, notably its speed and We thank the anonymous reviewers for their vocabulary size. valuable suggestions. In light of these limitations and the very small dif- This work was supported by Health Data Research ference in performance for the task tested, BBPE UK, an initiative funded by UK Research and appears to be a suitable alternative to traditional Innovation, Department of Health and Social Care tokenizers and allows to reduce significantly com- (England) and the devolved administrations, and putational costs. leading medical research charities.

92 AM is funded by Takeda California, Inc. Corinna Cortes and Vladimir Vapnik. 1995. Support- RD, RS, AR are part-funded by the National vector networks. Machine Learning, 20(3):273– Institute for Health Research (NIHR) Biomedical 297. Research Centre at South London and Maudsley Hong-Jie Dai. 2019. Family member information ex- NHS Foundation Trust and King’s College traction via neural sequence labeling models with London. different tag schemes. BMC Medical Informatics RD is also supported by The National Institute and Decision Making, 19(10):257. for Health Research University College London Hospitals Biomedical Research Centre, and Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training by the BigData@Heart Consortium, funded of Deep Bidirectional Transformers for Language by the Innovative Medicines Initiative-2 Joint Understanding. arXiv:1810.04805 [cs]. ArXiv: Undertaking under grant agreement No. 116074. 1810.04805. This Joint Undertaking receives support from the European Union’s Horizon 2020 research and N Elhadad, W. W Chapman, T O’Gorman, M Palmer, and G Savova. 2013. The ShARe schema for the innovation programme and EFPIA. syntactic and semantic annotation of clinical texts. DB is funded by a UKRI Innovation Fellowship as part of Health Data Research UK MR/S00310X/1. Philip Gage. 1994. A new algorithm for data compres- RB is funded in part by grant MR/R016372/1 sion. for the King’s College London MRC Skills Henk Harkema, John N. Dowling, Tyler Thornblade, Development Fellowship programme funded by and Wendy W. Chapman. 2009. ConText: an al- the UK Medical Research Council (MRC) and gorithm for determining negation, experiencer, and by grant IS-BRC-1215-20018 for the National temporal status from clinical reports. Journal of Institute for Health Research (NIHR) Biomedical Biomedical Informatics, 42(5):839–851. Research Centre at South London and Maudsley Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li- NHS Foundation Trust and King’s College London. wei H. Lehman, Mengling Feng, Mohammad Ghas- semi, Benjamin Moody, Peter Szolovits, Leo An- This paper represents independent research thony Celi, and Roger G. Mark. 2016. MIMIC-III, part funded by the National Institute for Health a freely accessible critical care database. Scientiﬁc Research (NIHR) Biomedical Research Centre at Data, 3(1):1–9. South London and Maudsley NHS Foundation Tom Kenter, Alexey Borisov, and Maarten de Rijke. Trust and King’s College London. The views ex- 2016. Siamese CBOW: Optimizing Word Embed- pressed are those of the authors and not necessarily dings for Sentence Representations. In Proceedings those of the NHS, the NIHR or the Department of of the 54th Annual Meeting of the Association for Health and Social Care. The funders had no role in Computational Linguistics (Volume 1: Long Papers), pages 941–951, Berlin, Germany. Association for study design, data collection and analysis, decision Computational Linguistics. to publish, or preparation of the manuscript. T.A. Koleck, C. Dreisbach, P.E. Bourne, and S. Bakken. 2019. Natural language processing of symptoms documented in free-text narratives of electronic References health records: A systematic review. Journal of the American Medical Informatics Association, Olivier Bodenreider. 2004. The Uniﬁed Med- 26(4):364–379. ical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research, Zeljko Kraljevic, Daniel Bean, Aurelie Mascio, 32(suppl 1):D267–D270. Lukasz Roguski, Amos Folarin, Angus Roberts, Piotr Bojanowski, Edouard Grave, Armand Joulin, and Rebecca Bendayan, and Richard Dobson. 2019. Tomas Mikolov. 2017. Enriching Word Vectors with MedCAT – Medical Concept Annotation Tool. Subword Information. Transactions of the Associa- arXiv:1912.10166 [cs, stat]. ArXiv: 1912.10166. tion for Computational Linguistics, 5:135–146. Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Wendy W. Chapman, Will Bridewell, Paul Hanbury, Donghyeon Kim, Sunkyu Kim, Chan Ho So, Gregory F. Cooper, and Bruce G. Buchanan. 2001. and Jaewoo Kang. 2019. BioBERT: a pre-trained A Simple Algorithm for Identifying Negated Find- biomedical language representation model for ings and Diseases in Discharge Summaries. Journal biomedical text mining. Bioinformatics, page of Biomedical Informatics, 34(5):301–310. btz682. ArXiv: 1901.08746.

93 Stephane´ M. Meystre, Paul M. Heider, Youngjun Kim, Ozlem¨ Uzuner. 2009. Recognizing Obesity and Co- Daniel B. Aruch, and Carolyn D. Britten. 2019. Au- morbidities in Sparse Data. Journal of the Amer- tomatic trial eligibility surveillance based on un- ican Medical Informatics Association : JAMIA, structured clinical data. International Journal of 16(4):561–570. Medical Informatics, 129:13–19. Changhan Wang, Kyunghyun Cho, and Jiatao Gu. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey 2019. Neural Machine Translation with Byte- Dean. 2013. Efficient Estimation of Word Repre- Level Subwords. arXiv:1909.03341 [cs]. ArXiv: sentations in Vector Space. arXiv:1301.3781 [cs]. 1909.03341. ArXiv: 1301.3781. Y. Wang, L. Wang, M. Rastegar-Mojarad, S. Moon, F. Shen, N. Afzal, S. Liu, Y. Zeng, S. Mehrabi, Danielle L Mowery, Sumithra Velupillai, Brett R S. Sohn, and H. Liu. 2018. Clinical information ex- South, Lee Christensen, David Martinez, Liadh traction applications: A literature review. Journal of Kelly, Lorraine Goeuriot, Noemie Elhadad, Guer- Biomedical Informatics, 77:34–49. gana Savova, and Wendy W Chapman. 2014. Task 2: ShARe/CLEF eHealth Evaluation Lab 2014. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien page 12. Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, Remi´ Louf, Morgan Funtow- Ghulam Mujtaba, Liyana Shuib, Norisma Idris, icz, and Jamie Brew. 2020. HuggingFace’s Trans- Wai Lam Hoo, Ram Gopal Raj, Kamran Khowaja, formers: State-of-the-art Natural Language Process- Khairunisa Shaikh, and Henry Friday Nweke. 2019. ing. arXiv:1910.03771 [cs]. ArXiv: 1910.03771. Clinical text classification research trends: System- atic literature review and open issues. Expert Sys- Ikuya Yamada, Akari Asai, Hiroyuki Shindo, tems with Applications, 116:494–520. Hideaki Takeda, and Yoshiyasu Takefuji. 2018. Wikipedia2Vec: An Optimized Tool for Learning Mark Neumann, Daniel King, Iz Beltagy, and Waleed Embeddings of Words and Entities from Wikipedia. Ammar. 2019. ScispaCy: Fast and Robust Models arXiv:1812.06280 [cs]. ArXiv: 1812.06280. for Biomedical Natural Language Processing. Pro- ceedings of the 18th BioNLP Workshop and Shared Liang Yao, Chengsheng Mao, and Yuan Luo. 2019. Task, pages 319–327. ArXiv: 1902.07669. Clinical text classification with rule-based features and knowledge-guided convolutional neural net- Jeffrey Pennington, Richard Socher, and Christopher works. BMC Medical Informatics and Decision Manning. 2014. Glove: Global Vectors for Word Making, 19(3):71. Representation. In Proceedings of the 2014 Con- ference on Empirical Methods in Natural Language A Appendices Processing (EMNLP), pages 1532–1543, Doha, A.1 Test set comparison of word Qatar. Association for Computational Linguistics. representation, text pre-processing, tokenization and classification Sanjay Purushotham, Chuizheng Meng, Zhengping methods across tasks Che, and Yan Liu. 2018. Benchmarking deep learning models on large healthcare datasets. Journal of . Biomedical Informatics, 83:112–134.

M. Saeed, C. Lieu, G. Raber, and R. G. Mark. 2002. MIMIC II: a massive temporal ICU patient database to support research in intelligent patient monitoring. Computers in Cardiology, 29:641–644.

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. arXiv:1508.07909 [cs]. ArXiv: 1508.07909.

Yuqi Si, Jingqi Wang, Hua Xu, and Kirk Roberts. 2019. Enhancing clinical concept extraction with contextual embeddings. Journal of the American Medical Informatics Association, 26(11):1297–1304.

Ozlem Uzuner, Ira Goldstein, Yuan Luo, and Isaac Ko- hane. 2008. Identifying patient smoking status from medical discharge records. Journal of the American Medical Informatics Association: JAMIA, 15(1):14– 24.

94 Noise Pollution in Hospital Readmission Prediction: Long Document Classiﬁcation with Reinforcement Learning

Liyan Xu† Julien Hogan‡ Rachel Patzer‡ Jinho D. Choi† †Department of Computer Science, ‡Department of Surgery Emory University, Atlanta, US {liyan.xu,julien.hogan,rpatzer,jinho.choi}@emory.edu

Abstract In particular, we face three types of challenges in this task. First, the document size can be very long; This paper presents a reinforcement learning approach to extract noise in long clinical doc- documents associated with these patients can have uments for the task of readmission prediction tens of thousands of tokens. Second, the dataset after kidney transplant. We face the challenges is relatively small with fewer than 2,000 patients of developing robust models on a small dataset available, as kidney transplant is a non-trivial med- where each document may consist of over 10K ical surgery. Third, the documents are noisy, and tokens with full of noise including tabular text there are many target-irrelevant sentences and tabu- and task-irrelevant sentences. We first exper- lar data in various text forms (Section2). iment four types of encoders to empirically decide the best document representation, and The lengthy documents together with the small then apply reinforcement learning to remove dataset impose a great challenge on representation noisy text from the long documents, which learning. In this work, we experiment four types models the noise extraction process as a se- of encoders: bag-of-words (BoW), averaged word quential decision problem. Our results show embedding, and two deep learning-based encoders that the old bag-of-words encoder outperforms that are ClinicalBERT (Huang et al., 2019) and deep learning-based encoders on this task, and LSTM with weight-dropped regularization (Merity reinforcement learning is able to improve upon baseline while pruning out 25% text segments. et al., 2018). To overcome the long sequence issue, Our analysis depicts that reinforcement learn- documents are split into multiple segments for both ing is able to identify both typical noisy tokens ClinicalBERT and LSTM (Section4). and task-specific noisy text. After we observe the best performed encoders, we further propose to combine reinforcement learn- 1 Introduction ing (RL) to automatically extract out task-specific Prediction of hospital readmission has always been noisy text from the long documents, as we observe recognized as an important topic in surgery. Pre- that many text segments do not contain predictive vious studies have shown that the post-discharge information such that removing these noise can po- readmission takes up tremendous social resources, tentially improve the performance. We model the while at least a half of the cases are preventable noise extraction process as a sequential decision (Basu Roy et al., 2015; Jones et al., 2016). Clini- problem, which also aligns with the fact that clinical notes, as part of the patients’ Electronic Health cal documents are received in time-sequential order. Records (EHRs), contain valuable information but At each step, a policy network with strong entropy are often too time-consuming for medical experts regularization (Mnih et al., 2016) decides whether to manually evaluate. Thus, it is of significance to to prune the current segment given the context, and develop prediction models utilizing various sources the reward comes from a downstream classifier af- of unstructured clinical documents. ter all decisions have been made (Section5). The task addressed in this paper is to predict 30- Empirical results show that the best performed day hospital readmission after kidney transplant, encoder is BoW, and deep learning approaches suf- which we treat it as a long document classification fer from severe overfitting under huge feature space problem without using specific domain knowledge. in contrast of the limited training data. RL is ex- The data we use is the unstructured clinical docu- perimented on this BoW encoder, and able to im- ments of each patient up to the date of discharge. prove upon baseline while pruning out around 25%

95 Proceedings of the BioNLP 2020 workshop, pages 95–104 Online, July 9, 2020 c 2020 Association for Computational Linguistics

Type P T Description CO 1,354 4,395.3 Report for every outpatient consultation before transplantation DS 514 1,296.7 Summary at the time of discharge from every hospital admission happened before transplant EC 1,110 1,073.6 Results of echocardiography HP 1,422 3,025.1 Summary of the patient’s medical history and clinical examination OP 1,472 4,224.8 Report of surgical procedures PG 1,415 13,723.4 Medical note during hospitalization summarizing the patient’s medical status each day SC 2,033 1,189.2 Report from the evaluation of each transplant candidate by the selection committee SW 1,118 1,407.6 Report from encounters with social workers

Table 1: Statistics of our dataset with respect to different types of clinical notes. P: # of patients, T: avg. # of tokens, CO: Consultations, DS: Discharge Summary, EC: Echocardiography, HP: History and Physical, OP: Operative, PG: Progress, SC: Selection Conference, SW: Social Worker. The report for SC is written by the committee that consists of surgeons, nephrologists, transplant coordinators, social workers, etc. at the end of the transplant evaluation. All 8 types follow the approximately 3:7 positive-negative class distribution. text segments (Section6). Further analysis shows Lab Fishbone (BMP, CBC, CMP, Diff) and that RL is able to identify traditional noisy tokens critical labs - Last 24 hours 03/08/2013 12:45 with few document frequencies (DF), as well as 142(Na) 104(Cl) 70H(BUN) - 10.7L(Hgb) < task-irrelevant tokens with high DF but of little 92(Glu) 6.5(WBC) 137L(Plt) 3.6(K) 26(CO2) information (Section7). Table 2: An example of tabular text in EKTD. 2 Data This work is based on the Emory Kidney Transplant preprocessing. Although numbers may provide use- Dataset (EKTD) that contains structured chart data ful features, most quantitative measurements are as well as unstructured clinical notes associated already included in the structured data so that those with 2,060 patients. The structured data comprises features can be better extracted from the structured 80 features that are lab results before the discharge data if necessary. The remaining tabular text con- as well as the binary labels of whether each patient tains headers and values that do not provide much is readmitted within 30 days after kidney transplant helpful information and become another source of or not where 30.7% patients are labeled as positive. noise, which we handle by training a reinforcement The unstructured data includes 8 types of notes learning model to identify them (Section5). such that all patients have zero to many documents Table1 gives the statistics of each clinical note for each note type. It is possible to develop a more type after preprocessing. The average number of accurate prediction model by co-training the struc- tokens is measured by counting tokens in all doc- tured and unstructured data; however, this work uments from the same note type of each patient. focuses on investigating the potentials of unstruc- Given this preprocessed dataset, our task is to take tured data only, which is more challenging. all documents in each note type as a single input and predict whether or not the patient associated 2.1 Preprocessing with those documents will be readmitted. As the clinical notes are collected through various sources of EMRs, many noisy documents exist in 3 Related Work EKTD such that 515 documents are HTML pages and 303 of them are duplicates. These documents Shin et al.(2019) presented ensemble models uti- are removed during preprocessing. Moreover, most lizing both the structured and the unstructured data documents contain not only written text but also in EKTD, where separate logistic regression (LR) tabular data, because some EMR systems can only models are trained on the structured data and each export entire documents in the table format. type of notes respectively, and the ﬁnal prediction While there are many tabular texts in the documents of each patient is obtained by averaging predictions (e.g., lab results and prescription as in Table2), it is from each models. Since some patients may lack impractical to write rules to ﬁlter them out, as the documents from certain note types, prediction on exported formats are not consistent across EMRs. these note types are simply ignored in the averaging Thus, any tokens containing digits or symbols, ex- process. For the unstructured notes, concatenation cept for one-character tokens, are removed during of Term Frequency-Inverse Document Frequency

96 (TF-IDF) and Latent Dirichlet Allocation (LDA) 2017). It is suitable for this task as unseen terms representation is fed into LR. However, we have or misspellings frequently appear in these clinical found that the representation from LDA only con- notes. The averaged word embedding is used to tributes marginally, while LDA takes significantly represent the input document consisting of multi- more inferring time. Thus, we drop LDA and only ple notes, which gets fed into LR with the same use TF-IDF as our BoW encoder (Section 4.1). training objective. Various deep learning models regarding text classification have been proposed in recent years. Pre- 4.3 ClinicalBERT trained language models like BERT have shown Following Huang et al.(2019), the pretrained lan- state-of-the-art performance on many NLP tasks guage BERT model (Devlin et al., 2019) is first (Devlin et al., 2019). ClinicalBERT is also intro- tuned on the MIMIC-III clinical note corpus (John- duced on the medical domain (Huang et al., 2019). son et al., 2016), which has shown to provide better However, deep learning approaches have two draw- related word similarities in medical domains. Then, backs on this particular dataset. First, deep learn- a dense layer is added on the CLS token of the last ing requires large dataset to train, whereas most of BERT layer. The entire parameters are fine-tuned our unstructured note types only have fewer than to optimize the binary cross entropy loss, that is the 2,000 samples. Second, these approaches are not same objective as Equation1. designed for long documents, and difficult to keep Since BERT has a limit on the input length, the long-term dependencies over thousands of tokens. input document of each patient is split into multi- Reinforcement learning has been explored to ple subsequences. Each subsequence is within the combat data noise by previous work (Zhang et al., BERT length limit, and serves as an independent 2018; Qin et al., 2018) on the short text setting. A sample with the same label of the patient. The policy network makes decision left-to-right over training data is therefore noisily inflated. The final tokens, and is jointly trained with another classifier. probability of readmission is computed as follows: However, there is little investigation of using RL on ni ni the long text setting, as it still requires an effective pmax + pmeanni/c p(yi = 1 gi) = (2) encoder to give meaningful representation of long | 1 + ni/c documents. Therefore, in our experiments, the first g i n step is to select the best encoder, and then apply where i is the BERT representation of patient , i RL on the long document classification. is the corresponding number of subsequences, and c is a hyperparameter to control the influence of ni. ni ni 4 Document Representation pmax and pmean are the max and mean probability across the subsequences, respectively. 4.1 Bag-of-Words The motivation behind balancing between the For the baseline model, the bag-of-words represen- max and mean probability is that subsequences do ni tation with TF-IDF scores, excluding stopwords not contain equal information. pmax represents the (Nothman et al., 2018), is fed into logistic regres- best potential, while longer text should give more ni ni sion (LR). The objective is to minimize the negative importance to pmean, because pmax is more easily af- log likelihood of the gold label yi: fected by noise as the text length grows. Although Equation2 seems intuitive, the use of pseudo labels m 1 on subsequences becomes another source of noise, [y log p(g )+(1 y ) log 1 p(g )] (1) −m i i − i − i especially when there are thousands of tokens; thus, i=1 X the performance is uncertain. Section 6.2 provides where gi is the TF-IDF representation of Di. In detailed empirical analysis for this model. addition, we experiment two common techniques in the encoder to reduce feature space: token stem- 4.4 Weight-dropped LSTM ming, and document frequency cutoff. We split documents of each patient into multiple short segments, and feed the segment representa- 4.2 Averaged Word Embedding tion to long short-term memory network (LSTM) Word embeddings generated by fastText are used at each time step: to establish another baseline, that utilizes subwords to better represent unseen terms (Bojanowski et al., hj LSTM(sj, hj 1; θ) (3) ← − 97 Environment Policy Network Action Downstream Classifier a a ⋯ a ⋯ a at 1 2 t T Selected Text st ⟶ at ↓ ↓ ↓ ↓ gτ ⟶ p(y|gτ) πθ State gτ s1 → s2 → ⋯ → st → ⋯ → sT st+1 Pruning Prediction Reward Reward

Figure 1: Overview of our reinforcement learning approach. Rewards are calculated and sent back to the policy network after all actions a1:T have been sampled for the given episode.

where hj is the hidden state at time step j, sj is the sponsible for the reward, and the REINFORCE al- jth segment, and θ is the set of parameters. gorithm is used to train the policy (Williams, 1992). Although segmentation of documents is still nec- State At each time step, the state st is the con- essary, no pseudo labels are needed. We get the catenation of two parts: the representation of previ- segment representation by averaging its token em- ously selected text, and the current segment repre- bedding from the last layer of BERT. The final sentation g . The previously selected text serves as hidden state at each step j is the concatenated hid- i the context and provides a prior importance. Both den states of a single-layer Bi-directional LSTM. parts are represented by an effective encoder, e.g. After we get the hidden state for each segment, a the best performing encoder from Section4. max-pooling operation is performed on h1:n over the time dimension to obtain a fixed-length vector, Action The action space at each step is binary: similar to Kim(2014); Adhikari et al.(2019). A {Keep, Prune}. If the action is Keep, the current dense layer is immediately followed. segment is added to the selected text; otherwise, it It is particularly important to strengthen regu- is discarded. The final selected text for a patient is larization on this dataset with small sample size. the concatenated segments selected by the policy. Dropout (Srivastava et al., 2014) as a way of regu- Reward The reward comes at the end when all larization has been shown effective in deep learning actions are sampled for the entire sequence. The models, and Merity et al.(2018) has successfully final selected text is fed to the downstream classi- applied dropout-like technique in LSTM: the use of fier, and negative log-likelihood of the gold label is DropConnect (Wan et al., 2013) is applied on the used as the reward R. In addition, we also include four hidden-to-hidden matrices, preventing overfit- a reward term R to encourage pruning, as follows: ting from occurring on the recurrent weights. p l R = c α [2σ( ) 1] (4) 5 Reinforcement Learning p · · β − Reinforcement learning is applied to the best per- where c and β are hyperparameters to control the forming encoder in Section4 to prune noisy text, scale of Rp, l is the number of segments, α is the ratio of pruned segments a = Prune /l, σ which can lead to comparable or even better per- |{ k }| formance, as many text segments in these clinical is the sigmoid function. The value of the term 2σ( l ) 1 falls into range (0, 1). When l is small, notes are found to be irrelevant to this task. Fig- β − ure1 describes the overview of our reinforcement it downgrades the encouragement of pruning; when learning approach. The pruning process is mod- l is large, it also gives an upper bound of Rp. Addi- eled as a sequential decision problem, for the fact tionally, we apply exponential decay on the reward. l that these notes are received in time-order. It con- The final reward is d R + Rp. d is the discount rate. sists of two separate components: a policy network, Policy Network The policy network maintains a and a downstream classifier. To avoid having too stochastic policy π(at st; θ): many time steps, the policy is performed on the seg- | ment level instead of token level. For each patient, π(a s ; θ) = σ(W s + b) (5) t| t t documents are split into short segments g1:T = g , g , , g , and the policy network conducts where θ is the set of policy parameters W and b, a { 1 2 ··· T } t a sequence of decisions a = a , a , , a and s are the action and state at the time step t re- 1:T { 1 2 ··· T } t over segments. The downstream classifier is respectively. During training, an action is sampled at

98 Encoder CO DS EC HP OP PG SC SW Bag-of-Words (§4.1) 58.6 62.1 52.0 58.9 51.8 61.2 59.3 51.6 + Cutoff 58.6 62.3 52.8 59.0 51.9 61.3 59.3 51.9 + Stemming 58.9 61.8 53.4 59.4 51.9 61.5 59.3 51.6 Averaged Embedding (§4.2) 56.3 53.7 52.4 54.0 53.4 54.7 54.2 46.6 ClinicalBERT (§4.3) 51.9 53.3 - 52.7 - - 52.3 - Weight-dropped LSTM (§4.4) 53.7 55.8 - 54.2 - - 54.5 -

Table 3: The Area Under the Curve (AUC) scores achieved by different encoders on the 5-fold cross-validation. See the caption in Table1 for the descriptions of CO, DS, EC, HP, OP, PG, SC, and SW. For deep learning encoders, only four types are selected in experiments (Section 6.2). each step with the probability from the policy. After cross-validation as suggested by Shin et al.(2019). the sampling is performed over the entire sequence, To evaluate each fold Fi, 12.5% of the training set, the delayed reward is computed. During evaluation, that is the combined data of the other 4 folds, are the action is picked by argmax π(a s ; θ). held out as the development set and the best conﬁg- a | t The training is guided by the REINFORCE algo- uration from this development set is used to decode rithm (Williams, 1992), which optimizes the policy Fi. The same split is used across all experiments to maximize the expected reward: for fair comparison. Following Shin et al.(2019), the averaged Area Under the Curve (AUC) across

J(θ) = Ea1:T π Ra1:T (6) ∼ these 5 folds is used as the evaluation metric. and the gradient has the following form: 6.1 Baseline T Bag-of-Words We first conduct experiments us- θJ(θ) = Eτ θ log π(at st; θ)Rτ (7) ing the bag-of-words encoder (BoW; Section 4.1) ∇ ∇ | t=1 to establish the baseline. Many experiments are per- X N T 1 formed on all note types using the vanilla TF-IDF, log π(a s ; θ)R ≈ N ∇θ it| it τi document frequency (DF) cutoff at 2 (removing i=1 t=1 all tokens whose DF 2), and token stemming. X X ≤ (8) For every experiment, the class weight is assigned where τ represents the sampled trajectory inversely proportional to class frequencies, and the inverse of regularization strength C is searched a1, a2, , aT , N is the number of sampled tra- { ··· } from 0.01, 0.1, 1, 10 , where the best results are jectories. Rτi here equals the delayed reward from { } C = 1 the downstream classifier at the last step. achieved with on the development set. To encourage exploration and avoid local op- Table3 describes the cross-validation results on tima, we add the entropy regularization (Mnih et al., every note type. The top AUC is 62.3%, which is 2016) on the policy loss: within expectation given the difficulty of this task. Some note types are not as predictive as the others, N λ 1 such as Operative (OP) and Social Worker (SW), Jreg(θ) = θH(π(sit; θ)) (9) with the AUC under 52%. Most note types have N Ti ∇ Xi=1 the standard deviations in range 0.02 to 0.03. where H is the entropy, and λ is the regularization In comparison to the previous work (Shin et al., 2019), we achieve 0.671 AUC combining both strength, Ti is the trajectory length. Finally, the downstream classifier and policy net- structured and unstructured data, despite without work are warm-started by separate training, and the use of LDA in our encoder. then jointly trained together. Noise Observation The DF cutoff coupled with token stemming significantly reduce feature space 6 Experiments for the BoW model. As shown in Table4, the DF Before experiments, we perform the preprocessing cutoff itself can achieve about 50% reduction of the described in Section 2.1, and then randomly split feature space. Furthermore, applying the DF cutoff patients in every note type by 5 folds to perform leads to slightly higher AUCs on most of the note

99 types, despite almost a half of the tokens are re- For the ClinicalBERT, we use the PyTorch BERT moved from the vocabulary. This implies that there implementation with the base configuration:2 768 exists a large amount of noisy text that appears only embedding dimensions and 12 transformer layers, in few documents, causing the models to be over- and we load the weights provided by Huang et al. fitted more easily. These results further verify our (2019) whose language model has been finetuned previous observation and strengthen the necessity on large-scale clinical notes.3 We finetune the en- to extract noise from these long documents using tire ClinicalBERT with batch size 4, learning rate reinforcement learning (Section 6.3). 2 10 5, and weight decay rate 0.01. × − For the weight-dropped LSTM, we set the batch Averaged Word Embedding For the averaged 64 10 3 word embedding encoder (AWE; Section 4.2), em- size to , the learning rate to − , the weight- 0.5 beddings generated by FastText trained on the Com- drop rate to , and search the hidden state dimension from 128, 256, 512 on the development set. mon Crawl and the English Wikipedia with the 300 { } dimension is used.1 AWE is outperformed by BoW Early stop is used for both approaches. on every note type except Operative (OP; Table3). This empirical result implies that AWE over thou- Type + Model SEN SEQ INST sands of tokens is not so effective in generating the CO + BERT 318 14.8 11,376 document representation so that the averaged em- CO + LSTM 128 36.8 948 beddings are less discriminative than the sparse vec- DS + BERT 318 4.6 1,588 tors generated by BoW for such long documents. DS + LSTM 64 22.5 371 HP + BERT 318 10.1 8,364 Type Vanilla + Cutoff + Stemming HP + LSTM 128 27.3 987 CO 28,213 15,022 (46.8) 12,243 (56.6) SC + BERT 318 3.7 5,206 DS 11,029 6,117 (44.5) 5,228 (52.6) SC + LSTM 64 25.4 1,422 HP 20,245 11,276 (44.3) 9,329 (53.9) SC 19,050 9,873 (48.2) 8,200 (57.0) Table 5: SEN: maximum segment length (number of tokens) allowed by the corresponding model, SEQ: av- Table 4: The dimensions of the feature spaces used by erage sequence length (number of segments), INST: av- each BoW model with respect to the four note types. erage number of samples in the training set. The numbers in the parentheses indicate the percentage reduction from the vanilla model, respectively.

Result Analysis Table3 shows the ﬁnal results 6.2 Deep Learning-based Encoders achieved by the ClinicalBERT and LSTM models. For deep learning encoders, the four note types with The AUCs of both models experience a non-trivial good baseline performance ( 60% AUC) and rea- drop from the baseline. After further investigation, ≈ sonable sequence length (< 5000) are selected to the issue is that both models suffer from severe use in the following experiments, which are Con- overﬁtting under the huge feature spaces, and strug- sultations (CO), Discharge Summary (DS), History gle to learn generalized decision boundaries from and Physical (HP), and Selection Conference (SC) this data. Figure2 shows an example of the weak (see Tables1 and3). correlation between the training loss and the AUC scores on the development set. Segmentation For both ClinicalBERT and the As more steps are processed, the training loss grad- LSTM models, the input document is split into ually decreases to 0. However, the model has high segments as described in Section 4.3. For LSTM, variance and it does not necessarily give better per- we set the maximum segment length to be 128 for formance on the development set as the training CO HP DS SC and , 64 for and , to balance between loss drops. This issue is more apparent with Clini- segment length and sequence length. The segment calBERT on CO because there are too many pseudo length for ClinicalBERT is set to 318 (approach- labels acting as noise, which makes it harder for ing 500 after BERT tokenization) to avoid noise the model to distinguish useful patterns from noise. brought by too many pseudo labels. More statistics about segmentation are summarized in Table5. 2https://github.com/huggingface/transformers 1https://fasttext.cc/docs/en/crawl-vectors.html 3https://github.com/kexinhuang12345/clinicalBERT

100 0.8 0.64 Tuning Analysis We ﬁnd that two hyperparame-

0.7 62 0.6 60 ters are essential to the final success of reinforce- 58 0.5 56 ment learning (RL). The first is the reward discount 0.4 54 0.3 52 rate d. The scale of the policy gradient θJ(θ) de- 0.2 50 48 ∇ 0.1 T 46 pends on the sequence length , while the delayed 0 44 reward Rτ is always on the same scale regardless 0 500 1k 1.5k 2k 2.5k 3k 3.5k 4k 0 40 80 120 160 200 of T . Therefore, different sequence length across (a) Training loss (b) AUC on dev-set episodes causes turbulence on the policy gradient, Figure 2: Training loss and AUC scores on the develop- leading to unstable training. It is important to apply ment set during the LSTM training on the CO type. The reward decay to stabilize the scale of J(θ). ∇θ AUC scores depict high variance while showing weak The second is the entropy regularization coeffi- correlation to the training loss. cient λ, which forces the model to add bias towards uncertainty. Without strong entropy regularization, the training is easy to fall into local optima in early 6.3 Reinforcement Learning stage, which is to keep all segments, as shown by According to Table3, the BoW model achieves Figure3(a). λ = 6 gives the model descent incen- the best performance. Therefore, we decide to use tive to explore aggressively, as shown by Figure TF-IDF to represent the long text of each patient, 3(b), and finally leads to higher AUC. along with logistic regression as the classifier for 1.04 1.2 reinforcement learning. Document segmentation 1.02 1.1 1 1 is the same as LSTM (Table5). During training, 0.98 0.9 0.96 0.8 0.7 0.94 0.6 segments within each note are shuffled to reduce 0.92 0.5 0.9 overfitting risks, and sequences with more than 36 0.4 0.88 0.3 0.86 0.2 segments are truncated. 0.84 0.1 0.82 0 The downstream classifier is warm-started by - 0 500 1k 1.5k 2k 2.5k 3k 3.5k 4k 4.5 0 500 1k 1.5k 2k 2.5k 3k 3.5k 4k loading weights from the logistic regression model (a) Without Entropy Reg. (b) With Entropy Reg. in the previous experiment. The policy network is then trained for 400 episodes while freezing the Figure 3: Retaining ratios on the development set of SC downstream classifier. After the warm start, both while training the reinforcement learning model. En- models are jointly trained. We set the number of tropy regularization encourages more exploration. sampling N as 10 episodes, learning rate 2 10 4, × − and fix the scaling factor β in Equations4 as 8, 7 Noise Analysis and discount rate as 0.95. Moreover, we search the reward coefficient c in 0.02, 0.1, 0.4 , and entropy To investigate the noise extracted by RL, we an- { } coefficient λ in 2, 4, 6, 8 . alyze the pruned segments on the validation sets { } of the Consultations type (CO), and compare the CO DS HP SC results with other basic noise removal techniques. Best 58.9 62.3 59.4 59.3 Table7 demonstrates the RL 59.8 62.4 60.6 60.2 Qualitative Analysis potential of the learned policy to automatically Pruning 26% 5% 19% 23% identify noisy text from the long documents. The Table 6: The AUC scores and the pruning ratios of re- original notes of shown examples are tabular text inforcement learning (RL). Best: AUC scores from the with headers and values, mostly lab results and best performing models in Table3. medical prescription. After the data cleaning step, the text becomes broken and does not make much The AUC scores and the pruning ratios (the number sense for humans to evaluate. The learned policy of pruned segments divided by the sequence length) can identify noisy segments by looking at the pres- are shown in Table6. Our reinforcement learning ence of headers such as “lab fishbone”, “lab report”, approach outperforms the best performing models and certain medical terms that frequently appear in Table3, achieving around 1% higher AUC scores in tabular reports such as “chloride”, “creatinine”, on three note types, CO, HP, and SC, while pruning “hemoglobin”, “methylprednisolone”, etc. We find out up to 26% of the input documents. that many pruned segments have strong indicators

101 lab fishbone ( bmp , cbc , cmp , diff ) and critical labs - last hours ( not an official lab report . please see flowsheet ( or printed official lab reports ) for official lab results . ) ( na ) ( cl ) h ( bun )- ( hgb ) ( glu ) ( wbc ) ( plt ) ( ) h ( cr ) ( hct ) na = not applicable a = abnormal ( ftn ) = footnote . laboratory studies : sodium , potassium , chloride ,., bun , creatinine , glucose . total bilirubin 1 , phos of , calcium , ast 9 , alt , alk phos . parathyroid hormone level . white blood cell count , hemoglobin , hematocrit , platelets . inr , ptt , and pt . methylprednisolone ivpb : mg , ivpb , give in surgery , routine , / , infuse over : minute . mycophe- nolate mofetil : mg = 4 cap , po , capsule , once , now , / , stop date / , ml . documented medications documented accupril : mg , po , qday , 0 refill , substitution allowed .

Table 7: Examples of pruned segments by the learned policy. Tokens that have feature importance lower than 0.001 (towards Prune action) are marked bold. −

the social worker met with this pleasant year old caucasian male on this date for kidney transplant evaluation . the patient was alert , oriented and easily engaged in conversation with the social worker today . he resides in atlanta with his spouse of years , who he describes as very supportive . he reports occasional alcohol drinks per month but denies any illicit drug use . he has a grade education . he has been married for years . he is working full - time while on peritoneal dialysis as a business asset manager . he has medicare and an aarp prescriptions supplement . family history : mother deceased at age with complications of obesity , high blood pressure and heart disease .

Table 8: Examples of kept segments by the learned policy. Tokens that have feature importance greater than 0.0005 (towards Keep action) are marked bold. of headers and speciﬁc medical terms, which appear mostly in tabular text rather than written notes.

Table8 shows examples that are kept by the policy. Tokens that contribute towards Keep action are words related with human and social life, such as “social worker”, “engaged”, “drinks”, “married”,

“medicare”, and terms related with health conditions, such as “obesity”, “heart”, “high blood pressure”. These terms indeed appear mostly in written text rather than tabular data. In addition, we also notice that the policy is able Figure 4: Log scale distribution on document frequency to remove certain duplicate segments. Medical of tokens with top negative feature importance. professionals sometimes repeat certain description from previous notes to a new document, causing We observe that the majority of those tokens have duplicate content. The policy learns to make use of small DF values. It shows that the learned policy is the already selected context, and assigns negative able to identify certain tokens with small DF values coefficients to certain tokens. Duplicate segments as noise, which aligns with DF cutoff. Moreover, are only selected once if the segment contains many the distribution also shows a non-trivial amount of tokens that have opposite feature importance in the tokens with large DF values, demonstrating that context and segment vectors. RL can also identify task-specific noisy tokens that Quantitative Analysis We examine tokens that commonly appear in documents, which in this case are pruned by RL and compare with document fre- are certain tokens in noisy tabular text. quency (DF) cutoff. We select 3000 unique tokens Either RL or DF cutoff achieves higher AUC in the vocabulary that have the top negative feature while reducing input features, proving that given importance (towards Prune action) in the segment the small sample size, the extracted text is more vector of CO. Figure4 shows the DF distribution likely to cause overfit than being generalizable pat- of these tokens. tern, which also verifies our initial hypothesis.

102 8 Conclusion pages 4171–4186, Minneapolis, Minnesota. Associ- ation for Computational Linguistics. In this paper, we address the task of 30-day readmission prediction after kidney transplant, and propose Kexin Huang, Jaan Altosaar, and Rajesh Ran- to improve the performance by applying reinforce- ganath. 2019. Clinicalbert: Modeling clinical notes and predicting hospital readmission. CoRR, ment learning with noise extraction capability. To abs/1904.05342. overcome the challenge of long document representation with a small dataset, four different encoders Alistair EW Johnson, Tom J Pollard, Lu Shen, are experimented. Empirical results show that bag- H Lehman Li-wei, Mengling Feng, Moham- of-words is the most suitable encoder, surpassing mad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. Mimic- overfitted deep learning models, and reinforcement iii, a freely accessible critical care database. Scien- learning is able to improve the performance, while tific data, 3:160035. being able to identify both traditional noisy tokens that appear in few documents, and task-specific Caroline Jones, Robert Hollis, Tyler Wahl, Brad Oriel, noisy text that commonly appear. Kamal Itani, Melanie Morris, and Mary Hawn. 2016. Transitional care interventions and hospital readmis- sions in surgical populations: A systematic review. Acknowledgments The American Journal of Surgery, 212. We gratefully acknowledge the support of the Na- Yoon Kim. 2014. Convolutional neural networks tional Institutes of Health grant R01MD011682, for sentence classification. In Proceedings of the Reducing Disparities among Kidney Transplant Re- 2014 Conference on Empirical Methods in Natural cipients. Any opinions, findings, and conclusions Language Processing (EMNLP), pages 1746–1751, or recommendations expressed in this material are Doha, Qatar. Association for Computational Lin- guistics. those of the authors and do not necessarily reflect the views of the National Institutes of Health. Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2018. Regularizing and optimizing LSTM language models. In International Conference on References Learning Representations. Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Jimmy Lin. 2019. Rethinking complex neural net- Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, work architectures for document classification. In David Silver, and Koray Kavukcuoglu. 2016. Asyn- Proceedings of the 2019 Conference of the North chronous methods for deep reinforcement learning. American Chapter of the Association for Compu- In Proceedings of The 33rd International Confer- tational Linguistics: Human Language Technolo- ence on Machine Learning, volume 48 of Proceed- gies, Volume 1 (Long and Short Papers), pages ings of Machine Learning Research, pages 1928– 4046–4051, Minneapolis, Minnesota. Association 1937, New York, New York, USA. PMLR. for Computational Linguistics. Joel Nothman, Hanmin Qin, and Roman Yurchak. Senjuti Basu Roy, Ankur Teredesai, Kiyana Zolfaghar, 2018. Stop word lists in free open-source software Rui Liu, David Hazel, Stacey Newman, and Albert packages. In Proceedings of Workshop for NLP Marinez. 2015. Dynamic hierarchical classification Open Source Software (NLP-OSS), pages 7–12, Mel- for patient risk-of-readmission. In Proceedings of bourne, Australia. Association for Computational the 21th ACM SIGKDD International Conference on Linguistics. Knowledge Discovery and Data Mining, KDD ’15, page 1691–1700, New York, NY, USA. Association for Computing Machinery. Pengda Qin, Weiran Xu, and William Yang Wang. 2018. Robust distant supervision relation extrac- Piotr Bojanowski, Edouard Grave, Armand Joulin, and tion via deep reinforcement learning. In Proceed- Tomas Mikolov. 2017. Enriching word vectors with ings of the 56th Annual Meeting of the Association subword information. Transactions of the Associa- for Computational Linguistics (Volume 1: Long Pa- tion for Computational Linguistics, 5:135–146. pers), pages 2137–2147, Melbourne, Australia. As- sociation for Computational Linguistics. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Bonggun Shin, Julien Hogan, Andrew B. Adams, Ray- deep bidirectional transformers for language under- mond J. Lynch, and Jinho D. Choi. 2019. Mul- standing. In Proceedings of the 2019 Conference timodal ensemble approach to incorporate various of the North American Chapter of the Association types of clinical notes for predicting readmission. for Computational Linguistics: Human Language In 2019 IEEE EMBS International Conference on Technologies, Volume 1 (Long and Short Papers), Biomedical & Health Informatics (BHI).

103 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Re- search, 15(56):1929–1958. Li Wan, Matthew Zeiler, Sixin Zhang, Yann LeCun, and Rob Fergus. 2013. Regularization of neural networks using dropconnect. In Proceedings of the 30th International Conference on International Con- ference on Machine Learning - Volume 28, ICML’13, page III–1058–III–1066. JMLR.org. Ronald J. Williams. 1992. Simple statistical gradient- following algorithms for connectionist reinforcement learning. Mach. Learn., 8(3–4):229–256. Tianyang Zhang, Minlie Huang, and Li Zhao. 2018. Learning structured representation for text classification via reinforcement learning. In AAAI Conference on Artificial Intelligence.

104 Evaluating the Utility of Model Conﬁgurations and Data Augmentation on Clinical Semantic Textual Similarity

Yuxia Wang Fei Liu Karin Verspoor Timothy Baldwin School of Computing and Information Systems The University of Melbourne Victoria, Australia yuxiaw, fliu3 @student.unimelb.edu.au [email protected]{ } [email protected]

Abstract 2016; Reimers and Gurevych, 2019). These architectures require a large amount of training data, an In this paper, we apply pre-trained language unrealistic requirement in low resource settings. models to the Semantic Textual Similarity (STS) task, with a specific focus on the clini- Recently, pre-trained language models (LMs) cal domain. In low-resource setting of clinical such as GPT-2 (Radford et al., 2018) and STS, these large models tend to be impractical BERT (Devlin et al., 2019) have been shown to and prone to overfitting. Building on BERT, benefit from pre-training over large corpora fol- we study the impact of a number of model de- lowed by fine tuning over specific tasks. How- sign choices, namely different fine-tuning and ever, for small-scale datasets, only limited fine- pooling strategies. We observe that the impact tuning can be done. For example, GPT-2 achieved of domain-specific fine-tuning on clinical STS is much less than that in the general domain, strong results across four large natural language likely due to the concept richness of the do- inference (NLI) datasets, but was less successful main. Based on this, we propose two data aug- over the small-scale RTE corpus (Bentivogli et al., mentation techniques. Experimental results on 2009), performing below a multi-task biLSTM N2C2-STS1 demonstrate substantial improve- model. Similarly, while the large-scale pre-training ments, validating the utility of the proposed of BERT has led to impressive improvements on methods. a range of tasks, only very modest improvements 1 Introduction have been achieved on STS tasks such as STS- B(Cer et al., 2017) and MRPC (Dolan and Brock- Semantic Textual Similarity (STS) is a language ett, 2005) (with 5.7k and 3.6k training instances, understanding task, involving assessing the degree resp.). Compared to general-domain STS bench- of semantic equivalence between two pieces of text marks, labeled clinical STS data is more scarce, based on a graded numerical score (Corley and which tends to cause overfitting during fine-tuning. Mihalcea, 2005). It has application in tasks such Moreover, further model scaling is a challenge due as information retrieval (Hliaoutakis et al., 2006), to GPU/TPU memory limitations and longer train- question answering (Hoogeveen et al., 2018), and ing time (Lan et al., 2019). This motivates us to summarization (AL-Khassawneh et al., 2016). In search for model configurations which strike a bal- this paper, we focus on STS in the clinical domain, ance between model flexibility and overfitting. in the context of a recent task within the framework In this paper, we study the impact of a number of of N2C2 (the National NLP Clinical Challenges)1, model design choices. First, following Reimers and which makes use of the extended MedSTS data Gurevych(2019), we study the impact of various set (Wang et al., 2018), referring to N2C2-STS, pooling methods on STS, and find that convolution with limited annotated sentences pairs (1.6K) that filters coupled with max and mean pooling out- are rich in domain terms. perform a number of alternative approaches. This Neural STS models typically consist of encoders can largely be attributed to their improved model to generate text representations, and a regression expressiveness and ability to capture local interac- layer to measure the similarity score (He et al., tions (Yu et al., 2019). Next, we consider differ- 2015; Mueller and Thyagarajan, 2016; He and Lin, ent parameter fine-tuning strategies, with varying 1https://portal.dbmi.hms.harvard.edu/ degrees of flexibility, ranging from keeping all pa- projects/n2c2-2019-t1/ rameters frozen during training to allowing all pa-

105 Proceedings of the BioNLP 2020 workshop, pages 105–111 Online, July 9, 2020 c 2020 Association for Computational Linguistics rameters to be updated. This allows us to identify 2.2 Data Augmentation the optimal model flexibility without over-tuning, Synonym replacement is one of the most com- thereby further improving model performance. monly used data augmentation methods to simulate Finally, inspired by recent studies, including linguistic diversity, but it introduces ambiguity if sentence ordering prediction (Lan et al., 2019) accurate context-dependent disambiguation is not and data-augmented question answering (Yu et al., performed. Moreover, random selection and re- 2019), we focus on data augmentation methods to placement of a single word used in general texts is expand the modest amount of training data. We first not plausible for term-rich clinical text, resulting consider segment reordering (SR), in permuting in too much semantic divergence (e.g patient to af- segments that are delimited by commas or semi- fected role and discharge to home to spark to home). colons. Our second method increases linguistic By contrast, replacing a complete mention of the diversity with back translation (BT). Extensive concept can increase error propagation due to the experiments on N2C2-STS reveal the effective- prerequisite concept extraction and normalization. ness of data augmentation on clinical STS, par- Random insertion, deletion, and swapping of ticularly when combined with the best parameter words have been demonstrated to be effective on fine-tuning and pooling strategies identified in Sec- five text classification tasks (Wei and Zou, 2019). tion3, achieving an absolute gain in performance. But those experiments targeted topic prediction, in contrast to semantic reasoning such as STS and MultiNLI. Intuitively, they do not change the over- 2 Related Work all topic of a text, but can skew the meaning of a sentence, undermining the STS task. Swapping an 2.1 Model Configurations entire semantic segment may mitigate the risk of In pre-training, a spectrum of design choices have introducing label noise to the STS task. been proposed to optimize models, such as the pre- Compared to semantic and syntactic distortion training objective, training corpus, and hyperpa- potentially caused by aforementioned methods, rameter selection. Specific examples of objective back translation (BT) (Sennrich et al., 2016)— functions include masked language modeling in translating to a target language then back to the BERT, permutation language modeling in XLNet original language — presents fluent augmented (Yang et al., 2019), and sentence order prediction data and reliable improvements for tasks demand- (SOP) in ALBERT (Lan et al., 2019). Addition- ing for adequate semantic understanding, such as ally, RoBERTa (Liu et al., 2019) explored benefits low-resource machine translation (Xia et al., 2019) from a larger mini-batch size, a dynamic masking and question answering (Yu et al., 2019). This strategy, and increasing the size of the training cor- motivates our application of BT on low-resource pus (16G to 160G). However, all these efforts are clinical STS, to bridge linguistic variation between targeted at improving downstream tasks indirectly two sentences. This work represents the first explo- by optimizing the capability and generalizability of ration of applying BT for STS. LMs, while adapting a single fully-connected layer to capture task features. 3 STS Model Configurations Sentence-BERT (Reimers and Gurevych, 2019) In this section, we study the impact of a number makes use of task-specific structures to optimize of model design choices on BERT for STS, using STS, concentrating on computational and time effi- a 12-layer base model initialized with pretrained ciency, and is evaluated on relatively larger datasets weights. in the general domain. For evaluating the impact of number of layers transferred to the supervised 3.1 Hierarchical Convolution (HConv) target task from the pre-trained language model, The resource-poor and concept-rich nature of clini- GPT-2 has been analyzed on two datasets. How- cal STS makes it difficult to train a large model end- ever, they are both large: MultiNLI (Williams et al., to-end on sentence pairs. To address this, most re- 2018) with >390k instances, and RACE (Lai et al., cent studies have made use of pre-trained language 2017) with >97k instances. These tasks also both models, such as BERT. The most straightforward involve reasoning-related classification, as opposed way to use BERT is the feature-based approach, to the nuanced regression task of STS. where the output of the last transformer block is

106 taken as input to the task-specific classifier. Many Model SICK-R STS-B N2C2-STS have proposed the use of a dummy CLS token to Feature-based: generate the feature vector, where CLS is a special CLS-BERT 53.6/52.1 49.3/67.9 14.6/28.4 symbol added in front of every sequence during HConv-BERT 80.1/73.6 83.0/83.2 79.4/74.4 pre-training, with its final hidden state always used Fine-tuning: as the aggregate sequence representation for clas- CLS-BERT 88.6/82.9 90.0/89.6 86.7/81.9 sification tasks, referring to CLS pooling. Other HConv-BERT 88.7/83.5 90.1/89.6 87.7/80.7 types of pooling, such as mean and max pooling, are investigated by Reimers and Gurevych(2019). Table 1: Pearson and Spearman correlation (r/ρ) between the predicted score and the gold labels for three However, this results in inferior performance 2 STS datasets using the feature-based approach (upper as shown in the first row of Table1. As a con- half) and fine-tuning (bottom half) with CLS-BERT sequence, the best strategy for extracting feature and HConv-BERT. Performance is reported by conven- vectors to represent a sentence remains an open tion as r/ρ 100. × question. In this work, we first experiment with the feature- c based approach, coupled with convolutional filters. where i:i+1,k is the output of the first convolu- i i + 1 This is inspired by the use of convolutional filters tional layer over the span to as defined in w b in QANet (Yu et al., 2019) to capture local interac- Equation (2), and m and m are the filter and tions. The difference lies in where convolutional bias term for the second convolutional layer with, a 2 M = 128 filters are applied. With QANet, multiple conv kernel size of and output dimension of . filters are incorporated into each transformer en- Lastly, we extract feature vectors by max and coder block to process the input from the previous mean pooling over the temporal axis and then con- layer. In contrast, HConv-BERT is largely based catenation: on BERT, with the addition of a single task-specific vk = max ck vk = avg ck (5) classifier placed on top of BERT consisting of conv max i mean i 2 3 4 2 3 4 filters organised in a hierarchical fashion. This v = vmax; vmax; vmax; vavg; vavg; vavg . (6) results in a much simplified model, making HConv- BERT less prone to overfitting. The upper half of Table1 shows that the pro- Specifically, we run a collection of convolutional posed hierarchical convolutional (HConv) architec- filters with a kernel of size k [2, 4], each with ture provides substantial performance gains. ∈ J = 768 output channels (indexed by j [1,J]), ∈ 3.2 Model Flexibility over the temporal axis (indexed by i [1,T ]): ∈ We also evaluate the utility of this mechanism in the

ci,kj = wkj xi:i+k 1 + bkj (1) fine-tuning setting with varying modelling flexibil- ∗ − ity. Concretely, we progressively increase the num- ci,k = [ci,k1 ; ... ; ci,kJ ] (2) ber of trainable parameters by transformer blocks. where xi:i+k 1 is the output BERT features for the That is, for the base BERT model with 12 layers, − token span i to i + k 1, is the convolution − ∗ we allow errors to be back-propagated through the operation, wkj and bkj are the convolution filter last l layers while keeping the rest (12 l) fixed. − and bias term for the j-th kernel of size k, and The results on STS-B and N2C2-STS are shown [a; b] denotes the concatenation of a and b. in Figure1. We observe performance crossover of To capture interactions between distant elements, HConv and CLS-pooling on both datasets as the we feed the output ci,k into another convolution number of trainable transformer layers increases. layer of kernel size 2 with M = 128 output chan- While HConv reaches peak performance before the nels (indexed by m [1,M]): ∈ crossover, CLS-pooling often requires more blocks to be trainable to achieve comparable accuracy, ck = w c + b (3) i,m m ∗ i:i+1,k m rendering the model much slower. Notably, the pro- k k k ci = ci,1; ... ; ci,M (4) posed mechanism peaks with much fewer trainable blocks on N2C2-STS than STS-B. We speculate 2 h i Due to space constrains, we limit our comparison to the that this is due to the size difference between the CLS pooling strategy, based on the observation of little improvements when using other types of pooling (mean, max) two datasets. To verify this hypothesis, we further and concatenation, or sequence processing recurrent units. look into the relationship between the number of

107 90.0 CLS-BERT 86 HConv-BERT 89.5 85 89.0 84 88.5

88.0 83 r on STS-B 87.5 r on N2C2 STS 82

87.0 81

86.5 CLS-BERT 80 HConv-BERT 86.0 2 4 6 8 10 12 2 4 6 8 10 12 The number of trainable transformer encoder layers The number of trainable transformer encoder layers

Figure 1: Evaluation of CLS-BERT and HConv-BERT over datasets from the general (STS-B) and clinical (N2C2) domains. r refers to Pearson correlation. N2C2-STS is split into 1233 and 409 instances for training and dev.

90 Table1, with HConv outperforming CLS-pooling.

88 4 Data Augmentation

r The accuracy of an STS model unsurprisingly de- 500 1000 84 pends on the amount of labeled data. This is re- 1500 flected in Figure2, where models trained with more 2200 82 3500 data outperform those with fewer training instances. 4500 In this section, we propose two data augmenta- 5749 80 tion methods, namely segment reordering (SR) and 2 4 6 8 10 12 The number of trainable transformer encoder layers back translation (BT), to address the data sparsity issue in clinical STS. Figure 2: Impact of number of trainable transformer blocks based on HConv-BERT over different data size, Segment reordering. Clinical texts often consist randomly sampled from STS-B, ranging from 500 to of text segments describing multiple events and pa- full set (5, 749). tient symptoms. Each segment is often an independent semantic unit, separated by commas or semi- colons. Inspired by the random word swapping trainable transformer blocks and training data size. of Wei and Zou(2019), we exploit this property In Figure2, we observe performance degradation and propose a heuristic, named segment reordering as the size of training data shrinks, with the mod- (SR), to generate permutations of the original se- els trained on the full set achieving far superior quence based on these segments. While we expect Pearson correlation to those trained on the smaller this to introduce some noise to the training data, our subsets. Zooming into the curve representing each hypothesis is that the increase in training data size subset, we find that peak performance is attained will outweigh this. For instance, consider the text at different points depending on data size: with new confusion or inability to stay alert and awake; the smallest dataset (500 instances), the number of feeling like you are going to pass out. Flipping the parameter updates is also limited. Only updating order of the two segments new confusion or inabil- the top few layers of transformer blocks is simply ity to stay alert and awake and feeling like you are not enough to make the model fully adapt to the going to pass out will not hinder the overall under- task. It is therefore beneficial to allow the model standing of the text. More formally, for a given access to more trainable layers (e.g., 11) to improve pair of sentences S1 and S2, each consisting of a performance. sequence of segments S = s , . . . , s and 1 { 11 1m} Based on this, we set the number of trainable S = s , . . . , s , we generate a new pair by 2 { 21 2n} blocks to 6 for SICK-R (consisting of 4, 500 train- randomly permuting the segment order, effectively ing instances), as presented in the bottom half of doubling the size of the training corpus.

108 Back translation. Inspired by the work of Yu Model r ρ et al.(2019), we make use of machine translation IBM-N2C2 90.1 — tools to perform back translation (BT). Here, we choose Chinese as the pivot language as it is linguis- BERTbase 86.7 81.9 tically distant to English and supported by mature + SR 87.1 80.8 commercial translation solutions. That is, we ﬁrst + BT 87.2 81.7 translate from English to Chinese and then back BERTclinical 86.1 81.4 to English. We use Google Translate to translate + SR 87.4 82.7 each sentence in a sentence pair from English to + BT 88.6 82.4 3 Chinese, and Baidu Translation to translate back Conv1dBERTbase 87.7 80.7 to English. For example, for the original sentence + SR 88.0 81.4 negative for cough and stridor, the backtranslated + BT 88.1 82.2 result is bad for coughing and wheezing. We apply Conv1dBERTSTS-B 87.9 82.5 this to each sentence pair, doubling the amount of + SR 88.6 83.1 training data. + BT 89.4 83.0

5 Experiments Table 2: Pearson r and Spearman ρ on N2C2-STS for models with and without segment reordering (“SR”) 5.1 Experimental Setup and back translation (“BT”). We evaluate the effectiveness of SR and BT on N2C2-STS with four baseline mod- We additionally conduct a cross-domain exper- els: BERTbase (Devlin et al., 2019) and iment on BIOSSES (Sogancıo˘ glu˘ et al., 2017), BERTclinical (Alsentzer et al., 2019), both using CLS-pooling and consisting of 12 layers; a biomedical literature STS dataset comprising 100 sentence pairs derived from the Text Analysis ConvBERTbase, based on BERTbase with hierarchical convolution and fine-tuning over the last 4 lay- Conference Biomedical Summarization task with ers (consistent with our findings of the best model scores ranging from 0 (complete unrelatedness) to 4 (exact equivalence). Specifically, baseline model configuration in Section3); and ConvBERT STS-B, Pooling BERTbase and proposed ConvBERTSTS-B + where we take ConvBERTbase and fine-tune first over STS-B, before N2C2-STS. BT are both fine-tuned on N2C2-STS, and then ap- We split the training partition of N2C2-STS into plied with no further training to BIOSSES. Despite 1, 233 (train) and 409 (dev) instances, and report the increase in task difficulty, the proposed method results on the test set (412 instances). demonstrates strong generalisability, outperforming the baseline by an absolute gain of 2.4 and 3.9 5.2 Results to 85.42/82.83 (r/ρ). Experimental results are presented in Table2. We see clear benefits of the two proposed data aug- 6 Conclusions mentation methods, consistently boosting perfor- In this paper, we have presented an empirical study mance across all categories, with BT providing of the impact of a number of model design choices larger gains than SR. This is likely caused by the on a BERT-based approach to clinical STS. We rather na¨ıve implementation of SR, resulting in un- have demonstrated that the proposed hierarchical natural segment sequences. A possible fix to this is convolution mechanism outperforms a number of to further filter out such irregular statements with alternative conventional pooling methods. Also, we a language model pre-trained on clinical corpora. have investigated parameter fine-tuning strategies We leave this for future work. with varying degrees of flexibility, and identified It is impressive that the best-performing configu- the optimal number of trainable transformer blocks, ration ConvBERTSTS-B + BT is capable of achiev- thereby preventing over-tuning. Lastly, we have ing comparable results with the state-of-the-art verified the utility of two data augmentation meth- IBM-N2C2, an approach heavily reliant on exter- ods on clinical STS. It may be interesting to see the nal, domain-specific resources, and an ensemble of impact of leveraging target languages other than multiple pre-trained language models. Chinese in BT, which we leave for future work. 3https://fanyi.baidu.com/

109 References Doris Hoogeveen, Andrew Bennett, Yitong Li, Karin M Verspoor, and Timothy Baldwin. 2018. De- Yazan Alaya AL-Khassawneh, Naomie Salim, and tecting misflagged duplicate questions in commu- Adekunle Isiaka Obasae. 2016. Sentence similarity nity question-answering archives. In Twelfth Inter- techniques for automatic text summarization. Jour- national AAAI Conference on Web and Social Me- nal of Soft Computing and Decision Support Sys- dia. tems, 3(3):35–41. Emily Alsentzer, John Murphy, William Boag, Wei- Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, Hung Weng, Di Jindi, Tristan Naumann, and and Eduard Hovy. 2017. RACE: Large-scale ReAd- Matthew McDermott. 2019. Publicly available clini- ing comprehension dataset from examinations. In cal BERT embeddings. In Proceedings of the 2nd Proceedings of the 2017 Conference on Empirical Clinical Natural Language Processing Workshop, Methods in Natural Language Processing, pages pages 72–78, Minneapolis, Minnesota, USA. 785–794, Copenhagen, Denmark. Association for Computational Linguistics. Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. 2009. The fifth pascal recognizing tex- Zhenzhong Lan, Mingda Chen, Sebastian Goodman, tual entailment challenge. In TAC. Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite BERT for self-supervised Daniel Cer, Mona Diab, Eneko Agirre, Inigo˜ Lopez- learning of language representations. arXiv preprint Gazpio, and Lucia Specia. 2017. SemEval-2017 arXiv:1909.11942. task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- of the 11th International Workshop on Semantic dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Evaluation (SemEval-2017), pages 1–14, Vancouver, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Canada. Roberta: A robustly optimized bert pretraining ap- Courtney Corley and Rada Mihalcea. 2005. Measuring proach. arXiv preprint arXiv:1907.11692. the semantic similarity of texts. In Proceedings of the ACL workshop on empirical modeling of seman- Jonas Mueller and Aditya Thyagarajan. 2016. Siamese tic equivalence and entailment, pages 13–18. Asso- recurrent architectures for learning sentence similar- ciation for Computational Linguistics. ity. In Thirtieth AAAI Conference on Artificial Intel- ligence. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Alec Radford, Karthik Narasimhan, Tim Sali- deep bidirectional transformers for language under- mans, and Ilya Sutskever. 2018. Improving standing. In Proceedings of the 2019 Conference of language understanding by generative pre-training. the North American Chapter of the Association for URL https://s3-us-west-2.amazonaws.com/openai- Computational Linguistics: Human Language Tech- assets/researchcovers/languageunsupervised/language nologies, Volume 1 (Long and Short Papers), pages understanding paper.pdf. 4171–4186, Minneapolis, Minnesota. Nils Reimers and Iryna Gurevych. 2019. Sentence- William B Dolan and Chris Brockett. 2005. Automati- BERT: Sentence embeddings using Siamese BERT- cally constructing a corpus of sentential paraphrases. networks. In Proceedings of the 2019 Conference on In Proceedings of the Third International Workshop Empirical Methods in Natural Language Processing on Paraphrasing (IWP2005). and the 9th International Joint Conference on Natu- ral Language Processing (EMNLP-IJCNLP), pages Hua He, Kevin Gimpel, and Jimmy Lin. 2015. Multi- 3980–3990, Hong Kong, China. perspective sentence similarity modeling with convolutional neural networks. In Proceedings of the Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015 Conference on Empirical Methods in Natu- 2016. Improving neural machine translation mod- ral Language Processing, pages 1576–1586, Lisbon, els with monolingual data. In Proceedings of the Portugal. 54th Annual Meeting of the Association for Compu- Hua He and Jimmy Lin. 2016. Pairwise word interac- tational Linguistics (Volume 1: Long Papers), pages tion modeling with deep neural networks for seman- 86–96, Berlin, Germany. tic similarity measurement. In Proceedings of the ¨ 2016 Conference of the North American Chapter of Gizem Sogancıo˘ glu,˘ Hakime Ozturk,¨ and Arzucan ¨ the Association for Computational Linguistics: Hu- Ozgur.¨ 2017. BIOSSES: a semantic sentence simi- man Language Technologies, pages 937–948, San larity estimation system for the biomedical domain. Diego, California. Bioinformatics, 33(14):i49–i58. Angelos Hliaoutakis, Giannis Varelas, Epimenidis Yanshan Wang, Naveed Afzal, Sunyang Fu, Liwei Voutsakis, Euripides GM Petrakis, and Evangelos Wang, Feichen Shen, Majid Rastegar-Mojarad, and Milios. 2006. Information retrieval by semantic sim- Hongfang Liu. 2018. MedSTS: a resource for clini- ilarity. International Journal on Semantic Web and cal semantic textual similarity. Language Resources Information Systems (IJSWIS), 2(3):55–73. and Evaluation, pages 1–16.

110 Jason Wei and Kai Zou. 2019. EDA: Easy data augmentation techniques for boosting performance on text classiﬁcation tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natu- ral Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6381–6387, Hong Kong, China. Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceed- ings of the 2018 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguis- tics. Mengzhou Xia, Xiang Kong, Antonios Anastasopou- los, and Graham Neubig. 2019. Generalized data augmentation for low-resource translation. In Pro- ceedings of the 57th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 5786– 5796, Florence, Italy. Association for Computa- tional Linguistics. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- bonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pages 5754–5764. Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. 2019. QANet: Combining local convolution with global self-attention for reading comprehension. In The Sixth International Conference on Learning Representations (ICLR).

111 Entity-Enriched Neural Models for Clinical Question Answering

1, 2,3,# 2,3, Bhanu Pratap Singh Rawat ,∗ Wei-Hung Weng , So Yeon Min §, 3,4, 2,3, Preethi Raghavan †, Peter Szolovits ‡ 1UMass-Amherst, 2MIT CSAIL, 3MIT-IBM Watson AI Lab, 4IBM Research, Cambridge * # [email protected], [email protected], §[email protected] †[email protected], ‡[email protected]

Context: The patient had an elective termination of her pregnancy on [DATE]. The Abstract work-up for the extent of the patient's disease included mri scan of the cervical and thoracic spine which revealed multiple metastatic lesions in the vertebral bodies; A T We explore state-of-the-art neural models 3 lesion extending from the body to the right neural for amina with foraminal obstruc tion. An abdominal and pelvic ct scan with iv contrast revealed bilateral pulmo for question answering on electronic medical nary nodules and bilateral pleural effusions, extensive liver metastases, narrowing of records and improve their ability to general- the intra hepatic ivc and distention of the azygous system suggestive of ivc obstructi on by liver metastases. ize better on previously unseen (paraphrased) Question: How was the patient's extensive liver metastases diagnosed? questions at test time. We enable this by learn- Paraphrase: What diagnosis was used for the patient's extensive liver metastases? Logical Form: {LabEvent (x) [date=x, result=x] OR ProcedureEvent (x) [date=x, re ing to predict logical forms as an auxiliary sult=x] OR VitalEvent (x) [date=x, result=x]} reveals ConditionEvent (|problem|) task along with the main task of answer span Answer: An abdominal and pelvic ct scan with iv contrast detection. The predicted logical forms also serve as a rationale for the answer. Further, Figure 1: A synthetic example of a clinical context, we also incorporate medical entity information question, its logical form and the expected answer. in these models via the ERNIE (Zhang et al., 2019a) architecture. We train our models on the large-scale emrQA dataset and observe that (clinical notes). Thus, we adapt these models for our multi-task entity-enriched models general- EMR QA while focusing on model generalization ize to paraphrased questions 5% better than ∼ via the following. (1) learning to predict the logi- the baseline BERT model. cal form (a structured semantic representation that captures the answering needs corresponding to a 1 Introduction natural language question) along with the answer The field of question answering (QA) has seen sig- and (2) incorporating medical entity embeddings nificant progress with several resources, models into models for EMR QA. We now examine the and benchmark datasets. Pre-trained neural lan- motivation behind these. guage encoders like BERT (Devlin et al., 2019) and A physician interacting with a QA system on its variants (Seo et al., 2016; Zhang et al., 2019b) EMRs may ask the same question in several dif- have achieved near-human or even better perfor- ferent ways; a physician may frame a question as: mance on popular open-domain QA tasks such as “Is the patient allergic to penicillin?” whereas the SQuAD 2.0 (Rajpurkar et al., 2016). While there other could frame it as “Does penicillin cause any has been some progress in biomedical QA on med- allergic reactions to the patient?”. Since paraphras- ical literature (Susterˇ and Daelemans, 2018; Tsat- ing is a common form of generalization in natural saronis et al., 2012), existing models have not been language processing (NLP) (Bhagat et al., 2009), similarly adapted to clinical domain on electronic a QA model should be able to generalize well to medical records (EMRs). such paraphrased question variants that may not be Community-shared large-scale datasets like em- seen during training (and avoid simply memorizing rQA (Pampari et al., 2018) allow us to apply state- the questions). However, current state-of-the-art of-the-art models, establish benchmarks, innovate models do not consider the use of meta-information and adapt them to clinical domain-specific needs. such as the semantic parse or logical form of the emrQA enables question answering from electronic questions in unstructured QA. In order to give the medical records (EMRs) where a question is asked model the ability to understand the semantic infor- by a physician against a patient’s medical record mation about answering needs of a question, we

∗ The author did this work while interning at MIT-IBM frame our problem in a multitask learning setting Watson AI Lab. where the primary task is extractive QA and the

112 Proceedings of the BioNLP 2020 workshop, pages 112–122 Online, July 9, 2020 c 2020 Association for Computational Linguistics auxiliary task is the logical form prediction of the 2 Problem Formulation question. We formulate the EMR QA problem as a read- Fine-tuning on medical copora (MIMIC-III, ing comprehension task. Given a natural language PubMed (Johnson et al., 2016; Lee et al., 2020)) question (asked by a physician) and a context, helps models like BERT align their representations where the context is a set of contiguous sentences according to medical vocabulary (since they are from a patient’s EMR (unstructured clinical notes), previously trained on open-domain corpora such the task is to predict the answer span from the given as WikiText (Zhu et al., 2015)). However, another context. Along with the (question, context, answer) challenge for developing EMR QA models is that triplet, also available as input are clinical entities different physicians can use different medical ter- extracted from the question and context. Also avail- minology to express the same entity; e.g., “heart able as input is the, logical form (LF) that is a struc- attack” vs. “myocardial infarction”. Mapping these tured representation that captures answering needs phrases to the same UMLS semantic type1 as dis- of the question through entities, attributes and rela- ease or syndrome (dsyn) provides common infor- tions required to be in the answer (Pampari et al., mation between such medical terminologies. In- 2018). A question may have multiple paraphrases corporating such entity information about tokens where all paraphrases map to the same LF (and the in the context and question can further improve the same answer, ﬁg.1). performance of QA models for the clinical domain. Our contributions are as follows: 3 Methodology 1. We establish state-of-the-art benchmarks for In this section, we brieﬂy describe BERT (Devlin EMR QA on a large clinical question answer- et al., 2019), ERNIE (Zhang et al., 2019a) and our ing dataset, emrQA (Pampari et al., 2018) proposed model.

2. We demonstrate that incorporating an auxil- 3.1 Bidirectional Encoder Representations iary task of predicting the logical form of a from Transformers (BERT) question helps the proposed models generalize BERT (Devlin et al., 2019) uses multi-layer bidi- well over unseen paraphrases, improving the rectional Transformer (Vaswani et al., 2017) net- overall performance on emrQA by 5% over works to encode contextualised language represen- ∼ BERT (Devlin et al., 2019) and by 3.5% tations. BERT representations are learned from two ∼ over clinicalBERT (Alsentzer et al., 2019). tasks: masked language modeling (Taylor, 1953) We support this hypothesis by running our pro- and next sentence prediction task. We chose BERT posed model over both emrQA and another model as pre-trained BERT models can be fine- clinical QA dataset, MADE (Jagannatha et al., tuned with just one additional inference layer and it 2019). achieved state-of-the-art results for a wide range of tasks such as question answering, such as SQuAD 3. The predicted logical form for unseen para- (Rajpurkar et al., 2016, 2018), and multiple lan- phrases helps in understanding the model bet- guage inference tasks, such as MultiNLI (Williams ter and provides a rationale (explanation) for et al., 2017). clinicalBERT (Alsentzer et al., 2019) why the answer was predicted for the provided yielded superior performance on clinical-related question. This information is critical in clin- NLP tasks such as i2b2 named entity recognition ical domain as it provides an accompanying (NER) challenges (Uzuner et al., 2011). It was answer justification for clinicians. created by further fine-tuning of BERTbase with biomedical and clinical corpus (MIMIC-III) (John- 4. We incorporate medical entity information by son et al., 2016). including entity embeddings via the ERNIE (Zhang et al., 2019a) architecture (Zhang et al., 3.2 Enhanced Language Representation with 2019a) and observe that the model accuracy Informative Entities (ERNIE) and ability to generalize goes up by 3% ∼ We adopt the ERNIE framework (Zhang et al., over BERTbase(Devlin et al., 2019). 2019a) to integrate the entity-level clinical con- 1https://metamap.nlm.nih.gov/ cept information into the BERT architecture, which SemanticTypesAndGroups.shtml has not yet been explored in the previous works.

113 ConditionEvent(|problem|) OR SymptomEvent(|problem|) Start Token End Token Logical Form

Span Selection Layer Logical Form Inference Layer

Information Fusion Layer Pooled Sequence Representation

e1 e1 et en

w1 w1 wt wn e1 e1 et en

Token: Multi-Head Attention Entity: Multi-Head Attention

[CLS] Has the edema ? [SEP] Extremities or edema . N/A N/A N/A FindingN/A N/A N/A N/A FindingN/A

MetaMap

Ques: Has the patient ever gone into edema?

Context: Extremities no clubbing, ... cyanosis or edema.

Figure 2: The network architecture of our multi-task learning question answering model (M-cERNIE). The question and context are provided to a multi-head attention model (orange) and are also passed through MetaMap to extract clinical entities which are passed through a separate multi-head attention (yellow). The token and entity representations are then passed through an information fusion layer (blue) to extract entity-enriched token representations which are then used for answer span prediction. The pooled sequence representation from the information fusion layer is passed through logical form inference layer to predict the logical form.

ERNIE has shown significant improvement in dif- ERNIE architecture would be applicable to the ferent entity typing and relation classification tasks, model even if the logical forms are not available. as it utilises the extra entity information which is provided from knowledge graphs. ERNIE uses 3.3 Multi-task Learning for Extractive QA BERT for extracting contextualized token embed- In order to improve the ability of a QA model to dings and a multi-head attention model to generate generalize better over paraphrases, it helps to pro- entity embeddings. These two set of embeddings vide the model information about the logical form are aligned and provided as an input to an infor- that links these paraphrases. Since the answer to all mation fusion layer which provides entity-enriched the paraphrased questions is the same (and hence, token embeddings. For a token (wj) and its aligned logical form is the same), we constructed a multi- entity (ek = f(wj)), the information fusion pro- task learning framework to incorporate the logical cess is as follows: form information into the model. Thus, along with predicting the answer span, we added an auxiliary (i) (i) (i) (i) (i) hj = σ(Wt wj + We ek + b ) (1) task to also predict the corresponding logical form of the question. Multi-task learning provides an Here hj represents the entity enriched token em- inductive bias to enhance the primary task’s perfor- bedding, σ is the non-linear activation function, Wt mance via auxiliary tasks (Weng et al., 2019). In refers to an affine layer for token embeddings and our setting, the primary task is span detection of the We refers to an affine layer for entity embeddings. answer and the auxiliary task is logical form pre- For the tokens without corresponding entities, the diction for both emrQA and MADE (both datasets information fusion process becomes: are explained in detail in 4). The final loss for our § (i) (i) (i) model is defined as: hj = σ(Wt wj + b ) (2) model = ω lf + (1 ω) span, (3) Initially, each entity embedding is assigned ran- L L − L domly and is fine-tuned along with token em- where ω is the weightage given to the loss of aux- beddings throughout the training procedure. The illary task ( ), logical form prediction. Llf Lspan 114 is loss for answer span prediction and is MADE MADE 1.0 (Jagannatha et al., 2019) Lmodel the final loss for our proposed model. The multi- dataset was hosted as an adverse drug reactions task learning model can work with both BERT and (ADRs) and medication extraction challenge from ERNIE as the base model. Figure2 depicts the EMRs. This dataset was converted into a QA proposed multi-task model to predict both the an- dataset by following the same procedure as enu- swer and logical form given a question and ERNIE merated in the literature of emrQA (Pampari et al., architecture that is used to learn entity-enriched 2018). MADE QA dataset is smaller than emrQA, token embeddings. as emrQA consists of multiple datasets taken from i2b2 (Uzuner et al., 2011) whereas MADE only 4 Datasets has specific relations and entity mentions to that of We used emrQA2 and MADE3 datasets for our ex- ADRs and medications. This resulted in a clinical periments. We provide a brief summary of each QA dataset which has different properties as com- dataset and the methodology followed to split these pared to emrQA. MADE also has lesser number datasets into train and test sets. of logical forms (8 LFs) as compared to emrQA because of fewer entities and relations. The 8 LFs emrQA The emrQA corpus (Pampari et al., for MADE are provided in AppendixB. 2018) is the only community-shared clinical QA dataset that consists of questions, posed by physi- 4.1 Train/test splits cians against electronic medical records (EMRs) of The emrQA dataset is generated using a semi- a patient, along with their answers. The dataset was automated process that normalizes real physician developed by leveraging existing annotations avail- questions to create question templates, associates able for other clinical natural language processing expert annotated logical forms with each tem- (NLP) tasks (i2b2 challenge datasets (Uzuner et al., plate and slot fills them using annotations for 2011)). It is a credible resource for clinical QA various NLP tasks from i2b2 challenge datasets as logical forms that are generated by a physician (for e.g., fig.1). emrQA is rich in paraphrases help slot fill question templates and extract corre- as physicians often tend to express the same in- sponding answers from annotated notes. Multiple formation need in different ways. As shown in question templates can be mapped to the same logi- Table.1, all paraphrases of a question map to cal form (LF), as shown in Table1, and are referred the same logical form. Thus, if a model has ob- to as paraphrases of each other. served some of the paraphrases it should be able to generalize to the others effectively with the LF: MedicationEvent ( medication ) [dosage=x] | | help of their shared logical form “MedicationEvent How much medication does the patient take per day? ( medication ) [dosage=x]”. In order to simulate | | | | What is her current dose of medication ? this, and test the true capability of the model to gen- What is the current dose of the| patient’s| medication ? What is the current dose of medication ?| | eralize to unseen paraphrased questions, we create What is the dosage of medication| ? | a splitting scheme and refer to it as paraphrase- What was the dosage prescribed| of| medication ? | | level split. Table 1: A logical form (LF) and its respective question Paraphrase-level split templates (paraphrases). The basic idea is that some of question templates would be observed by the model during training The emrQA corpus has over 1M+ question, log- and remaining would be used during validation and ical form, and answer/evidence triplets, an example testing. The steps taken for creating this split are of a context, question, its logical form and a para- enumerated below: phrase is shown in Fig1. The evidences are the sentences from the clinical note that are relevant 1. First, the clinical notes are separated into train, to a particular question. There are total 30 logical val and test sets. Then the question, logical forms in the emrQA dataset 4. form and context triplets are generated for

2 each set resulting in the full dataset. Here the https://github.com/panushri25/emrQA context is the set of contiguous sentences from 3https://bio-nlp.org/index.php/ projects/39-nlp-challenges the EMR. 4https://github.com/panushri25/emrQA/ blob/master/templates/templates-all.csv 2. Then for each logical form (LF), 70% of its

115 corresponding question templates are chosen question-context pair. For both emrQA and MADE for train dataset and the rest are kept for val- dataset, the span is marked as the answer to the idation and test dataset. Considering the LF question and the sentence is marked as the evidence. shown in Table1, four of the question tem- Hence, we perform extractive question answering plates (QTtr) would be assigned for training at two levels: sentence and paragraph. and two (QT ) of them would be assigned v/t Sentence setting: For this setting, the evidence for validation/testing. So any sample in train- sentence which contains the answer span is pro- ing dataset whose question is generated from vided as the context to the question and the model the question template set Q would be dis- v/t has to predict the span of the answer, given the carded. Similarly, any sample with a question question. generated from the question template set Qtr would be discarded. Paragraph setting: Clinical notes are noisy and often contain incomplete sentences, lists and em- 3. To compare the generalizability performance bedded tables making it difficult to segment para- of our model, we keep the training dataset graphs in notes. Hence, we decided to define the QT + with both set of question templates ( tr context as evidence sentence and 15 20 sentences QT − v/t) as well. Essentially, a baseline model around it. We randomly chose the length of the which has observed all the question templates paragraph (lpara) and another number less than the (QTtr +QTv/t) should be able to perform bet- length of the paragraph (lpre < lpara). We chose ter on the QTv/t set as compared to a model lpre contiguous sentences which exist prior to the QT which has only observed tr set. This com- evidence sentence in the EMR and (l l ) sen- para− pre parison would help us in measuring the im- tences after the evidence sentence. We adopted this provement in performance with the help of strategy because the model could have benefited logical forms even when a set of question tem- from the information that the evidence sentence is plates are not observed by the model. exactly in the middle of a fixed length paragraph. The dataset statistics for both emrQA and MADE The model has to predict the span of the answer are shown in Table2. The training set with both from the lpara sentences long paragraph (context) question template sets (QTtr + QTv/t) is shown given the question. with ‘(r)’ appended as suffix, as it is essentially The datasets are appended by ‘-p’ and ‘-s’ for a random split, whereas the training set with the paragraph and sentence settings respectively. The question template (QTtr) is appended with ‘(pl)’ sentence setting is a relatively easier setting, for the for paraphrase-level split. model, compared to the paragraph setting because the scope of the answer is narrowed down to lesser Datasets Split Train Val. Test number of tokens and there is less noise. For both # Notes 433 44 47 settings, as also mentioned in 4, we kept the train emrQA § # Samples (pl) 133,589 21,666 19,401 set where all the question templates (paraphrases) # Samples (r) 198,118 21,666 19,401 are observed by the model during training and that # Notes 788 88 213 MADE # Samples (pl) 73,224 4,806 9,235 is referred with ‘(r)’ prefix, suggesting ‘random’ se- # Samples (r) 113,975 4,806 9,235 lection and no filtering based on question templates (paraphrases). All these dataset abbreviations are Table 2: Train, validation and test data splits. shown in the first column of Table3.

5.2 Extracting Entity Information 5 Experiments MetaMap (Aronson, 2001) uses a knowledge- In this section, we brieﬂy discuss the experimen- intensive approach to discover different clinical tal settings, clinical entity extraction method, im- concepts referred to in the text according to uniﬁed plementation details of our proposed model and medical language system (UMLS) (Bodenreider, evaluation metrics for our experiments. 2004). The clinical ontologies, such as SNOMED (Spackman et al., 1997) and RxNorm (Liu et al., 5.1 Experimental Setting 2005), embedded in MetaMap are quite useful in As a reading comprehension style task, the model extracting 127 entities across diagnosis, medica- ∼ has to identify the span of the answer given the tion, procedure and sign/symptoms. We shortlisted

116 these entities (semantic types) by mapping them Dataset Model F1-score Exact Match to the entities which were used for creating logical BERT 72.13 65.81 cBERT 74.75 (+2.62) 67.25 (+1.44) forms of the questions as these are the main enti- emrQA-s (pl) cERNIE 77.39 (+5.26) 70.17 (+4.36) ties for which the question has been posed. The M-cERNIE 79.87 (+7.74) 71.86 (+6.05) selected entities are: acab, aggp, anab, anst, bpoc, emrQA-s (r) cBERT 82.34 74.58 cgab, clnd, diap, emod, evnt, fndg, inpo, lbpr, lbtr, BERT 64.19 56.30 phob, qnco, sbst, sosy and topp. Their descriptions cBERT 65.45 (+1.26) 57.58 (+1.28) emrQA-p (pl) are provided in AppendixC. cERNIE 66.15 (+1.96) 59.80 (+3.5) M-cERNIE 67.21 (+3.02) 61.22 (+4.92) These filtered entities (Table7), extracted from MetaMap, are provided to ERNIE. A separate em- emrQA-p (r) cBERT 72.51 65.14 bedding space is defined for the entity embeddings BERT 68.45 60.73 cBERT 70.19 (+1.74) 62.00 (+1.27) MADE-s (pl) which are passed through a multi-head attention cERNIE 71.51 (+3.06) 65.31 (+4.58) layer (Vaswani et al., 2017) before interacting with M-cERNIE 73.83 (+5.38) 67.53 (+6.8) token embeddings in the information fusion layer. MADE-s (r) cBERT 73.70 65.54 The entity-enriched token embeddings are then BERT 63.39 57.49 cBERT 64.97 (+1.58) 58.94 (+1.45) used to predict the span of the answer from the MADE-p (pl) cERNIE 65.71 (+2.32) 60.55 (+3.06) context. We fine-tuned these entity embeddings M-cERNIE 64.58 (+1.19) 59.39 (+1.9) along with the token embeddings, as opposed to MADE-p (r) cBERT 66.89 61.27 using learned entities and not fine-tuning during downstream tasks (Zhang et al., 2019a). The archi- Table 3: F1-score and exact match values for Models tecture is illustrated in Fig2. on emrQA and MADE. The ‘-s’ suffix refers to the sentence setting and ‘-p’ refers to the paragraph setting 5.3 Implementation Details for the context provided in our reading comprehension style QA task. The ‘(pl)’ refers to the paraphrase-level The BERT model was released with pre-trained and ‘(r)’ refers to the random split as explained in 4. weights as BERTbase and BERTlarge. BERTbase § BERT refers to BERTbase, cBERT refers to clinical- has lesser number of parameters but achieved state- BERT, cERNIE refers to clinicalERNIE and M-cERNIE of-the-art results on a number of open-domain refers to the multi-task learning clinicalERNIE model. NLP tasks. We performed our experiments with BERTbase and hence, from here onwards we re- help the model in achieving better performance or fer to BERTbase as BERT. A fine-tuned version of not. We also analyse the logical form predictions BERTbase on clinical notes was released as clin- to understand whether it provides a rationale for icalBERT (cBERT)(Alsentzer et al., 2019). We the answer predicted by our proposed model. The use cBERT as the multi-head attention model for compiled results for all the models are shown in getting the token representations in ERNIE. We Table3. The hyper-parameter values for the best refer to this version of ERNIE, with entities from performing models are provided in AppendixA. MetaMap, as cERNIE for clinical ERNIE. Our final multi-task learning model, incorporated with Does clinical entity information improve mod- an auxillary task of predicting logical forms, is els’ performance? Across all settings, the F1- referred to as M-cERNIE for multi-task clinical score of cERNIE improves by 2 5% over BERT ∼ − ERNIE. The code for all the models is provided at and 0.75 3% over cBERT. The exact match ∼ − https://github.com/emrQA/bionlp_acl20. performance improved by 3 4.5 over BERT ∼ − and 1.5 3.25% over cBERT. Also, as expected, − Evaluation Metrics For our extractive question the performance in sentence setting (-s) improved answering task, we utilised exact match and F1- relatively more than it did in paragraph-setting. score for evaluation as per earlier literature (Ra- The entity-enriched tokens help in identifying the jpurkar et al., 2016). tokens which are required by the question. For example, in Fig.3, the token ‘infiltrative’ in the 6 Results and Discussion question as well as the context get highlighted with In this section, we compare the results of all the the help of the identified entity ‘topp’ (therapeutic models that we introduced in 3. With the help of or preventive procedure) and then relevant tokens § different experiments, we try to analyse whether in the context, chest x ray, get highlighted with the the induced entity and logical form information relevant entity ‘diap’ (diagnostic procedure). This

117 information aids the model in narrowing down its the performance of M-cERNIE on MADE-s and focus to highlighted diagnostic procedures in the emrQA-s datasets for logical form prediction, as context for answer extraction. we saw most improvement in sentence setting (-s). We calculated macro-weighted precision, recall and Question: How was diffuse infiltrativetopp diagnosedfndg? F1-score for logical form classification. The model Context: Earlier that day, pt had a chest x ray which showed diap achieved a F1-score of 0.45 0.59 for both diffuse infiltrativetopp process concerning for ARDS. ∼ − Answer: chest x ray datasets, as shown in Table4, exact match setting. We analysed the confusion matrix of predicted LF Figure 3: An example of a question, context, their ex- and observed that the model mainly gets confused tracted entities and expected answer. between the logical forms which convey similar semantic information as shown in Fig.4. Does logical form information help the model Q1: What were the results of the abnormal BMI on 2094-12-02? generalize better? In order to answer this ques- Logical Form: LabEvent (|test|) [abnormalResultFlag=Y, date=|date|, tion, we compared the performance of our M- result=x] OR ProcedureEvent (|test|) [abnormalResultFlag=Y, date=|d ate|, result=x] OR VitalEvent (|test|) [date=|date|, (result=x)>vital.ref cERNIE model to cERNIE model and observed high] OR VitalEvent (|test|) [date=|date|, (result=x)lab.refhigh] OR VitalEvent (|test|) in understanding the information need expressed [date=x, (result=x)vital.refhigh] in the question and helps in narrowing down its focus to certain tokens as the candidate answer. As Figure 4: Two similar questions with different logical seen in example3, the logical form helps in un- forms (LFs) but overlapping answer conditions. derstanding that the ‘dose’ of ‘medication’ needs to be extracted from the context where ‘dose’ was As we can see in Fig.4 that both logical forms re- already highlighted with the help of the entity em- fer to quite similar information, hence, we decided bedding of ‘qnco’. to obtain performance metrics (precision, recall relaxed Overall, the performance of our proposed model and F1-score) in setting. We designed this relaxed improves the F1-score by 1.2 7.7% and exact- setting to create a more realistic setting, − tokens match by 3.1 6.8% over BERT model. Thus, em- where the of predicted and actual logical − bedding clinical entity information with the help forms are matched rather than the whole logical of further fine-tuning, entity-enriching and logical form. An example of logical form tokenization is form prediction help the model in performing better shown in Fig.5. over the unseen paraphrases by a significant mar- LF: MedicationEvent (x) given {ConditionEvent (|problem|) OR gin. For emrQA, the performance of M-cERNIE SymptomEvent (|problem|)} is still below the upper bound performance of the Tokenized: ['MedicationEvent (x)', 'given', 'ConditionEvent (|pr oblem|)', 'OR', 'SymptomEvent (|problem|)'] cBERT model which is achieved when all the question templates are observed (emrQA-s/p (r)) by the Figure 5: Tokenized logical form (LF). model but for MADE, in sentence setting (-s), the performance of M-cERNIE is even better than the The model achieves a F1-score of 0.92 for upper bound model performance. For MADE-p emrQA-s and 0.84 for MADE-s in relaxed setting dataset the performance dropped a little when the (Table4). This suggests that the model can ef- LF prediction information is added to the model ficiently identify important semantic information which might be because MADE-p only has 8 log- from the question, which is critical for efficient QA. ical forms (AppendixB) in total, resulting in low During inference, the M-cERNIE models yield a variety between the questions. Thus, the auxiliary rationale regarding a new test question (unseen task did not add much value to the learning of the paraphrase) by predicting the logical form of the base model (cERNIE) at paragraph level. question as an auxiliary task. For ex, the LF in Fig.1 provides a rationale that any lab or proce- Does the model provide a supporting rationale dure event related to the condition event needs to via logical form (LF) prediction? We analyzed be extracted from the EMR for diagnosis.

118 Setting Dataset Precision Recall F1-score re-ranking of predictions for queries that will be emrQA 0.65 0.61 0.59 Exact issued for each document. However, BERT-based 0.47 0.52 0.45 MADE models have not been adapted to answering physi- emrQA 0.93 0.91 0.92 Relaxed MADE 0.83 0.85 0.84 cian questions on EMRs. In case of domain-specific QA, logical forms Table 4: Precision, Recall and F1-score for logical form or semantic parse are typically used to integrate prediction. the domain knowledge associated with a KB-based (knowledge base) structured QA datasets, where Can logical form information be induced in a model is learnt for mapping a natural language multi-class QA tasks as well? To answer this question, we performed another experiment where question to a LF. GeoQuery (Zelle and Mooney, the model has to classify the evidence sentences 1996), and ATIS (Dahl et al., 1994), are the old- from the non-evidence sentences making it a two- est known manually generated question-LF annota- class classification task. The model would be pro- tions on closed-domain databases. QALD (Lopez vided a tuple of question and a sentence and it has et al., 2013), FREE 917 (Cai and Yates, 2013), to predict whether the sentence is evidence or not? SIMPLEQuestions (Bordes et al., 2015) contain hundreds of hand-crafted questions and their cor- The final loss of the model ( model) changes to: L responding database queries. Prior work has also model = ω lf + (1 ω) evidence (4) used LFs as a way to generate questions via crowd- L L − L sourcing (Wang et al., 2015). WEBQuestions (Be- where ω is the weightage given to the loss of auxil- rant et al., 2013) contains thousands of questions lary task ( ), logical form prediction. Llf Levidence from Google search where the LFs are learned as is loss for evidence classification and is the Lmodel latent representations in helping answer questions final loss for our proposed model. We conducted from Freebase. Prior work has not investigated our experiments on emrQA dataset as evidence sen- the utility of logical forms in unstructured QA, es- tences were provided in it. In the multi-class set- pecially as a means to generalize the QA model ting, the [CLS] token representation would be across different paraphrases of a question. used for evidence classification as well as logical There have been efforts on using multi-task form prediction. learning for efficient question answering, such as the authors of (McCann et al., 2018) tried to learn Dataset Model Precision Recall F1-score multiple tasks together resulting in an overall boost cBERT 0.67 0.99 0.76 emrQA cERNIE 0.69 0.98 0.78 (+0.02) in the performance of the model on SQuAD (Ra- M-cERNIE 0.73 0.99 0.82 (+0.06) jpurkar et al., 2016). Similarly, the authors of (Lu et al., 2019) also utilised the information across dif- Table 5: Macro-weighted precision, recall and F1-score ferent tasks which lie at the intersection of vision of Proposed Models on Test Dataset (Multi-choice QA). and natural language processing to improve the For the model names, c: clinical; M: multitask. performance of their model across all tasks. The The multi-task entity enriched model (M- authors of (Rawat et al., 2019) utilised weak super- cERNIE) achieved an absolute improvement of 6% vision to the model while predicting the answer but over cBERT and 4% over cERNIE. This suggests not much work has been done to incorporate the that the inductive bias introduced via LF prediction logical form of the question for unstructured ques- does help in improving the overall performance of tion answering in a multi-task setting. Hence, we the model for multi-class QA as well. decided to explore this direction and incorporate 7 Related Work the structured semantic information of the questions for extractive question answering. In the general domain, BERT-based models are on the top of different leader boards across var- 8 Conclusion ious tasks, including QA tasks (Rajpurkar et al., 2018, 2016). The authors of (Nogueira and Cho, The proposed entity-enriched QA models trained 2019) applied BERT to the MS-MARCO passage with an auxiliary task improve over the state-of-the- retrieval QA task and observed improvement over art models by about 3 6% across the large-scale − state of the art results. (Nogueira et al., 2019) fur- clinical QA dataset, emrQA (Pampari et al., 2018) ther extended the work by combining BERT with (as well as MADE (Jagannatha et al., 2019)). We

119 also show that multitask learning for logical forms Alistair EW Johnson, Tom J Pollard, Lu Shen, along with the answer results in better generalizing H Lehman Li-wei, Mengling Feng, Moham- over unseen paraphrases for EMR QA. The pre- mad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. Mimic- dicted logical forms also serve as an accompanying iii, a freely accessible critical care database. Scien- justification to the answer and help in adding credi- tific data, 3:160035. bility to the predicted answer for the physician. Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Acknowledgement Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. Biobert: a pre-trained biomed- This work is supported by MIT-IBM Watson AI ical language representation model for biomedical Lab, Cambridge, MA USA. text mining. Bioinformatics, 36(4):1234–1240. Simon Liu, Wei Ma, Robin Moore, Vikraman Gane- san, and Stuart Nelson. 2005. Rxnorm: prescription References for electronic drug information exchange. IT profes- Emily Alsentzer, John R Murphy, Willie Boag, Wei- sional, 7(5):17–23. Hung Weng, Di Jin, Tristan Naumann, and Matthew McDermott. 2019. Publicly available clinical bert Vanessa Lopez, Christina Unger, Philipp Cimiano, and embeddings. NAACL-HLT Clinical NLP Workshop. Enrico Motta. 2013. Evaluating question answering over linked data. Web Semantics: Science, Services Alan R Aronson. 2001. Effective mapping of biomedi- and Agents on the World Wide Web, 21:3–13. cal text to the umls metathesaurus: the metamap program. In AMIA, page 17. Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 2019. 12-in-1: Multi-task Jonathan Berant, Andrew Chou, Roy Frostig, and Percy vision and language representation learning. arXiv Liang. 2013. Semantic parsing on freebase from preprint arXiv:1912.02315. question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Lan- Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, guage Processing, pages 1533–1544. and Richard Socher. 2018. The natural language de- cathlon: Multitask learning as question answering. Rahul Bhagat, Eduard Hovy, and Siddharth Patward- arXiv preprint arXiv:1806.08730. han. 2009. Acquiring paraphrases from text corpora. In Proceedings of the fifth international conference Rodrigo Nogueira and Kyunghyun Cho. 2019. Pas- on Knowledge capture, pages 161–168. sage re-ranking with bert. arXiv preprint Olivier Bodenreider. 2004. The unified medical lan- arXiv:1901.04085. guage system (umls): integrating biomedical terminology. Nucleic acids research, 32(suppl 1):D267– Rodrigo Nogueira, Wei Yang, Jimmy Lin, and D270. Kyunghyun Cho. 2019. Document expansion by query prediction. arXiv preprint arXiv:1904.08375. Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. 2015. Large-scale simple question Anusri Pampari, Preethi Raghavan, Jennifer Liang, and answering with memory networks. arXiv preprint Jian Peng. 2018. emrqa: A large corpus for question arXiv:1506.02075. answering on electronic medical records. EMNLP.

Qingqing Cai and Alexander Yates. 2013. Large-scale Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. semantic parsing via schema matching and lexicon Know what you don’t know: Unanswerable ques- extension. In ACL, volume 1, pages 423–433. tions for squad. arXiv preprint arXiv:1806.03822. Deborah A Dahl, Madeleine Bates, Michael Brown, Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and William Fisher, Kate Hunicke-Smith, David Pallett, Percy Liang. 2016. SQuAD: 100,000+ questions Christine Pao, Alexander Rudnicky, and Elizabeth for machine comprehension of text. arXiv preprint Shriberg. 1994. Expanding the scope of the atis task: arXiv:1606.05250. The atis-3 corpus. In HLT, pages 43–48. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Bhanu Pratap Singh Rawat, Fei Li, and Hong Yu. 2019. Kristina Toutanova. 2019. BERT: Pre-training of Naranjo question answering using end-to-end multi- deep bidirectional transformers for language under- task learning model. In Proceedings of the 25th standing. NAACL-HLT. ACM SIGKDD International Conference on Knowl- edge Discovery & Data Mining, pages 2547–2555. Abhyuday Jagannatha, Feifan Liu, Weisong Liu, and ACM. Hong Yu. 2019. Overview of the ﬁrst natural language processing challenge for extracting medica- Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and tion, indication, and adverse drug events from elec- Hannaneh Hajishirzi. 2016. Bidirectional attention tronic health record notes (made 1.0). Drug safety, ﬂow for machine comprehension. arXiv preprint 42(1):99–111. arXiv:1611.01603.

120 Kent A Spackman, Keith E Campbell, and Roger A Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut- Cotˆ e.´ 1997. Snomed rt: a reference terminology for dinov, Raquel Urtasun, Antonio Torralba, and Sanja health care. In Proceedings of the AMIA annual fall Fidler. 2015. Aligning books and movies: Towards symposium, page 640. American Medical Informat- story-like visual explanations by watching movies ics Association. and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19– Simon Susterˇ and Walter Daelemans. 2018. Clicr: a 27. dataset of clinical case reports for machine reading comprehension. arXiv preprint arXiv:1803.09720.

Wilson L Taylor. 1953. “cloze procedure”: A new tool for measuring readability. Journalism Bulletin, 30(4):415–433.

George Tsatsaronis, Michael Schroeder, Georgios Paliouras, Yannis Almirantis, Ion Androutsopoulos, Eric Gaussier, Patrick Gallinari, Thierry Artieres, Michael R Alvers, Matthias Zschunke, et al. 2012. Bioasq: A challenge on large-scale biomedical semantic indexing and question answering. In 2012 AAAI Fall Symposium Series.

Ozlem¨ Uzuner, Brett R South, Shuying Shen, and Scott L DuVall. 2011. 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Asso- ciation, 18(5):552–556.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.

Yushi Wang, Jonathan Berant, and Percy Liang. 2015. Building a semantic parser overnight. In ACL- IJCNLP, volume 1, pages 1332–1342.

Wei-Hung Weng, Yuannan Cai, Angela Lin, Fraser Tan, and Po-Hsuan Cameron Chen. 2019. Multi- modal multitask representation learning for pathology biobank metadata prediction. arXiv preprint arXiv:1909.07846.

Adina Williams, Nikita Nangia, and Samuel R Bow- man. 2017. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426.

John M Zelle and Raymond J Mooney. 1996. Learn- ing to parse database queries using inductive logic programming. In Proceedings of the national conference on artiﬁcial intelligence, pages 1050–1055.

Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. 2019a. Ernie: En- hanced language representation with informative entities. arXiv preprint arXiv:1905.07129.

Zhuosheng Zhang, Yuwei Wu, Junru Zhou, Sufeng Duan, and Hai Zhao. 2019b. Sg-net: Syntax-guided machine reading comprehension. arXiv preprint arXiv:1908.05147.

121 A Model Hyper-parameters Semantic Type Description acab Acquired Abnormality Most of the hyper-parameters across our models aggp Age Group remained same: learning rate: 2e 5, weight de- anab Anatomical Abnormality − cay: 1e 5, warm-up proportion: 10% and hidden anst Anatomical Structure − bpoc Body Part, Organ, or Organ Component dropout probability: 0.1. The parameters that var- cgab Congenital Abnormality ied across models for different datasets are enumer- clnd Clinical Drug ated in the Table6. The hyper-parametsrs provided diap Diagnostic Procedure emod Experimental Model of Disease in Table6 are for all models in a particular dataset. evnt Event This also suggests that even after adding an auxil- fndg Finding iary task, the proposed model doesn’t need a lot of inpo Injury or Poisoning lbpr Laboratory Procedure hyper-parameter tuning. lbtr Laboratory or Test Result phob Physical Object qnco Quantitative Concept Entity Embedding Auxiliary Task Dataset sbst Substance Dim Wt. sosy Sign or Symptom emrQA-rel 100 0.3 topp Therapeutic or Preventive Procedure BoolQ 90 0.3 emrQA 100 0.3 MADE 80 0.2 Table 7: Selected semantic types as per MetaMap and their brief descriptions. Table 6: Hyper-parameter values across different datasets.

B Logical forms (LFs) for MADE dataset 1. MedicationEvent ( medication ) [sig=x] | | 2. MedicationEvent ( medication ) causes Condi- | | tionEvent (x) OR SymptomEvent (x) 3. MedicationEvent ( medication ) given Condi- | | tionEvent (x) OR SymptomEvent (x) 4. [ProcedureEvent ( treatment ) given/conducted | | ConditionEvent (x) OR SymptomEvent (x)] OR [MedicationEvent ( treatment ) given Condition- | | Event (x) OR SymptomEvent (x)] 5. MedicationEvent (x) CheckIfNull ([enddate]) OR MedicationEvent (x) [enddate>currentDate] OR ProcedureEvent (x) [date=x] given Condition- Event ( problem ) OR SymptomEvent ( problem ) | | | | 6. MedicationEvent (x) CheckIfNull ([enddate]) OR MedicationEvent (x) [enddate>currentDate] given ConditionEvent ( problem ) OR Symp- | | tomEvent ( problem ) | | 7. MedicationEvent ( treatment ) OR Proce- | | dureEvent ( treatment ) given ConditionEvent (x) | | OR SymptomEvent (x) 8. MedicationEvent ( treatment ) OR Proce- | | dureEvent ( treatment ) improves/worsens/causes | | ConditionEvent (x) OR SymptomEvent (x)

C Selected entities from MetaMap The list of selected semantic types in the form of entities and their brief descriptors are provided in Table7.

122 Evidence Inference 2.0: More Data, Better Models

Jay DeYoung?Ψ, Eric Lehman?Ψ, Ben NyeΨ, Iain J. MarshallΦ, and Byron C. WallaceΨ

?Equal contribution ΨKhoury College of Computer Sciences, Northeastern University ΦKings College London deyoung.j,lehman.e,nye.b,b.wallace @northeastern.edu, [email protected] { }

Abstract trials per day.1 Motivated by the rapid growth in clinical trial publications, there now exist a plethora How do we most effectively treat a disease of tools to partially automate the systematic review or condition? Ideally, we could consult a database of evidence gleaned from clinical tri- task (Marshall and Wallace, 2019). However, ef- als to answer such questions. Unfortunately, forts at fully integrating the PICO framework into no such database exists; clinical trial results this process have been limited (Eriksen and Frand- are instead disseminated primarily via lengthy sen, 2018). What if we could build a database natural language articles. Perusing all such ar- of Participants,2 Interventions, Comparisons, and ticles would be prohibitively time-consuming Outcomes studied in these trials, and the findings for healthcare practitioners; they instead tend reported concerning these? If done accurately, this to depend on manually compiled systematic re- would provide direct access to which treatments views of medical literature to inform care. the evidence supports. In the near-term, such tech- NLP may speed this process up, and eventu- nologies may mitigate the tedious work necessary ally facilitate immediate consult of published evidence. The Evidence Inference dataset for manual synthesis. (Lehman et al., 2019) was recently released Recent efforts in this direction include the EBM- to facilitate research toward this end. This NLP project (Nye et al., 2018), and Evidence Infer- task entails inferring the comparative perfor- ence (Lehman et al., 2019), both of which comprise mance of two treatments, with respect to a annotations collected on reports of Randomized given outcome, from a particular article (de- Control Trials (RCTs) from PubMed.3 Here we scribing a clinical trial) and identifying sup- build upon the latter, which tasks systems with in- porting evidence. For instance: Does this ar- ferring findings in full-text reports of RCTs with ticle report that chemotherapy performed better than surgery for five-year survival rates of respect to particular interventions and outcomes, operable cancers? In this paper, we collect and extracting evidence snippets supporting these. additional annotations to expand the Evidence We expand the Evidence Inference dataset and Inference dataset by 25%, provide stronger evaluate transformer-based models (Vaswani et al., baseline models, systematically inspect the er- 2017; Devlin et al., 2018) on the task. Concretely, rors that these make, and probe dataset qual- our contributions are: ity. We also release an abstract only (as opposed to full-texts) version of the task for We describe the collection of an additional rapid model prototyping. The updated cor- • 2,503 unique ‘prompts’ (see Section2) with pus, documentation, and code for new baselines and evaluations are available at http: matched full-text articles; this is a 25% expan- //evidence-inference.ebm-nlp.com/. sion of the original evidence inference dataset that we will release. We additionally have col- 1 Introduction lected an abstract-only subset of data intended As reports of clinical trials continue to amass at to facilitate rapid iterative design of models, rapid pace, staying on top of all current literature to 1See https://ijmarshall.github.io/sote/. inform evidence-based practice is next to impossi- 2We omit Participants in this work as we focus on the ble. As of 2010, about seventy clinical trial reports document level task of inferring study result directionality, and the Participants are inherent to the study, i.e., studies do were published daily, on average (Bastian et al., not typically consider multiple patient populations. 2010). This has risen to over one hundred thirty 3https://pubmed.ncbi.nlm.nih.gov/

123 Proceedings of the BioNLP 2020 workshop, pages 123–132 Online, July 9, 2020 c 2020 Association for Computational Linguistics

as working over full-texts can be prohibitively over a variety of articles. This resulted in our col- time-consuming. lecting 3.77 prompts per article, on average. We asked doctors to derive at least 1 prompt from the We introduce and evaluate new models, • body (rather than the abstract) of the article. A achieving SOTA performance for this task. large difficulty of the task stems from the wide variety of treatments and outcomes used in the trials: We ablate components of these models and • 35.8% of interventions, 24.0% of comparators, and characterize the types of errors that they tend 81.6% of outcomes are unique to one another. to still make, pointing to potential directions In addition to these ICO prompts, doctors were for further improving models. asked to report the relationship between the inter- 2 Annotation vention and comparator with respect to the outcome, and cite what span from the article supports In the Evidence Inference task (Lehman et al., their reasoning. We find that 48.4% of the collected 2019), a model is provided with a full-text arti- prompts can be answered using only the abstract. cle describing a randomized controlled trial (RCT) However, 63.0% of the evidence spans supporting and a ‘prompt’ that specifies an Intervention (e.g., judgments (provided by both the prompt generator aspirin), a Comparator (e.g., placebo), and an Out- and prompt annotator), are from outside of the ab- come (e.g., duration of headache). We refer to these stract. Additionally, 13.6% of evidence spans cover as ICO prompts. The task then is to infer whether more than one sentence in length. a given article reports that the Intervention resulted in a significant increase, significant decrease, or 2.2 Prompt Annotation produced no significant difference in the Outcome, Following the guidelines presented in Lehman et al. as compared to the Comparator. (2019), each prompt was assigned to a single doc- Our annotation process largely follows that out- tor. They were asked to report the difference be- lined in Lehman et al.(2019); we summarize this tween the specified intervention and comparator, briefly here. Data collection comprises three steps: with respect to the given outcome. In particular, (1) prompt generation; (2) prompt and article anno- options for this relationship were: “increase”, “de- tation; and (3) verification. All steps are performed crease”, “no difference” or “invalid prompt.” An- 4 by Medical Doctors (MDs) hired through Upwork. notators were also asked to mark a span of text Annotators were divided into mutually exclusive supporting their answers: a rationale. However, groups performing these tasks, described below. unlike Lehman et al.(2019), here, annotators were Combining this new data with the dataset intro- not restricted via the annotation platform to only duced in Lehman et al.(2019) yields in total 12,616 look at the abstract at first. They were free to search unique prompts stemming from 3,346 unique arti- the article as necessary. 5 cles, increasing the original dataset by 25%. To Because trials tend to investigate multiple in- acquire the new annotations, we hired 11 doctors: terventions and measure more than one outcome, 1 for prompt generation, 6 for prompt annotation, articles will usually correspond to multiple — po- and 4 for verification. tentially many — valid ICO prompts (with cor- 2.1 Prompt Generation respondingly different findings). In the data we collected, 62.9% of articles comprise at least two In this collection phase, a single doctor is asked to ICO prompts with different associated labels (for read an article and identify triplets of interventions, the same article). comparators, and outcomes; we refer to these as ICO prompts. Each doctor is assigned a unique ar- 2.3 Verification ticle, so as to not overlap with one another. Doctors Given both the answers and rationales of the were asked to find a maximum of 5 prompts per prompt generator and prompt annotator, a third article as a practical trade-off between the expense doctor — the verifier — was asked to determine of exhaustive annotation and acquiring annotations the validity of both of the previous stages.6 We esti- 4http://upwork.com. mate the accuracy of each task with respect to these 5We use the first release of the data by Lehman et al., which verification labels. For prompt generation, answers included 10,137 prompts. A subsequent release contained 10,113 prompts, as the authors removed prompts where the 6The verifier can also discard low-quality or incorrect answer and rationale were produced by different doctors. prompts.

124 Evidence Identiﬁcation ICO + Ev Inference [BERT + MLP] [BERT + MLP]

Supervised postive Training Fully supervised Sample negatives I C O Article Ev increased w.r.t. I O C Background Support: Duis aute irure dolor in reprehenderit in Classify Ev voluptate velit esse cillum dolore eu fugiat nulla pariatur. Results Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt. Increased | Deceased | Same

Testing Rank sentences by p(evidence) progesterone placebo mortality mortality rate was significantly lower mortality rate was significantly lower Classify Does progesterone affect adverse events were not found argmax mortality w.r.t. placebo ? intracranial pressure values were lower

results suggest further clinical trials Decreased

Figure 1: BERT to BERT pipeline. Evidence identification and classification stages are trained separately. The identifier is trained via negative samples against the positive instances, the classifier via only those same positive evidence spans. Decoding assigns a score to every sentence in the document, and the sentence with the highest evidence score is passed to the classifier. were 94.0% accurate, and rationales were 96.1% scientific corpora,8 followed by a Softmax. accurate. For prompt annotation, the answers were Specifically, we first perform sentence segmenta- 90.0% accurate, and accuracy of the rationales was tion over full-text articles using ScispaCy (Neu- 88.8%. The drop in accuracy between prompt gen- mann et al., 2019). We use this segmentation to eration answers and prompt annotation answers is recover evidence bearing sentences. We train an likely due to confusion with respect to the scope of evidence identifier by learning to discriminate be- the intervention, comparator, and outcome. tween evidence bearing sentences and randomly We additionally calculated agreement statistics sampled non-evidence sentences.9 We then train an amongst the doctors across all stages, yielding a evidence classifier over the evidence bearing sen- Krippendorf’s α of α = 0.854. In contrast, the tences to characterize the trial’s finding as report- agreement between prompt generator and annotator ing that the Intervention significantly decreased, (excluding verifier) had a α = 0.784. did not significantly change, or significantly increased the Outcome compared to the Compara- 2.4 Abstract Only Subset tor in an ICO. When making a prediction for an We subset the articles and their content, yielding (ICO, document) pair we use the highest scoring 9,680 of 24,686 annotations, or approximately 40%. evidence sentence from the identifier, feeding this This leaves 6375 prompts, 50.5% of the total. to the evidence classifier for a final result. Note 3 Models that the evidence classifier is conditioned on the ICO frame; we prepend the ICO embedding (from We consider a simple BERT-based (Devlin et al., Biomed RoBERTa) to the embedding of the identi- 2018) pipeline comprising two independent mod- fied evidence snippet. Reassuringly, removing this els, as depicted in Figure1. The first identifies signal degrades performance (Table1). evidence bearing sentences within an article for a For all models we fine-tuned the underlying given ICO. The second model then classifies the BERT parameters. We trained all models using reported findings for an ICO prompt using the ev- the Adam optimizer (Kingma and Ba, 2014) with idence extracted by this first model. These mod- a BERT learning rate 2e-5. We train these models place a dense layer on top of representations els for 10 epochs, keeping the best performing 7 yielded from (Gururangan et al., 2020), a vari- version on a nested held-out set with respect to ant of RoBERTa (Liu et al., 2019) pre-trained over 8We use the [CLS] representations. 7An earlier version of this work used SciBERT (Beltagy 9We train this via negative sampling because the vast ma- et al., 2019); we preserve these results in AppendixC. jority of sentences are not evidence-bearing.

125 macro-averaged f-scores. When training the evi- Model Cond? P R F dence identifier, we experiment with different num- BR Pipeline X .784 .777 .780 bers of random samples per positive instance. We BR Pipeline .513 .510 .510 BR Pipeline abs. .776 .777 .776 used Scikit-Learn (Pedregosa et al., 2011) for X Baseline X .526 .516 .514 evaluation and diagnostics, and implemented all Diagnostics: PyTorch models in (Paszke et al., 2019). We ad- BR Pipeline 1.0 X .762 .764 .763 ditionally reproduce the end-to-end system from Baseline 1.0 X .531 .519 .520 Lehman et al.(2019): a gated recurrent unit (Cho BR ICO Only .522 .515 .511 et al., 2014) to encode the document, attention BR Oracle Spans X .851 .853 .851 (Bahdanau et al., 2015) conditioned on the ICO, BR Oracle Sentence X .845 .843 .843 with the resultant vector (plus the ICO) fed into an BR Oracle Spans .806 .812 .808 BR Oracle Sentence .802 .795 .797 MLP for a final significance decision. BR Oracle Spans abs. X .830 .823 .824 4 Experiments and Results Baseline Oracle 1.0 X .740 .739 .739 Baseline Oracle X .760 .761 .759 Our main results are reported in Table1. We make a few key observations. First, the gains over the prior Table 1: Classification Scores. BR Pipeline: Biomed state-of-the-art model — which was not BERT RoBERTa BERT Pipeline. abs: Abstracts only. Base- based — are substantial: 20+ absolute points in line: model from Lehman et al.(2019). Diagnostic models: Baseline scores Lehman et al.(2019), BR F-score, even beyond what one might expect to see 10 Pipeline when trained using the Evidence Inference 1.0 shifting to large pre-trained models. Second, con- data, BR classifier when presented with only the ICO ditioning on the ICO prompt is key; failing to do so element, an entire human selected evidence span, or results in substantial performance drops. Finally, a human selected evidence sentence. Full document we seem to have reached a plateau in terms of the BR models are trained with four negative samples; ab- performance of the BERT pipeline model; adding stracts are trained with sixteen; Baseline oracle span re- the newly collected training data does not budge sults from Lehman et al.(2019). In all cases: ‘Cond?’ indicates whether or not the model had access to the performance (evaluated on the augmented test set). ICO elements; P/R/F scores are macro-averaged. This suggests that to realize stronger performance here, we perhaps need a less naive architecture that to be well-aligned with the existing release. This better models the domain. We next probe specific also suggests that the performance of the current aspects of our design and training decisions. simple pipeline model may have plateaued; bet- Impact of Negative Sampling As negative sam- ter performance perhaps requires inductive biases pling is a crucial part of the pipeline, we vary the via domain knowledge or improved strategies for number of samples and evaluate performance. We evidence identification. provide detailed results in AppendixA, but to sum- Oracle Evidence We report two types of Ora- marize briefly: we find that two to four negative cle evidence experiments - one using ground truth samples (per positive) performs the best for the evidence spans “Oracle spans”, the other using sen- end-to-end task, with little change in both AUROC tences for classification. In the former experiment, and accuracy of the best fit evidence sentence. This we choose an arbitrary evidence span11 for each is likely because the model needs only to maximize prompt for decoding. For the latter, we arbitrarily discriminative capability, rather than calibration. choose a sentence contained within a span. Both Distribution Shift In addition to comparable experiments are trained to use a matching classifier. Krippendorf-α values computed above, we mea- We find that using a span versus a sentence causes sure the impact of the new data on pipeline perfor- a marginal change in score. Both diagnostics pro- mance. We compare performance of the pipeline vide an upper bound on this model type, improve with all data “Biomed RoBERTa (BR) Pipeline” over the original Oracle baseline by approximately vs. just the old data “Biomed RoBERTA (BR) 10 points. Using Oracle evidence as opposed to BERT Pipeline 1.0” in Table1. As performance a trained evidence identifier leaves an end-to-end stays relatively constant, we believe the new data performance gap of approximately 0.08 F1 score.

10To verify the impact of architecture changes, we exper- 11Evidence classification operates on a single sentence, iment with randomly initialized and fine-tuned BERTs. We but an annotator’s selection is span based. Furthermore, the find that these perform worse than the original models in all prompt annotation stage may produce different evidence spans instances and elide more detailed results. than prompt generation.

126 Predicted Class on the basis of these), achieving state of the art Ev. Cls ID Acc. Sig Sig Sig ∼ ⊕ results on this task. Sig .667 .684 .153 .163 With this expanded dataset, we hope to support Sig .674 .060 .840 .099 ∼ Sig .652 .085 .107 .808 further development of NLP for assisting Evidence ⊕ Based Medicine. Our results demonstrate promise Table 2: Breakdown of the conditioned Biomed for the task of automatically inferring results from RoBERTa pipeline model mistakes and performance by Randomized Control Trials, but still leave room evidence class. ID Acc. is the ”identification accuracy”, for improvement. In our future work, we intend to or percentage of . To the right is a confusion matrix for jointly automate the identification of ICO triplets end-to-end predictions. ‘Sig ’ indicates significantly and inference concerning these. We are also keen decreased, ‘Sig ’ indicates no significant difference, ∼ ‘Sig ’ indicates significantly increased. to investigate whether pre-training on related scien- ⊕ tific ‘fact verification’ tasks might improve perfor- Conditioning As the pipeline can optionally mance (Wadden et al., 2020). condition on the ICO, we ablate over both the ICO Acknowledgments and the actual document text. We find that using the ICO alone performs about as effectively as an We thank the anonymous BioNLP reviewers. unconditioned end-to-end pipeline, 0.51 F1 score This work was supported by the National Sci- (Table1). However, when fed Oracle sentences, the ence Foundation, CAREER award 1750978. unconditioned pipeline performance jumps to 0.80 F1. As shown in Table3 (AppendixA), this large decrease in score can be attributed to the model losing the ability to identify the correct evidence sentence. Mistake Breakdown We further perform an analysis of model mistakes in Table2. We find that the BERT-to-BERT model is somewhat better at identifying significantly decreased spans than it is at identifying spans for the significantly increased or no significant difference evidence classes. Spans for the no significant difference tend to be classified correctly, and spans for the significantly increased category tend to be confused in a similar pattern to the significantly decreased class. End-to-end mistakes are relatively balanced between all possible confusion classes. Abstract Only Results We report a full suite of experiments over the abstracts-only subset in Ap- pendixB. We find that the pipeline models perform similarly on the abstract-only subset; differing in score by less than .01F1. Somewhat surprisingly, we find that the abstracts oracle model falls behind the full document oracle model, perhaps due to a difference in language reporting general results vs. more detailed conclusions. 5 Conclusions and Future Work We have introduced an expanded version of the Evidence Inference dataset. We have proposed and evaluated BERT-based models for the evidence inference task (which entails identifying snippets of evidence for particular ICO prompts in long documents and then classifying the reported finding

127 References using machine learning tools in research synthesis. Systematic Reviews, 8(1):163. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2015. Neural machine translation by jointly Mark Neumann, Daniel King, Iz Beltagy, and Waleed learning to align and translate. In 3rd Inter- Ammar. 2019. Scispacy: Fast and robust models for national Conference on Learning Representations, biomedical natural language processing. ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Benjamin Nye, Junyi Jessy Li, Roma Patel, Yinfei Yang, Iain Marshall, Ani Nenkova, and Byron Wal- Hilda Bastian, Paul Glasziou, and Iain Chalmers. 2010. lace. 2018. A corpus with multi-level annotations Seventy-five trials and eleven systematic reviews a of patients, interventions and outcomes to support day: how will we ever keep up? PLoS Med, language processing for medical literature. In Pro- 7(9):e1000326. ceedings of the 56th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Iz Beltagy, Arman Cohan, and Kyle Lo. 2019. Scibert: Papers), pages 197–207, Melbourne, Australia. As- Pretrained contextualized embeddings for scientific sociation for Computational Linguistics. text. arXiv preprint arXiv:1903.10676. Adam Paszke, Sam Gross, Francisco Massa, Adam Kyunghyun Cho, Bart van Merrienboer,¨ Caglar Gul- Lerer, James Bradbury, Gregory Chanan, Trevor cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Killeen, Zeming Lin, Natalia Gimelshein, Luca Schwenk, and Yoshua Bengio. 2014. Learning Antiga, et al. 2019. Pytorch: An imperative style, phrase representations using RNN encoder–decoder high-performance deep learning library. In Ad- for statistical machine translation. In Proceedings of vances in Neural Information Processing Systems, the 2014 Conference on Empirical Methods in Nat- pages 8024–8035. ural Language Processing (EMNLP), pages 1724– 1734, Doha, Qatar. Association for Computational F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, Linguistics. B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, Jacob Devlin, Ming-Wei Chang, Kenton Lee, and D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- Kristina Toutanova. 2018. Bert: Pre-training of deep esnay. 2011. Scikit-learn: Machine learning in bidirectional transformers for language understand- Python. Journal of Machine Learning Research, ing. arXiv preprint arXiv:1810.04805. 12:2825–2830. Mette Brandt Eriksen and Tove Faber Frandsen. 2018. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob The impact of patient, intervention, comparison, out- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz come (PICO) as a search strategy tool on literature Kaiser, and Illia Polosukhin. 2017. Attention is all search quality: a systematic review. Journal of the you need. In Advances in neural information pro- Medical Library Association, 106(4). cessing systems, pages 5998–6008. Suchin Gururangan, Ana Marasovi, Swabha David Wadden, Kyle Lo, Lucy Lu Wang, Shanchuan Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, Lin, Madeleine van Zuylen, Arman Cohan, and Han- and Noah A. Smith. 2020. Don’t stop pretraining: naneh Hajishirzi. 2020. Fact or fiction: Verifying Adapt language models to domains and tasks. scientific claim. In Association for Computational Linguistics (ACL). Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.

Eric Lehman, Jay DeYoung, Regina Barzilay, and By- ron C. Wallace. 2019. Inferring which medical treatments work from reports of clinical trials. In Pro- ceedings of the 2019 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol- ume 1 (Long and Short Papers), pages 3705–3717, Minneapolis, Minnesota. Association for Computa- tional Linguistics.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach.

Iain J. Marshall and Byron C. Wallace. 2019. Toward systematic review automation: A practical guide to

128 Appendix Model Cond? P R F BR Pipeline X .776 .777 .776 A Negative Sampling Results BR Pipeline .513 .510 .510 We report negative sampling results for Biomed Diagnostics: ICO Only .545 .543 .537 RoBERTa pipelines in Table3 and Figure2. Oracle Spans X .830 .823 .824 Oracle Sentence X .845 .843 .843 Oracle Spans .814 .809 .809 Oracle Sentence .802 .795 .797

Table 4: Classiﬁcation Scores. Biomed RoBERTa Ab- stract only version of Table1. All evidence identiﬁca- tion models trained with sixteen negative samples.

Neg. Samples Cond? AUROC Top1 Acc 1 X 0.983 0.647 2 X 0.982 0.664 4 X 0.981 0.680 8 X 0.978 0.656 16 X 0.980 0.673 1 0.944 0.351 2 0.953 0.373 Figure 2: End to end pipeline scores for different nega- 4 0.947 0.334 tive sampling strategies with Biomed RoBERTa. 8 0.938 0.273 16 0.947 0.308

Neg, samples Cond? AUROC Top1 Acc Table 5: Abstract only (v2.0) evidence identification 1 X 0.973 0.682 validation scores varying across negative sampling 2 X 0.972 0.700 strategies using Biomed RoBERTa. 4 X 0.972 0.671 8 X 0.961 0.492 at the time experiment configurations were deter- 16 X 0.590 0.027 1 0.915 0.236 mined. Biomed RoBERTa experiments use the v2.0 2 0.921 0.226 set for calibration. We find that Biomed RoBERTa 4 0.925 0.251 generally performs better, with a notable excep- 8 0.899 0.165 tion in performance on abstracts-only Oracle span 16 0.508 0.015 classification. Table 3: Evidence Inference v2.0 evidence identifica- C.1 Negative Sampling Results tion validation scores varying across negative sampling We report SciBERT negative sampling results in strategies using Biomed RoBERTa in the pipeline. Table9 and Figure4. C.2 Abstract Only Results B Abstract Only Results We repeat the experiments described in Section4 We repeat the experiments described in Section and report results in Tables 10, 11, 12 and Figure5. 4. Our primary findings are that the abstract-only Our primary findings are that the abstract-only task task is easier and sixteen negative samples perform is easier and eight negative samples perform better better than four. Otherwise results follow a similar than four. Otherwise results follow a similar trend trend to the full-document task. We document these to the full-document task. in Table4,5,6 and Figure3. C SciBERT Results We report original SciBERT results in Tables7,8, 9 and Figures4,5. Table7 contains the Biomed RoBERTa numbers for comparison. Note that original SciBERT experiments use the evidence inference v1.0 dataset as v2.0 collection was incomplete

129 Model Cond? P R F BR Pipeline X .784 .777 .780 SB Pipeline X .750 .750 .749 BR Pipeline .513 .510 .510 SB Pipeline .489 .486 .486 BR Pipeline abs. X .776 .777 .776 SB Pipeline abs. X .803 .798 .799 Baseline X .526 .516 .514 Diagnostics: BR Pipeline 1.0 X .762 .764 .763 SB Pipeline 1.0 X .749 .761 .753 Baseline 1.0 X .531 .519 .520 BR ICO Only .522 .515 .511 SB ICO Only .494 .501 .494 BR Oracle Spans X .851 .853 .851 SB Oracle Spans X .840 .840 .838 BR Oracle Sentence X .845 .843 .843 Figure 3: End to end pipeline scores on the abstract- SB Oracle Sentence X .829 .830 .829 only subset for different negative sampling strategies BR Oracle Spans .806 .812 .808 with Biomed RoBERTa. SB Oracle Spans .786 .789 .787 BR Oracle Sentence .802 .795 .797 SB Oracle Sentence .780 .770 .773 BR Oracle Spans abs. X .830 .823 .824 SB Oracle Spans abs. .866 .862 .863 Conf. Cls X Baseline Oracle 1.0 .740 .739 .739 Ev. Cls ID Acc. Sig Sig Sig X ∼ ⊕ Baseline Oracle X .760 .761 .759 Sig .728 .761 .067 .172

Sig .691 .130 .802 .068 ∼ Table 7: Replica of Table1 with both SciBERT Sig .573 .123 .109 .768 ⊕ and Biomed RoBERTa results. Classification Scores. BR Pipeline: Biomed RoBERTa BERT Pipeline, SB Table 6: Breakdown of the abstract-only conditioned Pipeline: SciBERT Pipeline. abs: Abstracts only. Base- Biomed RoBERTa pipeline model mistakes and perfor- line: model from Lehman et al.(2019). Diagnostic mance by evidence class. ID Acc. is breakdown by models: Baseline scores Lehman et al.(2019), BR final evidence truth. To the right is a confusion matrix Pipeline when trained using the Evidence Inference 1.0 for end-to-end predictions. data, BR classifier when presented with only the ICO element, an entire human selected evidence span, or a human selected evidence sentence. Full document BR models are trained with four negative samples; abstracts are trained with sixteen; Baseline oracle span results from Lehman et al.(2019). In all cases: ‘Cond?’ indicates whether or not the model had access to the ICO elements; P/R/F scores are macro-averaged over classes.

Predicted Class Ev. Cls ID Acc. Sig Sig Sig ∼ ⊕ Sig .711 .697 .143 .160

Sig .643 .076 .838 .086 ∼ Sig .635 .146 .141 .713 ⊕ Table 8: Replica of Table2 for SciBERT. Breakdown of the conditioned BERT pipeline model mistakes and performance by evidence class. ID Acc. is the ”identification accuracy”, or percentage of . To the right is a confusion matrix for end-to-end predictions. ‘Sig ’ Figure 4: End to end pipeline scores for different nega- indicates significantly decreased, ‘Sig ’ indicates no tive sampling strategies for SciBERT. ∼ significant difference, ‘Sig ’ indicates significantly in- ⊕ creased.

130 Neg. Samples Cond? AUROC Top1 Acc 1 X .969 .663 2 X .959 .673 4 X .968 .659 8 X .961 .627 16 X .967 .593 1 .894 .094 2 .890 .181 4 .843 .083 8 .862 .170 16 .403 .014

Table 9: Evidence Inference v1.0 evidence identiﬁca- tion validation scores varying across negative sampling strategies for SciBERT.

Model Cond? P R F BERT Pipeline X .803 .798 .799 BERT Pipeline .528 .513 .510 Diagnostics: ICO Only .480 .480 .479 Oracle Spans X .866 .862 .863 Oracle Sentence X .848 .842 .844 Oracle Spans .804 .802 .801 Oracle Sentence .817 .776 .783

Table 10: Classiﬁcation Scores. SciBERT/Abstract only version of Table1. All evidence identiﬁcation models trained with eight negative samples.

Neg. Samples Cond? AUROC Top1 Acc 1 X 0.980 0.573 2 X 0.978 0.596 4 X 0.977 0.623 8 X 0.950 0.609 16 X 0.975 0.615 Figure 5: End to end pipeline scores on the abstract- 1 0.946 0.340 only subset for different negative sampling strategies 2 0.939 0.342 for SciBERT. 4 0.912 0.286 8 0.938 0.313 16 0.940 0.282

Table 11: Abstract only (v1.0) evidence identiﬁca- tion validation scores varying across negative sampling strategies for SciBERT.

Conf. Cls Ev. Cls ID Acc. Sig Sig Sig ∼ ⊕ Sig .767 .750 .044 .206

Sig .686 .092 .816 .092 ∼ Sig .591 .109 .064 .827 ⊕ Table 12: Breakdown of the abstract-only conditioned SciBERT pipeline model mistakes and performance by evidence class. ID Acc. is breakdown by ﬁnal evidence truth. To the right is a confusion matrix for end-to-end predictions.

131 Train Dev Test Total Number of prompts 10150 1238 1228 12616 Number of articles 2672 340 334 3346 Label counts (-1 / 0 / 1) 2465 / 4563 / 3122 299 / 544 / 395 295 / 516 / 417 3059 / 5623 / 3934

Table 13: Corpus statistics. Labels -1, 0, 1 indicate significantly decreased, no significant difference and significantly increased, respectively.

132 Personalized Early Stage Alzheimer’s Disease Detection: A Case Study of President Reagan’s Speeches

Ning Wang Fan Luo Vishal Peddagangireddy∗ K.P. Subbalakshmi R. Chandramouli Stevens Institute of Technology Hoboken, NJ 07030 nwang7, ﬂuo4, vpeddaga, ksubbala, mouli @stevens.edu { }

Paper Number 59 due to a variety of reasons including social, eco- Abstract nomic, and cultural factors. Therefore, Internet based technologies that unobtrusively and continu- Alzheimer’s disease (AD)-related global ally collect, store, and analyze mental health data healthcare cost is estimated to be $1 trillion by 2050. Currently, there is no cure for this are critical. For example, a home smart speaker de- disease; however, clinical studies show that vice can record a subject’s speech periodically, au- early diagnosis and intervention helps to tomatically extract AD related speech or linguistic extend the quality of life and inform tech- features, and present easy to understand machine nologies for personalized mental healthcare. learning based analysis and visualization. Such a Clinical research indicates that the onset and technology will be highly valuable for personal- progression of Alzheimer’s disease lead to ized medicine and early intervention. This may dementia and other mental health issues. As a result, the language capabilities of patient also encourage people to sign-up for such a low start to decline. cost and home-based AD diagnostic technology. In this paper, we show that machine learning- Data and results of such a technology will also in- based unsupervised clustering of and anomaly stantly provide invaluable information to mental detection with linguistic biomarkers are health professionals. promising approaches for intuitive visualiza- Several studies show that subtle linguistic tion and personalized early stage detection of Alzheimer’s disease. We demonstrate changes are observed even at the early stages of this approach on 10 year’s (1980 to 1989) AD. In (Forbes-McKay and Venneri, 2005), more of President Ronald Reagan’s speech data than 70% of AD patients scored low in a picture set. Key linguistic biomarkers that indicate description task. Therefore, a critical research ques- early-stage AD are identified. Experimental tion is: can spontaneous temporal language impair- results show that Reagan had early onset of ments caused by AD be detected at an early stage Alzheimer’s sometime between 1983 and of the disease? Relation between AD, language 1987. This finding is corroborated by prior functions and language domain are summarized in work that analyzed his interviews using a statistical technique. The proposed technique (Szatloczki et al., 2015). In (Venneri et al., 2008), a also identifies the exact speeches that reflect significant correlation between the lexical attributes linguistic biomarkers for early stage AD. characterising residual linguistic production and the integrity of regions of the medial temporal 1 Introduction lobes in early AD patients is observed. Specific Alzheimer’s disease is a serious mental health issue lexical and semantic deficiencies of AD patients at faced by the global population. About 44 million early to moderate stages are also detected in verbal people worldwide are diagnosed with AD. The U.S. communication task in (Boye´ et al., 2014). There- alone has 5.5 million AD patients. According to fore, in this paper, we explore a machine-learning the Alzheimer’s association the total cost of care based clustering and data visualization technique for AD is estimated to be $1 trillion by 2050. There to identify linguistic changes in a subject over a is no cure for AD yet; however, studies have shown time period. The proposed methodology is also that early diagnosis and intervention can delay the highly personalized since it observes and analyzes onset. the linguistic patterns of each individual separately Regular mental health assessment is a key chal- using only his/her own linguistic biomarkers over lenge faced by the medical community. This is a period of time.

133 Proceedings of the BioNLP 2020 workshop, pages 133–139 Online, July 9, 2020 c 2020 Association for Computational Linguistics

First, we explore a machine learning algorithm the picture ignoring expected facts and inferences called t-distributed stochastic neighbor embedding (Giles et al., 1996). AD patients have difficulty (t-SNE) (Maaten and Hinton, 2008). t-SNE is in naming things and replace target words with useful in dimensionality reduction suitable for vi- simpler semantically neighboring words. Under- sualization of high-dimensional datasets. It cal- standing metaphors and sarcasm also deteriorate in culates the probability that two points in a high- people diagnosed with AD (Rapp and Wild, 2011). dimensional space are similar, computes the corre- Machine learning based classifier design to dif- sponding probability in a low-dimensional space, ferentiate between AD and non-AD subjects is an and minimizes the difference between these two active area of research. A significant correlation probabilities for mapping or visualization. Dur- between dementia severity and linguistic measures ing this process, the sum of Kullback-Leibler di- such as confrontation naming, articulation, word- vergences (Liu and Shum, 2003) over all the data finding ability, and semantic fluency exists. Some points is minimized. Our hypothesis is that high- studies have reported a 84.8 percent accuracy in dimensional AD-related linguistic features when distinguishing AD patients from healthy controls visualized in a low-dimensional space may quickly using temporal and acoustic features ((Rentoumi and intuitively reveal useful information for early et al., 2014); (Fraser et al., 2016)). diagnosis. Such a visualization will also help in- Our study differs from the prior work in sev- dividuals and general medical practitioners (who eral ways. Prior work attempt to identify linguistic are first points of contact) to assess the situation features and machine learning classifiers for differ- for further tests. Second, we investigate two un- entiating between AD and non-AD subjects from a supervised machine learning techniques, one class corpus (e.g., Dementia Bank (MacWhinney, 2007)) support vector machine (SVM) and isolation forest, containing text transcripts of interviews with pa- for temporal detection of linguistic abnormalities tients and healthy control. In this paper, we first indicative of early-stage AD. These proposed ap- analyze the speech transcripts (over several years) proaches are tested on President Reagan’s speech of a single person (President Ronald Reagan) and dataset and corroborated with results from other visualize time-dependent linguistic information us- research in the literature. ing t-SNE. The goal is to identify visual clues for This paper is organized as follows. Back- linguistic changes that may indicate the onset of ground research is discussed in Section2, Sec- AD. Note that such an easy to understand visual tion3 presents the Reagan speech data set used representation depicting differences in linguistic in this paper, data pre-processing techniques that patterns will be useful to both a common person were applied and AD-related linguistic feature se- and a general practitioner (who is the first point lection rationale and methodology, Section4 con- of contact for majority of patients). Since most tains the clustering and visualization of the hand- general practitioners are not trained mental health crafted linguistic features and anomaly detection to professionals the impact of such a visualization infer the onset of AD and to detect the time period tool will be high, especially at the early stages of of changes in the linguistic statistical characteris- AD. Sigificant AD-related linguistic biomarkers tics, and experimental results to demonstrate the derived from t-SNE analysis are then used in two proposed method. In Section5, we describe the ma- unsupervised clustering algorithms for detecting chine learning algorithms for detecting anomalies AD-related temporal linguistic anomalies. This from personalized linguistic biomarkers collected provides an estimate for the time when early-stage over a period of time to identify early-stage AD. AD symptoms are beginning to be observed. Concluding remarks are given in Section6. 3 Reagan Speeches: Data Collection, 2 Background Pre-processing, and Feature Selection

The “picture description” task has been widely stud- We describe the data collection, pre-processing, ied to differentiate between AD and non-AD or and feature engineering methodologies in this sec- control subjects. In this task, a picture (the “cookie tion. Ronald Reagan was the 40th president (served theft picture”) is shown and the subject is asked from 1981 to 1989) of the United States of Amer- to describe it. It has been observed that subjects ica. He was an extraordinary orator, a radio an- with AD usually convey sparse information about nouncer, actor, and the host for a show called “Gen-

134 eral Electric Theatre.” Clearly, for being successful Vocabulary Richness: AD patients show a de- in these professional domains one needs to have cline in their vocabulary range. Therefore, vocab- good memory, consciousness, intuition, command ulary richness metrics: Honore’s Statistic (HS), over the language, and ease of communicating with Sichel Measure (SICH), and Brunet’s Measure a large audience. Reagan ofﬁcially announced that (BM) were calculated for each speech. Higher he had been diagnosed with AD on November values of Honore’s and Sichel measures indicate 5, 1994. But it was speculated that his cognitive greater vocabulary richness. But a higher value abilities where on the decline even while in ofﬁce corresponds to low vocabulary richness for the (Gottschalk et al., 1988). Therefore, analyzing his Brunet’s measure. speech transcripts for early signs of AD may reveal Readability Measures: We computed two read- interesting patterns, if any. ability measures, namely, Automated Readability The Reagan Library is the repository of presiden- Index (ARI) and Flesch-Kincaid readability (FKR) tial records for President Reagan’s administration. score. A higher ARI indicates complex speech We download his 98 speeches from 1980 to 1989 as with rich vocabulary whereas lower Flesch-Kincaid shown in Table1. We removed special characters, score indicates rich vocabulary. tags, and numbers and kept only the words from each speech transcript. The resulting data was then lemmatized and tokenized.

Table 1: President Reagan’s speech dataset

Year No. of speeches 1980 6 1981 8 1982 11 1983 12 1984 14 1985 13 1986 14 1987 10 Figure 1: Correlation matrix of the linguistic features. 1988 9 1989 1 Figure1 shows the correlation matrix of the chosen linguistic features computed for the Reagan speech dataset. Note that some of the features are Part-of-speech (POS) features: People diag- highly correlated, therefore, we pruned the feature nosed with AD use more pronouns than nouns. set to the following 9 features: Therefore, POS features such as the number of 1. pronoun-noun ratio pronouns and the pronouns-to-nouns ratio are important. We identiﬁed adverbs, nouns, verbs, and 2. word frequency rate pronouns for each speech transcript with natural 3. verb frequency rate language processing (NLP) tools. POS tags having at least 10 occurrences were selected to compute 4. pronoun frequency rate their percentage ratios. Similarly, words that have 5. adverb frequency rate at least a frequency of 10 were selected and their occurrence percentages were computed. 6. Honore’s measure The full set of POS features we used were: (1) 7. Brunet’s measure number of pronouns, (2) pronoun-noun ratio, (3) number of adverbs, (4) number of nouns, (5) num- 8. Sichel measure ber of verbs, (6) pro-noun frequency rate, (7) noun frequency rate, (8) verb frequency rate, (9) adverb 9. Automated Readability Index frequency rate, (10) word frequency rate, and (11) Any further analysis in this paper uses the above 9 word frequency rate without excluding stop words. selected linguistic features.

135 4 Clustering and Visualization of the pronoun-to-noun ratio. Notice that the clus- Linguistic Features ter on the left side of the graph contains speeches (from 1983 to 1987) have higher values of the ra- We selected the t-SNE machine learning technique tio. Recall that higher pronoun-to-noun ratio is an for clustering and visulaization of linguistic fea- indicator of early stage AD. tures extracted from Reagan’s speeches. t-SNE is better at creating a single map for revealing structures at many different scales important for high- dimensional data that lie on several different, but related, low-dimensional manifolds. This implies that it can capture much of the local structure of the high-dimensional data very well, while also revealing global structure such as the presence of clusters at several scales (van der Maaten and Hinton 2008). If there are AD-related linguistic patterns then t- SNE may reveal them as clusters. In t-SNE, the high-dimensional Euclidean distances between datapoints are converted into con- Figure 2: 2-dimensional visualization of speech tran- ditional probabilities representing similarities be- scripts where size of each circle in the map is propor- tween them. For example, the similarity of data tional to the pronoun-to-noun ratio. point xj with xi is the conditional probability pj i | that xi would pick xj as its neighbor. Neighbors Pronoun frequency: Fig.3 indicates clustering are picked in proportion to their probability density and low-dimensional visualization results when the under Student t-distribution centered at xi in the size of each circle is proportional to the pronoun low-dimensional space. Then the Kullback-Leibler frequency. We again see that the cluster on the divergence between a joint probability distribution, left side of the map contains speeches with higher P, in the high-dimensional space and a joint proba- pronoun frequency, another indicator of early stage bility distribution, Q, in the low-dimensional space AD. is then minimized:

pij minKL(P Q) = pijlog (1) || qij Xi Xj t-SNE has two tunable hyperparameters: perplexity and learning rate. Perplexity allows us to balance the weighting between local and global relationships of the data. It gives a sense of the number of close neighbors for each point. A perplexity value between 5 and 50 is recommended. The learning rate for t-SNE is usually in the interval [10, 1000]. For a high learning rate, each point Figure 3: 2-dimensional visualization of speech tran- will approximately be equidistant from its nearest scripts where size of each circle in the map is propor- neighbours. For a low rate, there will be few out- tional to the pronoun frequency. liers and therefore the points may look compressed into a dense cloud. Since t-SNE’s cost function Readability score: From Fig.4 we observe that is not convex different initializations can produce the speeches from 1983-1987 have lower readabil- different results. After tuning t-SNE’s hyperpa- ity scores. rameters for the dataset, we chose perplexity value Word repetition frequency: The word repe- equal to 4 and learning rate equal to 100. tition frequency map in Fig.5 shows that three Pronoun-to-noun ratio: Figure2 shows t-SNE speeches standout. These three speeches have a based clustering of the speech transcripts, speeches higher repetition of high frequency words. Inter- sharing similar patterns are clustered together. Be- estingly, two of these speeches have word lengths sides, the radius of each circle is proportional to smaller than the mean length of all the speeches.

136 08-12-1987: Address to the Nation on the • Iran-ContraAffair 08-12-1987: Address to the Nation on the • Iran-ContraAffair 09-21-1987: Address to the General Assem- • bly of the United Nations, (INF Agreement and Iran) Therefore, t-SNE based clustering and low- dimensional visualization of President Reagan’s Figure 4: 2-dimensional visualization of speech tran- speeches from 1964 to 1989 reveals the following: scripts where size of each circle in the map is propor- he started showing signs of early AD well tional to the readability score. • before the official announcement in 1994 it is highly likely that he developed AD some- • time between 1983 and 1987 over time, President Reagan’s showed a signif- • icant reduction the number of unique words but a significant increase in conversational fillers and non-specific nouns the proposed method identifies specific • speeches that exhibit linguistic markers for AD

Figure 5: 2-dimensional visualization of speech tran- Some of these findings are corroborated by prior scripts where size of each circle in the map is propor- research that analyzed his interviews and com- tional to the word repetition frequency. pared them with President Bursh’s public speeches, (Berisha et al., 2015). From these clustering results the speeches iden- 5 Linguistic Anomaly Detector for AD tified as showing early signs of AD are: t-SNE-clustering-based approach provides visual- 03-23-1983: Address to the Nation on Na- • ization of speeches that are statistically different tional Security (“Star Wars” SDI Speech) (“anomalies”). But we need an automated method 05-28-1984: Remarks honoring the Vietnam to identify these anomalies for early signs of AD. • War’s Unknown Therefore, we investigate a one-class support vector machine (SVM) anomaly detector (Erfani et al., 02-06-1985: State of the Union Address, 2016). This method is useful in practice when ma- • (American Revolution II) jority of a subject’s speeches over several years would be (statistically) typical of a healthy control 04-24-1985: Address to the Nation on the • (“normal”) until he/she begins to exhibit early signs Federal Budget and Deficit Reduction of AD. Our hypothesis is that early stage AD will 05-05-1985: Speech at Bergen-Belsen Con- begin to reveal itself as statistical anomalies in lin- • centration Camp Memorial guistic feature space. In this section we investigate this hypothesis. 05-05-1985: Speech at Bitburg Air Base • We designed a one class SVM with the following 04-14-1986: Address to the Nation on the Air hyperparameter values (the choices of these values • Strike Against Libya are not discussed for the sake of clarity and focus), ν = 0.5 (an upper bound on the fraction of training 06-24-1986: Address to the Nation on Aid to errors and a lower bound of the fraction of sup- • the Contras port vectors), kernel=rbf (radial basis function) and

137 1 γ = (number of features variance of features) (kernel coefﬁcient). Figure6 shows× that the one class SVM

Figure 7: Isolation forest algorithm for AD detection.

Figure 6: Speeches from 1984 to 1986 are detected as abnormal or anomalies. detector identified several speeches from 1984 to 1986 as anomalous. That is, President Reagan’s speeches in these years are different from the previous years in the linguistic feature space. Therefore, it is likely that: Figure 8: Year-wise speeches identified as anomalous. he started showing signs of early AD well • before the official announcement in 1994 6 Conclusions

the onset of AD start from 1984 and became A set of nine linguistic biomarkers for AD was • more pronounced in 1985 and 1986, which is identified. Two complementary unsupervised ma- corroborated by prior research (e.g., (Berisha chine learning methods were applied on the linguis- et al., 2015)) that analyzed his interviews. tic features extracted from President Reagan’s 98 speeches given between 1980 to 1989. The first One class SVM learns the profile of non-AD method, t-SNE, identified and visualized speeches speeches as “normal” over a period of time and indicating early onset of AD. A higher pronoun detects anomalies to signal the onset of AD. But usage frequency, lower readability scores, higher in many practical instances long historical data repetition of high frequency words were revealed may not be available for a subject. In this case, to be the key characteristics of potential AD-related we must identify anomalies explicitly instead of speeches. A subset speeches from 1983 to 1987 learning what is normal. Isolation forest (Ding and were detected to possess these characteristics. Fei, 2013) is an unsupervised machine learning The second machine learning method, one class algorithm that is applicable for this purpose. SVM, learned what is “normal” (i.e., non-AD We applied the isolation forest algorithm on the speech) to detect anomalies in speeches over a speech dataset and tuned its hyperparameters. Fig- period of time. This approach detected several ure7 and Figure8 show the results. Figure7 shows speeches between 1984 and 1986 as potential AD- the ten speeches that were isolated as anomalies. related. Since normal speech may not be avail- The corresponding dates of these speeches are seen able historically we applied the isolation forest al- in Figure8. We observe that a few speeches be- gorithm that explicitly detects anomalies without tween 1984 and 1988 as linguistic anomalies. This learning what is normal. This detected 10 speeches time period overlaps significantly with the results from 1983 to 1987 as AD-related. of once class SVM and t-SNE clustering. From the experimental analysis our conclusion

138 is that President Reagan had signs of AD sometime Alexander M Rapp and Barbara Wild. 2011. Nonlit- between 1983 and 1988. This conclusion corrob- eral language in alzheimer dementia: a review. Jour- orates results from other studies in the literature. nal of the International Neuropsychological Society, 17(2):207–218. Note that that President Reagan had AD was publicly disclosed only in November 1994. Vassiliki Rentoumi, Ladan Raoufian, Samrah Ahmed, Celeste A de Jager, and Peter Garrard. 2014. Fea- tures and machine learning classification of con- References nected speech samples from patients with autopsy proven alzheimer’s disease with and without addi- Visar Berisha, Shuai Wang, Amy LaCross, and Julie tional vascular pathology. Journal of Alzheimer’s Liss. 2015. Tracking discourse complexity preced- Disease, 42(s3):S3–S17. ing alzheimer’s disease diagnosis: a case study comparing the press conferences of presidents ronald rea- Greta Szatloczki, Ildiko Hoffmann, Veronika Vincze, gan and george herbert walker bush. Journal of Janos Kalman, and Magdolna Pakaski. 2015. Speak- Alzheimer’s Disease, 45(3):959–963. ing in alzheimer’s disease, is that an early sign? importance of changes in language abilities in Ma¨ıte´ Boye,´ Natalia Grabar, and Thi Mai Tran. 2014. alzheimer’s disease. Frontiers in aging neuro- Contrastive conversational analysis of language pro- science, 7:195. duction by alzheimer’s and control people. In MIE, pages 682–686. Annalena Venneri, William J McGeown, Heidi M Hietanen, Chiara Guerrini, Andrew W Ellis, and Zhiguo Ding and Minrui Fei. 2013. An anomaly de- Michael F Shanks. 2008. The anatomical bases of tection approach based on isolation forest algorithm semantic retrieval deficits in early alzheimer’s dis- for streaming data using sliding window. IFAC Pro- ease. Neuropsychologia, 46(2):497–510. ceedings Volumes, 46(20):12–17. Sarah M Erfani, Sutharshan Rajasegarar, Shanika Karunasekera, and Christopher Leckie. 2016. High- dimensional and large-scale anomaly detection using a linear one-class svm with deep learning. Pat- tern Recognition, 58:121–134. Katrina E Forbes-McKay and Annalena Venneri. 2005. Detecting subtle spontaneous language decline in early alzheimer’s disease with a picture description task. Neurological sciences, 26(4):243–254. Kathleen C Fraser, Jed A Meltzer, and Frank Rudzicz. 2016. Linguistic features identify alzheimer’s disease in narrative speech. Journal of Alzheimer’s Dis- ease, 49(2):407–422. Elaine Giles, Karalyn Patterson, and John R Hodges. 1996. Performance on the boston cookie theft picture description task in patients with early dementia of the alzheimer’s type: missing information. Apha- siology, 10(4):395–408. Louis A Gottschalk, Regina Uliana, and Ronda Gilbert. 1988. Presidential candidates and cognitive impairment measured from behavior in campaign debates. Public Administration Review, pages 613–619. Ce Liu and Hueng-Yeung Shum. 2003. Kullback- leibler boosting. In 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recog- nition, 2003. Proceedings., volume 1, pages I–I. IEEE. Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605. Brian MacWhinney. 2007. The talkbank project. In Creating and digitizing language corpora, pages 163–180. Springer.

139 BIOMRC: A Dataset for Biomedical Machine Reading Comprehension

Petros Stavropoulos1,2, Dimitris Pappas1,2, Ion Androutsopoulos1, Ryan McDonald3,1 1Department of Informatics, Athens University of Economics and Business, Greece 2Institute for Language and Speech Processing, Research Center ‘Athena’, Greece 3Google Research pstav1993, pappasd, ion @aueb.gr { [email protected]}

Abstract been created following this approach either using books (Hill et al., 2016; Bajgar et al., 2016) or We introduce BIOMRC, a large-scale cloze- news articles (Hermann et al., 2015). Datasets of style biomedical MRC dataset. Care was taken this kind are noisier than MRC datasets containing to reduce noise, compared to the previous human-authored questions and manually annotated BIOREAD dataset of Pappas et al.(2018). Ex- periments show that simple heuristics do not passage spans that answer them (Rajpurkar et al., perform well on the new dataset, and that two 2016, 2018; Nguyen et al., 2016). They require neural MRC models that had been tested on no human annotations, however, which is particu- BIOREAD perform much better on BIOMRC, larly important in biomedical question answering, indicating that the new dataset is indeed less where employing annotators with appropriate ex- noisy or at least that its task is more feasible. pertise is costly. For example, the BIOASQQA Non-expert human performance is also higher dataset (Tsatsaronis et al., 2015) currently contains on the new dataset compared to BIOREAD, and biomedical experts perform even better. We approximately 3k questions, much fewer than the also introduce a new BERT-based MRC model, 100k questions of a SQUAD (Rajpurkar et al., 2016), the best version of which substantially outper- exactly because it relies on expert annotators. forms all other methods tested, reaching or sur- To bypass the need for expert annotators and passing the accuracy of biomedical experts in produce a biomedical MRC dataset large enough some experiments. We make the new dataset to train (or pre-train) deep learning models, Pap- available in three different sizes, also releasing pas et al.(2018) adopted the cloze-style questions our code, and providing a leaderboard. approach. They used the full text of unlabeled 1 1 Introduction biomedical articles from PUBMEDCENTRAL, and METAMAP (Aronson and Lang, 2010) to annotate Creating large corpora with human annotations is a the biomedical entities of the articles. They ex- demanding process in both time and resources. Re- tracted sequences of 21 sentences from the arti- search teams often turn to distantly supervised or cles. The ﬁrst 20 sentences were used as a passage unsupervised methods to extract training examples and the last sentence as a cloze-style question. A from textual data. In machine reading compre- biomedical entity of the ‘question’ was replaced hension (MRC)(Hermann et al., 2015), a training by a placeholder, and systems have to guess which instance can be automatically constructed by taking biomedical entity of the passage can best ﬁll the an unlabeled passage of multiple sentences, along placeholder. This allowed Pappas et al. to produce with another smaller part of text, also unlabeled, a dataset, called BIOREAD, of approximately 16.4 usually the next sentence. Then a named entity of million questions. As the same authors reported, the smaller text is replaced by a placeholder. In this however, the mean accuracy of three humans on a setting, MRC systems are trained (and evaluated for sample of 30 questions from BIOREAD was only their ability) to read the passage and the smaller 68%. Although this low score may be due to the text, and guess the named entity that was replaced fact that the three subjects were not biomedical ex- by the placeholder, which is typically one of the perts, it is easy to see, by examining samples of named entities of the passage. This kind of ques- BIOREAD, that many examples of the dataset do tion answering (QA) is also known as cloze-type questions (Taylor, 1953). Several datasets have 1https://www.ncbi.nlm.nih.gov/pmc/

140 Proceedings of the BioNLP 2020 workshop, pages 140–149 Online, July 9, 2020 c 2020 Association for Computational Linguistics

‘question’ originating from caption: for researchers with more or fewer resources, along “figure 4 htert @entity6 and @entity4 XXXX cell invasion.” ‘question’ originating from reference: with the 60 instances (TINY) humans answered. “2004 , 17 , 250 257 .14967013 c samuni y. ; samuni u. ; Random samples from BIOMRCLARGE where se- goldstein s. the use of cyclic XXXX as hno scavengers .” lected to create LITE and TINY. BIOMRCTINY ‘passage’ containing captions: “figure 2: distal UNK showing high insertion of rectum is used only as a test set; it has no training and into common channel. figure 3: illustration of the cloacal validation subsets. malformation. figure 4: @entity5 showing UNK” We tested on BIOMRCLITE the two deep learn- Table 1: Examples of noisy BIOREAD data. XXXX is ing MRC models that Pappas et al.(2018) had tested the placeholder, and UNK is the ‘unknown’ token. on BIOREADLITE, namely Attention Sum Reader not make sense. Many instances contain passages (AS-READER)(Kadlec et al., 2016) and Attention or questions crossing article sections, or originat- Over Attention Reader (AOA-READER)(Cui et al., ing from the references sections of articles, or they 2017). Experimental results show that AS-READER include captions and footnotes (Table1). Another and AOA-READER perform better on BIOMRC, with the accuracy of AOA-READER reaching 70% com- source of noise is METAMAP, which often misses or mistakenly identifies biomedical entities (e.g., it pared to the corresponding 52% accuracy of Pappas often annotates ‘to’ as the country Togo). et al.(2018), which is a further indication that the new dataset is less noisy or that at least its task In this paper, we introduce BIOMRC, a new is more feasible. We also developed a new BERT- dataset for biomedical MRC that can be viewed based (Devlin et al., 2019) MRC model, the best ver- as an improved version of BIOREAD. To avoid sion of which (SCIBERT-MAX-READER) performs crossing sections, extracting text from references, even better, with its accuracy reaching 80%. We captions, tables etc., we use abstracts and titles of encourage further research on biomedical MRC by biomedical articles as passages and questions, re- making our code and data publicly available, and spectively, which are clearly marked up in PUBMED by creating an on-line leaderboard for BIOMRC.3 data, instead of using the full text of the articles. Using titles and abstracts is a decision that favors 2 Dataset Construction precision over recall. Titles are likely to be related to their abstracts, which reduces the noise-to- Using PUBTATOR, we gathered approx. 25 million signal ratio significantly and makes it less likely to abstracts and their titles. We discarded articles generate irrelevant questions for a passage. We with titles shorter than 15 characters or longer than replace a biomedical entity in each title with a 60 tokens, articles without abstracts, or with ab- placeholder, and we require systems to guess the stracts shorter than 100 characters, or fewer than hidden entity by considering the entities of the 10 sentences. We also removed articles with ab- abstract as candidate answers. Unlike BIOREAD, stracts containing fewer than 5 entity annotations, we use PUBTATOR (Wei et al., 2012), a repository or fewer than 2 or more than 20 distinct biomedi- that provides approximately 25 million abstracts cal entity identifiers. (PUBTATOR assigns the same and their corresponding titles from PUBMED, with identifier to all the synonyms of a biomedical en- 2 multiple annotations. We use DNORM’s biomed- tity; e.g., ‘hemorrhagic stroke’ and ‘stroke’ have ical entity annotations, which are more accurate the same identifier ‘MESH:D020521’.) We also than METAMAP’s (Leaman et al., 2013). We also discarded articles containing entities not linked to perform several checks, discussed below, to dis- any of the ontologies used by PUBTATOR,4 or en- card passage-question instances that are too easy, tities linked to multiple ontologies (entities with and we show that the accuracy of experts and non- multiple ids), or entities whose spans overlapped expert humans reaches 85% and 82%, respectively, with those of other entities. We also removed ar- on a sample of 30 instances for each annotator type, ticles with no entities in their titles, and articles which is an indication that the new dataset is indeed with no entities shared by the title and abstract.5 less noisy, or at least that the task is more feasible for humans. Following Pappas et al.(2018), we 3Our code, data, and information about the leaderboard will be available at http://nlp.cs.aueb.gr/ release two versions of BIOMRC, LARGE and LITE, publications.html. containing 812k and 100k instances respectively, 4PUBTATOR uses the Open Biological and Biomedical On- tology (OBO) Foundry, which comprises over 60 ontologies. 2Like PUBMED, PUBTATOR is supported by NCBI. Consult: 5A further reason for using the title as the question is that www.ncbi.nlm.nih.gov/research/pubtator/ the entities of the titles are typically mentioned in the abstract.

141 BACKGROUND: Most brain metastases arise from @entity0 . Few studies compare the brain regions they involve, their numbers and intrinsic attributes. METHODS: Records of all @entity1 referred to Radiation Oncology for treatment of symptomatic brain metastases were obtained. Computed tomography (n = 56) or magnetic resonance imaging (n = 72) brain scans were reviewed. RESULTS: Data from 68 breast and 62 @entity2 @entity1 were compared. Brain metastases presented earlier in the course of the lung than of the @entity0 @entity1 (p = 0.001). There were more metastases in the Passage cerebral hemispheres of the breast than of the @entity2 @entity1 (p = 0.014). More @entity0 @entity1 had cerebellar metastases (p = 0.001). The number of cerebral hemisphere metastases and presence of cerebellar metastases were positively correlated (p = 0.001). The prevalence of at least one @entity3 surrounded with > 2 cm of @entity4 was greater for the lung than for the breast @entity1 (p = 0.019). The @entity5 type, rather than the scanning method, correlated with differences between these variables. CONCLUSIONS: Brain metastases from lung occur earlier, are more @entity4 , but fewer in number than those from @entity0 . Cerebellar brain metastases are more frequent in @entity0 . @entity0 : [‘breast and lung cancer’] ; @entity1 : [‘patients’] ; @entity2 : [‘lung cancer’] ; Candidates @entity3 : [‘metastasis’] ; @entity4 : [‘edematous’, ‘edema’] ; @entity5 : [‘primary tumor’] Question Attributes of brain metastases from XXXX. Answer @entity0 : [‘breast and lung cancer’]

Figure 1: Example passage-question instance of BIOMRC. The passage is the abstract of an article, with biomedical entities replaced by @entityN pseudo-identiﬁers. The original entity names are shown in square brackets. Both ‘edematous’ and ‘edema’ are replaced by ‘@entity4’, because PUBTATOR considers them synonyms. The question is the title of the article, with a biomedical entity replaced by XXXX. @entity0 is the correct answer.

BIOMRCLARGE BIOMRCLITE BIOMRCTINY Training Development Test Total Training Development Test Total Setting A Setting B Total Instances 700,000 50,000 62,707 812,707 87,500 6,250 6,250 100,000 30 30 60 Avg candidates 6.73 6.68 6.68 6.72 6.72 6.68 6.65 6.71 6.60 6.57 6.58 Max candidates 20 20 20 20 20 20 20 20 13 11 13 Min candidates 2 2 2 2 2 2 2 2 2 3 2 Avg abstract len. 253.79 257.41 253.70 254.01 253.78 257.32 255.56 254.11 248.13 264.37 256.25 Max abstract len. 543 516 511 543 519 500 510 519 371 386 386 Min abstract len. 57 89 77 57 60 109 103 60 147 154 147 Avg title len. 13.93 14.28 13.99 13.96 13.89 14.22 14.09 13.92 14.17 14.70 14.43 Max title len. 51 46 43 51 49 40 42 49 21 35 35 Min title len. 3 3 3 3 3 3 3 3 6 4 4

Table 2: Statistics of BIOMRCLARGE, LITE, TINY. The questions of the TINY version were answered by humans. All lengths are measured in tokens using a whitespace tokenizer.

Finally, to avoid making the dataset too easy for pseudo-identifier in the whole dataset. This allows a system that would always select the entity with a system to learn information about the entity rep- the most occurrences in the abstract, we removed resented by a pseudo-identifier from all the occur- a passage-question instance if the most frequent rences of the pseudo-identifier in the training set. entity of its passage (abstract) was also the answer For example after seeing the same pseudo-identifier to the cloze-style question (title with placeholder); multiple times a model may learn that it stands for if multiple entities had the same top frequency in a drug, or that a particular pseudo-identifier tends the passage, the instance was retained. We ended to neighbor with specific words. Then, much like a up with approx. 812k passage-question instances, language model, a system may guess the pseudo- which form BIOMRCLARGE, split into training, de- identifier that should fill in the placeholder even velopment, and test subsets (Table2). The LITE and without the passage, or at least it may infer a prior TINY versions of BIOMRC are subsets of LARGE. probability for each candidate answer. In contrast, In all versions of BIOMRC (LARGE, LITE, TINY), Setting B uses a local scope, i.e., it restarts the the entity identifiers of PUBTATOR are replaced by numbering of the pseudo-identifiers (from @en- pseudo-identifiers of the form @entityN (Fig.1), tity0) anew in each passage-question instance. This as in the CNN and Daily Mail datasets (Hermann forces the models to rely only on information about et al., 2015). We provide all BIOMRC versions the entities that can be inferred from the particular in two forms, corresponding to what Pappas et al. passage and question. This corresponds to a non- (2018) call Settings A and B in BIOREAD.6 In Set- expert answering the question, who does not have ting A, each pseudo-identifier has a global scope, any prior knowledge of the biomedical entities. meaning that each biomedical entity has a unique Table2 provides statistics on BIOMRC. In TINY, we use 30 different passage-question instances in 6Pappas et al.(2018) actually call ‘option a’ and ‘option b’ Settings A and B, because in both settings we asked our Setting B and A, respectively. the same humans to answer the questions, and we

142 passage with the same highest frequency, if multiple exist, otherwise it selects the entity with the second highest frequency. BASE4 extracts all the token n-grams from the passage that include an entity identifier (@entityN), and all the n-grams from the question that include the placeholder (XXXX).7 Then for each candidate answer (entity identifier), it counts the tokens shared between the n-grams that include the candidate and the n-grams that in- Figure 2: Illustration of our SCIBERT-based models. clude the placeholder. The candidate with the most Each sentence of the passage is concatenated with the question and fed to SCIBERT. The top-level embed- shared tokens is selected. These baselines are used ding produced by SCIBERT for the first sub-token of to check that the questions cannot be answered by each candidate answer is concatenated with the top- simplistic heuristics (Chen et al., 2016). level embedding of [MASK] (which replaces the place- Neural baselines: We use the same implementa- holder XXXX) of the question, and they are fed to an AS READER AOA MLP, which produces the score of the candidate answer. tions of - (Kadlec et al., 2016) and - In SCIBERT-SUM-READER, the scores of multiple oc- READER (Cui et al., 2017) as Pappas et al.(2018), currences of the same candidate are summed, whereas who also provide short descriptions of these neu- SCIBERT-MAX-READER takes their maximum. ral models, not provided here to save space. The hyper-parameters of both methods were tuned on did not want them to remember instances from the development set of BIOMRCLITE. one setting to the other. In LARGE and LITE, the instances are the same across the two settings, apart BERT-based model: We use SCIBERT (Beltagy from the numbering of the entity identifiers. et al., 2019), a pre-trained BERT (Devlin et al., 2019) model for scientific text. SCIBERT is pre- 3 Experiments and Results trained on 1.14 million articles from Semantic Scholar,8 of which 82% (935k) are biomedical We experimented only on BIOMRCLITE and TINY, and the rest come from computer science. For since we did not have the computational resources each passage-question instance, we split the pas- to train the neural models we considered on the sage into sentences using NLTK (Bird et al., 2009). LARGE version of BIOREAD. Pappas et al.(2018) For each sentence, we concatenate it (using BERT’s also reported experimental results only on a LITE [SEP] token) with the question, after replacing the version of their BIOREAD dataset. We hope that oth- XXXX with BERT’s [MASK] token, and we feed ers may be able to experiment on BIOMRCLARGE, the concatenation to SCIBERT (Fig.2). We col- and we make our code available, as already noted. lect SCIBERT’s top-level vector representations of the entity identifiers (@entityN) of the sentence 3.1 Methods and [MASK].9 For each entity of the sentence, we We experimented with the four basic baselines concatenate its top-level representation with that (BASE1–4) that Pappas et al.(2018) used in of [MASK], and we feed them to a Multi-Layer BIOREAD, the two neural MRC models used by the Perceptron (MLP) to obtain a score for the partic- same authors, AS-READER (Kadlec et al., 2016) ular entity (candidate answer). We thus obtain a and AOA-READER (Cui et al., 2017), and a BERT- score for all the entities of the passage. If an entity based (Devlin et al., 2019) model we developed. occurs multiple times in the passage, we take the Basic baselines: BASE1, 2, 3 return the first, last, sum or the maximum of the scores of its occur- and the entity that occurs most frequently in the rences. In both cases, a softmax is then applied to passage (or randomly one of the entities with the the scores of all the entities, and the entity with the same highest frequency, if multiple exist), respec- maximum score is selected as the answer. We call tively. Since in BIOREAD the correct answer is 7We tried n = 2,..., 6 and use n = 3, which gave the never (by construction) the most frequent entity best accuracy on the development set of BIOMRCLARGE. of the passage, unless there are multiple entities 8https://www.semanticscholar.org/ 9 with the same highest frequency, BASE3 performs BERT’s tokenizer splits the entity identifiers into sub- tokens (Devlin et al., 2019). We use the first one. The top-level poorly. Hence, we also include a variant, BASE3+, token representations of BERT are context-aware, and it is com- which randomly selects one of the entities of the mon to use the first or last sub-token of each named-entity.

143 BIOMRC Lite – Setting A BIOMRC Lite – Setting B Train Dev Test Train All Word Entity Train Dev Test Train All Word Entity Method Acc Acc Acc Time Params Embeds Embeds Acc Acc Acc Time Params Embeds Embeds BASE1 37.58 36.38 37.63 0 0 0 0 37.58 36.38 37.63 0 0 0 0 BASE2 22.50 23.10 21.73 0 0 0 0 22.50 23.10 21.73 0 0 0 0 BASE3 10.03 10.02 10.53 0 0 0 0 10.03 10.02 10.53 0 0 0 0 BASE3+ 44.05 43.28 44.29 0 0 0 0 44.05 43.28 44.29 0 0 0 0 BASE4 56.48 57.36 56.50 0 0 0 0 56.48 57.36 56.50 0 0 0 0 AS-READER 84.63 62.29 62.38 18 x 0.92 hr 12.87M 12.69M 1.59M 79.64 66.19 66.19 18 x 0.65 hr 6.82M 6.66M 0.60k AOA-READER 82.51 70.00 69.87 29 x 2.10 hr 12.87M 12.69M 1.59M 84.62 71.63 71.57 36 x 1.82 hr 6.82M 6.66M 0.60k SCIBERT-SUM-READER 71.74 71.73 71.28 11 x 4.38 hr 154k 0 0 68.92 68.64 68.24 6 x 4.38 hr 154k 0 0 SCIBERT-MAX-READER 81.38 80.06 79.97 19 x 4.38 hr 154k 0 0 81.43 80.21 79.10 15 x 4.38 hr 154k 0 0 Table 3: Training, development, test accuracy (%) on BIOMRCLITE in Settings A (global scope of entity identifiers) and B (local scope), training times (epochs time per epoch), and number of trainable parameters (total, word × embedding parameters, entity identifier embedding parameters). In the lower zone (neural methods), the difference from each accuracy score to the next best is statistically significant (p < 0.02). We used singe-tailed Approximate Randomization (Dror et al., 2018), randomly swapping the answers to 50% of the questions for 10k iterations. this model SCIBERT-SUM-READER or SCIBERT- READER in Setting B, which again casts doubts MAX-READER, depending on how it aggregates the on the value of SCIBERT’s extensive pre-training. scores of multiple occurrences of the same entity. We expect, however, that the performance of the SCIBERT-SUM-READER is closer to AS-READER SCIBERT-based models, could be improved further and AOA-READER, which also sum the scores of by fine-tuning SCIBERT’s parameters. multiple occurrences of the same entity. This sum- The performance of SCIBERT-SUM-READER is ming aggregation, however, favors entities with sev- slightly better in Setting A than in Setting B, which eral occurrences in the passage, even if the scores might suggest that the model manages to capture of all the occurrences are low. Our experiments global properties of the entity pseudo-identifiers indicate that SCIBERT-MAX-READER performs bet- from the entire training set. However, the perfor- ter. In all cases, we only update the parameters of mance of SCIBERT-MAX-READER is almost the the MLP during training, keeping the parameters of same across the two settings, which contradicts SCIBERT frozen to their pre-trained values to speed the previous hypothesis. Furthermore, the devel- up training. With more computing resources, it may opment and test performance of AS-READER and be possible to improve the scores of SCIBERT-MAX- AOA-READER is higher in Setting B than A, indi- READER (and SCIBERT-SUM-READER) further by cating that these two models do not capture global fine-tuning SCIBERT on BIOMRC training data. properties of entities well, performing better when forced to consider only the information of the par- 3.2 Results on BIOMRCLITE ticular passage-question instance. Overall, we see Table3 reports the accuracy of all methods on no strong evidence that the models we considered BIOMRCLITE for Settings A and B. In both settings, are able to learn global properties of the entities. all the neural models clearly outperform all the ba- In both Settings A and B, AOA-READER per- sic baselines, with BASE3 (most frequent entity of forms better than AS-READER, which was expected the passage) performing worst and BASE3+ per- since it uses a more elaborate attention mechanism, forming much better, as expected. In both settings, at the expense of taking longer to train (Table3). 10 SCIBERT-MAX-READER clearly outperforms all the The two SCIBERT-based models are also compet- other methods on both the development and test itive in terms of training time, because we only sets. The performance of SCIBERT-SUM-READER train the MLP (154k parameters) on top of SCIB- is approximately ten percentage points worse than ERT, keeping the parameters of SCIBERT frozen. SCIBERT-MAX-READER’s on the development and The trainable parameters of AS-READER and test sets of both settings, indicating that the superior AOA-READER are almost double in Setting A com- results of SCIBERT-MAX-READER are to a large ex- pared to Setting B. To some extent, this difference tent due to the different aggregation function (max is due to the fact that for both models we learn instead of sum) it uses to combine the scores of a word embedding for each @entityN pseudo- multiple occurrences of a candidate answer, not identifier, and in Setting A the numbering of the to the extensive pre-training of SCIBERT. AOA- identifiers is not reset for each passage-question READER, which does not employ any pre-training, is competitive to SCIBERT-SUM-READER in Set- 10We trained all models for a maximum of 40 epochs, using ting A, and performs better than SCIBERT-SUM- early stopping on the dev. set, with patience of 3 epochs.

144 Figure 3: More detailed statistics and results on the development subset of BIOMRCLITE. Number of passage- question instances with 2, 3, . . . , 20 candidate answers (top left). Accuracy (%) of the basic baselines (top right). Accuracy (%) of the neural models in Settings A (bottom left) and B (bottom right). instance, leading to many more pseudo-identifiers least more feasible for machines. The results of (31.77k pseudo-identifiers in the vocabulary of Set- Pappas et al.(2018) were slightly higher in Setting ting A vs. only 20 in Setting B); this accounts for A than in Setting B, suggesting that AOA-READER a difference of 1.59M parameters.11 The rest of was able to benefit from the global scope of entity the difference in total parameters (from Setting A identifiers, unlike our findings in BIOMRC.12 to B) is due to the fact that we tuned the hyper- Figure3 shows how many passage-question in- parameters of each model separately for each set- stances of the development subset of BIOMRCLITE ting (A, B), on the corresponding development set. have 2, 3, . . . , 20 candidate answers (top left), Hyper-parameter tuning was performed separately and the corresponding accuracy of the basic base- for each model in each setting, but led to the same lines (top right), and the neural models (bottom). numbers of trainable parameters for AS-READER BASE3+ is the best basic baseline for 2 and 3 can- and AOA-READER, because the trainable parame- didates, and for 2 candidates it is competitive to the ters are dominated by the parameters of the word neural models. Overall, however, BASE4 is clearly embeddings. Note that the hyper-parameters of the the best basic baseline, but it is outperformed by two SCIBERT-based models (of their MLPs) were all neural models in almost all cases, as in Table3. very minimally tuned, hence these models may per- SCIBERT-MAX-READER is again the best system form even better with more extensive tuning. in both settings, almost always outperforming the AOA-READER was also better than AS-READER other systems. AS-READER is the worst neural in the experiments of Pappas et al.(2018) on a model in almost all cases. AOA-READER is compet- LITE version of their BIOREAD dataset, but the itive to SCIBERT-SUM-READER in Setting A, and development and test accuracy of AOA-READER slightly better overall than SCIBERT-SUM-READER in Setting A of BIOREAD was reported to be only in Setting B, as can be seen in Table3. 52.41% and 51.19%, respectively (cf. Table3); in Setting B, it was 50.44% and 49.94%, respectively. 3.3 Results on BIOMRCTINY The much higher scores of AOA-READER (and AS- Pappas et al.(2018) asked humans (non-experts) to READER) on BIOMRCLITE are an indication that answer 30 questions from BIOREAD in Setting A, the new dataset is less noisy, or that the task is at and 30 other questions in Setting B. We mirrored their experiment by providing 30 questions (from 11Hyper-parameter tuning led to 50- and 30-dimensional word embeddings in Settings A, B, respectively. AS-READER 12For AS-READER, Pappas et al.(2018) report results only and AOA-READER learn word embeddings from the training for Setting B: 37.90% development and 42.01% test accuracy set, without using pre-trained embeddings. on BIOREADLITE. They did not consider BERT-based models.

145 The study enrolled 53 @entity1 (29 males, 24 females) with @entity1576 aged 15-88 years. Most of them were 59 years of age and younger. In 1/3 of the @entity1 the diseases started with symptoms of @entity1729, in 2/3 of them–with pulmonary affection. @entity55 was diagnosed in 50 @entity1 (94.3%), acute @entity3617 –in 3 @entity1. ECG changes were registered in about half of the examinees who had no cardiac complaints. 25 of them had alterations in the end part of the Passage ventricular ECG complex; rhythm and conduction disturbances occurred rarely. Mycoplasmosis @entity1 suffering from @entity741( @entity741 ) had stable ECG changes while in those free of @entity741 the changes were short. @entity296 foci were absent. @entity299 comparison in @entity1 with @entity1576 and in other @entity1729 has found that cardiovascular system suffers less in acute mycoplasmosis. These data are useful in differential diagnosis of @entity296. @entity1 : [‘patients’] ; @entity1576 : [‘respiratory mycoplasmosis’] ; @entity1729 : [‘acute respiratory infections’, ‘acute respiratory viral infection’] ; @entity55 : [‘Pneumonia’] ; @entity3617 Candidates : [‘bronchitis’] ; @entity741 : [‘IHD’, ‘ischemic heart disease’] ; @entity296 : [‘myocardial infections’, ‘Myocardial necrosis’] ; @entity299 : [‘Cardiac damage’] . Question Cardio-vascular system condition in XXXX. Expert Human Answers annotator1: @entity1576; annotator2: @entity1576. Non-expert Human Answers annotator1: @entity296; annotator2: @entity296; annotator3: @entity1576. Systems’ Answers AS-READER: @entity1729; AOA-READER: @entity296; SCIBERT-SUM-READER: @entity1576.

Figure 4: Example from BIOMRCTINY. In Setting A, humans see both the pseudo-identifiers (@entityN) and the original names of the biomedical entities (shown in square brackets). Systems see only the pseudo-identifiers, but the pseudo-identifiers have global scope over all instances, which allows the systems, at least in principle, to learn entity properties from the entire training set. In Setting B, humans no longer see the original names of the entities, and systems see only the pseudo-identifiers with local scope (numbering reset per passage-question instance).

BIOMRCLITE) to three non-experts (graduate CS they do benefit from the global scope of entity iden- students) in Setting A, and 30 other questions in tifiers. Also, SCIBERT-MAX-READER performs bet- Setting B. We also showed the same questions of ter than both experts and non-experts in Setting A, each setting to two biomedical experts. As in the and better than non-experts in Setting B. However, experiment of Pappas et al.(2018), in Setting A BIOMRCTINY contains only 30 instances in each both the experts and non-experts were also pro- setting, and hence the results of Table4 are less vided with the original names of the biomedical reliable than those from BIOMRCLITE (Table3). entities (entity names before replacing them with In the corresponding experiments of Pappas et al. @entityN pseudo-identifiers) to allow them to use (2018), which were conducted in Setting B only, prior knowledge; see the top three zones of Fig.4 the average accuracy of the (non-expert) humans for an example. By contrast, in Setting B the origi- was 68.01%, but the humans were also allowed not nal names of the entities were hidden. to answer (when clueless), and unanswered ques- Table4 reports the human and system accuracy tions were excluded from accuracy. On average, scores on BIOMRCTINY. Both experts and non- they did not answer 21.11% of the questions, hence experts perform better in Setting A, where they can their accuracy drops to 46.90% if unanswered ques- use prior knowledge about the biomedical entities. tions are counted as errors. In our experiment, the The gap between experts and non-experts is three humans were also allowed not to answer (when points larger in Setting B than in Setting A, presum- clueless), but we counted unanswered questions ably because experts can better deduce properties as errors, which we believe better reflects human of the entities from the local context. Turning to the performance. Non-experts answered all questions system scores, SCIBERT-MAX-READER is again the in Setting A, and did not answer 13.33% (4/30) of best system, but again much of its performance is the questions on average in Setting B. The decrease due to the max-aggregation of the scores of multi- in the questions non-experts did not answer (from ple occurrences of entities. With sum-aggregation, 21.11% to 13.33%) in Setting B (the only one con- SCIBERT-SUM-READER obtains exactly the same sidered in BIOREAD) again suggests that the new scores as AOA-READER, which again performs bet- dataset is less noisy, or at least that the task is more ter than AS-READER.(AOA-READER and SCIBERT- feasible for humans, even when the names of the SUM-READER make different mistakes, but their entities are hidden. Experts did not answer 2.5% scores just happen to be identical because of the (0.75/30) and 1.67% (0.5/30) of the questions on small size of TINY.) Unlike our results on BIOMRC average in Settings A and B, respectively. LITE, we now see all systems performing better in Inter-annotator agreement was also higher for Setting A compared to Setting B, which suggests experts than non-experts in our experiment, in both

146 Method Setting A Setting B Annotators (Setting) Kappa Experts (Avg) 85.00 61.67 Experts (A) 70.23 Non-Experts (Avg) 81.67 55.56 Non Experts (A) 65.61 AS-READER 66.67 46.67 Experts (B) 72.30 AOA READER 70.00 56.67 - Non Experts (B) 47.22 SCIBERT-SUM-READER 70.00 56.67 SCIBERT-MAX-READER 90.00 60.00 Table 5: Human agreement (Cohen’s Kappa, %) on BIOMRCTINY. Avg. pairwise scores for non-experts. Table 4: Accuracy (%) on BIOMRCTINY. Best human and system scores shown in bold. perts. Creating data of this kind, however, requires significant expertise and time. In the eight years Settings A and B (Table5). In Setting B, the agree- of BIOASQ, only 3,243 questions and gold answers ment of non-experts was particularly low (47.22%), have been created. It would be particularly inter- possibly because without entity names they had to esting to explore if larger automatically generated rely more on the text of the passage and question, datasets like BIOMRC and CLICR could be used to which they had trouble understanding. By contrast, pre-train models, which could then be fine-tuned the agreement of experts was slightly higher in Set- for human-generated QA or MRC datasets. ting B than Setting A, possibly because without Outside the biomedical domain, several cloze- prior knowledge about the entities, which may dif- style open-domain MRC datasets have been created fer across experts, they had to rely to a larger extent automatically (Hill et al., 2016; Hermann et al., on the particular text of the passage and question. 2015; Dunn et al., 2017; Bajgar et al., 2016), but have been criticized of containing questions that 4 Related work can be answered by simple heuristics like our ba- Several biomedical MRC datasets exist, but have sic baselines (Chen et al., 2016). There are also orders of magnitude fewer questions than BIOMRC several large open-domain MRC datasets annotated (Ben Abacha and Demner-Fushman, 2019) or are by humans (Kwiatkowski et al., 2019; Rajpurkar not suitable for a cloze-style MRC task (Pampari et al., 2016, 2018; Trischler et al., 2017; Nguyen et al., 2018; Ben Abacha et al., 2019; Zhang et al., et al., 2016; Lai et al., 2017). To our knowledge the 2018). The closest dataset to ours is CLICR (Susterˇ biggest human annotated corpus is Google’s Nat- and Daelemans, 2018), a biomedical MRC dataset ural Questions dataset (Kwiatkowski et al., 2019), with cloze-type questions created using full-text with approximately 300k human annotated exam- articles from BMJ case reports.13 CLICR contains ples. Datasets of this kind require extensive an- 100k passage-question instances, the same num- notation effort, which for open-domain datasets ber as BIOMRCLITE, but much fewer than the is usually crowd-sourced. Crowd-sourcing, how- 812.7k instances of BIOMRCLARGE. Susterˇ et ever, is much more difficult for biomedical datasets, al. used CLAMP (Soysal et al., 2017) to detect because of the required expertise of the annotators. biomedical entities and link them to concepts of the UMLS Metathesaurus (Lindberg et al., 1993). 5 Conclusions and Future Work Cloze-style questions were created from the ‘learn- We introduced BIOMRC, a large-scale cloze-style ing points’ (summaries of important information) biomedical MRC dataset. Care was taken to reduce of the reports, by replacing biomedical entities with noise, compared to the previous BIOREAD dataset placeholders. Susterˇ et al. experimented with the of Pappas et al.(2018). Experiments showed that Stanford Reader (Chen et al., 2017) and the Gated- BIOMRC’s questions cannot be answered well by Attention Reader (Dhingra et al., 2017), which per- simple heuristics, and that two neural MRC models form worse than AOA-READER (Cui et al., 2017). that had been tested on BIOREAD perform much The QA dataset of BIOASQ (Tsatsaronis et al., better on BIOMRC, indicating that the new dataset 2015) contains questions written by biomedical ex- is indeed less noisy or at least that its task is more perts. The gold answers comprise multiple relevant feasible. Human performance was also higher on a documents per question, relevant snippets from the sample of BIOMRC compared to BIOREAD, and documents, exact answers in the form of entities, biomedical experts performed even better. We as well as reference summaries, written by the ex- also developed a new BERT-based model, the best 13https://casereports.bmj.com/ version of which outperformed all other meth-

147 ods tested, reaching or surpassing the accuracy of Danqi Chen, Adam Fisch, Jason Weston, and Antoine biomedical experts in some experiments. We make Bordes. 2017. Reading Wikipedia to Answer Open- Domain Questions. In Proceedings of the 55th An- BIOMRC available in three different sizes, also re- nual Meeting of the Association for Computational leasing our code, and providing a leaderboard. Linguistics (Volume 1: Long Papers), pages 1870– We plan to tune more extensively the BERT- 1879, Vancouver, Canada. based model to further improve its efficiency, and Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang, to investigate if some of its techniques (mostly its Ting Liu, and Guoping Hu. 2017. Attention-over- max-aggregation, but also using sub-tokens) can Attention Neural Networks for Reading Comprehen- also benefit the other neural models we considered. sion. In Proceedings of the 55th Annual Meeting of We also plan to experiment with other MRC models the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 593–602, Vancouver, that recently performed particularly well on open- Canada. domain MRC datasets (Zhang et al., 2020). Finally, we aim to explore if pre-training neural models on Jacob Devlin, Ming-Wei Chang, Kenton Lee, and BIOREAD is beneficial in human-generated biomed- Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Un- ical datasets (Tsatsaronis et al., 2015). derstanding. In NAACL-HLT. Acknowledgments Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang, William Cohen, and Ruslan Salakhutdinov. 2017. Gated- We are most grateful to I. Almirantis, S. Kotitsas, Attention Readers for Text Comprehension. In Pro- V. Kougia, A. Nentidis, S. Xenouleas, who partici- ceedings of the 55th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long pated in the human evaluation with BIOMRCTINY. Papers), pages 1832–1846, Vancouver, Canada.

Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Re- References ichart. 2018. The Hitchhiker’s Guide to Testing Sta- tistical Signiﬁcance in Natural Language Processing. Alan R Aronson and Franc¸ois-Michel Lang. 2010. An In Proceedings of the 56th Annual Meeting of the As- overview of MetaMap: historical perspective and re- sociation for Computational Linguistics (Volume 1: cent advances. Journal of the American Medical In- Long Papers), pages 1383–1392, Melbourne, Aus- formatics Association, 17(3):229–236. tralia.

Ondrej Bajgar, Rudolf Kadlec, and Jan Kleindienst. Matthew Dunn, Levent Sagun, Mike Higgins, V. Ugur 2016. Embracing data abundance: BookTest Guney,¨ Volkan Cirik, and Kyunghyun Cho. 2017. Dataset for Reading Comprehension. CoRR. SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine. CoRR. Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciB- ERT: Pretrained Language Model for Scientiﬁc Text. Karl Moritz Hermann, Toma´sˇ Kociskˇ y,´ Edward Grefen- In EMNLP. stette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching Machines to Read and Comprehend. In Proceedings of the 28th Asma Ben Abacha and Dina Demner-Fushman. 2019. International Conference on Neural Information A question-entailment approach to question answer- Processing Systems - Volume 1, page 1693–1701, ing. BMC Bioinformatics, 20:511. Cambridge, MA, USA.

Asma Ben Abacha, Chaitanya Shivade, and Dina Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Demner-Fushman. 2019. Overview of the MEDIQA Weston. 2016. The Goldilocks Principle: Reading 2019 Shared Task on Textual Inference, Question Children’s Books with Explicit Memory Represen- Entailment and Question Answering. In Proceed- tations. CoRR. ings of the 18th BioNLP Workshop and Shared Task, pages 370–379, Florence, Italy. Rudolf Kadlec, Martin Schmid, Ondrej Bajgar, and Jan Kleindienst. 2016. Text Understanding with the At- Steven Bird, Loper Edward, and Ewan Klein. tention Sum Reader Network. In Proceedings of the 2009. Natural Language Processing with Python. 54th Annual Meeting of the Association for Compu- O’Reilly Media Inc. tational Linguistics (Volume 1: Long Papers), pages 908–918, Berlin, Germany. Danqi Chen, Jason Bolton, and Christopher D. Man- ning. 2016. A Thorough Examination of the Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- CNN/Daily Mail Reading Comprehension Task. In ﬁeld, Michael Collins, Ankur Parikh, Chris Al- Proceedings of the 54th Annual Meeting of the As- berti, Danielle Epstein, Illia Polosukhin, Jacob De- sociation for Computational Linguistics (Volume 1: vlin, Kenton Lee, Kristina Toutanova, Llion Jones, Long Papers), pages 2358–2367, Berlin, Germany. Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai,

148 Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Association for Computational Linguistics: Human Natural Questions: A Benchmark for Question An- Language Technologies, Volume 1 (Long Papers), swering Research. Transactions of the Association pages 1551–1563, New Orleans, Louisiana. for Computational Linguistics, 7:453–466. Wilson L. Taylor. 1953. “Cloze Procedure”: A New Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, Tool for Measuring Readability. Journalism Quar- and Eduard Hovy. 2017. RACE: Large-scale ReAd- terly, 30(4):415–433. ing Comprehension Dataset From Examinations. In Proceedings of the 2017 Conference on Empirical Adam Trischler, Tong Wang, Xingdi Yuan, Justin Har- Methods in Natural Language Processing, pages ris, Alessandro Sordoni, Philip Bachman, and Ka- 785–794, Copenhagen, Denmark. heer Suleman. 2017. NewsQA: A Machine Compre- hension Dataset. In Proceedings of the 2nd Work- Robert Leaman, Rezarta Islamaj Dogan,˘ and Zhiy- shop on Representation Learning for NLP, pages ong Lu. 2013. DNorm: disease name normaliza- 191–200, Vancouver, Canada. tion with pairwise learning to rank. Bioinformatics, 29(22):2909–2917. G. Tsatsaronis, G. Balikas, P. Malakasiotis, I. Parta- las, M. Zschunke, M.R. Alvers, D. Weissenborn, Donald A. B. Lindberg, Betsy L. Humphreys, and A. Krithara, S. Petridis, D. Polychronopoulos, Alexa T. McCray. 1993. The Uniﬁed Medical Lan- Y. Almirantis, J. Pavlopoulos, N. Baskiotis, P. Galli- guage System. Yearbook of medical informatics, nari, T. Artieres, A. Ngonga, N. Heino, E. Gaussier, 1:41–51. L. Barrio-Alvers, M. Schroeder, I. Androutsopou- los, and G. Paliouras. 2015. An Overview of the Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, BioASQ Large-Scale Biomedical Semantic Index- Saurabh Tiwary, Rangan Majumder, and Li Deng. ing and Question Answering Competition. BMC 2016. MS MARCO: A Human Generated MAchine Bioinformatics, 16(138). Reading COmprehension Dataset. ArXiv. Chih-Hsuan Wei, Bethany R. Harris, Donghui Li, Anusri Pampari, Preethi Raghavan, Jennifer Liang, and Tanya Z. Berardini, Eva Huala, Hung-Yu Kao, and Jian Peng. 2018. emrQA: A Large Corpus for Ques- Zhiyong Lu. 2012. Accelerating literature curation tion Answering on Electronic Medical Records. In with text-mining tools: a case study of using PubTa- Proceedings of the 2018 Conference on Empirical tor to curate genes in PubMed abstracts. Database, Methods in Natural Language Processing, pages 2012. 2357–2368, Brussels, Belgium. Xiao Zhang, Ji Wu, Zhiyang He, Xien Liu, and Ying Dimitris Pappas, Ion Androutsopoulos, and Haris Pa- Su. 2018. Medical Exam Question Answering with pageorgiou. 2018. BioRead: A New Dataset for Large-scale Reading Comprehension. ArXiv. Biomedical Reading Comprehension. In Proceed- ings of the Eleventh International Conference on Zhuosheng Zhang, Jun jie Yang, and Hai Zhao. 2020. Language Resources and Evaluation (LREC 2018), Retrospective Reader for Machine Reading Compre- Miyazaki, Japan. hension. ArXiv. Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know What You Don’t Know: Unanswerable Ques- tions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Lin- guistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceed- ings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Ergin Soysal, Jingqi Wang, Min Jiang, Yonghui Wu, Serguei Pakhomov, Hongfang Liu, and Hua Xu. 2017. CLAMP – a toolkit for efﬁciently building customized clinical natural language processing pipelines. Journal of the American Medical Infor- matics Association, 25(3):331–336. Simon Susterˇ and Walter Daelemans. 2018. CliCR: a Dataset of Clinical Case Reports for Machine Read- ing Comprehension. In Proceedings of the 2018 Conference of the North American Chapter of the

149 Neural Transduction of Letter Position Dyslexia using an Anagram Matrix Representation

Avi Bleiweiss BShalem Research Sunnyvale, CA, USA [email protected]

Abstract Along the same lines Kezilas et al.(2014) suggest that character transposition effects in LPD are most Research on analyzing reading patterns of dyslectic children has mainly been driven by likely caused by a deficit specific to coding the classifying dyslexia types offline. We contend letter position and is evidenced by an interaction that a framework to remedy reading errors in- between the orthographic and visual analysis stages line is more far-reaching and will help to fur- of reading. To this end, more recently Marcet et al. ther advance our understanding of this impair- (2019) managed to significantly reduce migration ment. In this paper, we propose a simple and errors by either altering letter contrast or presenting intuitive neural model to reinstate migrating letters to the young adult sequentially. words that transpire in letter position dyslexia, a visual analysis deficit to the encoding of To dyslectic children not all letter positions are character order within a word. Introduced by equally impaired as medial letters in a word are by the anagram matrix representation of an input far more vulnerable to reading errors compared to verse, the novelty of our work lies in the ex- the first and last characters of the word (Friedmann pansion from one to a two dimensional con- and Gvion, 2001). Children with LPD have high text window for training. This warrants words migration errors where the transposition of letters that only differ in the disposition of letters to in the middle of the word leads to another word, remain interpreted semantically similar in the for example, slime–smile or cloud–could. On the embedding space. Subject to the apparent constraints of the self-attention transformer archi- other hand, not all reading errors in cases of se- tecture, our model achieved a unigram BLEU lective LPD are migratable and are evidenced by score of 40.6 on our reconstructed dataset of words read without a lexical sense e.g., slime–silme. the Shakespeare sonnets. Intriguingly, increasing the word length does not elevate the error rate, and moreover, shorter words 1 Introduction that have lexical anagrams are prone to a larger Dyslexia is a reading disorder that is perhaps the proportion of migration errors compared to longer most studied of learning disabilities, with an esti- words that possess no-anagram words. A key ob- mated prevalence rate of 5 to 17 percentage points servation for LPD is that although words read may of school-age children in the US (Shaywitz and share all letters in most of the positions, they still Shaywitz, 2005; Made by Dyslexia, 2019). Counter remain semantically unrelated. to popular belief, dyslexia is not only tied to the vi- Machine learning tools to classify dyslexia use sual analysis system of the brain, but also presents a large corpus of reading errors for training and a linguistic problem and hence its relevance to natu- mainly aim to automate and substitute diagnostic ral language processing (NLP). Dyslexia manifests procedures expensively managed by human experts. itself in several forms as this work centers on Letter Lakretz et al.(2015) used both LDA and Naive Position Dyslexia (LPD), a selective deficit to en- Bayes models and showed an area under curve coding the position of a letter within a word while (AUC) performance of about 0.8 that exceeded the sustaining both letter identification and character quality of clinician-rendered labels. In their study, binding to words (Friedmann and Gvion, 2001). Rello and Ballesteros(2015) proposed a statisti- A growing body of research advocates hetero- cal model that predicts dyslectic readers using eye geneity of dyslexia causes to poor non-word and tracking measures. Employing an SVM-based bi- irregular-word reading (McArthur et al., 2013). nary classifier, they achieved about 80% accuracy.

150 Proceedings of the BioNLP 2020 workshop, pages 150–155 Online, July 9, 2020 c 2020 Association for Computational Linguistics

(1) (n) k n Instead, our approach applies deep learning to anagram matrix A = [m ; ,..., ; m ] R × , 1:k 1:k ∈ the task of restoring LPD inline that we further for- where m(i) are column vectors, [ ; ] is column- · · mulate as a sequence transduction problem. Thus, bound matrix concatenation, and k and n are the given an input verse that contains shuffled-letter transposition and input verse dimensions, respec- words identified as transpositional errors, the objec- tively. In Table1, we render a subset of an anagram tive of our neural model is to predict the originat- matrix drawn from a target verse with a maximal ing unshuffled words. We use language similarity word length of seven letters. The anagram matrix between predicted verses and ground-truth target founds an effective context structure for a two-pass text-sequences to quantitatively evaluate our model. embedding training, and our training dataset thus Our main contribution is a concise representation reconstructs on the basis of a collection of anagram of the input verse that scales up to moderate an matrices with varying dimensions. exhaustive set of LPD permutable data. 3 LPD Embeddings 2 Anagram Matrix Models for learning word vectors train locally on a Using a colon notation, we denote an input verse to one-dimensional context window by scanning the our model as a text sequence w1:n = (w1, . . . , wn) entire corpus (Mikolov et al., 2013). Through eval- of n words interchangeably with n collections of uation on a word analogy task, these models cap- letters l = (l(1) , . . . , l(n) ). We generate mi- 1:n 1: w1 1: wn ture linguistic regularities as linear relationships be- | | | | grated word patterns synthetically by anchoring tween word embeddings. Mikolov et al.(2013) pro- the first and last character of each word and ran- posed the skip-gram and continuous-bag-of-words domly permuting the position of the inner letters (CBOW) neural architectures with the objective to (1) (n) (l , . . . , l ). Thus given a word with predict the context of the target word and the target 2: w1 1 2: wn 1 | |− | |− a character length l(i) , the number of possible word given its context, respectively. Notably LPD | | unique transpositions for each word follows t1:n = migrating words tend mostly outside the English ( l(1) !,..., l(n) !). Next, we extract a migration vocabulary and thus pretrained word embeddings | | | | n on large corpora are of limited use in our system. 1 amplification factor k = argmaxi=1 ti that we apply to each word in an input verse independently and form the sequence m = (m , . . . , m ). 1:k 1 k 풕ퟐ,풕ퟐ 풕ퟏ,풕ퟐ 풕,풕ퟐ 풕ퟏ,풕ퟐ 풕ퟐ,풕ퟐ Word length commonly used in experiments of previous LPD studies averages five letters and ranges 풕ퟐ,풕ퟏ 풕ퟏ,풕ퟏ 풕,풕ퟏ 풕ퟏ,풕ퟏ 풕ퟐ,풕ퟏ from four to seven letters long, hence migrating to feasible 2, 6, 24, and 120 letter substitutions, 풕ퟐ,풕 풕ퟏ,풕 풕,풕 풕ퟏ,풕 풕ퟐ,풕 respectively. We note that words with 1, 2, or 3 풕ퟐ,풕ퟏ 풕ퟏ,풕ퟏ 풕,풕ퟏ 풕ퟏ,풕ퟏ 풕ퟐ,풕ퟏ letters are held intact and are not migratable.

풕ퟐ,풕ퟐ 풕ퟏ,풕ퟐ 풕,풕ퟐ 풕ퟏ,풕ퟐ 풕ퟐ,풕ퟐ when forty winters shall besiege thy brow wehn fotry wenitrs sahll bseeige thy borw Figure 1: A two-dimensional context window of size when froty winrtes slhal begseie thy borw two drawn from outside context cells of an anagram ma- wehn fotry wrenits slahl begisee thy borw when forty wtenirs shall begeise thy brow trix. The center words are shown in gray for both the when froty wtneirs shall bigeese thy brow normal by-row wt,t 2, . . . , wt,t+2 and transposed { − } when ftory weinrts sahll bgiesee thy borw column-wise wt 2,t, . . . , wt+2,t forms of feeding { − } wehn frtoy wirtens slhal bisgeee thy borw our neural network. wehn froty wterins slahl beeisge thy brow when froty wtiners shlal beesgie thy borw wehn frtoy wnetris shall beisege thy borw While the essence of our task is formalized as verse simpliﬁcation, mending LPD relies on robust Table 1: A snapshot of letter-position migration pat- discovery of word similarities along both the mi- terns in the form of an anagram matrix. The unedited gration and verse axes of the anagram matrix. To version of the text sequence is highlighted on top. this extent, we reshape the context window to train word embeddings from one to a two-dimensional To address the inherent semantic unrelatedness array. In Figure1, we show a bi-dimensional con- between transpositioned words, we deﬁne a two- dimensional migration-verse array in the form of an 1https://nlp.stanford.edu/projects/glove/

151 text window of size two that is a visible subset tial to the quality of repairing reading errors due to drawn from outside context cells of an anagram ma- extensive out-of-vocabulary non-migrating words. trix. Learning word vectors for LPD is a two-pass process in our model. First, the context window 5 Setup W feeds our neural network row-by-row for each To quantitatively evaluate our LPD transduction ap- transpositioned verse, and then follows by iterating proach, we chose to mainly report n-gram BLEU migration vectors m(i) in W T as inputs. precision (Papineni et al., 2002) that defines the language similarity between a predicted text sequence 4 Model and the ground-truth reference verse. In the BLEU Our task is inspired by recent advances in neural metric, higher scores indicate better performance. machine translation (NMT). NMT architectures have shown state-of-the-art results in both the form 5.1 Corpus of a powerful sequence model (Sutskever et al., Rather than clinical reading tests, we used the Son- 2014; Cho et al., 2014; Bahdanau et al., 2015) and nets by William Shakespeare (Shakespeare, 1997). more recently, the cross-attention ConvS2S (El- This is motivated by the apostrophe-rich data that bayad et al., 2018) and the self-attention based forces left-out letters. The raw dataset comprises transformer (Vaswani et al., 2017) networks. Given 2,154 verses that range from four to fifteen word an unintelligible diction of shuffled-letter words, sequences. In Figure3, we show the distribution our model aims to output a verse that preserves the of word length across the dataset, as 18,858 unique semantics of the input, and uses the transformer tokens are of up to seven-letter long inclusive and that outperforms both recurrent and convolutional take about 62 percentage points of the entire corpus configurations on many language translation tasks. words. To conform to preceding LPD research, we conducted a cleanup step that removes all words of eight letters or more from the dataset. We hypothesize that evaluating LPD on a single word basis softmax lets us perform this step without loss of generality. feed-forward network

shared self-attention 8000

word embeddings 6000

4000

Figure 2: Transformer architecture overview (encoder 2000 path shown in blue, decoder in brown). 0 1 2 3 4 5 6 7 8 9 1011121314151617 Word Letter Count Stacked with several network layers, the transformer architecture only relies on attention mech- Figure 3: Distribution of word letter count across anisms and entirely dispensing with recurrence unique tokens of the Shakespeare Sonnets dataset. (Hochreiter and Schmidhuber, 1997; Chung et al., 2014). In Figure2, we show a synoptic rendition We then transform each verse of the Sonnets to of the transformer. Its inputs consist of a source an anagram matrix representation A. The verse verse with potentially letter-transpositioned words word with the maximal letters has a set of distinct xi, and a ground-truth target verse of words with traspositions while words of lesser letters are shuf- unshuffled letters yi. The transformer encoder and fled with repetition (Table1). In Figure4, we show decoder modules largely operate in parallel and the distribution of anagram matrices across the en- provide for a source-to-target attention communi- tire Shakespeare Sonnets dataset, with a migration cation, and a softmax layer operates on the decoder amplification factor k 1, 2, 6, 24, 120 and a ∈ { } hidden-state outputs to produce predicted words yî. cleaned up verse that spans two to thirteen words. In LPD, source and target verses are consistently of Evidently most prominent tiles are of words with the same word count n, however, copying tokens seven letters and consist of verse sizes between from the source over to predictions is inconsequen- seven to nine words. Concatenating the rows of all

152 the anagram matrices presents a sixtyfold extended concept. The presence of replicated words in vector shape of our LPD training dataset that has 130,021 space owes to the transformer built-in learned po- text sequences, along with source and target vocab- sitions of input embeddings. We chose the Adam ularies of 173,575 and 3,147 tokens, respectively. optimizer (Kingma and Ba, 2014) with a varied learning rate and a ﬁxed model dropout of 0.1, us- 13 ing cross-entropy loss and label smoothing for regu- 12 11 larization. Figure6 reviews epoch-loss progression 10 in training and validating our model. 9 8

Verse 7 train val 6 5 4 3 4 2 1 2 6 24 120 Migration Loss 2 Figure 4: Distribution of anagram matrices across the verse collection of the Shakespeare Sonnets dataset. 0 1 2 3 4 5 6 7 8 9 10 Epoch

5.2 Training Figure 6: Epoch-loss progression in training and vali- We used PyTorch (Paszke et al., 2017) version 1.0 dation. Loss descent subsides near the seventh epoch. as our deep learning platform for training and inference. PyTorch supports the building of effective neural architectures for NLP task development. 6 Results We ran our model inference on a split test set that bae'uyts die mhgit comprises randomly selected rows sampled from die mhigt the entire collection of anagram matrices and fur- diedie neevr die neevrmhgit neevrdie ther excluded from the train set. We postulate that neevrdietbehrey die btue'ays beuya'tsdie the use of matrix columns along the migration axis mhgitbtye'uasdie are only beneficial for embedding training. that thatthatthat threbey therbeythatthat bet'uays Context Window BLEU-1 BLEU-2 BLEU-3 BLEU-4 taht tahttaht beyatu's one-dimensional 36.8 20.9 13.0 8.3 thebrey rose rsoe taht rsoe two-dimensional 40.6 23.7 14.7 8.9 teehrby nveerterheby mihgtb'auteysrose rsoe ba'yetustheerbythreebymghit nveernveer mghitmight nveernever bay'tuesroseroserose rsoe mghitthbeery never trbheeyrose b'eyutas Table 2: Model performance using n-gram BLEU mea- mghit sures at a corpus level on our augmented Sonnets test- set for repairing letter transpositions. Scores shown are Figure 5: Aided by using our anagram matrix approach, contrasted between the use of one and two dimensional migrated and non-migrated embeddings shown to pre- context window for training word embeddings. serve unedited input similarity. Presented in seven clusters produced by k-means (R Core Team, 2013). In Table2, we report corpus-level n-gram BLEU scores of our transformer-based model for inline We incorporated in our framework the annotated transduction of LPD reading patterns. Uniformly a PyTorch implementation of the transformer (Rush, two-dimensional context window for training em- 2018) and modified it to accommodate our LPD beddings boosts our performance by about ten per- dataset. Multi-head attention was configured with centage points on average compared to the one- h = 8 layers and a model size dmodel = 512, and dimensional window. As expected, BLEU scores the query, key, and value vectors were set uniformly decline exponentially when we increase n-gram, to dmodel/h = 64. The inner layer of the encoder from 40.6 for BLEU-1 down to 8.9 for BLEU-4. and the decoder had dimensionality dff = 2, 048. While BLEU scores the output by counting n- In Figure5, we show permuted embeddings retain- gram matches with the reference, we also evaluated ing input semantics by using our anagram matrix our model using SARI (Xu et al., 2016), a novel

153 metric that correlates with human judgments and Compared to almost two orders of magnitude designed to specifically analyze text simplification larger Sonnets dataset, the screening corpus was models. SARI principally compares system output too small and thus overfitting our transformer- against both the reference and input verse and re- based neural model. In addition, to effectively ex- turns an arithmetic average of n-gram precisions ploit our proposed anagram matrix representation, and recalls of addition, copy, and delete rewrite rather than disjoint words we require to train our operations. 2 Table3 summarizes SARI and aver- sequence model on a dataset comprised of verses or age BLEU measures of our model. Scores appear sentences that provides essential context for learn- fairly correlated with a slight edge in favor of SARI ing embeddings. that correctly rewards models like ours which make In a practical application framework, our pro- changes that simplify input verses. posed model is rated on successful recovery from LPD reading errors that transpire in a text sequence. Context Window SARI BLEU We envision our model already pretrained on mul- one-dimensional 21.2 19.8 tiple corpora, each extended to a collection of ana- two-dimensional 23.7 22.0 gram matrices. Every editing instance follows with a dyslectic individual who reads and utters a verse Table 3: Model performance using automatic evalua- at a time from a text document. Fed to the network, tion measures of SARI and BLEU at a corpus level on our augmented Sonnets test-set. Scores are contrasted the verse is then inferred by our model that returns between the use of one and two dimensional context an amended text sequence the user can compare window for training word embeddings. side-by-side on his display. It is key for the system we presented to perform responsively. The transformer is known to be bound by a fixed- length context and thus tends to split a long context 8 Conclusions to segments that often ignore semantic boundaries. In this paper, we presented word-level neural sen- This led to the conjecture that context fragmenta- tence simplification to aid letter-position dyslectic tion may impact our model performance adversely. children. We modeled the task after a monolingual The novel transformer-xl network (Dai et al., 2019) machine translation and showed the representation that learns dependencies across subsequences using effectiveness of a two-dimensional context window recurrence, might be the more effective architecture to boost our model performance. Future avenues to perform our task. of research include using our model in real-world restoration scenarios of LPD, and exploring the ef- 7 Discussion ficacy of the transformer-xl architecture to a non To conduct a baseline evaluation of our model, we language modeling task like ours. We look forward hand curated a corpus made of LPD screening tests. to leverage the exceptional ability of transformer-xl Targeted screeners are brief performance measures to perform character-level language modeling and intended to classify at-risk individuals. To the ex- improve mending LPD. tent of our knowledge, Lakretz et al.(2015) used for their experiments the largest known screener Acknowledgments dataset to date that consisted of 196 loose target We would like to thank the anonymous reviewers words in Hebrew. Correspondingly, we assembled for their insightful suggestions and feedback. a screening corpus of 196 English words that are prone to erroneous reading. In our system, these words are recast into a set of anagram matrices, References k 1 each however reduced to a vector R × . Further ∈ Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- downstream, we represented context-less words as gio. 2015. Neural machine translation by jointly one-hot vectors. As expected, on the task of rein- learning to align and translate. In International Con- stating screener data our sequence model achieved ference on Learning Representations, (ICLR), San Diego, California. a fairly low 1-gram BLEU score of 9.2. Counter to nearly 4.4X improvement on the Sonnets dataset, Kyunghyun Cho, Bart van Merrienboer, Caglar Gul- when trained using a 2D context window. cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning 2https://github.com/cocoxu/simplification phrase representations using RNN encoder–decoder

154 for statistical machine translation. In Empirical Z. Ghahramani, and K. Q. Weinberger, editors, Ad- Methods in Natural Language Processing (EMNLP), vances in Neural Information Processing Systems pages 1724–1734, Doha, Qatar. (NIPS), pages 3111–3119. Curran Associates, Inc.

Junyoung Chung, C¸aglar Gulc¸ehre,¨ KyungHyun Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Cho, and Yoshua Bengio. 2014. Empirical Jing Zhu. 2002. BLEU: a method for automatic evaluation of gated recurrent neural networks evaluation of machine translation. In Annual Meet- on sequence modeling. CoRR, abs/1412.3555. ing of the Association for Computational Linguistics Http://arxiv.org/abs/1412.3555. (ACL), pages 311–318, Philadelphia, Pennsylvania.

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Car- Adam Paszke, Sam Gross, Soumith Chintala, Gregory bonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Chanan, Edward Yang, Zachary DeVito, Zeming Transformer-XL: Attentive language models beyond Lin, Alban Desmaison, Luca Antiga, and Adam a ﬁxed-length context. In Annual Meeting of the Lerer. 2017. Automatic differentiation in pytorch. Association for Computational Linguistics (ACL), In Workshop on Autodiff, Advances in Neural Infor- pages 2978–2988, Florence, Italy. mation Processing Systems (NIPS) ), Long Beach, California. Maha Elbayad, Laurent Besacier, and Jakob Verbeek. 2018. Pervasive attention: 2D convolutional neu- R Core Team. 2013. R: A Language and Environment ral networks for sequence-to-sequence{ } prediction. for Statistical Computing. R Foundation for Statis- In Proceedings of the 22nd Conference on Computa- tical Computing, Vienna, Austria. http://www.R- tional Natural Language Learning (CONLL), pages project.org/. 97–107, Brussels, Belgium. Luz Rello and Miguel Ballesteros. 2015. Detecting Naama Friedmann and Aviah Gvion. 2001. Letter po- readers with dyslexia using machine learning with sition dyslexia. Journal of Cognitive Neuropsychol- eye tracking measures. In Web for All Conference, ogy, 18(8):673–696. pages 16:1–16:8, Florence, Italy.

Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. Alexander Rush. 2018. The annotated transformer. Long short-term memory. Neural Computation, In Proceedings of Workshop for NLP Open Source 9(8):1735–1780. Software (NLP-OSS), pages 52–60, Melbourne, Aus- tralia. Yvette Kezilas, Saskia Kohnen, Meredith Mckague, and Anne Castles. 2014. The locus of impairment in William Shakespeare. 1997. Shakespeare’s son- english developmental letter position dyslexia. Fron- nets. http://www.gutenberg.org/ebooks/1041 tiers in human neuroscience, 8:1–14. (2019/10/24).

Diederik P. Kingma and Jimmy Ba. 2014. Adam: Sally E. Shaywitz and Bennett A. Shaywitz. 2005. A method for stochastic optimization. CoRR, Dyslexia (speciﬁc reading disability). Biological abs/1412.6980. http://arxiv.org/abs/1412.6980. Psychiatry, 57(11):1301–1309.

Yair Lakretz, Gal Chechik, Naama Friedmann, and Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Michal Rosen-Zvi. 2015. Probabilistic graphical Sequence to sequence learning with neural networks. models of dyslexia. In Knowledge Discovery and In Advances in Neural Information Processing Sys- Data Mining (KDD), pages 1919–1928, Sydney, tems (NIPS), pages 3104–3112. Curran Associates, Australia. Inc., Red Hook, NY.

Made by Dyslexia. 2019. Dyslexia in schools: A sur- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob vey. http://madebydyslexia.org/assets/downloads/ Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Dyslexia In Schools 2019.pdf (2019/9/8). Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pro- Ana Marcet, Manuel Perea, Ana Baciero, and Pablo cessing Systems (NIPS), pages 5998–6008. Curran Gomez.´ 2019. Can letter position encoding be mod- Associates, Inc., Red Hook, NY. iﬁed by visual perceptual elements? Quarterly Jour- nal of Experimental Psychology, 72(6):1344–1353. Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Chen, and Chris Callison-Burch. 2016. Optimizing Genevieve McArthur, Saskia Kohnen, Linda Larsen, statistical machine translation for text simpliﬁcation. Kristy Jones, Thushara Anandakumar, Erin Banales, Transactions of the Association for Computational and Anne Castles. 2013. Getting to grips with the Linguistics, 4:401–415. heterogeneity of developmental dyslexia. Cognitive neuropsychology, 30:1–24.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013. Distributed representations of words and phrases and their composition- ality. In C. J. C. Burges, L. Bottou, M. Welling,

155 Domain Adaptation and Instance Selection for Disease Syndrome Classiﬁcation over Veterinary Clinical Notes Brian Hur1,2 Timothy Baldwin2 Karin Verspoor2,3 Laura Hardefeldt1 James Gilkerson1 1Asia-Paciﬁc Centre for Animal Health 2School of Computing and Information Systems 3Centre for the Digital Transformation of Health The University of Melbourne, Australia b.hur,tbaldwin,karin.verspoor,laura.hardefeldt,jrgilk @unimelb.edu.au { }

Abstract History: Examination: Still extremely pruritic. There is no frank blood Identifying the reasons for antibiotic administration in veterinary records is a critical compo- visible. And does not appear to be nent of understanding antimicrobial usage pat- overt inflammation of skin inside EAC. terns. This informs antimicrobial stewardship Laboratory: Assessment: Much im- programs designed to fight antimicrobial resis- proved but still concnered there might tance, a major health crisis affecting both hu- be some residual pain/infection. This mans and animals in which veterinarians have may be exac by persistent oilinesss an important role to play. We propose a docu- from PMP over the last week. Treat- ment classification approach to determine the reason for administration of a given drug, with ment: Cefovecin 1mg/kg sc Owner particular focus on domain adaptation from will also use advocate; Advised needs one drug to another, and instance selection to to lose weight. To be 7kg Plan: Owner minimize annotation effort. may return to recheck in ten days at completion of cefo duration. 1 Introduction

Microorganisms — such as bacteria, fungi, and Figure 1: Sample clinical note, in which the indication viruses — were a major cause of death until the of use for cefovecin would be EARDISORDER discovery of antibiotics (Demain and Sanchez, 2009). However, antimicrobial resistance (“AMR”) to these drugs has been detected since their intro- is given and the reason — or “indication” — for duction to clinical practice (Rollo et al., 1952), and its use. This data is generally captured within free risen dramatically over the last decade to be context clinical records created at the time of consult. sidered an emergent global phenomenon and major The primary objective of this paper is to develop public health problem (Roca et al., 2015). Com- text categorization methods to automatically label panion animals are capable of acquiring and ex- clinical records with the indication for antibiotic changing multidrug-resistant pathogens with hu- use. mans, and may serve as a reservoir of AMR (Lloyd, We perform this research over the VetCompass 2007; Guardabassi et al., 2004; Allen et al., 2010; Australia corpus, a large dataset of veterinary clini- Graveland et al., 2010). In addition, AMR is asso- cal records from over 180 of the 3,222 clinical prac- ciated with worse animal health and welfare out- tices in Australia which contains over 15 million comes in veterinary medicine (Duff et al.; Johnston clinical records and 1.3 billion tokens (McGreevy and Lumsden). “Antimicrobial Stewardship” is et al., 2017). An example of a typical clinical note broadly used to refer to the implementation of a is shown in Figure1. We aim to map the indication program for responsible antimicrobial usage, and for an antimicrobial into a standardized format such has been demonstrated to be an effective means as Veterinary Nomenclature (VeNom) (Brodbelt, of reducing AMR in hospital settings (Arda et al., 2019), and in doing so, facilitate population-scale 2007; Pulcini et al., 2014; Baur et al., 2017; Cis- quantiﬁcation of antimicrobial usage patterns. neros et al., 2014). A key part of antimicrobial As illustrated in Figure1, the data is domain stewardship is having the ability to monitor antimi- speciﬁc, and expert annotators are required to la- crobial usage patterns, including which antibiotic bel the training data. This motivates the use of

156 Proceedings of the BioNLP 2020 workshop, pages 156–166 Online, July 9, 2020 c 2020 Association for Computational Linguistics approaches to minimize the amount of annotation 2 Related Work effort required, with specific interest in adapting models developed for one drug to a novel drug. Clinical coding of medical documents has been previously done with a variety of methods (Kir- Previous analysis of this dataset has focused on itchenko and Cherry, 2011; Goldstein et al., 2007; labeling the antibiotic associated with each clini- Li et al., 2018a). Additionally, classifying dis- cal note (Hur et al., 2020). In that study, it was eases and medications in clinical text has been ad- found that cefovecin along with amoxycillin clavu- dressed in shared tasks for human texts (Uzuner lanate and cephalexin were the top 3 antibiotics et al., 2010). Previous methods have also been ex- used. As cefovecin was the most commonly used plored for extracting the antimicrobials used, out antimicrobial with the most critical significance for of veterinary prescription labels, associated with the development of AMR, it was targeted for addi- the clinical records (Hur et al., 2019), and labeling tional studies to understand the specific indications of diseases in veterinary clinical records (Zhang of use. The indication of use was manually labeled et al., 2019; Nie et al., 2018) as well exploring in 5,008 records. However, there were still over methods for negation of diseases for addressing 79,000 clinical records with instances of cefovecin false positives (Cheng et al., 2017; Kennedy et al., administration that did not have labels, in addition 2019). Our work expands on this work by link- to over 1.1 million other clinical records involving ing the indication of use to an antimicrobial being other antimicrobial drug administrations missing administered for that diagnosis. labels. Contextualized language models have recently gained much popularity due to their ability to Having only a single type of antimicrobial agent greatly improve the representation of texts with labeled causes challenges for training a model to fewer training instances, thereby transferring more classify the indication of use for other antimicro- efficiently between domains (Devlin et al., 2018; bials, as antimicrobials vary in how and why they Howard and Ruder, 2018). Pre-training these lan- are used, with the form of drug administration guage models on large amounts of text data spe- (oral, injected, etc.) and different indications of cific to a given domain, such as clinical records use creating distinct contexts that can be seen as or biomedical literature, has also been shown to sub-domains. Therefore, models that allow for the further improve the performance in biomedical do- transfer of knowledge between the sub-domains mains with unique vocabularies (Alsentzer et al., of the various antimicrobials are required to effec- 2019; Lee et al., 2019). These models can also tively label the indication of use. accomplish many tasks in an unsupervised manner. For example, Radford et al.(2019) showed To explore the interaction between learning that free text questions could be fed through a lan- methods and the resource constraints on labeling, guage model and generate the correct answer in we develop models using the complete set of labels many cases. In our experiments, we demonstrate we had available, but also models derived using the usefulness of contextualized language models only labels that can be created within two hours, by pre-training BERT on a large set of veterinary following the paradigm of Garrette and Baldridge clinical records, and further explore its usefulness (2013). for domain adaptation through instance selection. Domain adaptation is a task which has a long Specifically, our work explores methods to im- history in NLP (Blitzer et al., 2006; Jiang and Zhai, prove the performance of classifying the indica- 2007; Agirre and De Lacalle, 2008; Daume´ III, tion for an antibiotic administration in veterinary 2007). There has been further work demonstrat- records of dogs and cats. In addition to classi- ing the usefulness of reducing the covariance be- fying the indication of use, we explore how data tween domains through adversarial learning (Li selection can be used to improve the transfer of et al., 2018b). More recently, it has been shown knowledge derived from labeled data of a single that domain adversarial training can be effectively antimicrobial agent to the context of other agents. done using contextualized models, such as BERT, We also release our code, and select pre-trained through using a two-step domain-discriminative models used in this study at: https://github. data selection (Ma et al., 2019). We adapt these com/havocy28/VetBERT. methods to our task to create a more generalizable

157 SOURCE TARGET Y TARGET Z (Cefovecin) (Cephalexin) (Amocycillin Clavulonate) 100% s d r o c

e 10% R

f o

r e b

m 1% u N

l a t o T

f 0.1% o

traumatic injury 15.6% skin disorder 56.2% traumatic injury 23.1% abscess 14.8% prophylaxis 7.6% prophylaxis 14.4% skin disorder 14.1% ear disorder 6.7% abscess 8.7%

Figure 2: Distribution of labels from the SOURCE and TARGET domains (log scale). The Top-3 labels are noted below each chart. model that can adapt between domains more effec- of which 38 actually occur in our target dataset. tively. Previous experiments have used active learning 3.2 Data sub-domains to improve clinical concept extraction with weak We consider the individual antibiotic agents in our supervision (Kholghi et al., 2016). Our work ex- dataset to be sub-domains, as they are adminis- pands on this work through combining approaches tered differently (e.g. orally vs. injectable), and to domain adaptation and the effective use of a in response to different indications. In our experi- small number of labels through the development of ments, we target cefovecin (injectable), amoxycillin additional instance selection methods. clavulanate (oral or injectable), and cephalexin (oral). In addition, cefovecin and amoxycillin clavu- 3 Dataset lanate are used broadly for many indications, while cephalexin is primarily used for skin infections. 3.1 Creating a set of terms Standardized terminologies such as VeNom and 3.3 Extracting and labeling the data SNOMED (NIH, 2019) have been created for med- A corpus of 5,008 clinical records, where patients ical diagnosis codes. While SNOMED has a vet- had been given cefovecin, were sourced from Vet- erinary extension, VeNom was created specifically Compass Australia using methods previously de- for veterinary clinical text and can be mapped back scribed in Hur et al.(2019). The indication of use to SNOMED, and is also part of the Unified Medi- for cefovecin was then labeled by a veterinarian. cal Language System (UMLS) (Bodenreider, 2004) A subset of 100 of these annotations were la- used widely within human medicine. Therefore, beled by another veterinarian and used to calculate VeNom codes are used here to create labels for the agreement, which was measured as Cohen’s Kappa indication of drug administration (Brodbelt, 2019). = 0.78, with raw agreement of 0.80. An additional The VeNom codes we adopt are not fully com- 105 and 104 records were randomly selected for prehensive; they are a subset of the codes used by each of cephalexin and amoxycillin clavulanate, (O’Neill et al., 2019) which map specific VeNom respectively, and annotated by the same two veteri- codes to more generalized codes. These codes narians. were provided by the Royal College of Veterinary The variance between the distribution of indi- Medicine for this study. In this subset of terms, cations for cefovecin, cephalexin, and amoxycillin specific labels such as EXTRACTIONOFUPPER clavulanate is presented in Figure2. LEFTPREMOLAR 4 are simply mapped to DENTAL An additional set of 3000 unannotated clinical DISORDER. There were a total of 52 of these terms, notes was sampled, comprising 1000 clinical notes

158 for each of the three antibiotics of interest. We Instance Selection use these to train a domain classification filter (to Clinical Notes identify which antimicrobial is administered), and Chosen Based on Antibiotics X, Y, Z for data selection. Any notes with fewer than 5 Z Y tokens were removed from the corpus. X Domain Classifier 3.4 Training and development sets X, Y, Z The training of the indication-of-use classifier was performed using the dataset pertaining to cefovecin, Domain X’ Classifier X’yz based on a 90:10 split of train and development Filter X’ data. In evaluation, we will refer to the development set as “SOURCE”. Clustering Random Rank The labeled datasets for amoxycillin clavulanate X’ X’ X’yz and cephalexin are used to test cross-domain accuracy, and are referred to as “TARGET Y” for cephalexin and “TARGET Z” for amoxycillin clavulanate. The test data used for “TARGET Y” and “TARGET Z” were fixed in all tests and strictly disjoint from any training. Mention The estimated number of records that could be Boundaries Augmentation annotated within two hours was 250, based on the annotation of the three datasets. To assess the setting of having only two hours of annotation time, a subset of 250 records was sampled and annotated Figure 3: Outline of the proposed approach. for for training and taken only from the “SOURCE” data according to one of the various instance selection methods described in the Approach section. trained for 3 epochs, while models based on limited training data (see Section 4.3) were trained for 4 Approach 60 epochs. All models were tested with 5 different random seeds, and results averaged across them. In this section we detail our approach, as illustrated Table1 shows the performance of VetBERT in Figure3. and the two baseline methods both in-domain Pretraining (“SOURCE”), and for the two out-of-domain an- In order to fine-tune our model to veterinary clini- timicrobials using the training data from SOURCE. cal notes, we took ClinicalBERT (Alsentzer et al., While the performance of VetBERT exceeded 2019) and repeated the pretraining steps as de- the interannotator agreement of 78% in-domain, scribed by Devlin et al.(2018) using the entire cor- the out-of-domain performance over TARGET Z in pus of 15 million clinical notes from VetCompass particular was substantially less, at 65.4% accuracy. Australia. We refer to this model as “VetBERT”. To improve cross-domain performance, we add instance selection and dataset manipulation methods, Training classifiers as described below. A baseline classifier for indication of antibiotic administration was trained using an LSTM (“LSTM”: 4.1 Instance selection Gers et al.(1999)) with a 100 dimension embed- We hypothesize that filtering out training data that ding layer with 0.3 dropout, implemented in keras is dissimilar to the target domain will improve per- (Chollet et al., 2015). We also use a baseline BERT formance, despite the lower volume of training data. model using BERT-Base (“BERT”), in addition to To this end, we experiment with domain-based in- a model based on VetBERT. Both the BERT and stance selection. VetBERT classifiers were trained using an Adam We model domain similarity using a domain clas- optimizer, maximum of 512 word pieces, batch sification model, trained on the domain (i.e. ad- size of 32, softmax loss, and Learning Rate of 2e- ministered antimicrobial) associated with a given 5. Models trained on the full training set were medical record. Note that this is directly avail-

159 SOURCE TARGET YTARGET Z LSTM 51.5 3.5 47.4 3.4 29.2 5.0 ± ± ± BERT 73.4 1.1 71.0 1.3 58.1 1.4 ± ± ± VetBERT 80.1 0.7 81.5 2.1 65.4 1.5 ± ± ± VetBERT+A 80.9 0.6 83.4 1.5 68.1 2.1 ± ± ± VetBERT+M 80.5 0.6 80.2 1.3 65.8 2.6 ± ± ± VetBERT+M+A 81.2 0.6 82.1 1.8 66.5 1.7 ± ± ± VetBERT+D 78.3 0.7 79.1 2.6 66.7 1.5 ± ± ± VetBERT+D+A 80.5 0.5 83.1 1.4 68.5 2.2 ± ± ± VetBERT+D+M 78.5 0.8 78.3 3.5 66.7 0.9 ± ± ± VetBERT+D+M+A 80.3 0.4 82.7 2.1 67.5 2.2 ± ± ±

Table 1: Predictive accuracy (%) of reason for antimicrobial administration in the SOURCE and TARGET domains, trained on all available source-domain training data. Notation: +D = domain-based instance selection, +M = mention boundary tagging, +A = data augmentation able as an artefact of the dataset construction, and 4.2 Automated dataset manipulation doesn’t require any manual annotation. Specifi- We also explore the use of dataset manipulation, in cally, we identify instances of source domain X two forms: (1) mention boundary tagging; and (2) (cefovecin) for which we have labeled data, which data augmentation. are most similar to instances from target domains Y and Z, i.e., records in which cephalexin or amoxy- 4.2.1 Mention Boundary Tagging cillin clavulanate, respectively, were administered. To sensitize the model to the specific drug of inter- Determination of similarity is based on the proba- est, we add special learnable embedding vectors to bilistic output of a domain classifier over the three the start and end of each antibiotic mention, based domains. In Figure3, we label this subset of the on the findings of Logeswaran et al.(2019) and training data “XYZ0 ”, reflecting the fact that it is a Wu et al.(2019). Similar to Wu et al.(2019), we subset of X similar to Y and Z. This subset of X is used special tokens to mark the boundaries of the then used to train a second classifier focused on the tokens that contained a partial string match for the primary task, namely the reason for administering antibiotic of interest. This allows for the model an antibiotic. to attend to these tokens at every layer of the net- To build the domain classification model, we work while training the classifier, and ideally bet- follow the procedure of Ma et al.(2019), first train- ter generalize across antimicrobials. The partial ing a domain classifier for 1 epoch, based on the string matches were created by identifying strings datasets of 1000 instances each of the three do- that contained the prefixes clav or amoxyclav for mains. We used the same model architecture as amoxycillin clavulanate, ceph, rilex or kflex for the VetBERT model, with a softmax classification cephalexin, and conv or cefov for cefovecin. These layer. This model was then applied to the 5,008 prefixes were sourced from a previous study explor- training instances for cefovecin, which were sorted ing mention detection of antimicrobials (Hur et al., in increasing score over domain X (i.e. in decreas- 2019). We signal the use of mention boundary ing order of similarity to the target domains), the embeddings with “+M” in the results tables. Top-1000, 2000, 3000, or 4000 records were se- 4.2.2 Data augmentation VetBERT lected, and the model was trained over Synonym-based data augmentation has been suc- that subset of the training data. The best results cessfully applied to contexts including word sense were found to occur for 3000 samples. Models disambiguation (Leacock and Chodorow, 1998), with domain-based instance selection are indicated sentiment analysis (Li et al., 2017), text classifica- +D with “ ” in Table1. tion (Wei and Zou, 2019), and argument analysis The domain classifier filtering method results in (Joshi et al., 2018). an improvement for TARGET Z (66.7%), but drop We perform data augmentation on clinical notes in accuracy for TARGET Y (79.1%). by replacing synonyms using WordNet (Fellbaum,

160 SOURCE TARGET YTARGET Z VetBERT+rank[linear] 74.3 0.2 76.6 3.0 66.9 2.2 ± ± ± VetBERT+rank[linear]+A 75.8 1.3 81.0 2.6 63.7 1.4 ± ± ± VetBERT+rank[linear]+M 73.4 0.9 77.1 1.9 65.9 2.4 ± ± ± VetBERT+rank[linear]+M+A 75.7 0.8 81.0 2.8 63.8 3.5 ± ± ± VetBERT+rank[exp] 68.3 2.1 66.5 2.1 58.1 1.5 ± ± ± VetBERT+rank[exp]+A 76.6 0.3 76.7 2.4 65.4 1.0 ± ± ± VetBERT+rank[exp]+M 68.9 2.0 66.7 1.5 57.9 2.1 ± ± ± VetBERT+rank[exp]+M+A 76.9 0.2 77.3 2.3 64.4 1.5 ± ± ± VetBERT+rank[rand] 73.5 1.8 75.4 2.3 61.9 2.8 ± ± ± VetBERT+rank[rand]+A 74.8 1.3 78.9 3.1 64.2 2.5 ± ± ± VetBERT+rank[rand]+M 73.9 1.2 76.2 2.8 62.1 1.1 ± ± ± VetBERT+rank[rand]+M+A 74.9 0.4 80.6 1.3 63.1 2.6 ± ± ±

Table 2: Predictive accuracy (%) of reason for antimicrobial administration over the SOURCE and TARGET domains, trained on 2-hours’ worth of labeled data with the three domain similarity selection methods over the top- 3000 from XYZ0 of random sampling (“+rank[rand]”), modiﬁed exponential sampling (“+rank[exp]”), and linear step-wise sampling (“+rank[linear]”).

SOURCE TARGET YTARGET Z VetBERT+rand 70.9 1.5 76.2 1.6 58.0 2.0 ± ± ± VetBERT+rand+A 69.7 0.4 75.8 1.1 59.6 0.0 ± ± ± VetBERT+rand+M 70.5 0.1 77.4 0.6 57.4 2.4 ± ± ± VetBERT+rand+M+A 69.9 0.9 77.4 0.6 59.6 1.7 ± ± ± VetBERT+rank[linear] 74.3 0.2 76.6 3.0 66.9 2.2 ± ± ± VetBERT+rank[linear]+A 75.8 1.3 81.0 2.6 63.7 1.4 ± ± ± VetBERT+rank[linear]+M 73.4 0.9 77.1 1.9 65.9 2.4 ± ± ± VetBERT+rank[linear]+M+A 75.7 0.8 81.0 2.8 63.8 3.5 ± ± ± VetBERT+cluster 73.4 1.1 68.6 1.3 63.0 2.1 ± ± ± VetBERT+cluster+A 73.9 0.1 75.2 2.7 67.8 0.7 ± ± ± VetBERT+cluster+M 73.3 0.5 66.2 0.7 62.5 1.4 ± ± ± VetBERT+cluster+M+A 72.8 0.6 75.2 0.0 63.5 5.4 ± ± ±

Table 3: Predictive accuracy (%) of reason for antimicrobial administration in the SOURCE and TARGET domains, trained on 2-hours’ worth of labeled data, with random selection (“+rand”), linear step-wise sampling (“+rank[linear]”; results duplicated from Table2), and clustering (“+cluster”).

2012), based on random sampling. In this way, we and out-of-domain, as seen in Table1. The highest create up to two additional training instances1 in accuracy over the source domain 81.2% was ob- addition to the original instance, potentially tripling tained with both mention boundary tagging and the amount of training data. We signal the use of data augmentation (without instance selection), data augmentation with “+A” in the results tables. while the best out-of-domain results were obtained with data augmentation (with or without instance 4.2.3 Results for dataset augmentation selection). methods Mention boundary tagging and data augmentation 4.3 Instance selection under two generally led to improvements in results both in- annotation-hour constraint 1In the instance of there being no synonym substitutes for any words in the original clinical note, no additional training All results to date have been based on the gener- instances are generated. ous supervision setting of 3000 instances, or ap-

161 proximately 24 hours’ annotation time. One natu- domain instances, and selecting prototypical in- ral question, inspired by the work of Garrette and stances from each cluster. Baldridge(2013) in the context of part-of-speech First, we generate a representation of each tagging in low-resource languages, is whether sim- source-domain clinical note using the pretrained ilar results can be achieved with a more realistic VetBERT model, based on the [CLS] token in budget of expert annotation time. Speciﬁcally, we the second-last layer of the model. Next, we cluster assume access to only 2 hours of expert annota- the instances into 250 clusters using k-means++ tor time, which translates to the annotation of 250 (Arthur and Vassilvitskii, 2006), and select the in- clinical notes. We propose three approaches to instance closest to the centroid for each cluster. This stance selection under this constraint: (1) domain method is labeled “+cluster” in Table3. similarity selection; and (2) clustering. We contrast Clustering results in the highest accuracy for these with a random selection baseline (“+rand” in TARGET Z of 67.8%, but weaker results for TAR- our results tables). GET Y.

4.3.1 Domain similarity selection 5 Discussion Our first approach is based on the instance selection 5.1 Pretraining Improvements method from Section 4.1, except that we now select only 250 instances from SOURCE for annotation, Pretraining BERT to the veterinary domain us- based on their similarity with the target domain (as ing the VetCompass Australia corpus showed the distinct from the top-3000 instances in Table1). most dramatic improvement in our experiments. That is, we take the top-3000 instances from XYZ0 This was demonstrated by marked improvement and perform additional sub-sampling, in the form over other baselines, without any additional steps of: (a) random sampling (“+rank[rand]”);2 (b) mod- (Table1: VetBERT vs. BERT and LSTM). How- ified exponential sampling (“+rank[exp]”); or (c) ever, even with the pretraining used to create linear step-wise sampling (“+rank[linear]”). VetBERT, there was significant degradation in per- Modified exponential sampling is implemented formance across the domains where there were by mapping 3000 onto an exponential scale of 250 fewer training instances (VetBERT in Table1 vs. steps over the 3000 results, rounding to the near- VetBERT+rand in Table3). est integer, and additionally rounding up in the case that there is a collision with a value earlier 5.2 Sub-domain transfer performance in the series. That is, instead of the (rounded) se- The relative performance over TARGET Z as ries being 0, 0, 0, ..., 2879, 2938, 2999 it becomes compared to TARGET Y when transferring from 0, 1, 2, ..., 2879, 2938, 2999. SOURCE was generally poor (Tables1 and3). This Linear step-wise sampling involves separating could be due to TARGET Y sharing more similar- the domain space evenly, and taking every nth sam- ities with SOURCE, along with the more skewed ple where n = len(N)/x where x is the number class distribution in TARGET Y (Figure2), poten- b c of labeled instances (= 250) and N is the total num- tially making it an easier classification task. More ber of samples (= 3000). analysis is needed to understand this effect. Results for the different instance selection meth- 5.3 Optimizing for two hours of annotation ods are presented in Table2. The best-performing time method is step-wise sampling, achieving out-of- domain accuracy which is competitive with the When optimizing for two hours of annotation time, results from Table1 over 12 times the amount of there were consistent improvements with the in- training data. stance selection methods, compared to random se- lectin (Table3: VetBERT+rand vs. others). 4.3.2 Clustering-based instance selection Our second approach is based on the intuition that 5.4 Dataset manipulation methods the diversity in the training data will optimize per- The results for data augmentation and the ad- formance. We achieve this by clustering the source dition of mention boundary embeddings were not as clear, in that they sometimes resulted in 2Note that this differs from +rand in that it is over the top-3000 instances, whereas +rand is over all 5008 annotated improvements and sometimes did not (Table2 instances. and3: +A and +M vs. others). The clustering

162 method generally performed better with data proved through utilization of available resources augmentation and worst with mention boundary such as UMLS or Drugbank to identify clinical use embeddings (Table3: VetBERT+cluster+A guidelines for antibiotics, to allow for training or vs. VetBERT+cluster+M+A and adapting a model with few or no annotations. Ad- VetBERT+cluster+M). ditionally, further work is required to apply these models into a data pipeline to create labels for Vet- 5.5 Limitations Compass data to enable analysis of the key reasons The primary limiting factor was also the motivation for antimicrobial administration in veterinary hos- of this study, namely the difficulty in obtaining suf- pitals across Australia. ficient high-quality annotations to perform accurate analysis of the model performance. We were also Acknowledgments limited in that the instance selection was performed We thank Simon Suster,˘ Afshin Rahimi, and the retrospectively over the 5008 annotated instances, anonymous reviewers for their insightful comments and we were limited to the instances provided for and valuable suggestions. the SOURCE domain, rather than a larger sample This research was undertaken with the assis- that could be obtained from VetCompass. There are tance of information and other resources from also additional domains of data within this corpus the VetCompass Australia consortium under the that should be evaluated, such as records from spe- project “VetCompass Australia: Big Data and Real- cialty practices vs. records from general practices. time Surveillance for Veterinary Science”, which is This was shown to result in significant degrada- supported by the Australian Government through tion of performance by Nie et al.(2018), and is a the Australian Research Council LIEF scheme potential area for future research. (LE160100026).

6 Conclusions and future work References In conclusion, we proposed a range of methods to transfer knowledge derived from labeled data for Eneko Agirre and Oier Lopez De Lacalle. 2008. On ro- bustness and domain adaptation using SVD for word one antimicrobial agent to other agents, consider- sense disambiguation. In Proceedings of the 22nd ing the additional constraint of a limited annotation International Conference on Computational Linguis- resource time of two hours. While the in-domain tics (Coling 2008), pages 17–24. accuracy of 83% exceeds the raw inter-annotator Heather K. Allen, Justin Donato, Helena Huimi Wang, agreement of 80% (Cohen’s Kappa = 0.78) on the Karen A. Cloud-Hansen, Julian Davies, and Jo Han- source domain, transfer to other classes is still sub- delsman. 2010. Call of the wild: antibiotic resis- stantially lower with an average of 76% between tance genes in natural environments. Nature Re- the two classes. This shows that while the accuracy views Microbiology, 8(4):251–259. on classifying diseases is on par with human clas- Emily Alsentzer, John R. Murphy, Willie Boag, Wei- siﬁcations for a single disease, there is still room Hung Weng, Di Jin, Tristan Naumann, and Matthew for improvement on transferability to new data sub- B. A. McDermott. 2019. Publicly Available Clinical domains. BERT Embeddings. The primary question is whether the labels cre- Bilgin Arda, Oguz Resat Sipahi, Tansu Yamazhan, ated are good enough to report the reason for an- Meltem Tasbakan, Husnu Pullukcu, Alper Tunger, tibiotic administration in epidemiological reporting Cagri Buke, and Sercan Ulusoy. 2007. Short-term effect of antibiotic control policy on the usage pat- and antimicrobial stewardship guidelines. While terns and cost of antimicrobials, mortality, noso- the labels for why cefovecin was administered were comial infection rates and antibacterial resistance. better than the current standard of using expert an- Journal of Infection, 55(1):41–48. notations, our results indicate that accuracy varies David Arthur and Sergei Vassilvitskii. 2006. k- substantially depending on the antibiotic being ad- means++: The advantages of careful seeding. Tech- ministered, and testing of the accuracy for each nical report, Stanford. individual antibiotic should be evaluated prior to reporting the results based on labels generated by David Baur, Beryl Primrose Gladstone, Francesco Burkert, Elena Carrara, Federico Foschi, Stefanie any model. Dobele,¨ and Evelina Tacconelli. 2017. Effect of an- In future research, these methods could be im- tibiotic stewardship on the incidence of infection and

163 colonisation with antibiotic-resistant bacteria and North American Chapter of the Association for Com- Clostridium difficile infection: a systematic review putational Linguistics: Human Language Technolo- and meta-analysis. The Lancet Infectious Diseases, gies, pages 138–147, Atlanta, Georgia. 17(9):990–1001. Felix A Gers, Jurgen¨ Schmidhuber, and Fred Cummins. John Blitzer, Ryan McDonald, and Fernando Pereira. 1999. Learning to forget: Continual prediction with 2006. Domain adaptation with structural correspon- LSTM. In Proceedings of the 9th International Con- dence learning. In Proceedings of the 2006 Con- ference on Artificial Neural Networks: ICANN ’99, ference on Empirical Methods in Natural Language pages 850–855. Processing, pages 120–128. Ira Goldstein, Anna Arzumtsyan, and Ozlem¨ Uzuner. Olivier Bodenreider. 2004. The unified medical 2007. Three approaches to automatic assignment language system (UMLS): integrating biomed- of ICD-9-CM codes to radiology reports. In AMIA ical terminology. Nucleic acids research, Annual Symposium Proceedings, volume 2007, page 32(suppl 1):D267–D270. 279. American Medical Informatics Association.

David Brodbelt. 2019. VeNom Coding – VeNom Cod- Haitske Graveland, Jaap A. Wagenaar, Hans Heester- ing Group. http://venomcoding.org/. beek, Dik Mevius, Engeline van Duijkeren, and Dick Heederik. 2010. Methicillin Resistant Staphylococ- Katherine Cheng, Timothy Baldwin, and Karin Ver- cus aureus ST398 in Veal Calf Farming: Human spoor. 2017. Automatic Negation and Speculation MRSA Carriage Related with Animal Antimicrobial Detection in Veterinary Clinical Text. In Proceed- Usage and Farm Hygiene. PLOS ONE, 5(6):e10990. ings of the Australasian Language Technology Asso- Luca Guardabassi, Stefan Schwarz, and David H. ciation Workshop 2017, pages 70–78, Brisbane, Aus- Lloyd. 2004. Pet animals as reservoirs of tralia. antimicrobial-resistant bacteria. Journal of Antimi- François Chollet et al. 2015. Keras. https://keras. crobial Chemotherapy, 54(2):321–332. io. Jeremy Howard and Sebastian Ruder. 2018. Univer- sal language model fine-tuning for text classification. J. M. Cisneros, O. Neth, M. V. Gil-Navarro, J. A. Lepe, In Proceedings of the 56th Annual Meeting of the F. Jimenez-Parrilla,´ E. Cordero, M. J. Rodr´ıguez- Association for Computational Linguistics (Volume Hernandez,´ R. Amaya-Villar, J. Cano, A. Gutierrez-´ 1: Long Papers), pages 328–339, Melbourne, Aus- Pizarraya, E. Garc´ıa-Cabrera, and J. Molina. 2014. tralia. Global impact of an educational antimicrobial stewardship programme on prescribing practice in a ter- B. Hur, L. Y. Hardefeldt, K. Verspoor, T. Baldwin, and tiary hospital centre. Clinical Microbiology and In- J. R. Gilkerson. 2019. Using natural language pro- fection, 20(1):82–88. cessing and VetCompass to understand antimicrobial usage patterns in Australia. Australian Veteri- Hal Daume´ III. 2007. Frustratingly easy domain adap- nary Journal, 97(8):298–300. tation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages Brian A. Hur, Laura Y. Hardefeldt, Karin M. Verspoor, 256–263, Prague, Czech Republic. Timothy Baldwin, and James R. Gilkerson. 2020. Describing the antimicrobial usage patterns of com- Arnold L. Demain and Sergio Sanchez. 2009. Micro- panion animal veterinary practices; free text analysis bial drug discovery: 80 years of progress. The Jour- of more than 4.4 million consultation records. PLOS nal of Antibiotics, 62(1):5–16. ONE, 15(3):1–15. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Jing Jiang and ChengXiang Zhai. 2007. Instance Kristina Toutanova. 2018. BERT: Pre-training weighting for domain adaptation in NLP. In Pro- of Deep Bidirectional Transformers for Language ceedings of the 45th Annual Meeting of the Associa- Understanding. arXiv:1810.04805 [cs]. ArXiv: tion of Computational Linguistics, pages 264–271. 1810.04805. GCA Johnston and JM Lumsden. Antimicrobial sus- A Duff, SK Keane, and Laura Y. Hardefeldt. De- ceptiblity of bacterial isolates from 27 thorough- scriptive study of antimicrobial susceptibilty pat- breds with arytenoid chondropathy. In Proceedings terns from equine septic synovial structures. In Pro- of the 39th Bain Fallon Memorial Lectures, volume ceedings of the 39th Bain Fallon Memorial Lectures, 2017:11. Equine Veterinarians Australia. volume 2017:11. Equine Veterinarians Australia. Anirudh Joshi, Timothy Baldwin, Richard O Sinnott, Christiane Fellbaum. 2012. WordNet. The Encyclope- and Cecile Paris. 2018. UniMelb at SemEval- dia of Applied Linguistics. 2018 task 12: Generative implication using LSTMs, Siamese networks and semantic representations with Dan Garrette and Jason Baldridge. 2013. Learning a synonym fuzzing. In Proceedings of The 12th Inter- Part-of-Speech Tagger from Two Hours of Annota- national Workshop on Semantic Evaluation, pages tion. In Proceedings of the 2013 Conference of the 1124–1128.

164 Noel Kennedy, Dave C Brodbelt, David B Church, and Deep Learning Approaches for Low-Resource NLP Dan G O’Neill. 2019. Detecting false-positive dis- (DeepLo 2019), pages 76–83, Hong Kong, China. ease references in veterinary clinical notes without manual annotations. NPJ Digital Medicine, 2(1):1– Paul McGreevy, Peter Thomson, Navneet K Dhand, 7. David Raubenheimer, Sophie Masters, Caroline S Mansfield, Timothy Baldwin, Ricardo J Soares Ma- Mahnoosh Kholghi, Laurianne Sitbon, Guido Zuccon, galhaes, Jacquie Rand, Peter Hill, Anne Peaston, and Anthony Nguyen. 2016. Active learning: a James Gilkerson, Martin Combs, Shane Raidal, Pe- step towards automating medical concept extraction. ter Irwin, Peter Irons, Richard Squires, David Brod- Journal of the American Medical Informatics Asso- belt, and Jeremy Hammond. 2017. VetCompass ciation, 23(2):289–296. Australia: A National Big Data Collection System for Veterinary Science. page 15. Svetlana Kiritchenko and Colin Cherry. 2011. Lexically-triggered hidden markov models for Allen Nie, Ashley Zehnder, Rodney L. Page, Arturo L. clinical document coding. In Proceedings of Pineda, Manuel A. Rivas, Carlos D. Bustamante, the 49th Annual Meeting of the Association for and James Zou. 2018. DeepTag: inferring all-cause Computational Linguistics: Human Language diagnoses from clinical notes in under-resourced Technologies-Volume 1, pages 742–751. medical domain. arXiv:1806.10722 [cs]. ArXiv: 1806.10722. Claudia Leacock and Martin Chodorow. 1998. Com- bining local context and WordNet similarity for NIH. 2019. SNOMED CT. word sense identification. WordNet: An electronic lexical database, 49(2):265–283. Dan G. O’Neill, Alison M. Skipper, Jade Kadhim, David B. Church, Dave C. Brodbelt, and Rowena Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, M. A. Packer. 2019. Disorders of Bulldogs under Donghyeon Kim, Sunkyu Kim, Chan Ho So, primary veterinary care in the UK in 2013. PLOS and Jaewoo Kang. 2019. BioBERT: a pre-trained ONE, 14(6):e0217928. biomedical language representation model for biomedical text mining. arXiv:1901.08746 [cs]. C. Pulcini, E. Botelho-Nevers, O. J. Dyar, and S. Har- ArXiv: 1901.08746. barth. 2014. The impact of infectious disease spe- cialists on antibiotic prescribing in hospitals. Clini- Min Li, Zhihui Fei, Min Zeng, Fang-Xiang Wu, Yao- cal Microbiology and Infection, 20(10):963–972. hang Li, Yi Pan, and Jianxin Wang. 2018a. Auto- mated icd-9 coding via a deep learning approach. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, IEEE/ACM transactions on computational biology Dario Amodei, and Ilya Sutskever. 2019. Language and bioinformatics, 16(4):1193–1202. Models are Unsupervised Multitask Learners. Yitong Li, Timothy Baldwin, and Trevor Cohn. 2018b. I. Roca, M. Akova, F. Baquero, J. Carlet, M. Cava- What’s in a domain? learning domain-robust text leri, S. Coenen, J. Cohen, D. Findlay, I. Gyssens, representations using adversarial training. In Pro- O. E. Heure, G. Kahlmeter, H. Kruse, R. Laxmi- ceedings of the 2018 Conference of the North Amer- narayan, E. Liebana,´ L. Lopez-Cerero,´ A. Mac- ican Chapter of the Association for Computational Gowan, M. Martins, J. Rodr´ıguez-Bano,˜ J. M. Linguistics: Human Language Technologies, Vol- Rolain, C. Segovia, B. Sigauque, E. Tacconelli, ume 2 (Short Papers), pages 474–479, New Orleans, E. Wellington, and J. Vila. 2015. The global threat Louisiana. of antimicrobial resistance: science for intervention. New Microbes and New Infections, 6:22–29. Yitong Li, Trevor Cohn, and Timothy Baldwin. 2017. Robust training under linguistic adversity. In Pro- I. M. Rollo, J. Williamson, and R. L. Plackett. ceedings of the 15th Conference of the EACL (EACL 1952. Acquired Resistance To Penicillin And 2017), pages 21–27, Valencia, Spain. To Neoarsphenamine In Spirochaeta Recurrentis. British Journal of Pharmacology and Chemother- David H. Lloyd. 2007. Reservoirs of antimicrobial re- apy, 7(1):33–41. sistance in pet animals. Clinical Infectious Diseases: An Official Publication of the Infectious Diseases So- Ozlem¨ Uzuner, Imre Solti, and Eithon Cadag. 2010. ciety of America, 45 Suppl 2:S148–152. Extracting medication information from clinical text. Journal of the American Medical Informatics Asso- Lajanugen Logeswaran, Ming-Wei Chang, Kenton Lee, ciation, 17(5):514–518. Kristina Toutanova, Jacob Devlin, and Honglak Lee. 2019. Zero-Shot Entity Linking by Reading En- Jason Wei and Kai Zou. 2019. EDA: Easy data aug- tity Descriptions. arXiv:1906.07348 [cs]. ArXiv: mentation techniques for boosting performance on 1906.07348. text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natu- Xiaofei Ma, Peng Xu, Zhiguo Wang, Ramesh Nalla- ral Language Processing and the 9th International pati, and Bing Xiang. 2019. Domain Adaptation Joint Conference on Natural Language Processing with BERT-based Domain Classification and Data (EMNLP-IJCNLP), pages 6382–6388, Hong Kong, Selection. In Proceedings of the 2nd Workshop on China.

165 Ledell Wu, Fabio Petroni, Martin Josifoski, Sebas- tian Riedel, and Luke Zettlemoyer. 2019. Zero- shot Entity Linking with Dense Entity Retrieval. arXiv:1911.03814 [cs]. ArXiv: 1911.03814. Yuhui Zhang, Allen Nie, Ashley Zehnder, Rodney L. Page, and James Zou. 2019. VetTag: improving automated veterinary diagnosis coding via large-scale language modeling. NPJ Digital Medicine, 2(1).

166 Benchmark and Best Practices for Biomedical Knowledge Graph Embeddings David Chang1, Ivana Balazeviˇ c´2, Carl Allen2, Daniel Chawla1 Cynthia Brandt1, Richard Andrew Taylor1 1Yale Center for Medical Informatics, Yale University 2School of Informatics, University of Edinburgh, UK david.chang, richard.taylor @yale.edu {ivana.balazevic, carl.allen}@ed.ac.uk { }

Abstract hundred terminologies under the Unified Medical Language System (UMLS) (Bodenreider, 2004), Much of biomedical and healthcare data is which provides a metathesaurus that combines mil- encoded in discrete, symbolic form such as lions of biomedical concepts and relations under text and medical codes. There is a wealth of expert-curated biomedical domain knowl- a common ontological framework. The unique edge stored in knowledge bases and ontolo- identifiers assigned to the concepts as well as the gies, but the lack of reliable methods for learn- Resource Release Format (RRF) standard enable ing knowledge representation has limited their interoperability and reliable access to information. usefulness in machine learning applications. The UMLS and the terminologies it encompasses While text-based representation learning has are a crucial resource for biomedical and healthcare significantly improved in recent years through research. advances in natural language processing, at- One of the main obstacles in clinical and biomed- tempts to learn biomedical concept embeddings so far have been lacking. A recent fam- ical natural language processing (NLP) is the abil- ily of models called knowledge graph embed- ity to effectively represent and incorporate domain dings have shown promising results on general knowledge. A wide range of downstream applica- domain knowledge graphs, and we explore tions such as entity linking, summarization, patient- their capabilities in the biomedical domain. level modeling, and knowledge-grounded language We train several state-of-the-art knowledge models could all benefit from improvements in our graph embedding models on the SNOMED- ability to represent domain knowledge. While re- CT knowledge graph, provide a benchmark with comparison to existing methods and in- cent advances in NLP have dramatically improved depth discussion on best practices, and make textual representation (Alsentzer et al., 2019), at- a case for the importance of leveraging the tempts to learn analogous dense vector represen- multi-relational nature of knowledge graphs tations for biomedical concepts in a terminology for learning biomedical knowledge represen- or knowledge graph (concept embeddings) so far tation. The embeddings, code, and materials have several drawbacks that limit their usability 1 will be made available to the community . and wide-spread adoption. Further, there is currently no established best practice or benchmark 1 Introduction for training and comparing such embeddings. In A vast amount of biomedical domain knowledge is this paper, we explore knowledge graph embedding stored in knowledge bases and ontologies. For ex- (KGE) models as alternatives to existing methods ample, SNOMED Clinical Terms (SNOMED-CT)2 and make the following contributions: is the most widely used clinical terminology in the We train five recent KGE models on world for documentation and reporting in health- • care, containing hundreds of thousands of medical SNOMED-CT and demonstrate their advan- terms and their relations, organized in a polyhierar- tages over previous methods, making a case chical structure. SNOMED-CT can be thought of for the importance of leveraging the multi- as a knowledge graph: a collection of triples con- relational nature of knowledge graphs for sisting of a head entity, a relation, and a tail entity, biomedical knowledge representation. denoted (h, r, t). SNOMED-CT is one of over a We establish a suite of benchmark tasks to • 1https://github.com/dchang56/snomed kge enable fair comparison across methods and 2https://www.nlm.nih.gov/healthit/snomedct include much-needed discussion on best prac-

167 Proceedings of the BioNLP 2020 workshop, pages 167–176 Online, July 9, 2020 c 2020 Association for Computational Linguistics

tices for working with biomedical knowledge comparison to existing methods and the disregard graphs. for the heterogeneous structure of the knowledge graph substantially limit its significance. We also serve the general KGE community by • Notably, Snomed2Vec (Agarwal et al., 2019) providing benchmarks on a new dataset with applied NE methods on a clinically relevant multi- real-world relevance. relational subset of the SNOMED-CT graph and We make the embeddings, code, and other ma- provided comparisons to previous methods to • terials publicly available and outline several demonstrate that applying NE methods directly on avenues of future work to facilitate progress the graph is more data efficient, yields better em- in the field. beddings, and gives explicit control over the subset of concepts to train on. However, one major limita- 2 Related Work and Background tion of NE approaches is that they relegate relationships to mere indicators of connectivity, discarding 2.1 Biomedical concept embeddings the semantically rich information encoded in multi- Early attempts to learn biomedical concept embed- relational, heterogeneous knowledge graphs. dings have applied variants of the skip-gram model We posit that applying KGE methods on a knowl- (Mikolov et al., 2013) on large biomedical or clini- edge graph is more principled and should there- cal corpora. Med2Vec (Choi et al., 2016) learned fore yield better results. We now provide a brief embeddings for 27k ICD-9 codes by incorporating overview of the KGE literature and describe our temporal and co-occurrence information from pa- experiments in Section 3. tient visits. Cui2Vec (Beam et al., 2019) used an extremely large collection of multimodal medical 2.2 Knowledge Graph Embeddings data to train embeddings for nearly 109k concepts under the UMLS. Knowledge graphs are collections of facts in the These corpus-based methods have several draw- form of ordered triples (h, r, t), where entity h backs. First, the corpora are inaccessible due to is related to entity t by relation r. Because knowl- data use agreements, rendering them irreproducible. edge graphs are often incomplete, an ability to infer Second, these methods tend to be data-hungry and unknown facts is a fundamental task (link predic- extremely data inefficient for capturing domain tion). A series of recent KGE models approach link knowledge. In fact, one of the main limitations prediction by learning embeddings of entities and of language models in general is their reliance on relations based on a scoring function that predicts the distributional hypothesis, essentially making a probability that a given triple is a fact. use of mostly co-occurrence level information in RESCAL (Nickel et al., 2011) represents rela- the training corpus (Peters et al., 2019). Third, tions as a bilinear product between subject and they do a poor job of achieving sufficient concept object entity vectors. Although a very expressive coverage: Cui2Vec, despite its enormous training model, RESCAL is prone to overfitting due to the data, was only able to capture 109k concepts out of large number of parameters in the full rank relation over 3 million concepts in the UMLS, drastically matrix, increasing quadratically with the number limiting its downstream usability. of relations in the graph. A more recent trend has been to apply network DistMult (Yang et al., 2015) is a special case embedding (NE) methods directly on a knowledge of RESCAL with a diagonal matrix per relation, graph that represents structured domain knowl- reducing overfitting. However, by limiting linear edge. NE methods such as Node2Vec (Grover and transformations on entity embeddings to a stretch, Leskovec, 2016) learn embeddings for nodes in a DistMult cannot model asymmetric relations. network (graph) by applying a variant of the skip- ComplEx (Trouillon et al., 2016) extends Dist- gram model on samples generated using random Mult to the complex domain, enabling it to model walks, and they have shown impressive results on asymmetric relations by introducing complex con- node classification and link prediction tasks on a jugate operations into the scoring function. wide range of network datasets. In the biomedi- SimplE (Kazemi and Poole, 2018) modifies cal domain, CANode2Vec (Kotitsas et al., 2019) Canonical Polyadic (CP) decomposition (Hitch- applied several NE methods on single-relation sub- cock, 1927) to allow two embeddings for each en- sets of the SNOMED-CT graph, but the lack of tity (head and tail) to be learned dependently.

168 A recent model TuckER (Balazeviˇ c´ et al., 2019) there are no unseen entities or relations in the vali- is shown to be a fully expressive, linear model dation and test sets by simply moving them to the that subsumes several tensor factorization based train set. More details and the code used for data approaches including all models described above. preparation are included in the Supplements. TransE (Bordes et al., 2013) is an example of an alternative translational family of KGE models, Descriptions Statistics which regard a relation as a translation (vector off- Entities 293,884 set) from the subject to the object entity vectors. Relation types 170 Translational models have an additive component Facts 2,073,848 in the scoring function, in contrast to the multiplica- - Train 1,965,032 tive scoring functions of bilinear models. - Valid / Test 48,936 / 49,788 RotatE (Sun et al., 2019) extends the notion of Table 1: Statistics of the final SNOMED dataset. translation to rotation in the complex plane, enabling the modeling of symmetry/antisymmetry, inversion, and composition patterns in knowledge 3.2 Implementation graph relations. Considering the non-trivial size of SNOMED-CT We restrict our experiments to five models due and the importance of scalability and consistent to their available implementation under a common, implementation for running experiments, we use scalable platform (Zhu et al., 2019): TransE, Com- GraphVite (Zhu et al., 2019) for the KGE mod- plEx, DistMult, SimplE, and RotatE. els. GraphVite is a graph embedding framework 3 Experimental Setup that emphasizes scalability, and its speedup relative to existing implementations is well-documented4. 3.1 Data While the backend is written largely in C++, a Given the complexity of the UMLS, we detail our Python interface allows customization. We make preprocessing steps to generate the final dataset. our customized Python code available. We use the We subset the 2019AB version of the UMLS to five models available in GraphVite in our experi- SNOMED_CT_US terminology, taking all active ments: TransE, ComplEx, DistMult, SimplE, and concepts and relations in the MRCONSO.RRF and RotatE. While we restrict our current work to these MRREL.RRF files. We extract semantic type in- models, future work should also consider other formation from MRSTY.RRF and semantic group state-of-the-art models such as TuckER (Balazeviˇ c´ information from the Semantic Network website3 et al., 2019) and MuRP (Balazeviˇ c´ et al., 2019), to filter concepts and relations to 8 broad semantic especially since MuRP is shown to be particularly groups of interest: Anatomy (ANAT), Chemicals effective for graphs with hierarchical structure. Pre- & Drugs (CHEM), Concepts & Ideas (CONC), trained embeddings for Cui2Vec and Snomed2Vec Devices (DEVI), Disorders (DISO), Phenomena were used as provided by the authors, with dimen- (PHEN), Physiology (PHYS), and Procedures sionality 500 and 200, respectively. (PROC). We also exclude specific semantic types All experiments were run on 3 GTX-1080ti GPUs, and final runs took 6 hours on a sin- deemed unnecessary. A full list of the semantic ∼ types included in the dataset and their broader se- gle GPU. Hyperparameters were either tuned on mantic groups can be found in the Supplements. the validation set for each model: margin (4, The resulting list of triples comprises our fi- 6, 8, 10) and learning_rate (5e-4, 1e-4, 5e- nal knowledge graph dataset. Note that the 5, 1e-5); set: num_negative (60), dim (512), UMLS includes reciprocal relations (ISA and num_epoch (2000); or took default values from INVERSE_ISA), making the graph bidirectional. GraphVite. The final hyperparameter configuration A random split results in train-to-test leakage, can be found in the Appendix. which can inflate the performance of weaker mod- 3.3 Evaluation and Benchmark els (Dettmers et al., 2018). We fix this by ensuring reciprocal relations are in the same split, not across 3.3.1 KGE Link Prediction splits. Descriptive statistics of the final dataset are A standard evaluation task in the KGE literature shown in Table1. After splitting, we also ensure is link prediction. However, NE methods also use 3https://semanticnetwork.nlm.nih.gov 4https://github.com/DeepGraphLearning/graphVite

169 link prediction as a standard evaluation task. While cept embeddings, to compare embeddings across both predict whether two nodes are connected, NE methods we perform entity semantic classification, link prediction performs binary classification on a which requires only concept embeddings. balanced set of positive and negative edges based We generate a dataset for entity classification by on the assumption that the graph is complete. In taking the intersection of the concepts covered in contrast, knowledge graphs are typically assumed all (7) models, comprising 39k concepts with 32 incomplete, making link prediction for KGE a unique semantic types and 4 semantic groups. We ranking-based task in which the model’s scoring split the data into train and test sets with 9:1 ratio, function is used to rank candidate samples without and train a simple linear layer with 0.1 dropout and relying on ground truth negatives. In this paper, no further hyperparameter tuning. The single linear link prediction refers to the latter ranking-based layer for classification assesses the linear separabil- KGE method. ity of semantic information in the entity embedding Candidate samples are generated for each triple space for each model. Results for semantic type in the test set using all possible entities as the target and group classification are reported in Table3. entity, where the target can be set to head, tail, or both. For example, if the target is tail, the 4 Visualization model predicts scores for all possible candidates for the tail entity in (h, r, ?). For a test set with We first discuss the embedding visualizations ob- 50k triples and 300k possible unique entities, the tained through LargeVis (Tang et al., 2016), an model calculates scores for fifteen billion candi- efficient large-scale dimensionality reduction tech- date triples. The candidates are filtered to exclude nique available as an application in GraphVite. triples seen in the train, validation, and test sets, Figure1 shows concept embeddings for RotatE, so that known triples do not affect the ranking and ComplEx, Snomed2Vec, and Cui2Vec, with colors cause false negatives. Several ranking-based met- corresponding to broad semantic groups. Cui2Vec rics are computed based on the sorted scores. Note embeddings show structure but not coherent seman- that SNOMED-CT contains a transitive closure tic clusters. Snomed2Vec shows tighter groupings file, which lists explicit transitive closures for the of entities, though the clusters are patchy and scat- hierarchical relations ISA and INVERSE_ISA (if tered across the embedding space. ComplEx pro- A ISA B, and B ISA C, then the transitive closure duces globular clusters centered around the origin, includes A ISA C). This file should be included in with clearer boundaries between groups. RotatE the file list used to filter candidates to best enable gives visibly distinct clusters with clear group sep- the model to learn hierarchical structure. aration that appear intuitive: entities of the Physi- Typical link prediction metrics include Mean ology semantic group (black) overlap heavily with Rank (MR), Mean Reciprocal Rank (MRR), and those of Disorders (magenta); also entities under Hits@k (H@k). MR is considered to be sensitive the Concepts semantic group (red) are relatively to outliers and unreliable as a metric. Guu et al. scattered, perhaps due to their abstract nature, com- proposed using Mean Quantile (MQ) as a more ro- pared to more concrete entities like Devices (cyan), Anatomy (blue), and Chemicals (green), which bust alternative to MR and MRR. We use MQ100 as a more challenging version of MQ that introduces a form tighter clusters. cut-off at the top 100th ranking, appropriate for the Interestingly, the embedding visualizations for large numbers of possible entities. Link prediction the 5 KGE models fall into 2 types: RotatE and results are reported in Table2. TransE produce well-separated clusters while Com- plEx, DistMult and SimplE produce globular clus- 3.3.2 Embedding Evaluation ters around the origin. Since the plots for each For fair comparison with existing methods, we per- type appear almost indistinguishable we show one form some of the benchmark tasks for assessing from each (RotatE and ComplEx). We attribute the medical concept embeddings proposed by Beam characteristic difference between the two model et al.. However, we discuss their methodological types to the nature of their scoring functions: Ro- flaws in Section5 and suggest more appropriate tatE and TransE have an additive component while evaluation methods. ComplEx, DistMult and SimplE are multiplicative. Since non-KGE methods are not directly com- Figure2 shows more fine-grained semantic struc- parable on tasks that require both relation and con- ture by coloring 5 selected semantic types under

170 Figure 1: Concept embedding visualization (RotatE, ComplEx, Snomed2Vec, Cui2Vec) by semantic group. the Procedures semantic group and greying out Model MRR MQ100 H@1 H@10 the rest. We see that RotatE produces subclusters TransE .346 .739 .212 .597 that are also intuitive. Laboratory procedures are ComplEx .461 .761 .360 .652 well-separated on their own, health care activity DistMult .420 .752 .309 .626 and educational activity overlap significantly, and SimplE .432 .735 .337 .615 diagnostic procedures and therapeutic or preven- RotatE .317 .742 .162 .599 tative procedures overlap significantly. ComplEx TransEFB .294 - - .465 also reveals subclusters with globular shape, and TransEWN .226 - - .501 Snomed2Vec captures laboratory procedures well RotatEFB .338 - .241 .533 but leaves other types scattered. These observa- RotatEWN .476 - .428 .571 tions are consistent across other semantic groups. We include similar visualizations for the Chemicals Table 2: Link prediction results: for the 5 KGE models on SNOMED-CT (top); and for TransE and RotatE on & Drugs semantic group in the Supplements. two standard KGE datasets (Sun et al., 2019) (bottom). While semantic class information is not the only significant aspect of SNOMED-CT, since the SNOMED-CT graph is largely organized around no previous results to compare to, we include per- semantic group and type information, it is promis- formance of TransE and RotatE on two standard ing that embeddings learned (without supervision) KGE benchmark datasets for reference: FB15k-237 preserve it. (14,541 entities, 237 relations, and 310,116 triples) 5 Results and WN18RR (40,943 entities, 11 relations, and 93,003 triples). Given that SNOMED-CT is larger 5.1 Link Prediction and arguably a more complex knowledge graph Table2 shows results for the link prediction task than the two datasets, the link prediction results for the 5 KGE models on SNOMED-CT. Having suggest that the KGE models learn a reasonable

171 Figure 2: Visualization of selected semantic types under the Procedures semantic group for RotatE, ComplEx, and Snomed2Vec. Semantic types with more than 2,000 entities were subsampled to 1,200 for visibility. Cui2Vec (not shown) was similar to Snomed2Vec but more dispersed. representation of SNOMED-CT. We include sam- cosine similarity bootstrapping, proposed by Beam ple model outputs for the top 10 entity scores for et al.. For a given known relationship pair (e.g. link prediction in the Supplements. x cause_of y), a null distribution of pairwise cosine similarity scores is computed by bootstrap- 5.2 Embedding Evaluation and Relation ping 10,000 samples of the same semantic cate- Prediction gory as x and y respectively. The cosine similarity Test set accuracy for entity semantic type (STY) of the observed sample is compared to the 95th and semantic group (SG) classification are reported percentile of the bootstrap distribution (statistical in Table3. In accordance with the visualiza- significance at the 0.05 level). The authors claim tions of semantic clusters (Figures1 and2), the that, when applied to a collection of known relation- KGE and NE methods perform significantly bet- ships (causative, comorbidity, etc), the procedure ter than the corpus-based method (Cui2Vec). No- estimates the fraction of true relationships discov- tably, TransE and RotatE attain near-perfect accu- ered given a tolerance for some false positive rate. racy for the broader semantic group classification Following this, we report the statistical power of (4 classes). ComplEx, DistMult, and SimplE per- all 7 models for two of the tasks: semantic type form slighty worse, Snomed2Vec slightly below and causative relationships. The former (ST) aims them, and Cui2Vec falls behind by a significant to assess a model’s ability to determine if two con- margin. We see a greater discrepancy in relative cepts share the same semantic type. The latter performance by model type in semantic type clas- consists of two relation types: cause_of (Co) sification (32 classes), in which more fine-grained and causative_agent_of (CA). Results are semantic information is required. reported in Table3. The cosine similarity boot- Two advantages of the semantic type and group strap results, particularly for the causative relation- entity classification tasks are: (i) information is ship tasks, illustrate a major flaw in the protocol. provided by the UMLS, making the task non- While Snomed2Vec and Cui2Vec attain similar proprietary and standardized; (ii) it readily shows statistical powers for CA and Co, we see large whether a model preserves the semantic structure of discrepancies between the two tasks for the KGE the ontology, an important aspect of the data. The models, especially for ComplEx, DistMult, and tasks can also easily be modified for custom data SimplE, which produce globular embedding clus- and specific domains, e.g. class labels for genes ters. Examining the dataset, we observe that the and proteins relevant to a particular biomedical ap- cause_of relations occur mostly between con- plication can be used in classification to assess how cepts within the same semantic group/cluster (e.g. well the model captures relevant domain-specific Disorder), whereas the causative_agent_of information. relations occur between concepts in different se- For comparison to related work, we also exam- mantic groups/clusters (e.g. Chemicals to Disor- ine the benchmark tasks to assess medical con- ders). The large discrepancy in CA task results cept embeddings based on statistical power and

172 Entity Classiﬁcation Cosine-Sim Bootstrap Relation Prediction Model SG (4) STY (32) ST CA Co MRR H@1 H@10 Snomed2Vec .944 .769 .387 .903 .894 - - - Cui2Vec .891 .673 .416 .584 .559 - - - TransE .993 .827 .579 .765 .978 .800 .727 .965 ComplEx .956 .786 .249 .001 .921 .731 .606 .914 DistMult .971 .794 .275 .014 .971 .734 .569 .946 SimplE .953 .768 .242 .011 .791 .854 .803 .946 RotatE .995 .829 .544 .242 .943 .849 .799 .957

Table 3: Results for (i) entity classification of semantic type and group (test accuracy); (ii) selected tasks from (Beam et al., 2019); and (iii) relation prediction. Best results in bold. for the KGE models is because using cosine simi- head or tail entities for a relation type in the dataset larity embeds the assumption that all related enti- belongs to only one semantic group, then it has a ties are close, regardless of the relation type. The cardinality of 1, and a cardinality of many other- assumption that cosine similarity in the concept wise. If the mapping of the source semantic groups embedding space is an appropriate measure of a to the target semantic groups are one-to-one (e.g. diverse range of relatedness (a much broader ab- DISO to DISO and CHEM to CHEM), then it is straction that subsumes semantic similarity and considered homogeneous. We report relation pre- causality), renders this evaluation protocol unsuit- diction metrics for each of the 6 groups of relation able for assessing a model’s ability to capture spe- types for RotatE and ComplEx in Table4. cific types of relational information in the embed- We see that RotatE gives impressive rela- dings. Essentially, all that can be said about the tion prediction performance for all groups ex- cosine similarity-based procedure is that it assesses cept for many-to-many-homogeneous, a seem- how close entities are in that space as measured ingly challenging group of relations contain- by cosine distance. It does not reveal the nature of ing ambiguous and synonymous relation types, their relationship or what kind of relational infor- e.g. possibly_equivalent_to, same_as, mation is encoded in the space to begin with. refers_to, isa. The full list of M-M-hom re- In contrast, KGE methods explicitly model relations are shown in the Appendix. In contrast, lations and are better equipped to make inferences ComplEx struggles with a wider array of relation about the relational structure of the knowledge types, suggesting that it is generally less able to graph embeddings. Thus, we propose relation pre- model different types than RotatE. The last two diction as a standard evaluation task for assessing rows under each model show per-relation results for a model’s ability to capture information about re- the causative relationships mentioned previously: lations in the knowledge graph. We simply mod- cause_of and causative_agent_of. Ro- ify the link prediction task described above to ac- tatE again shows significantly better results com- commodate relation as a target (formulated as pared to ComplEx, in line with its theoretically (h, ?, t), generating ranking-based metrics for the superior representation capacity (Sun et al., 2019). model’s ability to prioritize the correct relation type given a pair of concepts. This provides a more prin- 6 Discussion cipled and interpretable way to evaluate the models’ Based on our findings, we recommend the use of relation representations directly based on the model KGE models to leverage the multi-relational na- prediction. The last 3 columns of Table3 report ture of knowledge graphs for learning biomedical relation prediction metrics for the 5 KGE models. concept and relation embeddings; and of appropri- In particular, RotatE and SimplE perform well, at- ate evaluation tasks such as link prediction, entity taining around 0.8 Hits@1 and around 0.85 MRR. classification and relation prediction for fair com- We conduct error analysis to gain further insight parison across models. We also encourage analysis by categorizing relation types into 6 groups based beyond standard validation metrics, e.g. visual- on the cardinality and homogeneity of their source ization, examining model predictions, reporting and target semantic groups. If the set of unique metrics for different relation groupings and devis-

173 Relation MRR H@1 H@10 Count tion tasks. We encourage researchers to engage in ComplEx further validation through visualizations, error anal- 1-1-hom .600 .319 .944 72 yses based on model predictions, examining strat- M-M-hom .605 .417 .877 29,028 ified metrics, and devising domain-specific tasks M-1 .683 .557 .884 2,509 that can assess the usefulness of the embeddings 1-M .738 .640 .916 2,497 for a given application domain. 1-1 .889 .817 .995 420 There are several immediate avenues of future M-M .867 .819 .941 15,044 work. While we focus on the SNOMED-CT dataset Co .706 .662 .779 145 and the KGE models implemented in GraphVite, CA .857 .822 .908 303 other biomedical terminologies such as the Gene RotatE Ontology (The Gene Ontology Consortium, 2018) M-M-hom .784 .718 .934 29,028 and RxNorm (Nelson et al., 2011) could be ex- M-M .973 .944 .992 15,044 plored and more recent KGE models, e.g. TuckER M-1 .971 .945 .998 2,509 (Balazeviˇ c´ et al., 2019) and MuRP (Balazeviˇ c´ et al., 1-M .975 .953 .998 2,497 2019), applied. Additional sources of information 1-1 .985 .959 1. 420 could also potentially be incorporated, such as tex- 1-1-hom .972 .976 1. 72 tual descriptions of entities and relations. In prelim- Co .803 .738 .890 145 inary experiments, we initialized entity and relation CA .996 .993 1. 303 embeddings with the embeddings of their textual Table 4: Relation prediction results for RotatE and descriptors extracted using Clinical Bert (Alsentzer ComplEx by category of relation type (last two rows et al., 2019), but it did not yield gains. This may relate to causative relation types). suggest that the concept and language spaces are substantially different and strategies to jointly train with linguistic and knowledge graph information ing problem or domain-specific validation tasks. A require further study. Other sources of information further promising evaluation task is the triple pre- include entity types (e.g. UMLS semantic type) and diction proposed in (Allen et al., 2019), which we paths, or multi-hop generalizations of the 1-hop leave for future work. A more ideal way to assess relations (triples) typically used in KGE models concept embeddings in biomedical NLP applica- (Guu et al., 2015). Notably, CoKE trains contex- tions and patient-level modeling would be to design tual knowledge graph embeddings using path-level a suite of benchmark downstream tasks that incor- information under an adapted version of the BERT porate the embeddings, but that warrants a rigorous training paradigm (Wang et al., 2019). paper of its own and is left for future work. We believe this paper serves the biomedical NLP Lastly, the usefulness of biomedical knowledge community as an introduction to KGEs and their graph embeddings should be investigated in down- evaluation and analyses, and also the KGE commu- stream applications in biomedical NLP such as nity by providing a potential standard benchmark information extraction, concept normalization and dataset with real-world relevance. entity linking, computational fact checking, question answering, summarization, and patient trajec- 7 Conclusion and Future Work tory modeling. In particular, entity linkers act as a bottleneck between text and concept spaces, and We present results from applying 5 leading KGE leveraging KGEs could help develop sophisticated models to the SNOMED-CT knowledge graph and tools to parse existing biomedical and clinical text compare them to related work through visualiza- datasets for concept-level annotations and additions and evaluation tasks, making a case for the tional insights. Well performing entity linkers may importance of using models that leverage the multi- then enable training knowledge-grounded large- relation nature of knowledge graphs for learning scale language models like KnowBert (Peters et al., biomedical knowledge representation. We discuss 2019). Overall, methods for learning and incorpo- best practices for working with biomedical knowl- rating domain-specific knowledge representation edge graphs and evaluating the embeddings learned are still at an early stage and further discussions from them, proposing link prediction, entity classi- are needed. fication, and relation prediction as standard evalua-

174 References Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceed- Khushbu Agarwal, Tome Eftimov, Raghavendra Ad- ings of the 22nd ACM SIGKDD International Con- danki, Sutanay Choudhury, Suzanne Tamang, and ference on Knowledge Discovery and Data Mining. Robert Rallo. 2019. Snomed2Vec: Random Walk and Poincare Embeddings of a Clinical Knowledge Kelvin Guu, John Miller, and Percy Liang. 2015. Base for Healthcare Analytics. arXiv:1907.08650 Traversing knowledge graphs in vector space. In [cs, stat]. ArXiv: 1907.08650. Proceedings of the 2015 Conference on Empirical Carl Allen, Ivana Balazevic, and Timothy M. Methods in Natural Language Processing, pages Hospedales. 2019. On Understanding Knowledge 318–327, Lisbon, Portugal. Association for Compu- Graph Representation. arXiv:1909.11611 [cs, stat]. tational Linguistics. Emily Alsentzer, John Murphy, William Boag, Wei- Frank L. Hitchcock. 1927. The expression of a tensor Hung Weng, Di Jin, Tristan Naumann, and Matthew or a polyadic as a sum of products. Journal of Math- McDermott. 2019. Publicly available clinical BERT ematics and Physics, 6(1-4):164–189. embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 72– Seyed Kazemi and David Poole. 2018. Simple embed- 78, Minneapolis, Minnesota, USA. Association for ding for link prediction in knowledge graphs. In Computational Linguistics. Advances in Neural Information Processing Systems 32. Ivana Balazeviˇ c,´ Carl Allen, and Timothy M Hospedales. 2019. Tucker: Tensor factorization for Sotiris Kotitsas, Dimitris Pappas, Ion Androutsopoulos, knowledge graph completion. In Empirical Methods Ryan McDonald, and Marianna Apidianaki. 2019. in Natural Language Processing. Embedding Biomedical Ontologies by Jointly En- coding Network Structure and Textual Node De- Ivana Balazeviˇ c,´ Carl Allen, and Timothy Hospedales. scriptors. In Proceedings of the 18th BioNLP Work- 2019. Multi-relational poincar ’e graph embed- shop and Shared Task, pages 298–308, Florence, \ dings. In Advances in Neural Information Process- Italy. Association for Computational Linguistics. ing Systems. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Andrew L. Beam, Benjamin Kompa, Allen Schmaltz, rado, and Jeff Dean. 2013. Distributed representa- Inbar Fried, Griffin Weber, Nathan P. Palmer, Xu Shi, tions of words and phrases and their composition- Tianxi Cai, and Isaac S. Kohane. 2019. Clinical Con- ality. In C. J. C. Burges, L. Bottou, M. Welling, cept Embeddings Learned from Massive Sources of Z. Ghahramani, and K. Q. Weinberger, editors, Ad- Multimodal Medical Data. arXiv:1804.01486 [cs, vances in Neural Information Processing Systems stat]. ArXiv: 1804.01486. 26, pages 3111–3119. Curran Associates, Inc. Olivier Bodenreider. 2004. The Unified Med- Stuart Nelson, Kelly Zeng, John Kilbourne, Tammy ical Language System (UMLS): integrating Powell, and Robin Moore. 2011. Normalized names biomedical terminology. Nucleic Acids Research, for clinical drugs: RxNorm at 6 years. JAMIA, 32(90001):267D–270. 18(4):441–448. Antoine Bordes, Nicolas Usunier, Alberto Garcia- Duran, Jason Weston, and Oksana Yakhnenko. Maximilian Nickel, Volker Tresp, and Hans-Peter Krie- 2013. Translating embeddings for modeling multi- gal. 2011. A three-way model for collective learn- relational data. In C. J. C. Burges, L. Bottou, ing on multi-relational data. In Proceedings of the M. Welling, Z. Ghahramani, and K. Q. Weinberger, 28th International Conference on Machine Learn- editors, Advances in Neural Information Processing ing, pages 809–816. Systems 26, pages 2787–2795. Curran Associates, Inc. Matthew E. Peters, Mark Neumann, Robert L Lo- gan, Roy Schwartz, Vidur Joshi, Sameer Singh, and Edward Choi, Mohammad Taha Bahadori, Eliza- Noah A. Smith. 2019. Knowledge enhanced contex- beth Searles, Catherine Coffey, Michael Thompson, tual word representations. In EMNLP. James Bost, Javier Tejedor-Sojo, and Jimeng Sun. 2016. Multi-layer Representation Learning for Med- Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian ical Concepts. In Proceedings of the 22nd ACM Tang. 2019. Rotate: Knowledge graph embedding SIGKDD International Conference on Knowledge by relational rotation in complex space. In Interna- Discovery and Data Mining - KDD ’16, pages 1495– tional Conference on Learning Representations. 1504, San Francisco, California, USA. ACM Press. Jian Tang, Jingzhou Liu, Ming Zhang, and Qiaozhu Tim Dettmers, Minervini Pasquale, Stenetorp Pon- Mei. 2016. Visualizing large-scale and high- tus, and Sebastian Riedel. 2018. Convolutional 2d dimensional data. In Proceedings of the 25th Inter- knowledge graph embeddings. In Proceedings of national Conference on World Wide Web, pages 287– the 32th AAAI Conference on Artificial Intelligence, 297. International World Wide Web Conferences pages 1811–1818. Steering Committee.

175 The Gene Ontology Consortium. 2018. The Gene On- tology Resource: 20 years and still GOing strong. Nucleic Acids Research, 47(D1):D330–D338. Theo´ Trouillon, Johannes Welbl, Sebastian Riedel, Eric´ Gaussier, and Guillaume Bouchard. 2016. Complex embeddings for simple link prediction. In Inter- national Conference on Machine Learning (ICML), volume 48, pages 2071–2080. Quan Wang, Pingping Huang, Haifeng Wang, Songtai Dai, Wenbin Jiang, Jing Liu, Yajuan Lyu, and Hua Wu. 2019. Coke: Contextualized knowledge graph embedding. arXiv:1911.02168. Bishan Yang, Scott Wen-tau Yih, Xiaodong He, Jian- feng Gao, and Li Deng. 2015. Embedding Enti- ties and Relations for Learning and Inference in Knowledge Bases. In Proceedings of the Inter- national Conference on Learning Representations (ICLR) 2015. Zhaocheng Zhu, Shizhen Xu, Meng Qu, and Jian Tang. 2019. Graphvite: A high-performance cpu-gpu hybrid system for node embedding. In The World Wide Web Conference, pages 2494–2504. ACM.

176 Extensive Error Analysis and a Learning-Based Evaluation of Medical Entity Recognition Systems to Approximate User Experience

Isar Nejadgholi, Kathleen C. Fraser and Berry De Bruijn National Research Council Canada isar.nejadgholi, kathleen.fraser,berry.debruijn @nrc-cnrc.gc.ca { }

Abstract word embeddings (Zhu et al., 2018; Si et al., 2019), and domain-specific word embeddings (Alsentzer When comparing entities extracted by a med- et al., 2019; Lee et al., 2019; Peng et al., 2019). ical entity recognition system with gold stan- Typically, research groups report their results us- dard annotations over a test set, two types of mismatches might occur, label mismatch or ing common evaluation metrics (most often preci- span mismatch. Here we focus on span mission, recall, and F-score) on standardized data sets. match and show that its severity can vary from While this facilitates exact comparison, it is diffi- a serious error to a fully acceptable entity ex- cult to know whether modest gains in F-score are traction due to the subjectivity of span anno- associated with significant qualitative differences tations. For a domain-specific BERT-based in the system performance, and how the benefits NER system, we showed that 25% of the er- and drawbacks of different embedding types are rors have the same labels and overlapping span with gold standard entities. We collected ex- reflected in the output of the NER system. pert judgement which shows more than 90% of these mismatches are accepted or partially ac- This work aims to investigate the types of errors cepted by the user. Using the training set of the and their proportion in the output of modern deep NER system, we built a fast and lightweight learning models for medical NER. We suggest that entity classifier to approximate the user expe- an evaluation metric should be a close reflection of rience of such mismatches through accepting what users experience when using the model. We or rejecting them. The decisions made by this classifier are used to calculate a learning-based investigate different types of errors that are penal- F-score which is shown to be a better approx- ized by exact F-score and identify a specific error imation of a forgiving user’s experience than type where there is high degrees of disagreement the relaxed F-score. We demonstrated the re- between the human user experience and what ex- sults of applying the proposed evaluation met- act F-score measures: namely, errors where the ric for a variety of deep learning medical entity extracted entity is correctly labeled, but the span recognition models trained with two datasets. only overlaps with the annotated entity rather than matching perfectly. We obtain expert human judge- 1 Introduction ment for 5296 such errors, ranking the severity Named entity recognition (NER) in medical texts of the error in terms of end user experience. We involves the automated recognition and classifica- then compare the commonly used F-score metrics tion of relevant medical/clinical entities, and has nu- with human perception, and investigate if there is merous applications including information extrac- a way to automatically analyze such errors as part tion from clinical narratives (Meystre et al., 2008), of the system evaluation. The code that calculates identifying potential drug interactions and adverse the number of different types of errors given the affects (Harpaz et al., 2014; Liu et al., 2016), and predictions of an NER model and the correspond- de-identification of personal health data (Dernon- ing annotations is available upon request and will court et al., 2017). be released at https://github.com/nrc-cnrc/ In recent years, medical NER systems have im- NRC-MedNER-Eval after publication. We will also proved over previous baseline performance by in- release the collected expert judgements so that corporating developments such as deep learning other researchers can use it as a benchmark for models (Yadav and Bethard, 2018), contextual further investigation about this type of errors.

177 Proceedings of the BioNLP 2020 workshop, pages 177–186 Online, July 9, 2020 c 2020 Association for Computational Linguistics

2 What do NER Evaluation Metrics ize across applications and data sets. We use both Measure? human judgement and a learning-based approach to evaluate span mismatch errors and the result- An output entity from an NER system can be incor- ing gap between what F-score measures and what rect for two reasons: either the span is wrong, or a human user experiences. We only consider the the label is wrong (or both). Although entity-level information extraction task and not any specific exact F-score (also called strict F-score) is estab- downstream task. lished as the most common metric for comparing NER models, exact F-score is the least forgiving 3 Types of Errors in NER systems metric in that it only credits a prediction when both the span and the label exactly match the annotation. While the SemEval 2013 Task 9.1 categorized different types of matches for the purpose of evalua- Other evaluation metrics have been proposed. tion, we further categorize mismatches for the sake The Message Understanding Conference (MUC) of error analysis. We consider five types of mis- used an evaluation which took into account differ- matches between annotation and prediction of the ent types of errors made by the system (Chinchor NER system. Reporting and comparing the number and Sundheim, 1993). Building on that work, the of these mismatches alongside an averaged score SemEval 2013 Task 9.1 (recognizing and labelling such as F-score can shed light on the differences of pharmacological substances in biomedical text) em- NER systems. ployed four different evaluations: strict match, in which label and span match the gold standard ex- Mismatch Type-1, Complete False Positive: • actly, exact boundary match, in which the span An entity is predicted by the NER model, but boundaries match exactly regardless of label, paris not annotated in the hand-labelled text. tial boundary match, in which the span boundaries partially match regardless of label, and type match, Mismatch Type-2, Complete False Nega- • in which the label is correct and the span overlaps tive: A hand labelled entity is not predicted with the gold standard (Segura Bedmar et al., 2013). by the model. The latter metric, also commonly known as inexact Mismatch Type-3, Wrong label, Right match, has been used to compute inexact or relaxed • F-score in the i2b2 2010 clinical NER challenge span: A hand-labelled entity and a predicted (Uzuner et al., 2011). Relaxed F-score and exact one have the same spans but different tags. F-score are the most frequently used evaluation Mismatch Type-4, Wrong label, Overlap- metrics for measuring the performance of medical • ping span: A hand-labelled entity and a pre- NER systems (Yadav and Bethard, 2018). Other dicted one have overlapping spans but differ- biomedical NER evaluations have accepted a span ent tags. as a match as long as either the right or left boundary is correct (Tsai et al., 2006). In BioNLP shared Mismatch Type-5, Right label, Overlap- • task 2013, the accuracy of the boundaries is relaxed ping span: A hand-labelled entity and a pre- or measured based on similarity of entities (Bossy dicted one have overlapping spans and the et al., 2013). Another strategy is to annotate all same tags. possible spans for an entity and accept any matches as correct (Yeh et al., 2005), although this detailed We focus on Type-5 errors and show that treat- level of annotation is rare. ing these mismatches is not a trivial task. Pre- Here, we focus on the differences between what vious works have shown that some Type-5 mis- F-score measures and the user experience. In the matches are completely wrong predictions while case of a correct label with a span mismatch, it is others are fully acceptable predictions resulting not always obvious that the user is experiencing an from the subjectivity and inconsistency of span error, due to the subjectivity of span annotations annotations (Tsai et al., 2006). (Tsai et al., 2006; Kipper-Schuler et al., 2008). Ex- Figure1 shows several examples of error Type-5. isting evaluation metrics treat all such span mis- In the first example, an adenosine - thallium stress matches equally, either penalizing them all (exact test is annotated as a test, while the NER system F-score), rewarding them all (relaxed F-score), or extracts thallium stress test as a test. Here, what based on oversimplified rules that do not general- NER extracted is partially correct but misses an

178 Figure 1: Examples of Type-5 error. We used the visualisation tool developed in (Zhu et al., 2018) important part of the entity. Whether the extracted the second version, which has become an impor- entity is acceptable may depend on the downstream tant benchmark in the literature on clinical NER task. In the next sentence, patchy consolidation in (Bhatia et al., 2019; Zhu et al., 2018). There are all lobes is annotated as a problem, but the NER 170 documents (16520 entities) in the i2b2 train set system extracted patchy consolidation in all lobes and 256 documents (31161 entities) in its test set. of both longs as the problem. Here, the system’s The i2b2 dataset was annotated by community prediction is more complete than the annotated annotators with carefully crafted guidelines. The entity, and so it appears to be a fully acceptable ground truth generated by the community obtained prediction. In the last example, according to human F-measures above 0.90 against the ground truth of annotation, 1cm cyst in the right lobe of the liver is the experts (Uzuner et al., 2011). a problem, but the NER system extracts two entities from the same phrase, 1) 1cm cyst in the right MedMentions dataset: The MedMentions lobe as a problem and 2) liver as another problem. dataset was released in 2019 and contains 4,392 While the ﬁrst extracted entity is correct and may abstracts from biomedical articles on PubMed be acceptable the second one is completely wrong. (Mohan and Li, 2019). The abstracts are annotated for UMLS concepts and semantic types. The fully 4 Datasets and Models annotated dataset contains 127 semantic types and these classes are highly-imbalanced. The creators We consider two medical text datasets, one clinical of the dataset also provide a version which has and the other biomedical. We analyse the errors of been annotated with only a subset of the most three models for each dataset to cover a variety of relevant concepts, called ‘st21pv’ (21 semantic deep learning models. types from preferred vocabularies); we consider i2b2 dataset: The i2b2 dataset of annotated clin- this version in the current work. While fewer ical notes was introduced by (Uzuner et al., 2011) papers have been published on MedMentions to in a shared task on entity recognition and relation date, it represents an interesting challenge to NLP extraction. The texts, consisting of de-identiﬁed systems due to its imbalanced and high number of discharge summaries, have been annotated for three classes, and some observed inconsistencies in the entity types: problems, tests, and treatments. There annotations (Fraser et al., 2019). There are 3513 are two versions of this dataset, as the version that documents (162,908 entities) in the st21pv train set was released to the wider NLP community contains and 879 documents (40,101 entities) in the test set. fewer texts than in the original shared task. We use MedMentions was annotated by a team of profes-

179 sional annotators with rich experience in biomedi- important for comparison of modern strong NER cal content curation. The precision of the annota- systems. tion in MedMention is estimated as 97.3% (Mohan and Li, 2019). 6 Expert Judgement on Type-5 Errors We explore a variety of NER Model Structures: We considered an information extraction task and deep learning models. For all the models we fol- asked a medical doctor to assess the Type-5 errors low the commonly used deep learning structure made by the BioBERT NER model on the st21pv consisting of a pretrained embedding model and su- dataset and either confirm or reject the extracted pervised prediction layers. For embedding, we ex- entity with granular scores. Our goal is to: 1) inves- plore three different models: a non-contextualized tigate the proportion of Type-5 extracted entities embedding model (Glove), general domain contex- that are acceptable, 2) set a benchmark of human tualized embedding model (BERT pretrained on experience from Type-5 errors. general domain text) and a domain-specific contextualized embedding model (BERT pretrained on Human Judgement Scheme: The following domain-specific text corpora). For the i2b2 dataset, scoring scheme is used by the expert for scoring we consider Glove+bi-LSTM+CRF (Pennington the acceptability of Type-5 mismatches for the et al., 2014), BERT+linear (Devlin et al., 2018) BioBERT-based model trained with the st21vp and ClinicalBERT+linear (Alsentzer et al., 2019) dataset. The Type-5 mismatches are identified and models. For the st21pv MedMentions dataset, we the expert is given the original sentence in the test consider Glove+bi-LSTM+CRF, BERT+linear and set, the annotated (gold-standard) entity, and the BioBERT+linear models (Lee et al., 2019). Clin- entity predicted by the NER model for all 5296 ical BERT is pretrained on clinical notes (similar Type-5 mismatches. to i2b2) and BioBERT is pretrained on biomedical articles from PubMed (similar to st21pv). SCORE = 1: The predicted entity is wrong and gets rejected. For example, while gene transfer 5 Analysis of Error Types Across Models is annotated as a research activity in the test set, and Datasets the NER extracted gene as research activity. Further investigation of Type-5 errors is only worth- SCORE = 2: The predicted entity is correct but an while if a significant proportion of the errors belong important piece of information is missing when to this group. We looked at the distribution of error seen in the full sentence. The prediction is partially types across datasets and NER models, described accepted by the expert. For example, injury of in Section4, and visualized the results in Figure2. lung is labeled as injury or poisoning in the test By calculating the distribution of error types, we set, but the NER extracts only the word injury as observed that for all assessed models at least 20% injury or poisoning. of the errors are recognized as Type-5 mismatches. SCORE = 3: The predicted entity is correct Moreover, for both datasets, we observed that but could be more complete. The prediction is better NER models generate more Type-5 errors. accepted by the expert. The entity normal HaCaT Models based on general BERT outperform glove- lines is annotated as anatomical structure in the based models in terms of both exact and relaxed test set but the NER extracts only HaCaT lines f-score and they also generate relatively more Type- with the same label. 5 errors. Same pattern is observed when comparing SCORE = 4: The predicted entity is equally domain-specific BERT models with general BERT correct and is accepted by the expert. As an models. This observation may be explained with example the annotated entity in test set is 196b-5p, the fact that contextualized embeddings combine as an anatomical structure but the NER extracts the meaning of words through attention mechanism -196b-5p, as an entity with the same tag. and the span information might be more vague in SCORE = 5: The predicted entity is more the resulting representation. Figure3 shows exact complete than the annotated entity and is accepted F-score, relaxed F-score and the proportion of Type- by the expert. The annotated entity in the test set is 5 mismatches to the total number of errors, for all drugs with the tag chemical and the NER extracts the models and datasets. This analysis implies that Alzheimer’s drugs with the same tag. proper handling of Type-5 errors becomes more

180 Figure 2: Types of errors made on the i2b2 and MedMentions-st21pv datasets

Figure 3: The change of relative proportion of Type-5 errors across dataset and models as the f-scores change

Results of Human Judgement Analysis: The results of the expert judgement are summarized in Figure4. Almost 40% of the Type-5 errors are scored • as 5. This means that in 40% of the cases the prediction of the NER is more complete than the entity labeled in its test set. 70% of the extracted entities scored 3 or above • and are fully accepted by the expert. 21% of the Type-5 mismatches are scored as • Figure 4: Results of expert judgement for Type-5 mis- 2. These are accepted as a correct entity ex- matches of the BioBERT-based NER model trained traction when seen out of the context, but in with MedMentions-st21pv dataset. the context of a given sentence they lack an important piece of information. Depending on the downstream tasks, they might be an 7 Entity Classifier for Automatic acceptable prediction or not. Refining of Type-5 Mismatches Only 9% of the extracted entities are totally We propose that an entity classifier can be trained • rejected by the expert. to predict the tag of entities extracted by the NER model and the predicted tag can be used to distin-

181 (a) Training

(b) Using for evaluation of the NER model

Figure 5: Workflow of the proposed entity classifier guish between acceptable and unacceptable Type-5 size of the other class to the average size of classes errors. Figure5 shows the workflow of the pro- related to the existing tags. posed method. Using the training dataset of the NER model, we train an entity classifier with gold 7.2 Classifier Structure standard entities as inputs and their assigned tags For the classifier structure, we chose to use a Dis- as outputs. For this classifier, the span is given tilBERT model (Sanh et al., 2019) with a linear and the tag is the only information that has to be prediction layer. DistilBERT is a distilled version learned. Although the full context of the sentence of BERT that is an optimum choice when fast in- helps the NER model to learn a better representa- ference is required. Since this classifier is going tion of the entity, many entities can be classified to be used for evaluation and error analysis and without seeing the full sentence and this is what the is not the main focus of building an NER model, entity classifier learns. the lightweight and fast inference is an important For Type-5 entities, the human annotators and practical criterion. We train the classifier only one the NER already agree on the tag and it is only the epoch for both datasets. When trained on the train span that is in disagreement. So, the intuition here set and tested on the test set, we achieved 89% F- is that the entity classifier can confirm or reject the score for i2b2 and 77% F-score for st21pv dataset. tag predicted by NER, given the identified span. This classifier is meant to play a third party role that has seen the variety of span annotations in the training dataset and performs the task that the human expert did in Section6. This classifier is trained once for each dataset and is not dependent on the type of the NER model.

7.1 Building the Training Data for the Entity Classifier In order to build a training dataset for the entity clas- Figure 6: Comparison of decisions made by the hu- sifier, we extracted pairs of (entity, tag) from the man expert and the entity classifier for the Type-5 mis- IOB annotated dataset. The entity classifier should matches of BioBERT NER and st21pv dataset. also be able to identify cases where the extracted entity does not belong to any of the pre-defined tags. For this reason we add the label other to the 7.3 Using the Entity Classifier for Refining list of tags of the classifier. To find examples of the Type-5 Mismatches other class, we used the spaCy library (Honnibal By building the entity classifier, our goal is to re- and Montani, 2017) to extract all the noun chunks fine the Type-5 errors and separate the acceptable that are out of the boundaries of tagged entities and predictions of the NER from the unacceptable. For randomly chose a number of them. We limited the instance, in the last example shown in Figure1

182 Annotated in test set Tag in test set Extracted by NER Tag from Entity classiﬁer Decision Central pathology biomedical discipline Central Spatial concept Reject Therapies healthcare activity Agonist Therapies healthcare activity Accept

Table 1: Examples of accepted and rejected Type-5 mismatches using the entity classiﬁer (st21pv dataset).

there are two Type-5 errors. We feed the two ex- 8 Refining Type-5 Mismatches Across tracted entities ‘1 cm cyst in the right lobe’ and Datasets and Models ‘liver’ to the entity classifier trained for i2b2 dataset. The classifier predicts the tag ‘problem’ for the Figure7 shows how the entity classifier refines extracted entity ‘1 cm cyst in the right lobe’ and Type-5 errors across models and datasets. Con- ‘Other’ for the extracted entity ‘liver’. Using these sistently, a significant proportion of Type-5 errors predictions we decide that the first entity is ac- are accepted by the entity classifier. For exam- ceptable, since although the span of the extracted ple, for the Glove-based model trained on i2b2 entity does not match the annotation, the classifier dataset, the entity classifier accepts 90% of Type- still recognizes it as a member of the correct class. 5 errors which is 26.6% of the total number of We reject the extracted entity ‘liver’ as a ‘prob- the errors penalized by the exact f-score. The pro- lem’ since the classifier recognizes it as not being portion of accepted Type-5 mismatches to the to- a ‘problem’. Table1 shows examples of rejected tal number of errors is 31.11% for i2b2+BERT, and accepted Type-5 mismatches from the st21pv 33.23% for i2b2+ClinicalBERT, 17.95% for dataset. st21pv+Glove, 19.55% for st21pv+BERT and 19.36% for st21pv+BioBERT. To sum up, about 20% to 30% of the mismatches penalized by exact f-score are accepted by the entity classifier. 7.4 Comparing the Classifier and the Expert 9 Learning-Based F-score Figure6 shows the comparison between the expert’s judgment and the classifier’s judgement The trained entity classifier can be leveraged for about Type-5 mismatches for the BioBERT NER F-score calculation. Here, instead of penalizing model on st21 pv dataset. all the type-5 mismatches as in exact F-score or rewarding all of them in relaxed F-score, we penal- Our analysis shows that 96% of the entities ac- ize the type-5 mismatches that are rejected by the cepted by the classifier are also accepted or par- classifier and reward the rest of them. In other tially accepted by the expert, and 86% of the en- words, this F-score penalizes errors of Type-1, tities accepted or partially accepted by the expert Type-2, Type-3, Type-4 and the Rejected Type- are accepted by the classifier as well. The clas- 5 mismatches. Accepted Type-5 mismatches and sifier and the expert disagree about 17% of the exact matches are rewarded. entities. In 24% of the disagreements, the probabilities assigned to the tags generated by the en- 9.1 Evaluation of the Learning-Based F-score tity classifier are low (less than 0.5) and our man- We use the expert judgement collected in Section6 ual investigation shows that the classifier’s predic- to quantify human experience for the BioBERT- tion is mostly wrong in these cases. These mis- based NER model on st21pv dataset and then takes mostly occurs in 5 classes namely anatomi- use that as a benchmark to evaluate the proposed cal structure, biologic function, chemical, finding learning-based F-score. We consider two scenar- and health care activity. ios based on the scores described in Section6, 1) We also observed that this classifier is not able a strict user that only accepts scores equal to or to distinguish between accepted and partially ac- above 3, 2) a forgiving user that accepts scores cepted entities extracted by the NER model, which equal to or above 2. We calculated the F-score for is one of the limitations of this method. The prob- each scenario and investigated the error of exact abilities assigned to the tags is 0.89 0.17 for F-score, relaxed F-score and the proposed F-score ± accepted entities, 0.88 0.17 for partially accepted to each of these scenarios. Table2 shows that in ± entities, and 0.78 0.23 for rejected entities. applications where strict evaluation of the NER ± 183 Figure 7: Number of accepted/rejected Type-5 mismatches by the entity classifier

F-score err. wtr strict user err. wtr forgiving user bel is recognized correctly and the span has overlap Exact -4.3% -5.5% with the annotated entity. We referred to this type Proposed 5.9% 4.7% of error as Type-5 mismatch and for six NER mod- Relaxed 8.2% 7.1% els (3 model structures and 2 datasets) showed that at least 20% of the errors belong to this category. Table 2: Comparing F-scores with human experience. The previous literature has raised the issue that in the case of medical NER, many such predictions is needed, exact F-score is better than both pro- are valid and useful entity extractions and penaliz- posed and relaxed f-score and results in the least ing them is a flaw of evaluation metrics. However, error with respect to the human experience. How- distinguishing between acceptable and unaccept- ever, in cases that partially accepted entities can able predictions when the label is correct and the be considered as useful predictions, the proposed span overlaps is not trivial. method results in the least disagreement with hu- We argue that the best evaluation metric is the man experience. A better classifier would be able one that reflects the human experience of the sys- to model human preferences better, and thus make tem best. We collected human judgement about a the learning-based F-score a stronger alternative all Type-5 errors made by a NER model based on to exact or relaxed F-scores. Another important BioBERT embeddings, trained with st21pv dataset finding from Table2 that when choosing between and showed that almost 70% of such errors are exact and relaxed F-score, exact is the better metric completely acceptable and only 10% of them are to choose. rejected by the user. The rest of the predictions are Figure8 shows how the proposed F-score can be acceptable entities for the associated tags but lack compared with exact and relaxed F-score. We only important information when seen in the context. have annotations for the BioBERT+stpv dataset Setting human experience as a benchmark, we and for the rest of the models we cannot evaluate suggested that expert judgement can be approxi- the F-score with respect to human experience. As mated by a decision made by an entity classifier. expected, from this figure we observe that for all The entity classifier can be trained using the train- the models, the proposed F-score is a forgiving one ing set of an NER. While the NER model looks at and is much closer to the relaxed F-score than the the context and identifies the type of the entity and a exact F-score. partially correct span, this classifier looks at the extracted entities out of context and decides whether 10 Discussion with the partially correct span, the extracted en- We highlighted the fact that when we evaluate NER tity can still belong to the predicted class or not. systems by comparing extracted and annotated en- The entity classifier trained on st21pv dataset ac- tities across a test set, for a significant part of the cepts more than 80% of Type-5 errors made by errors that are penalized by the exact F-score, the la- BioBERT-based NER model trained with the same

184 Figure 8: Comparison of f-scores. dataset, 96% of which is also accepted by the ex- ence of a strict user. Using extra sources of training pert user. The proposed entity classifier is trained data other than the NER training dataset may be a for each NER training set once and can be used to way to improve the judgements of the entity classi- evaluate any NER model trained on that dataset, fier. We used this classifier for error analysis and regardless of the structure of the NER model being refining of Type-5 errors. In future work, we can evaluated. We used a computationally inexpensive look at the possibility of using this classifier as a model structure and encourage researchers to use refining tool for all types of mismatches or a post- this model in order to automatically evaluate Type- processing tool without the need for annotation to 5 mismatches. Reporting the distribution of errors identify the types of mismatches. across all error types and also accepted and rejected Type-5 errors, will allow us to compare our models 11 Conclusion in a variety of dimensions and sheds light on how these models behave differently for detecting labels Medical NER systems that are based on most recent and spans. deep learning structures generate a high amount of outputs that match with the hand-labelled entities Accepting some Type-5 errors as useful predic- in terms of tag but only overlap in the span. While tions can be translated to F-score calculation by the exact f-score penalizes all of these predictions not penalizing the accepted entity extractions. We and relaxed f-score credits all of them, a human did this calculation separately for the cases that user accepts a significant proportion of them as were accepted by human expert or the classifier, valid entities and rejects the rest. and showed that the F-score resulting from the clas- A reformatted version of the NER training dataset sifier is closer to the judgement of a forgiving user can be used to train an entity classifier for eval- than both the exact and the relaxed F-score. In uation of extracted entities with right label and cases where a strict evaluation of the system is de- overlapping span. We showed that there is a high sired, exact F-score is a better approximation of degree of agreement between human expert and human experience, due to the fact that the entity this entity classifier in accepting or rejecting span classifier is a forgiving one and accepts most of the mismatches. This classifier is used to calculate a cases that are partially accepted by the expert. learning-based evaluation metric that outperforms We only collected human judgement on the de- relaxed F-score in approximating the experience of cisions made by NER model for one model and a forgiving user. one dataset. Further investigation is needed to confirm or reject our observations and to investigate the limitations and potential capabilities of training References an entity classifier alongside a NER model and using that for error analysis. Also, further research Emily Alsentzer, John Murphy, William Boag, Wei- Hung Weng, Di Jin, Tristan Naumann, and Matthew is needed to find a way of distinguishing between McDermott. 2019. Publicly available clinical BERT partially accepted and accepted entity extractions, embeddings. In Proceedings of the 2nd Clinical which is a necessary tool for measuring the experi- Natural Language Processing Workshop, pages 72–

185 78, Minneapolis, Minnesota, USA. Association for Stephane´ M Meystre, Guergana K Savova, Karin C Computational Linguistics. Kipper-Schuler, and John F Hurdle. 2008. Extract- ing information from textual documents in the elec- Parminder Bhatia, Busra Celikkaya, and Mohammed tronic health record: a review of recent research. Khalilia. 2019. Joint entity extraction and assertion Yearbook of Medical Informatics, 17(01):128–144. detection for clinical text. In Proceedings of the 57th Annual Meeting of the Association for Compu- Sunil Mohan and Donghui Li. 2019. MedMentions: A tational Linguistics, pages 954–959. large biomedical corpus annotated with UMLS concepts. arXiv preprint arXiv:1902.09476. Robert Bossy, Wiktoria Golik, Zorana Ratkovic, Yifan Peng, Shankai Yan, and Zhiyong Lu. 2019. Philippe Bessieres,` and Claire Nedellec.´ 2013. Transfer learning in biomedical natural language BioNLP shared task 2013–an overview of the bac- processing: An evaluation of bert and elmo on ten teria biotope task. In Proceedings of the BioNLP benchmarking datasets. In Proceedings of the 2019 Shared Task 2013 Workshop, pages 161–169. Workshop on Biomedical Natural Language Process- ing (BioNLP 2019), pages 58–65. Nancy Chinchor and Beth Sundheim. 1993. MUC-5 evaluation metrics. In Fifth Message Understanding Jeffrey Pennington, Richard Socher, and Christopher D Conference (MUC-5): Proceedings of a Conference Manning. 2014. Glove: Global vectors for word rep- Held in Baltimore, Maryland, August 25-27, 1993. resentation. In Proceedings of the 2014 conference on empirical methods in natural language process- Franck Dernoncourt, Ji Young Lee, Ozlem Uzuner, ing (EMNLP), pages 1532–1543. and Peter Szolovits. 2017. De-identification of patient notes with recurrent neural networks. Journal Victor Sanh, Lysandre Debut, Julien Chaumond, and of the American Medical Informatics Association, Thomas Wolf. 2019. Distilbert, a distilled version 24(3):596–606. of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep Isabel Segura Bedmar, Paloma Mart´ınez, and Mar´ıa bidirectional transformers for language understand- Herrero Zazo. 2013. Semeval-2013 task 9: Extrac- ing. arXiv preprint arXiv:1810.04805. tion of drug-drug interactions from biomedical texts (ddiextraction 2013). volume 2, pages 341–350. As- Kathleen C Fraser, Isar Nejadgholi, Berry De Bruijn, sociation for Computational Linguistics. Muqun Li, Astha LaPlante, and Khaldoun Zine Yuqi Si, Jingqi Wang, Hua Xu, and Kirk Roberts. 2019. El Abidine. 2019. Extracting UMLS concepts from Enhancing clinical concept extraction with contex- medical text using general and domain-specific deep tual embeddings. Journal of the American Medical learning models. EMNLP-IJCNLP 2019, page 157. Informatics Association, 26(11):1297–1304. Rave Harpaz, Alison Callahan, Suzanne Tamang, Yen Richard Tzong-Han Tsai, Shih-Hung Wu, Wen-Chi Low, David Odgers, Sam Finlayson, Kenneth Jung, Chou, Yu-Chun Lin, Ding He, Jieh Hsiang, Ting-Yi Paea LePendu, and Nigam H Shah. 2014. Text min- Sung, and Wen-Lian Hsu. 2006. Various criteria in ing for adverse drug events: the promise, challenges, the evaluation of biomedical named entity recogni- and state of the art. Drug Safety, 37(10):777–790. tion. BMC Bioinformatics, 7(1):92. Matthew Honnibal and Ines Montani. 2017. spaCy 2: Ozlem¨ Uzuner, Brett R South, Shuying Shen, and Natural language understanding with Bloom embed- Scott L DuVall. 2011. 2010 i2b2/VA challenge on dings, convolutional neural networks and incremen- concepts, assertions, and relations in clinical text. tal parsing. To appear. Journal of the American Medical Informatics Asso- ciation, 18(5):552–556. Karin Kipper-Schuler, Vinod Kaggal, James Masanz, Philip Ogren, and Guergana Savova. 2008. System Vikas Yadav and Steven Bethard. 2018. A survey on re- evaluation on a named entity corpus from clinical cent advances in named entity recognition from deep notes. In Language resources and evaluation con- learning models. In Proceedings of the 27th Inter- ference, LREC, pages 3001–3007. national Conference on Computational Linguistics, pages 2145–2158. Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Alexander Yeh, Alexander Morgan, Marc Colosimo, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Lynette Hirschman. 2005. BioCreAtIvE task 1a: and Jaewoo Kang. 2019. BioBERT: a pre-trained gene mention finding evaluation. BMC Bioinformat- biomedical language representation model for ics, 6(S1):S2. biomedical text mining. Bioinformatics. Henghui Zhu, Ioannis Ch Paschalidis, and Amir Shengyu Liu, Buzhou Tang, Qingcai Chen, and Xiao- Tahmasebi. 2018. Clinical concept extraction long Wang. 2016. Drug-drug interaction extraction with contextual word embedding. arXiv preprint via convolutional neural networks. Computational arXiv:1810.10566. and Mathematical Methods in Medicine.

186 A Data-driven Approach for Noise Reduction in Distantly Supervised Biomedical Relation Extraction

Saadullah Amin∗ Katherine Ann Dunﬁeld∗ Anna Vechkaeva Gunter¨ Neumann

German Research Center for Artiﬁcial Intelligence (DFKI) Multilinguality and Language Technology Lab saadullah.amin, katherine.dunfield, anna.vechkaeva, guenter.neumann @dfki.de { }

Abstract

Fact triples are a common form of structured knowledge used within the biomedical domain. As the amount of unstructured scientific texts continues to grow, manual annotation of these texts for the task of relation extraction becomes increasingly expensive. Distant supervision offers a viable approach to combat this by quickly producing large amounts of labeled, but considerably noisy, data. We aim to reduce such noise by extending an entity-enriched relation classification BERT model to the prob- Figure 1: Example of a distantly supervised bag of sen- lem of multiple instance learning, and defin- tences for a knowledge base tuple (neurofibromatosis 1, ing a simple data encoding scheme that signif- breast cancer) with special order sensitive entity mark- icantly reduces noise, reaching state-of-the-art ers to capture the position and the latent relation direc- performance for distantly-supervised biomed- tion with BERT for predicting the missing relation. ical relation extraction. Our approach further encodes knowledge about the direction of rela- sentences can be seen as potentially noisy evidence, tion triples, allowing for increased focus on reas they do not explicitly express the given relation. lation learning by reducing noise and alleviat- Since individual instance labels may be unknown ing the need for joint learning with knowledge graph completion. (Wang et al., 2018), we instead build on the recent findings of Wu and He(2019) and Soares et al. 1 Introduction (2019) in using positional markings and latent relation direction (Figure1), as a signal to mitigate Relation extraction (RE) remains an important nat- noise in bag-level multiple instance learning (MIL) ural language processing task for understanding the for distantly supervised biomedical RE. Our ap- interaction between entities that appear in texts. In proach greatly simplifies previous work by Dai supervised settings (GuoDong et al., 2005; Zeng et al.(2019) with following contributions: et al., 2014; Wang et al., 2016), obtaining fine- We extend sentence-level relation enriched grained relations for the biomedical domain is chal- • bag-level lenging due to not only the annotation costs, but BERT (Wu and He, 2019) to MIL. the added requirement of domain expertise. Distant We demonstrate that the simple applications • supervision (DS), however, provides a meaning- of this model under-perform and require ful way to obtain large-scale data for RE (Mintz knowledge base order-sensitive markings, k- et al., 2009; Hoffmann et al., 2011), but this form of tag, to achieve state-of-the-art performance. data collection also tends to result in an increased This data encoding scheme captures the latent amount of noise, as the target relation may not al- relation direction and provides a simple way ways be expressed (Takamatsu et al., 2012; Ritter to reduce noise in distant supervision. et al., 2013). Exemplified in Figure1, the last two We make our code and data creation pipeline • https://github.com/ ∗ Equal contribution publicly available: On behalf of the PRECISE4Q consortium suamin/umls-medline-distant-re

187 Proceedings of the BioNLP 2020 workshop, pages 187–194 Online, July 9, 2020 c 2020 Association for Computational Linguistics

2 Related Work 3 Bag-level MIL for Distant RE

In MIL-based distant supervision for corpus-level 3.1 Problem Definition RE, earlier works rely on the assumption that at Let and represent the set of entities and re- E R least one of the evidence samples represent the lations from a knowledge base , respectively. KB target relation in a triple (Riedel et al., 2010; Hoff- For h, t and r , let (h, r, t) be a ∈ E ∈ R ∈ KB mann et al., 2011; Surdeanu et al., 2012). Recently, fact triple for an ordered tuple (h, t). We denote all piecewise convolutional neural networks (PCNN) such (h, t) tuples by a set +, i.e., there exists some G (Zeng et al., 2014) have been applied to DS (Zeng r for which the triple (h, r, t) belongs to the ∈ R et al., 2015), with notable extensions in selective at- KB, called positive groups. Similarly, we denote tention (Lin et al., 2016) and the modelling of noise by the set of negative groups, i.e., for all r , G− ∈ R dynamics (Luo et al., 2017). Han et al.(2018a) pro- the triple (h, r, t) does not belong to KB. The union + 1 posed a joint learning framework for knowledge of these groups is represented by = − . (1) (m) G G ∪ G graph completion (KGC) and RE with mutual atten- We denote by = [sg , ..., sg ] an unordered se- Bg tion, showing that DS improves downstream KGC quence of sentences, called bag, for g such that ∈ G performance, while KGC acts as an indirect signal the sentences contain the group g = (h, t), where to filter textual noise. Dai et al.(2019) extended the bag size m can vary. Let f be a function that this framework to biomedical RE, using improved maps each element in the bag to a low-dimensional KGC models, ComplEx (Trouillon et al., 2017) and (1) (m) relation representation [rg , ..., rg ]. With o, we SimplE (Kazemi and Poole, 2018), as well as ad- represent the bag aggregation function, that maps ditional auxiliary tasks of entity-type classification instance level relation representation to a final bag and named entity recognition to mitigate noise. representation b = o(f( )). The goal of dis- g Bg Pre-trained language models, such as BERT (De- tantly supervised bag-level MIL for corpus-level vlin et al., 2019), have been shown to improve the RE is then to predict the missing relation r given downstream performance of many NLP tasks. Rel- the bag. evant to distant RE, Alt et al.(2019) extended the OpenAI Generative Pre-trained Transformer (GPT) 3.2 Entity Markers model (Radford et al., 2019) for bag-level MIL Wu and He(2019) and Soares et al.(2019) showed with selective attention (Lin et al., 2016). Sun et al. that using special markers for entities with BERT in (2019) enriched pre-training stage with KB entity the order they appear in a sentence encodes the po- information, resulting in improved performance. sitional information that improves the performance For sentence-level RE, Wu and He(2019) proposed of sentence-level RE. It allows the model to focus an entity marking strategy for BERT (referred to on target entities when, possibly, other entities are here as R-BERT) to perform relation classification. also present in the sentence, implicitly doing entity Specifically, they mark the entity boundaries with disambiguation and reducing noise. In contrast, special tokens following the order they appear in for bag-level distant supervision, the noisy channel the sentence. Likewise, Soares et al.(2019) studied be attributed to several factors for a given triple several data encoding schemes and found marking (h, r, t) and bag g: entity boundaries important for sentence-level RE. B 1. Evidence sentences may not express the rela- With such encoding, they further proposed a novel tion. pre-training scheme for distributed relational learning, suited to few-shot relation classification (Han 2. Multiple entities appearing in the sentence, et al., 2018b). requiring the model to disambiguate target Our work builds on these findings, in particular, entities among other. we extend the BERT model (Devlin et al., 2019) for 3. The direction of missing relation. bag-level MIL, similar to Alt et al.(2019). More 4. Discrepancy between the order of the target importantly, noting the significance of sentence- entities in the sentence and knowledge base. ordered entity marking in sentence-level RE (Wu and He, 2019; Soares et al., 2019), we introduce To address (1), common approaches are to learn a the knowledge-based entity marking strategy suited negative relation class NA and use better bag ag- to bag-level DS. This naturally encodes the infor- gregation strategies (Lin et al., 2016; Luo et al., 1 + mation stored in KB, reducing the inherent noise. The sets are disjoint, − = ∅ G ∩ G 188 2017; Alt et al., 2019). For (2), encoding positional information is important, such as, in PCNN (Zeng et al., 2014), that takes into account the relative positions of head and tail entities (Zeng et al., 2015), and in (Wu and He, 2019; Soares et al., 2019) for sentence-level RE. To account for (3) and (4), multi-task learning with KGC and mutual attention has proved effective (Han et al., 2018a; Dai et al., 2019). Simply extending sentence sensitive marking to bag-level can be adverse, as it enhances (4) and even if the composition is uniform, it dis- Figure 2: Multiple instance learning (MIL) based bag- tributes the evidence sentence across several bags. level relation classification BERT with KB ordered entity marking (Section 3.2). Special markers $ and ˆ al- On the other hand, expanding relations to multiple ways delimit the span of head (hs, he) and tail (ts, te) sub-classes based on direction (Wu and He, 2019), entities regardless of their order in the sentence. The enhances class imbalance and also distributes sup- markers captures the positions of entities and latent reporting sentences. To jointly address (2), (3) and lation direction. (4), we introduce KB sensitive encoding suitable for bag-level distant RE. Formally, for a group g = (h, t) and a match- 3.3 Model Architecture (i) 2 BERT (Devlin et al., 2019) is used as our base ing sentence sg with tokens (x0, ..., xL) , we add special tokens $ and ˆ to mark the entity spans as: sentence encoder, specifically, BioBERT (Lee Sentence ordered: Called s-tag, entities are et al., 2020), and we extend R-BERT (Wu and marked in the order they appear in the sentence. He, 2019) to bag-level MIL. Figure2 shows the model’s architecture with k-tag. Consider a bag Following Soares et al.(2019), let s1 = (i, j) g of size m for a group g representing the and s2 = (k, l) be the index pairs with 0 < B ∈ G i < j 1, j < k, k l 1 and l L de- ordered tuple (h, t), with corresponding spans − ≤ − ≤ (1) (1) (m) (m) limiting the entity mentions e1 = (xi, ..., xj) and Sg = [(sh , st ), ..., (sh , st )] obtained with e2 = (xk, ..., xl) respectively. We mark the bound- k-tag, then for a pair of sentences in the bag and (i) (i) (i) ary of s1 with $ and s2 with ˆ. Note, e1 and e2 can spans, (s , (sh , st )), we can represent the be either h or t. model in three steps, such that the first two steps KB ordered: Called k-tag, entities are marked in represent the map f and the final step o, as follows: the order they appear in the KB. Let sh = (i, j) and st = (k, l) be the index pairs delimiting head 1. SENTENCE ENCODING: BERT is applied to the (i) (h) and tail (t) entities, irrespective of the order sentence and the final hidden state H Rd, cor- 0 ∈ they appear in the sentence. We mark the boundary responding to the [CLS] token, is passed through 3 (1) d d of sh with $ and st with ˆ. a linear layer W R × with tanh(.) activa- ∈ The s-tag annotation scheme is followed by tion to obtain the global sentence information in (i) Soares et al.(2019) and Wu and He(2019) for h0 . span identification. In Wu and He(2019), each 2. RELATION REPRESENTATION: For the head en- relation type r is further expanded to two tity, represented by the span s(i) = (j, k) for k > j, ∈ R h sub-classes as r(e1, e2) and r(e2, e1) to capture di- 1 k (i) we apply average pooling k j+1 n=j Hn , and rection, while holding the s-tag annotation as fixed. − s(i) = (l, m) For DS-based RE, since the ordered tuple (h, t) is similarly for the tail entity with spanP t 1 m (i) given, the task is reduced to relation classification for m > l, we get m l+1 n=l Hn . The − without direction. This side information is encoded pooled representations are then passed through a (2) Pd d in data with k-tag, covering (2) but also (3) and shared linear layer W R × with tanh(.) (i) ∈ (i) (4). To account for (1), we also experiment with activation to get hh and ht . To get the fi- selective attention (Lin et al., 2016) which has been nal latent relation representation, we concatenate widely used in other works (Luo et al., 2017; Han the pooled entities representation with [CLS] as (i) (i) (i) (i) 3d et al., 2018a; Alt et al., 2019). rg = [h ; h ; h ] R . 0 h t ∈ 2 3 x0 =[CLS] and xL =[SEP] Each linear layer is implicitly assumed with a bias vector

189 3. BAG AGGREGATION: After applying the first Table 1: Overall statistics of the data. two steps to each sentence in the bag, we obtain (1) (m) Triples Entities Relations Pos. Groups Neg. Groups [rg , ..., rg ]. With a final linear layer consisting 3d 169,438 27,403 355 92,070 64,448 of a relation matrix Mr R|R|× and a bias vec- ∈ tor br R|R|, we aggregate the bag information ∈ with o in two ways: 4.2 Models and Evaluation Average: The bag elements are averaged as: We compare each tagging scheme, s-tag and k-tag, m with average (avg) and selective attention (attn) bag 1 (i) bg = r aggregation functions. To test the setup of Wu and m g Xi=1 He(2019), which follows s-tag, we expand each relation type (exprels) r to two sub-classes Selective attention (Lin et al., 2016): For a row r ∈ R in M representing the relation r , we get the r(e1, e2) and r(e2, e1) indicating relation direction r ∈ R attention weights as: from first entity to second and vice versa. For all experiments, we used batch size 2, bag size 16 with T (i) exp(r rg ) sampling (see A.4 for details on bag composition), αi = learning rate 2e 5 with linear decay, and 3 epochs. m exp(rT r(j)) − j=1 g As the standard practice (Weston et al., 2013), eval- m P uation is performed through constructing candidate b = α r(i) g i g triples by combining the entity pairs in the test set i=1 X with all relations (except NA) and ranking the re- Following bg, a softmax classifier is applied to pre- sulting triples. The extracted triples are matched dict the probability p(r bg; θ) of relation r being a against the test triples and the precision-recall (PR) | true relation with θ representing the model param- curve, area under the PR curve (AUC), F1 measure, eters, where we minimize the cross-entropy loss and Precision@k, for k in 100, 200, 300, 2000, { during training. 4000, 6000 are reported. } 4 Experiments 4.3 Results 4.1 Data Performance metrics are shown in Table2 and plots Similar to (Dai et al., 2019), UMLS4 (Bodenreider, of the resulting PR curves in Figure3. Since our 2004) is used as our KB and MEDLINE abstracts5 data differs from Dai et al.(2019), the AUC cannot as our text source. A data summary is shown in be directly compared. However, Precision@k indi- Table1 (see AppendixA for details on the data cre- cates the general performance of extracting the true ation pipeline). We approximate the same statistics triples, and can therefore be compared. Generally, as reported in Dai et al.(2019) for relations and models annotated with k-tag perform significantly entities, but it is important to note that the data does better than other models, with k-tag+avg achieving not contain the same samples. We divided triples state-of-the-art Precision@ 2k,4k,6k compared { } into train, validation and test sets, and following to the previous best (Dai et al., 2019). The best (Weston et al., 2013; Dai et al., 2019), we make model of Dai et al.(2019) uses PCNN sentence sure that there is no overlapping facts across the encoder, with additional tasks of SimplE (Kazemi splits. Additionally, we add another constraint, i.e., and Poole, 2018) based KGC and KG-attention, there is no sentence-level overlap between the train- entity-type classification and named entity recog- ing and held-out sets. To perform groups negative nition. In contrast our data-driven method, k-tag, sampling, for the collection of evidence sentences greatly simplifies this by directly encoding the KB supporting NA relation type bags, we extend KGC information, i.e., order of the head and tail en- open-world assumption to bag-level MIL (see A.3). tities and therefore, the latent relation direction. 20% of the data is reserved for testing, and of the Consider again the example in Figure1 where our remaining 80%, we use 10% for validation and the source triple (h, r, t) is (neurofibromatosis 1, asso- rest for training. ciated genetic condition, breast cancer), and only last sentence has the same order of entities as KB. 4We use 2019 release: umls-2019AB-full 5https://www.nlm.nih.gov/bsd/medline. This discrepancy is conveniently resolved (note in html Figure2, for last sentence the extracted entities

190 Table 2: Relation extraction results for different model conﬁgurations and data splits.

Model Bag Agg. AUC F1 P@100 P@200 P@300 P@2k P@4k P@6k Dai et al.(2019) ------.913 .829 .753 avg .359 .468 .791 .704 .649 .504 .487 .481 s-tag attn .122 .225 .587 .563 .547 .476 .441 .418 avg .383 .494 .508 .519 .521 .507 .508 .511 s-tag+exprels attn .114 .216 .459 .476 .482 .504 .496 .484 avg .684 .649 .974 .983 .986 .983 .977 .969 k-tag attn .314 .376 .967 .941 .925 .857 .814 .772 sentence order is flipped to KG order when concatenating, unlike s-tag) with k-tag. We remark that such knowledge can be seen as learned, when jointly modeling with KGC, however, considering the task of bag-level distant RE only, the KG triples are known information and we utilize this information explicitly with k-tag encoding. As PCNN (Zeng et al., 2015) can account for the relative positions of head and tail entities, it also performs better than the models tagged with s-tag using sentence order. Similar to Alt et al.(2019) 6, we also note that the pre-trained contextualized models result in sustained long tail performance. Figure 3: Precision-Recall (PR) curve for different s-tag+exprels reflects the direct application of Wu models. We see that the models with k-tag perform and He(2019) to bag-level MIL for distant RE. In better than the s-tag with average aggregation showing this case, the relations are explicitly extended to consistent performance for long-tail relations. model entity direction appearing first to second in the sentence, and vice versa. This implicitly introduces independence between the two sub-classes 5 Conclusion of the same relation, limiting the gain from shared This work extends BERT to bag-level MIL and in- knowledge. Likewise, with such expanded rela- troduces a simple data-driven strategy to reduce the tions, class imbalance is further enhanced to more noise in distantly supervised biomedical RE. We fine-grained classes. note that the position of entities in sentence and the Though selective attention (Lin et al., 2016) has order in KB encodes the latent direction of relation, been shown to improve the performance of distant which plays an important role for learning under RE (Luo et al., 2017; Han et al., 2018a; Alt et al., such noise. With a relatively simple methodology, 2019), models in our experiments with such an we show that this can sufficiently be achieved by attention mechanism significantly underperformed, reducing the need for additional tasks and high- in each case bumping the area under the PR curve lighting the importance of data quality. and making it flatter. We note that more than 50% of bags are under-sized, in many cases, with only Acknowledgements 1-2 sentences, requiring repeated over-sampling to match fixed bag size, therefore, making it difficult The authors would like to thank the anonymous for attention to learn a distribution over the bag reviewers for helpful feedback. The work was par- with repetitions, and further adding noise. For such tially funded by the European Union’s Horizon cases, the distribution should ideally be close to 2020 research and innovation programme under uniform, as is the case with averaging, resulting in grant agreement No. 777107 through the project better performance. Precise4Q and by the German Federal Ministry of Education and Research (BMBF) through the 6Their model does not use any entity marking strategy. project DEEPLEE (01IW17001).

191 References Seyed Mehran Kazemi and David Poole. 2018. Simple embedding for link prediction in knowledge graphs. Christoph Alt, Marc Hubner,¨ and Leonhard Hennig. In Advances in neural information processing sys- 2019. Fine-tuning pre-trained transformer language tems, pages 4284–4295. models to distantly supervised relation extraction. In Proceedings of the 57th Annual Meeting of the Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Association for Computational Linguistics, pages Donghyeon Kim, Sunkyu Kim, Chan Ho So, 1388–1398. and Jaewoo Kang. 2020. BioBERT: a pre- trained biomedical language representation model Olivier Bodenreider. 2004. The unified medical for biomedical text mining. Bioinformatics, language system (UMLS): integrating biomed- 36(4):1234–1240. ical terminology. Nucleic acids research, 32(suppl 1):D267–D270. Adam Lerer, Ledell Wu, Jiajun Shen, Timothee Lacroix, Luca Wehrstedt, Abhijit Bose, and Alex Qin Dai, Naoya Inoue, Paul Reisert, Ryo Takahashi, Peysakhovich. 2019. PyTorch-BigGraph: A large- and Kentaro Inui. 2019. Distantly supervised scale graph embedding system. In Proceedings of biomedical knowledge acquisition via knowledge the 2nd SysML Conference, Palo Alto, CA, USA. graph based attention. In Proceedings of the Work- shop on Extracting Structured Knowledge from Sci- Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, entific Publications, pages 1–10. and Maosong Sun. 2016. Neural relation extraction with selective attention over instances. In Proceed- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and ings of the 54th Annual Meeting of the Association Kristina Toutanova. 2019. BERT: Pre-training of for Computational Linguistics (Volume 1: Long Pa- deep bidirectional transformers for language under- pers), pages 2124–2133. standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- Bingfeng Luo, Yansong Feng, Zheng Wang, Zhanxing nologies, Volume 1 (Long and Short Papers), pages Zhu, Songfang Huang, Rui Yan, and Dongyan Zhao. 4171–4186. 2017. Learning with noise: Enhance distantly supervised relation extraction with dynamic transition Zhou GuoDong, Su Jian, Zhang Jie, and Zhang Min. matrix. arXiv preprint arXiv:1705.03995. 2005. Exploring various knowledge in relation extraction. In Proceedings of the 43rd annual meeting Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf- on association for computational linguistics, pages sky. 2009. Distant supervision for relation extrac- 427–434. Association for Computational Linguis- tion without labeled data. In Proceedings of the tics. Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Xu Han, Tianyu Gao, Yuan Yao, Deming Ye, Zhiyuan Natural Language Processing of the AFNLP: Vol- Liu, and Maosong Sun. 2019. OpenNRE: An open ume 2-Volume 2, pages 1003–1011. Association for and extensible toolkit for neural relation extraction. Computational Linguistics. In Proceedings of EMNLP-IJCNLP: System Demon- strations, pages 169–174. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Xu Han, Zhiyuan Liu, and Maosong Sun. 2018a. Neu- models are unsupervised multitask learners. OpenAI ral knowledge acquisition via mutual attention be- Blog, 1(8):9. tween knowledge graph and text. In Thirty-Second AAAI Conference on Artificial Intelligence. Sebastian Riedel, Limin Yao, and Andrew McCallum. 2010. Modeling relations and their mentions with- Xu Han, Hao Zhu, Pengfei Yu, Ziyun Wang, Yuan Yao, out labeled text. In Joint European Conference Zhiyuan Liu, and Maosong Sun. 2018b. FewRel: on Machine Learning and Knowledge Discovery in A large-scale supervised few-shot relation classifica- Databases, pages 148–163. Springer. tion dataset with state-of-the-art evaluation. In Pro- ceedings of the 2018 Conference on Empirical Meth- Alan Ritter, Luke Zettlemoyer, Mausam, and Oren Et- ods in Natural Language Processing, pages 4803– zioni. 2013. Modeling missing data in distant su- 4809. pervision for information extraction. Transactions of the Association for Computational Linguistics, Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke 1:367–378. Zettlemoyer, and Daniel S Weld. 2011. Knowledge- based weak supervision for information extraction Livio Baldini Soares, Nicholas FitzGerald, Jeffrey of overlapping relations. In Proceedings of the 49th Ling, and Tom Kwiatkowski. 2019. Matching the Annual Meeting of the Association for Computa- blanks: Distributional similarity for relation learn- tional Linguistics: Human Language Technologies- ing. In Proceedings of the 57th Annual Meeting Volume 1, pages 541–550. Association for Computa- of the Association for Computational Linguistics, tional Linguistics. pages 2895–2905.

192 Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi A Data Pipeline Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. ERNIE: Enhanced rep- In this section, we explain the steps taken to create resentation through knowledge integration. arXiv the data for distantly-supervised (DS) biomedical preprint arXiv:1904.09223. relation extraction (RE). We highlight the impor- Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, tance of a data creation pipeline as the quality of and Christopher D Manning. 2012. Multi-instance data plays a key role in the downstream perfor- multi-label learning for relation extraction. In Pro- mance of our model. We note that a pipeline is like- ceedings of the 2012 joint conference on empirical wise important for generating reproducible results, methods in natural language processing and computational natural language learning, pages 455–465. and contributes toward the possibility of having Association for Computational Linguistics. either a benchmark dataset or a repeatable set of rules. Shingo Takamatsu, Issei Sato, and Hiroshi Nakagawa. 2012. Reducing wrong labels in distant supervi- A.1 UMLS processing sion for relation extraction. In Proceedings of the 50th Annual Meeting of the Association for Compu- The fact triples were obtained for English con- tational Linguistics: Long Papers-Volume 1, pages cepts, filtering for RO relation types only (Dai et al., 721–729. Association for Computational Linguis- 2019). We collected 9.9M (CUI head, relation text, tics. CUI tail) triples, where CUI represents the concept Theo´ Trouillon, Christopher R Dance, Eric´ Gaussier, unique identifier in UMLS. Johannes Welbl, Sebastian Riedel, and Guillaume Bouchard. 2017. Knowledge graph completion via A.2 MEDLINE processing complex tensor factorization. The Journal of Ma- chine Learning Research, 18(1):4735–4772. From 34.4M abstracts, we extracted 160.4M unique sentences. To perform fast and scalable search, we Linlin Wang, Zhu Cao, Gerard De Melo, and Zhiyuan use the Trie data structure7 to index all the tex- Liu. 2016. Relation classification via multi-level tual descriptions of UMLS entities. In obtaining attention cnns. In Proceedings of the 54th An- nual Meeting of the Association for Computational a clean set of sentences, we set the minimum and Linguistics (Volume 1: Long Papers), pages 1298– maximum sentence character length to 32 and 256 1307. respectively, and further considered only those sen- Xinggang Wang, Yongluan Yan, Peng Tang, Xiang Bai, tences where matching entities are mentioned only and Wenyu Liu. 2018. Revisiting multiple instance once. The latter decision is to lower the noise that neural networks. Pattern Recognition, 74:15–24. may come when only one instance of multiple occurrences is marked for a matched entity. With Jason Weston, Antoine Bordes, Oksana Yakhnenko, and Nicolas Usunier. 2013. Connecting language these constraints, the data was reduced to 118.7M and knowledge bases with embedding models for re- matching sentences. lation extraction. In Conference on Empirical Meth- ods in Natural Language Processing, pages 1366– A.3 Groups linking and negative sampling 1371. Recall the entity groups = + (Section 3.1). G G ∪G− Shanchan Wu and Yifan He. 2019. Enriching pre- For training with NA relation class, we generate trained language model with entity information for hard negative samples with an open-world assump- relation classification. In Proceedings of the 28th tion (Soares et al., 2019; Lerer et al., 2019) suited to ACM International Conference on Information and Knowledge Management, pages 2361–2364. bag-level multiple instance learning (MIL). From 9.9M triples, we removed the relation type and col- Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. lected 9M CUI groups in the form of (h, t). Since 2015. Distant supervision for relation extraction via each CUI is linked to more than one textual form, piecewise convolutional neural networks. In Pro- ceedings of the 2015 conference on empirical meth- all of the text combinations for two entities must ods in natural language processing, pages 1753– be considered for a given pair, resulting in 531M 1762. textual groups for the 586 relation types. T Next, for each matched sentence, let 2 denote Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, Ps and Jun Zhao. 2014. Relation classification via con- the size 2 permutations of entities present in the volutional deep neural network. In Proceedings of sentence, then 2 return groups which are T ∩ Ps COLING 2014, the 25th International Conference present in KB and have matching evidence (positive on Computational Linguistics: Technical Papers, pages 2335–2344. 7https://github.com/vi3k6i5/flashtext

193 groups, +). Simultaneously, with a probability of 1 G 2 , we remove the h or t entity from this group and replace it with a novel entity e in the sentence, such that the resulting group (e, t) or (h, e) belongs to . This method results in sentences that are seen G− both for the true triple, as well as for the invalid ones. Further using the constraints that the relation group sizes must be between 10 to 1500, we ﬁnd 3548 relation types (approximately the same as Dai et al.(2019)) with 92K positive groups and 2.1M negative groups, which were reduced to 64K by considering a random subset of 70% of the positive groups. Table1 provides these summary statistics.

A.4 Bag composition and data splits For bag composition, we created bags of constant size by randomly under- or over-sampling the sentences in the bag (Han et al., 2019) to avoid larger bias towards common entities (Soares et al., 2019). The true distribution had a long tail, with more than 50% of the bags having 1 or 2 sentences. We de- ﬁned a bag to be uniform, if the special markers represent the same entity in each sentence, either h or t. If the special markers can take on both h or t, we consider that bag to have a mix composition. The k-tag scheme, on the other hand, naturally generates uniform bags. Further, to support the setting of Wu and He(2019), we followed the s-tag scheme and expanded the relations by adding a suf- ﬁx to denote the directions as r(e1, e2) or r(e2, e1), with the exception of the NA class, resulting in 709 classes. For fair comparisons with k-tag, we generated uniform bags with s-tag as well, by keeping e1 and e2 the same per bag. Due to these bag composition and class expansion (in one setting, exprels) differences, we generated three different splits, supporting each scheme, with the same test sets in cases where the classes are not expanded and a different test set when the classes are expanded. Table A.1 shows the statistics for these splits.

Table A.1: Different data splits.

Model Set Type Triples Triples (w/o NA) Groups Sentences (Sampled) train 92,972 48,563 92,972 1,487,552 k-tag valid 13,555 8,399 15,963 255,408 test 33,888 20,988 38,860 621,760 train 91,555 47,588 125,852 2,013,632 s-tag valid 13,555 8,399 22,497 359,952 test 33,888 20,988 55,080 881,280 train 125,155 71,402 125,439 2,007,024 s-tag+exprels valid 22,604 16,298 22,607 361,712 test 55,083 39,282 55,094 881,504

8355 including NA relation

194 Global Locality in Biomedical Relation and Event Extraction

Elaheh ShaﬁeiBavani, Antonio Jimeno Yepes, Xu Zhong, David Martinez Iraola IBM Research Australia elaheh.shafieibavani, david.martinez.iraola1 @ibm.com { antonio.jimeno, peter.zhong @au1.ibm.com} { }

Abstract verb), and could be an argument of other events to connect more than two entities. Event extraction Due to the exponential growth of biomedical is a more complicated task compared to relation literature, event and relation extraction are important tasks in biomedical text mining. Most extraction due to the tendency of events to capture work only focus on relation extraction, and de- the semantics of texts. For clarity, Figure1 shows tect a single entity pair mention on a short an example from the GE11 shared task corpus that span of text, which is not ideal due to long includes two nested events. sentences that appear in biomedical contexts. We propose an approach to both relation and event extraction, for simultaneously predicting relationships between all mention pairs in a text. We also perform an empirical study to discuss different network setups for this purpose. The best performing model includes a set of multi-head attentions and convolu- Figure 1: Example of nested events from GE11 shared tions, an adaptation of the transformer archi- task tecture, which offers self-attention the ability to strengthen dependencies among related el- Recently, deep neural network models obtain ements, and models the interaction between state-of-the-art performance for event and relation features extracted by multiple attention heads. Experiment results demonstrate that our ap- extraction. Two major neural network architectures proach outperforms the state of the art on a for this purpose include Convolutional Neural Net- set of benchmark biomedical corpora includ- works (CNNs) (Santos et al., 2015; Zeng et al., ing BioNLP 2009, 2011, 2013 and BioCre- 2015) and Recurrent Neural Networks (RNNs) ative 2017 shared tasks. (Mallory et al., 2015; Verga et al., 2015; Zhou et al., 2016). While CNNs can capture the local features 1 Introduction based on the convolution operations and are more Event and relation extraction has become a key re- suitable for addressing short sentence sequences, search topic in natural language processing with RNNs are good at learning long-term dependency a variety of practical applications especially in features, which are considered more suitable for the biomedical domain, where these methods are dealing with long sentences. Therefore, combining widely used to extract information from massive the advantages of both models is the key point for document sets, such as scientiﬁc literature and pa- improving biomedical event and relation extraction tient records. This information contains the inter- performance (Zhang et al., 2018). actions between named entities such as protein- However, encoding long sequences to incorpo- protein, drug-drug, chemical-disease, and more rate long-distance context is very expensive in complex events. RNNs (Verga et al., 2018) due to their computa- Relations are usually described as typed, some- tional dependence on the length of the sequence. times directed, pairwise links between deﬁned In addition, computations could not be parallelized named entities (Bjorne¨ et al., 2009). Event ex- since each token’s representation requires as input traction differs from relation extraction in the sense the representation of its previous token. In con- that an event has an annotated trigger word (e.g., a trast, CNNs can be executed entirely in parallel

195 Proceedings of the BioNLP 2020 workshop, pages 195–204 Online, July 9, 2020 c 2020 Association for Computational Linguistics across the sequence, and have shown good perfor- Section6 summarizes the findings of the paper and mance in event and relation extraction (Bjorne¨ and presents future work. Salakoski, 2018). However, the amount of context incorporated into a single token’s representation 2 Background is limited by the depth of the network, and very deep networks can be difficult to learn (Hochreiter, Biomedical event and relation extraction have been 1998). developed thanks to the contribution of corpora generated for community shared tasks (Kim et al., To address these problems, self-attention net- 2009, 2011;N edellec´ et al., 2013; Segura Bedmar works (Parikh et al., 2016; Lin et al., 2017) come et al., 2011, 2013; Krallinger et al., 2017). In these into play. They have shown promising empirical tasks, relevant biomedical entities such as genes, results in various natural language processing tasks, proteins and chemicals are given and the informa- such as information extraction (Verga et al., 2018), tion extraction methods aim to identify relations machine translation (Vaswani et al., 2017) and nat- alone or relations and events together within a sen- ural language inference (Shen et al., 2018). One of tence span. their strengths lies in their high parallelization in A variety of methods have been evaluated on computation and flexibility in modeling dependen- these tasks, which range from rule based meth- cies regardless of distance by explicitly attending ods to more complex machine learning methods, to all the elements. In addition, their performance either supported by shallow or deep learning ap- can be improved by multi-head attention (Vaswani proaches. Some of the deep learning based meth- et al., 2017), which projects the input sequence ods include CNNs (Bjorne¨ and Salakoski, 2018; into multiple subspaces and applies attention to the Santos et al., 2015; Zeng et al., 2015) and RNNs representation in each subspace. (Li et al., 2019; Mallory et al., 2015; Verga et al., In this paper, we propose a new neural network 2015; Zhou et al., 2016). CNNs will identify local model that combines multi-head attention mecha- context relations while their performance may suf- nisms with a set of convolutions to provide global fer when entities need to be identified in a broader locality in biomedical event and relation extraction. context. On the other hand, RNNs are difficult to Convolutions capture the local structure of text, parallelize while they do not fully solve the long de- while self-attention learns the global interaction pendency problem (Verga et al., 2018). Moreover, between each pair of words. Hence, our approach such approaches are proposed for relation extrac- models locality for self-attention while the interaction, but not to extract nested events. In this work, tions between features are learned by multi-head we intend to improve over existing methods. We attentions. The experiment results over the biomed- combine a set of parallel multi-head attentions with ical benchmark corpora show that providing global a set of 1D convolutions to provide global locality locality outperforms the existing state of the art in biomedical event and relation extraction. Our for biomedical event and relation extraction. The approach models locality for self-attention while proposed architecture is shown in Figure2. the interactions between features are learned by Conducting a set of experiments over the cor- multi-head attentions. We evaluate our model on pora of the shared tasks for BioNLP 2009, 2011 data from the shared tasks for BioNLP 2009, 2011 and 2013, and BioCreative 2017, we compare the and 2013, and BioCreative 2017. performance of our model with the best-performing The BioNLP Event Extraction tasks provide the system (TEES) (Bjorne¨ and Salakoski, 2018) in the most complex corpora with often large sets of event shared tasks. The results we achieve via precision, types and at times relatively small corpus sizes. Our recall, and F-score demonstrate that our model ob- proposed approach achieves higher performance tains state-of-the-art performance. We also em- on the GE09, GE11, EPI11, ID11, REL11, GE13, pirically assess three variants of our model and CG13 and PC13 BioNLP Shared Task corpora, elaborate on the results further in the experiments. compared to the top performing system (TEES) The rest of the paper is organized as follows. Sec- (Bjorne¨ and Salakoski, 2018) for both relation and tion2 summarizes the background. The data, and event extraction in these tasks. Since the annota- the proposed approach are explained in Sections3 tions for the test sets of the BioNLP Shared Task and4 respectively. Section5 explains the experi- corpora are not provided, we uploaded our predic- ments and discusses the achieved results. Finally, tions to the task organizers’ servers for evaluation.

196 Multi-Head x1 Attention x2 . Multi-Head . Attention . Multi-Head Attention

Multi-Head xn Attention

1 2 3

1 Word Embedding 2 Relative Position Embedding 3 Distance Embedding

Vector Representation Multi-head Attentions 1D-Convolutions Max Pooling Merge Layer Output Layer

Figure 2: Our model architecture for biomedical event and relation extraction: The embedding vectors are merged together before the multi-head attention and convolution layers. The global max pooling is then applied to the results of these operations. Finally, the output layer shows the predicted labels.

The CHEMPROT corpus in the BioCreative VI Corpus Domain E I S Chemical–Protein relation extraction task (CP17) GE09 Molecular Biology 10 6 11380 also provides a standard comparison with current GE11 Molecular Biology 10 6 14958 methods in relation extraction. The CHEMPROT EPI11 Epigenetics and PTM:s 16 6 11772 corpus is relatively large compared to its low num- ID11 Infection Diseases 11 7 5118 ber of ﬁve relation types. Our model outperforms REL11 Entity Relations 1 2 11351 the best-performing system (TEES) (Bjorne¨ and GE13 Molecular Biology 15 6 8369 CG13 Cancer Genetics 42 9 5938 Salakoski, 2018) for relation extraction in this task. PC13 Pathway Curation 24 9 5040 CP17 Chemical-Protein Int. - 5 24594 3 Data

We develop and evaluate our approach on a Table 1: Information about the domain, number of event and entity types (E), number of event argument number of event and relation extraction corpora. and relation types (I), and number of sentences (S), re- These corpora originate from three BioNLP Shared lated to the corpora of the biomedical shared tasks Tasks (Kim et al., 2009; Bjorne¨ and Salakoski, 2011;N edellec´ et al., 2013) and the BioCre- tions between genes and mutations. We extracted ative VI Chemical–Protein relation extraction task about 30% of the training set as the validation set. (Krallinger et al., 2017). The BioNLP corpora cover various domains of molecular biology and 4 Model provide the most complex event annotations. The BioCreative corpora use pairwise relation anno- We propose a new biomedical event extraction tations. Table1 shows information about these model that is mainly built upon multi-head atten- corpora. tions to learn the global interactions between each For further analysis and experiments, we also pair of tokens, and convolutions to provide locality. used the AMIA gene-mutation corpus available in The proposed neural network architecture consists (Jimeno Yepes et al., 2018). The training/testing of 4 parallel multi-head attentions followed by a set sets contain 2656/385 mentions of mutations, and of 1D convolutions with window sizes 1, 3, 5 and 7. 2799/280 of genes or proteins, and 1617/130 rela- Our model attends to the most important tokens in

197 the input features1, and enhances the feature extrac- tors. We also consider Relative Position features tion of dependent elements across multiple heads, to identify the locations and roles (i.e., entities, irrespective of their distance. Moreover, we model event triggers, and arguments) of tokens in the clas- locality for multi-head attentions by restricting the sified structure. Finally, these embeddings with attended tokens to local regions via convolutions. their learned weights3 are concatenated together to The relation and event extraction task is mod- shape an n-dimensional vector ei for each word to- elled as a graph representation of events and rela- ken. This merged input sequence is then processed tions (Bjorne¨ and Salakoski, 2018). Entities and by a set of parallel multi-head attentions followed event triggers are nodes, and relations and event ar- by convolutional layers. guments are the edges that connect them. An event is modelled as a trigger node and its set of outgoing 4.2 Multi-head Attention edges. Relation and event extraction are performed Self-attention networks produce representations by through the following classification tasks: (i) En- applying attention to each pair of tokens from the tity and Trigger Detection, which is a named-entity input sequence, regardless of their distance. Ac- recognition task where entities and event triggers in cording to the previous work (Vaswani et al., 2017), a sentence span are detected to generate the graph multi-head attention applies self-attention multiple nodes; (ii) Relation and Event Detection, where times over the same inputs using separately normal- relations and event arguments are predicted for all ized parameters (attention heads) and combines the valid pairs of entity and trigger nodes to create the results, as an alternative to applying one pass of at- graph edges; (iii) Event Duplication, where each tention with more parameters. The intuition behind event is classified as an event or a negative which this modeling decision is that dividing the attention causes unmerging in the graph2; (iv) Modifier De- into multiple heads makes it easier for the model to tection, in which event modality (speculation or learn to attend to different types of relevant infor- negation) is detected. In relation extraction tasks mation with each head. The self-attention updates where entities are given, only the second classifica- input embeddings ei by performing a weighted sum tion task is partially used. over all tokens in the sequence, weighted by their The same network architecture is used for all importance for modeling token i. Given an input I d four classification tasks, with the number of pre- sequence E = e1, ..., eI R × , the model first { } ∈ dicted labels changing between tasks. projects each input to a key k, value v, and query q, using separate affine transformations with ReLU 4.1 Inputs activations (Glorot et al., 2011). Here, k, v, and d The input is modelled in the context of a sen- q are each in R H , where d indicates the hidden tence window, centered around the target entity, size, and H is the number of heads. The attention h relation or event. The sentence is modelled as weights aij for head h between tokens i and j are a linear sequence of word tokens. Following computed using scaled dot-product attention: the work in (Bjorne¨ and Salakoski, 2018), we hT h use a set of embedding vectors as the input fea- h qi kj aij = σ( ) (1) tures, where each unique word token is mapped √d to the relevant vector space embeddings. We use oh = vh sh the pre-trained 200-dimensional word2vec vectors i j ij (Mikolov et al., 2013) induced on a combination of Xj the English Wikipedia and the millions of biomed- where oh is the output of the attention head h. de- i ical research articles from PubMed and PubMed notes element-wise multiplication and σ indicates Central (Moen and Ananiadou, 2013), along with a softmax along the jth dimension. The scaled at- the 8-dimensional embeddings of relative positions, tention is meant to aid optimization by flattening and distances learned from the input corpus. Fol- the softmax and better distributing the gradients lowing the work in (Zeng et al., 2014), we use (Vaswani et al., 2017). The outputs of the indi- Distance features, where the relative distances to vidual attention heads are concatenated into oi as: 1 H tokens of interest are mapped to their own vec- oi = [oi ; ...; oi ]. Herein, all layers use residual 1We choose different embeddings for each task/dataset to 3The only exception is for the word vectors, where the be in line with TEES. original weights are used to provide generalization to words 2Since events are n-ary relations, event nodes may overlap. outside the task’s training corpus.

198 connections between the output of the multi-headed is composed of four 1D convolutions with window attention and its input. Layer normalization (Lei Ba sizes 1, 3, 5 and 7. In our models and TEES, we et al., 2016), LN(.), is then applied to the output: set the number of filters for the convolutions to 64. mi = LN(ei + oi). The multi-head attention layer The number of heads for multi-head attentions is uses a softmax activation function. also set to 8. The reported results of TEES are achieved by running their out-of-the-box system 4.3 Convolutions for different tasks. The multi-head attentions are then followed by a set Since training a single model can be prone to of parallel 1D convolutions with window sizes 1, 3, overfitting if the validation set is too small (Bjorne¨ 5 and 7. Adding these explicit n-gram modelings and Salakoski, 2018), we use mixed 5 model en- helps the model to learn to attend to local features. semble, which takes 5-best models (out of 20), Our convolutions use the ReLU activation function. ranked with micro-averaged F-score on random- We use C(.) to denote a convolutional operator. ized train/validation set split, and considers their The convolutional portion of the model is given by: averaged predictions. These ensemble predictions are calculated for each label as the average of all ci = ReLU(C(mi)) (2) the models’ predicted confidence scores. Precision, recall, and F-score of the proposed approach and Global max pooling is then applied to each 1D its variants are compared to TEES in Table2. Our convolution and the resulting features are merged model (4MHA-4CNN) obtains the state-of-the-art together into an output vector. results compared to those of the top performing system (TEES) in different shared tasks: BioNLP 4.4 Classification (GE09, GE11, EPI11, ID11, REL11, GE13, CG13, Finally, the output layer performs the classifica- PC13), BioCreative (CP17), and the AMIA dataset. tion, where each label is represented by one neuron. Analyzing the results, we observe that the pro- The classification layer uses the sigmoid activation posed 4MHA-4CNN model has the best F-score function. Classification is performed as multilabel in the majority of datasets except for EPI11, ID11 classification where each example may have zero, and CG13, where the proposed MHA models (i.e., one or multiple positive labels. 1MHA and 4MHA) have the best F-score and re- We use the adam optimizer with binary crossen- call. These tasks are related to epigenetics and tropy and a learning rate of 0.001. Dropout of 0.1 is post-translational modifications (EPI11), infection also applied at two steps of merging input features diseases (ID11) and cancer genetics (CG13), where and global max pooling to provide generalization. events typically require long dependencies in most of the cases. It explains why the MHA-alone mod- 5 Experiments and Results els are better than when combined with convolu- We have conducted a set of experiments to eval- tions. The F-scores achieved by 4MHA-4CNN uate our proposed approach over the benchmark and 4MHA models on GE09 dataset are also very biomedical corpora. In addition to evaluating our close. In many cases, when using the configura- main model (4MHA-4CNN), we have evaluated tions in which MHA is applied to the input features, the performance of three variants of our proposed both precision and recall are better compared to approach: (i) 4MHA: 4 parallel multi-head atten- other configurations. Moreover, having four paral- tions apply self-attention multiple times over the lel MHAs applied to the input features outperforms input features; (ii) 1MHA: only 1 multi-head atten- 1MHA and the other potential variants6. tion applies self-attention to the input features; (iii) In terms of precision, the advantage of applying 4CNN-4MHA: multiple self-attentions are applied 4CNN versus 4MHA to the merged input features to the input features via a set of 1D convolutions4. depends on the dataset. On PC13, the precision The 4CNN architecture matches the best perform- when using 4CNN on the merged input features is ing configuration (4CNN - mixed 5 X ensemble)5 much higher compared to other configurations, but used by TEES (Bjorne¨ and Salakoski, 2018), which the recall is significantly lower. The proposed 4MHA-4CNN model has also 4We also conducted experiments with 1CNN-1MHA and 1MHA-1CNN, which are excluded due to the poor perfor- 6The experiment with 8MHA, and multiple MHAs one mance. after the other on the whole sequence are excluded from the 5We use 4CNN to represent this configuration. paper due to the poor perfromance.

199 Task Precision Recall F-score Approach 65.73 44.72 53.23 TEES 4CNN 65.01 46.83 54.44 Proposed 4MHA GE09 64.37 45.19 53.10 Proposed 1MHA 61.99 45.51 52.48 Proposed 4CNN-4MHA 65.98 45.60 53.93 Proposed 4MHA-4CNN 66.09 46.62 54.68 TEES 4CNN 66.19 48.67 56.09 Proposed 4MHA GE11 66.26 48.60 56.07 Proposed 1MHA 67.07 47.61 55.69 Proposed 4CNN-4MHA 66.12 49.34 56.51 Proposed 4MHA-4CNN 63.31 46.73 53.78 TEES 4CNN 63.71 50.73 56.48 Proposed 4MHA EPI11 66.38 49.85 56.94 Proposed 1MHA 63.60 45.72 53.20 Proposed 4CNN-4MHA 65.43 48.55 55.74 Proposed 4MHA-4CNN 70.14 44.36 54.35 TEES 4CNN 66.63 48.65 56.24 Proposed 4MHA ID11 71.64 46.99 56.75 Proposed 1MHA 68.92 41.04 51.44 Proposed 4CNN-4MHA 69.05 44.91 54.43 Proposed 4MHA-4CNN 71.26 62.37 66.52 TEES 4CNN 71.56 63.78 67.45 Proposed 4MHA REL11 68.55 64.39 66.40 Proposed 1MHA 71.02 55.53 62.33 Proposed 4CNN-4MHA 71.91 65.39 68.50 Proposed 4MHA-4CNN 62.22 39.96 48.66 TEES 4CNN 60.68 40.35 48.47 Proposed 4MHA GE13 60.21 40.75 48.60 Proposed 1MHA 58.14 37.66 45.71 Proposed 4CNN-4MHA 59.76 41.65 49.09 Proposed 4MHA-4CNN 66.08 49.05 56.30 TEES 4CNN 65.92 53.50 59.06 Proposed 4MHA CG13 67.02 52.49 58.87 Proposed 1MHA 61.91 48.02 54.09 Proposed 4CNN-4MHA 65.47 51.71 57.78 Proposed 4MHA-4CNN 63.49 43.37 51.54 TEES 4CNN 59.45 49.90 54.26 Proposed 4MHA PC13 60.64 47.25 53.11 Proposed 1MHA 57.61 43.23 49.39 Proposed 4CNN-4MHA 60.51 49.43 54.41 Proposed 4MHA-4CNN 73.00 45.00 56.00 TEES 4CNN 70.00 58.00 63.00 Proposed 4MHA CP17 77.00 48.00 58.00 Proposed 1MHA 77.00 44.00 56.00 Proposed 4CNN-4MHA 75.00 50.00 60.00 Proposed 4MHA-4CNN 84.41 87.52 85.90 TEES 4CNN 83.73 88.51 86.01 Proposed 4MHA AMIA 85.12 89.50 87.31 Proposed 1MHA 85.02 89.01 87.00 Proposed 4CNN-4MHA 85.21 90.11 87.53 Proposed 4MHA-4CNN

Table 2: Precision, Recall and F-score, measured on the corpora of various shared tasks for our models, and the state of the art. The best scores (the ﬁrst and the second highest scores) for each task are bolded and highlighted, respectively. All the results (except those of CP17 and AMIA) are evaluated using the ofﬁcial evaluation program/server of each task. good recall, except for EPI11, ID11, and CG13, tions might be less useful in these three sets, since where 4MHA is better. As mentioned before, the sentences in these topics describe interactions for addition of convolutions after the multi-head atten- which long context dependencies are present.

200 The The The The The presence presence presence presence presence of of of of of activating activating activating activating activating TSH-R TSH-R TSH-R TSH-R TSH-R mutations mutations mutations mutations mutations has has has has has also also also also also been been been been been demonstrated demonstrated demonstrated demonstrated demonstrated in in in in in differentiated differentiated differentiated differentiated differentiated thyroid thyroid thyroid thyroid thyroid carcinomas carcinomas carcinomas carcinomas carcinomas . . . . . (4MHA-4CNN) (1MHA)

The The The The The presence presence presence presence presence of of of of of activating activating activating activating activating TSH-R TSH-R TSH-R TSH-R TSH-R mutations mutations mutations mutations mutations has has has has has also also also also also been been been been been demonstrated demonstrated demonstrated demonstrated demonstrated in in in in in differentiated differentiated differentiated differentiated differentiated thyroid thyroid thyroid thyroid thyroid carcinomas carcinomas carcinomas carcinomas carcinomas . . . . . (4MHA)

Figure 3: Visualization of multi-head attention in different architectures

Overall, our observations support the hypothesis kens than the 1MHA model. In addition, the convo- that higher recall/F-score is obtained in configura- lutions make the 4MHA-4CNN model have more tions in which 4MHA is applied first to the merged focused attentions on certain important tokens than input features, where CNNs are not as convenient the 4MHA model. as MHAs to deal with long dependencies. Considering the computational complexity, according to the work in (Vaswani et al., 2017), self- 5.1 Discussion attention has a cost that is quadratic with the length of the sequence, while the convolution cost is Besides improving the previous state of the art, the quadratic with respect to the representation dimen- results indicate that combining multi-head attention sion of the data. The representation dimension of with convolution provides an effective performance the data is typically higher compared to the length compared to individual components. Among the of individual sentences. Outperforming convolu- variants of our model, 4MHA also outperforms tions in terms of computational complexity and TEES over all the shared tasks reported in Table F-score, multi-head attention mechanisms seem to 2. Even though convolutions are quite effective be better suited. Although the addition of convolu- (Bjorne¨ and Salakoski, 2018) on their own, multi- tions after the multi-head makes the model more head attentions improve their performance being expensive, the lower representation dimension of able to deal with longer dependencies. the filters reduces the cost. Figure3 shows the multi-head attention (sum of the attention of all heads) of the ”relation and event detection” classification task for different proposed 5.2 Error Analysis network architectures (4MHA-4CNN, 1MHA, and We have performed error analysis on the baseline 4MHA) on a sample sentence ”The presence of system (TEES), and our approach7 over the gene- activating TSH-R mutations has also been demon- mutation AMIA and CP17 datasets8, and observed strated in differentiated thyroid carcinomas.”. In the following sources of error. the 4MHA and 4MHA-4CNN models, the four multi-head attention layers contribute distinctively 7We consider the same configuration for the convolutions different attentions from each other. This allows in both TEES and our approach. 8We only use these datasets for error analysis due to the the 4MHA and 4MHA-4CNN models to indepen- limited access to the gold set of other datasets. Hence, this dently exploit more relationships between the to- error analysis only covers relation extraction.

201 Has_Mutation Has_Mutation (a) Has_Mutation

DNA_Mutation Gene_protein Gene_protein Gene_protein

... 25 pathogenic mutations ( 8 in MLH1,, 9 in MSH2 and 8 in MSH6MSH6 ) were found.

(b) Has_Mutation

DNA_Mutation Gene_protein

additions or deletions of the mononucleotide repeat sequence present in TGFTGF--ββ RII ...

Gene_protein DNA_Mutation

... by a SMAD2SMAD2 inactivating mutation in CIN tumors, or by the existence of a ...

Figure 4: Error analysis of TEES and our approach over the gene-mutation AMIA dataset

4MHA-4CNN 4MHA 4CNN 4MHA-4CNN 4MHA 4CNN 4MHA-4CNN 4MHA 4CNN 0.4 0.4 0.4 0.35 0.35 0.35 0.3 0.3 0.3 0.25 0.25 0.25 0.2 0.2 0.2 score Recall -

0.15 0.15 F 0.15 Precision 0.1 0.1 0.1 0.05 0.05 0.05 0 0 0 0 .. 4 5 .. 9 10 .. 14 15 .. 0 .. 4 5 .. 9 10 .. 14 15 .. 0 .. 4 5 .. 9 10 .. 14 15 .. Distance Distance Distance

Table 3: Empirical evaluation of long-distance dependencies on CP17

Relations involving multiple entities: This is a dataset (Figure4 (b)), which is captured by our major source of false negatives for TEES, while approach. We explored this issue further by plot- our approach exhibits a more robust behavior and ting the performance of different proposed archi- achieves full recall. The reason would be the ability tectures and that of TEES over different distances. of multi-head attention to jointly attend to infor- We relied on the CP17 dataset, since the test set mation from different representation subspaces at is considerably larger than AMIA. We performed different positions (Vaswani et al., 2017). In an ex- this analysis for the best performing network ar- ample from the AMIA dataset (Figure4 (a)), there chitecture proposed (4MHA-4CNN) along with is a ”has mutation” relationship between the term 4MHA and 4CNN architectures separately as the ”mutations” and the three gene-protein entities of individual components, to study how these archi- ”MLH1”, ”MSH2”, and ”MSH6”. While the state- tectures behave in capturing distant relations. We of-the-art approach only finds the relation between measure the distance as the number of tokens be- the mutation and the first gene-protein (MLH1) and tween the farthest entities involved in a relation, ignores the other two relations, our approach cap- by employing the tokenization carried out by the tures the relations between the mutation and all TEES pre-processing tool. The results are provided three entities (MLH1, MSH2, and MSH6). in Figure3. Regardless of the evaluation metric used, we observe that the scores decrease at longer Long-distance dependencies: TEES also seems distances, and 4MHA outperforms the other two ar- to have difficulty in annotating long-distance re- chitectures, which lies in the ability of multi-head lations, as in the missed relation between ”dele- attention to capture long distance dependencies. tions” and ”TGF-β” in an example from the AMIA This experiment shows how 4MHA provides glob-

202 ality in 4MHA-4CNN, which slightly outperforms BioNLP Shared Task 2011 Workshop, pages 183– 4CNN in longer distances. 191. Association for Computational Linguistics. Negative or speculative contexts: Regarding Jari Bjorne¨ and Tapio Salakoski. 2018. Biomedi- cal event extraction using convolutional neural net- the false positives for TEES that are generally well works and dependency parsing. In Proceedings of handled by our system, the annotation of specula- the BioNLP 2018 workshop, pages 98–108. tive or negative language seems to be problematic. For instance, as depicted in Figure4 (c), TEES Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In Pro- incorrectly captures the relation between ”muta- ceedings of the fourteenth international conference tion” and ”SMAD2”, despite the negative cue, ”in- on artificial intelligence and statistics, pages 315– activating”. Even though our approach correctly 323. ignores this false positive in the short distance, it Sepp Hochreiter. 1998. The vanishing gradient prob- still captures speculative long dependencies, which lem during learning recurrent neural nets and prob- motivates a natural extension of our work in future. lem solutions. International Journal of Uncer- tainty, Fuzziness and Knowledge-Based Systems, 6 Conclusion 6(02):107–116. We have proposed a novel architecture based on Antonio Jimeno Yepes, Andrew MacKinlay, Natalie Gunn, Christine Schieber, Noel Faux, Matthew multi-head attention and convolutions, which deals Downton, Benjamin Goudey, and Richard L Martin. with the long dependencies typical of biomedical 2018. A hybrid approach for automated mutation literature. The results show that this architecture annotation of the extended human mutation land- outperforms the state of the art on existing biomed- scape in scientific literature. In AMIA Annual Sym- ical information extraction corpora. While multi- posium Proceedings, volume 2018, page 616. Amer- ican Medical Informatics Association. head attention identifies long dependencies in extracting relations and events, convolutions provide Jin-Dong Kim, Tomoko Ohta, Sampo Pyysalo, Yoshi- the additional benefit of capturing more local rela- nobu Kano, and Jun’ichi Tsujii. 2009. Overview of bionlp’09 shared task on event extraction. In tions, which improves the performance of existing Proceedings of the Workshop on Current Trends in approaches. The finding that CNN-before-MHA is Biomedical Natural Language Processing: Shared outperformed by MHA-before-CNN is very inter- Task, pages 1–9. Association for Computational Lin- esting and could be used as a competitive baseline guistics. for future work. Jin-Dong Kim, Yue Wang, Toshihisa Takagi, and Aki- Our ongoing work includes generalizing our find- nori Yonezawa. 2011. Overview of genia event task ings to other non-biomedical information extrac- in bionlp shared task 2011. In Proceedings of the tion tasks. Current work is focused on event and BioNLP Shared Task 2011 Workshop, pages 7–15. Association for Computational Linguistics. relation extraction from a single short/long sentence; we would like to experiment with additional Martin Krallinger, Obdulia Rabal, Saber A Akhondi, contents to study the behaviour of these models et al. 2017. Overview of the biocreative vi chemical- protein interaction track. In Proceedings of the across sentence boundaries (Verga et al., 2018). Fi- sixth BioCreative challenge evaluation workshop, nally, we intend to extend our approach to deal with volume 1, pages 141–146. speculative contexts by considering more semantic linguistic features, e.g., sense embeddings (Rothe Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- ton. 2016. Layer normalization. arXiv preprint and Schutze¨ , 2015) on biomedical literature. arXiv:1607.06450. Diya Li, Lifu Huang, Heng Ji, and Jiawei Han. 2019. References Biomedical event extraction based on knowledge- driven tree-lstm. In NAACL-HLT. Jari Bjorne,¨ Juho Heimonen, Filip Ginter, Antti Airola, Tapio Pahikkala, and Tapio Salakoski. 2009. Ex- Zhouhan Lin, Minwei Feng, Cicero Nogueira dos San- tracting complex biological events with rich graph- tos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua based feature sets. In Proceedings of the Workshop Bengio. 2017. A structured self-attentive sentence on Current Trends in Biomedical Natural Language embedding. arXiv preprint arXiv:1703.03130. Processing: Shared Task, pages 10–18. Association for Computational Linguistics. Emily K Mallory, Ce Zhang, Christopher Re, and Russ B Altman. 2015. Large-scale extraction of Jari Bjorne¨ and Tapio Salakoski. 2011. Generalizing gene interactions from full-text literature using deep- biomedical event extraction. In Proceedings of the dive. Bioinformatics, 32(1):106–113.

203 Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. rado, and Jeff Dean. 2013. Distributed representa- 2015. Distant supervision for relation extraction via tions of words and phrases and their compositional- piecewise convolutional neural networks. In Pro- ity. In Advances in neural information processing ceedings of the 2015 Conference on Empirical Meth- systems, pages 3111–3119. ods in Natural Language Processing, pages 1753– 1762. SPFGH Moen and Tapio Salakoski2 Sophia Anani- adou. 2013. Distributional semantics resources for Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, biomedical text processing. Proceedings of LBM, Jun Zhao, et al. 2014. Relation classiﬁcation via con- pages 39–44. volutional deep neural network.

Claire Nedellec,´ Robert Bossy, Jin-Dong Kim, Jung- Yijia Zhang, Hongfei Lin, Zhihao Yang, Jian Wang, Jae Kim, Tomoko Ohta, Sampo Pyysalo, and Pierre Shaowu Zhang, Yuanyuan Sun, and Liang Yang. Zweigenbaum. 2013. Overview of bionlp shared 2018. A hybrid model based on neural networks for task 2013. In Proceedings of the BioNLP Shared biomedical relation extraction. Journal of biomedi- Task 2013 Workshop, pages 1–7. cal informatics, 81:83–92. Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li, Ankur P Parikh, Oscar Tackstr¨ om,¨ Dipanjan Das, and Hongwei Hao, and Bo Xu. 2016. Attention-based Jakob Uszkoreit. 2016. A decomposable attention bidirectional long short-term memory networks for model for natural language inference. arXiv preprint relation classiﬁcation. In Proceedings of the 54th arXiv:1606.01933. Annual Meeting of the Association for Computa- tional Linguistics (Volume 2: Short Papers), vol- Sascha Rothe and Hinrich Schutze.¨ 2015. Au- ume 2, pages 207–212. toextend: Extending word embeddings to embeddings for synsets and lexemes. arXiv preprint arXiv:1507.01127.

Cicero Nogueira dos Santos, Bing Xiang, and Bowen Zhou. 2015. Classifying relations by ranking with convolutional neural networks. arXiv preprint arXiv:1504.06580.

Isabel Segura Bedmar, Paloma Mart´ınez, and Mar´ıa Herrero Zazo. 2013. Semeval-2013 task 9: Ex- traction of drug-drug interactions from biomedical texts (ddiextraction 2013). Association for Compu- tational Linguistics.

Isabel Segura Bedmar, Paloma Martinez, and Daniel Sanchez´ Cisneros. 2011. The 1st ddiextraction-2011 challenge task: Extraction of drug-drug interactions from biomedical texts.

Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Shirui Pan, and Chengqi Zhang. 2018. Disan: Di- rectional self-attention network for rnn/cnn-free language understanding. In Thirty-Second AAAI Con- ference on Artiﬁcial Intelligence.

Patrick Verga, David Belanger, Emma Strubell, Ben- jamin Roth, and Andrew McCallum. 2015. Multilin- gual relation extraction using compositional universal schema. arXiv preprint arXiv:1511.06396.

Patrick Verga, Emma Strubell, and Andrew McCallum. 2018. Simultaneously self-attending to all mentions for full-abstract biological relation extraction. arXiv preprint arXiv:1802.10569.

204 An Empirical Study of Multi-Task Learning on BERT for Biomedical Text Mining

Yifan Peng Qingyu Chen Zhiyong Lu National Center for Biotechnology Information National Library of Medicine, National Institutes of Health Bethesda, MD, USA yifan.peng, qingyu.chen, zhiyong.lu @nih.gov { }

Abstract et al., 2019), to health informatics (Zhou et al., 2011; He et al., 2016; Harutyunyan et al., 2019). Multi-task learning (MTL) has achieved re- MTL has also been studied in biomedical and clin- markable success in natural language process- ical natural language processing (NLP) such as ing applications. In this work, we study a named entity recognition and normalization and the multi-task learning model with multiple decoders on varieties of biomedical and clini- relation extraction. However, most of these studies cal natural language processing tasks such as focus on either one task with multi corpora (Khan text similarity, relation extraction, named en- et al., 2020; Wang et al., 2019b) or multi-tasks on tity recognition, and text inference. Our em- a single corpus (Xue et al., 2019; Li et al., 2017; pirical results demonstrate that the MTL fine- Zhao et al., 2019). tuned models outperform state-of-the-art trans- To bridge this gap, we investigate the use of MTL former models (e.g., BERT and its variants) with transformer-based models (BERT) on multiple by 2.0% and 1.3% in biomedical and clinical domains, respectively. Pairwise MTL further biomedical and clinical NLP tasks. We hypothe- demonstrates more details about which tasks size the performance of the models on individual can improve or decrease others. This is par- tasks (especially in the same domain) can be im- ticularly helpful in the context that researchers proved via joint learning. Specifically, we compare are in the hassle of choosing a suitable model three models: the independent single-task model for new problems. The code and models are (BERT), the model refined via MTL (called MT- publicly available at https://github.com/ BERT-Refinement), and the model fine-tuned for ncbi-nlp/bluebert. each task using MT-BERT-Refinement (called MT- 1 Introduction BERT-Fine-Tune). We conduct extensive empirical studies on the Biomedical Language Understand- Multi-task learning (MTL) is a field of machine ing Evaluation (BLUE) benchmark (Peng et al., learning where multiple tasks are learned in paral- 2019), which offers a diverse range of text genres lel while using a shared representation (Caruana, (biomedical and clinical text) and NLP tasks (such 1997). Compared with learning multiple tasks indi- as text similarity, relation extraction, and named vidually, this joint learning effectively increases the entity recognition). When learned and fine-tuned sample size for training the model, thus leads to per- on biomedical and clinical domains separately, we formance improvement by increasing the general- find that MTL achieved over 2% performance on ization of the model (Zhang and Yang, 2017). This average, created new state-of-the-art results on four is particularly helpful in some applications such as BLUE benchmark tasks. We also demonstrate the medical informatics where (labeled) datasets are use of multi-task learning to obtain a single model hard to collect to fulfill the data-hungry needs of that still produces state-of-the-art performance on deep learning. all tasks. This positive answer will be very helpful MTL has long been studied in machine learn- in the context that researchers are in the hassle of ing (Ruder, 2017) and has been used success- choosing a suitable model for new problems where fully across different applications, from natural training resources are limited. language processing (Collobert and Weston, 2008; Our contribution in this work is three-fold: Luong et al., 2016; Liu et al., 2019c), computer (1) We conduct extensive empirical studies on 8 vision (Wang et al., 2009; Liu et al., 2019a; Chen tasks from a diverse range of text genres. (2) We

205 Proceedings of the BioNLP 2020 workshop, pages 205–214 Online, July 9, 2020 c 2020 Association for Computational Linguistics demonstrate that the MTL fine-tuned model (MT- exploited domain-invariant knowledge to segment BERT-Fine-Tune) achieved state-of-the-art perfor- Chinese word in medical text. mance on average and there is still a benefit to The other direction is to apply MTL on differ- utilizing the MTL refinement model (MT-BERT- ent tasks, but the annotations are from a single Refinement). Pairwise MTL, where two tasks were corpus. Li et al.(2017) proposed a joint model trained jointly, further demonstrates which tasks extract biomedical entities as well as their relations can improve or decrease other tasks. (3) We make simultaneously and carried out experiments on ei- codes and pre-trained MT models publicly avail- ther the adverse drug event corpus (Gurulingappa able. et al., 2012) or the bacteria biotope corpus (Deleger´ The rest of the paper is organized as follows. et al., 2016). Shi et al.(2019) also jointly extract We first present related work in Section2. Then, entities and relations but focused on the BioCre- we describe the multi-task learning in Section3, ative/OHNLP 2018 challenge regarding family his- followed by our experimental setup, results, and tory extraction (Liu et al., 2018). Xue et al.(2019) discussion in Section4. We conclude with future integrated the BERT language model into joint work in the last section. learning through dynamic range attention mechanism and fine-tuned NER and relation extraction 2 Related work tasks jointly on one in-house dataset of coronary arteriography reports. Multi-tasking learning (MTL) aims to improve the Different from these works, we studied to jointly learning of a model for task t by using the knowl- learn 8 different corpora from 4 different types of edge contained in the tasks where all or a subtasks. While MTL has brought significant improve- set of tasks are related (Zhang and Yang, 2017). ments in medicine tasks, no (or mixed) results have It has long been studied and has applications on been reported when pre-training MTL models in neural networks in the natural language process- different tasks on different corpora. To this end, ing domain (Caruana, 1997). Collobert and We- we deem that our model can provide more insights ston(2008) proposed to jointly learn six tasks about conditions under which MTL leads to gains such as part-of-speech tagging and language mod- in BioNLP and clinical NLP, and sheds light on the eling in a time-decay neural network. Changpinyo specific task relations that can lead to gains from et al.(2018) summarized recent studies on apply- MTL models over single-task setups. ing MTL in sequence tagging tasks. Bingel and Søgaard(2017) and Mart ´ınez Alonso and Plank 3 Multi-task model (2017) focused on conditions under which MTL leads to gain in NLP, and suggest that certain data The architecture of the MT-BERT model is shown features such as learning curve and entropy distri- in Figure1. The shared layers are based on bution are probably better predictors of MTL gains. BERT (Devlin et al., 2018). The input X can be In the biomedical and clinical domains, MTL either a sentence or a pair of sentences packed to- has been studied mostly in two directions. One is gether by a special token [SEP]. If X is longer to apply MTL on a single task with multiple cor- than the allowed maximum length (e.g., 128 tokens pora. For example, many studies focused on named in the BERT’s base configuration), we truncate entity recognition (NER) tasks (Crichton et al., X to the maximum length. When X is packed 2017; Wang et al., 2019a,b). Zhang et al.(2018), by a sequence pair, we truncate the longer se- Khan et al.(2020), and Mehmood et al.(2019) in- quence one token at a time. Similar to (Devlin tegrated MTL in the transformer-based networks et al., 2018), two additional tokens are added at the (BERT), which is the state-of-the-art language rep- start ([CLS]) and end ([SEP]) of X, respectively. resentation model and demonstrated promising re- Similar to (Lee et al., 2020; Peng et al., 2019), in sults to extract biomedical entities from literature. the sequence tagging tasks, we split one sentence Yang et al.(2019) extracted clinical named entity into several sub-sentences if it is longer than 30 from Electronic Medical Records using LSTM- words. CRF based model. Besides NER, Li et al.(2018) In the shared layers, the BERT model first con- and Li and Ji(2019) proposed to use MTL on re- verts the input sequence to a sequence of embed- lation classification task and Du et al.(2017) on ding vectors. Then, it applies attention mecha- biomedical semantic indexing. Xing et al.(2018) nisms to gather contextual information. This se-

206 푠𝑖푚(푋1, 푋2) 푃(푐|푋) 푃(푟|푃⨁퐻) 푃(푦푖|푋)

Text Relation Named entity Inference Task specific layers similarity extraction recognition

[CLS] [SEP] Shared layers ℎ1 ℎ2 ℎ푛 … ℎ푚 [SEP]

BERT

[CLS] Tok [SEP] [SEP] 1 Tok2 … Tokn … Tokm

Single sentence 푋 or a sentence pair (푋1, 푋2)

Figure 1: The architecture of the MT-BERT model. mantic representation is shared across all tasks and widely used in the transformer-based models (De- is trained by our multi-task objectives. Finally, the vlin et al., 2018; Peng et al., 2019; Liu et al., 2019c). BERT model encodes that information in a vector This task is trained using the categorical cross- for each token (h , . . . , h ). entropy loss: δ(y =y ˆ) log(P (c X)), where 0 n − c c | δ(y =y ˆ) = 1 if the classification yˆ of X is the On top of the shared BERT layers, the task- c P specific layer uses a fully-connected layer for each correct ground-truth for the class c C; otherwise ∈ task. We fine-tune the BERT model and the task- δ(yc =y ˆ) = 0. specific layers using multi-task objectives during the training phase. More details of the multi-task 3.3 Inference objectives in the BLUE benchmark are described After packing the pair of premise sentences with below. hypothesis into one sequence, this task can also be treated as a single sentence classification prob- 3.1 Sentence similarity lem. The aim is to find logical relation R between premise P and hypothesis H. Suppose that that Suppose that h0 is the BERT’s output of the to- h0 is the output embedding of the token [CLS] in ken [CLS] in the input sentence pair (X1,X2). X = P H, P (R P H) = softmax(ah + b). We use a fully connected layer to compute the ⊕ | ⊕ 0 This task is trained using the categorical cross- similarity score sim(X1,X2) = ah0 + b, where entropy loss as above. sim(X1,X2) is a real value. This task is trained using the Mean Squared Error (MSE) loss: (y 2 − 3.4 Named entity recognition sim(X1,X2)) , where y is the real-value similarity score of the sentence pair. The output of the BERT model produces a feature vector sequence h n with the same length { i}i=0 3.2 Relation extraction as the input sequence X. The MTL model predicts the label sequence by using a softmax out- This task extracts binary relations (two arguments) put layer, which scales the output for a label from sentences. After replacing two arguments l 1, 2,...,L as follows: P (ˆyi = j x) = of interest in the sentence with pre-defined tags ∈exp( {hiWj ) } | L , where L is the total number of exp(h W ) (e.g., GENE, or DRUG), this task can be treated l=1 i j as a classification problem of a single sentence tags. This task is trained using the categorical cross- P X. Suppose that h is the output embedding of entropy loss: δ(yi =y î) log P (yi X). 0 − i yi | the token [CLS], the probability that a relation P P is labeled as class c is predicted by a fully con- 3.5 The training procedure nected layer and a logistic regression with softmax: The training procedure for MT-BERT consists of P (c X) = softmax(ah + b). This approach is three stages: (1) pretraining the BERT model, | 0 207 (2) refining it via multi-task learning (MT-BERT- linear layers) of the MT-BERT and replace it with Refinement), and (3) fine-tuning the model using a new one, then we use a smaller learning rate to the task-specific data (MT-BERT-Fine-Tune). train the network.

3.5.1 Pretraining 4 Experiments The pretraining stage follows that of the BERT using the masked language modeling technique (De- We evaluate the proposed MT-BERT on 8 tasks in vlin et al., 2018). Here we used the base version. BLUE benchmarks. We compare three types of The maximum length of the input sequences is thus models: (1) existing start-of-the-art BERT models 128. fine-tuned directly on each task, respectively; (2) refinement MT-BERT with multi-task training (MT- 3.5.2 Refining via Multi-task learning BERT-Refinement); and (3) MT-BERT with fine- In this step, we refine all layers in the model. Al- tuning (MT-BERT-Fine-Tune). gorithm1 demonstrates the process of multi-task learning (Liu et al., 2019c). We first initialize the 4.1 Datasets shared layers with the pre-trained BERT model We evaluate the performance of the models on 8 and randomly initialize the task-specific layer pa- datasets in the BLUE benchmark used by (Peng rameters. Then we create the dataset by merging et al., 2019). Table1 gives a summary of these mini-batches of all the datasets. In each epoch, we datasets. Briefly, ClinicalSTS is a corpus of sen- randomly select a mini-batch b of task t from all t tence pairs selected from Mayo Clinics’s clinical datasets D. Then we update the model according data warehouse (Wang et al., 2018). The i2b2 2010 to the task-specific objective of the task t. Same as dataset was collected from three different hospi- in (Liu et al., 2019c), we use the mini-batch based tals and was annotated by medical practitioners for stochastic gradient descent to learn the parameters. eight types of relations between problems and treatments (Uzuner et al., 2011). MedNLI is a collection Algorithm 1: Multi-task learning. of sentence pairs selected from MIMIC-III (Shiv- ade, 2017). For a fair comparison, we use the same model parameters θ Initialize training, development and test sets to train and eval- Shared layer parameters by BERT; uate the models. ShARe/CLEF is a collection of Task-specific layer parameters 299 de-identified clinical free-text notes from the randomly; MIMIC-II database (Suominen et al., 2013). This end corpus is for disease entity recognition. Create D by merging mini-batches for each In the biomedical domain, the ChemProt con- dataset; sists of 1,820 PubMed abstracts with chemical- epoch in 1, 2, ..., epoch for max do protein interactions (Krallinger et al., 2017). The Shuffle D; DDI corpus is a collection of 792 texts selected b in D for t do from the DrugBank database and other 233 Med- Compute loss: L(θ) based on task t; line abstracts (Herrero-Zazo et al., 2013). These Compute gradient: (θ) ∇ two datasets were used in the relation extraction Update model: θ = θ η (θ) − ∇ task for various types of relations. BC5CDR is end a collection of 1,500 PubMed titles and abstracts end selected from the CTD-Pfizer corpus and was used in the named entity recognition task for chemical and disease entities (Li et al., 2016). 3.5.3 Fine-tuning MT-BERT We fine-tune existing MT-BERT that are trained in 4.2 Training the previous stage by continue training all layers Our implementation of MT-BERT is based on on each specific task. Provided that the dataset is the work of (Liu et al., 2019c).1 We trained not drastically different in context to other datasets, the model on one NVIDIA® V100 GPU using the MT-BERT model will already have learned gen- the PyTorch framework. We used the Adamax eral features that are relevant to a specific problem. Specifically, we truncate the last layer (softmax and 1https://github.com/namisan/mt-dnn

208 Corpus Task Metrics Domain Train Dev Test ClinicalSTS Sentence similarity Pearson Clinical 675 75 318 ShARe/CLEFE NER F1 Clinical 4,628 1,075 5,195 i2b2 2010 Relation extraction F1 Clinical 3,110 11 6,293 MedNLI Inference Accuracy Clinical 11,232 1,395 1,422 BC5CDR disease NER F1 Biomedical 4,182 4,244 4,424 BC5CDR chemical NER F1 Biomedical 5,203 5,347 5,385 DDI Relation extraction F1 Biomedical 2,937 1,004 979 ChemProt Relation extraction F1 Biomedical 4,154 2,416 3,458

Table 1: Summary of eight tasks in the BLUE benchmark. More details can be found in (Peng et al., 2019).

Model ClinicalSTS i2b2 2010 re MedNLI ShARe/CLEFE Avg

BlueBERTclinical 0.848 0.764 0.840 0.771 0.806 MT-BlueBERT-Reﬁnementclinical 0.822 0.745 0.835 0.826 0.807 MT-BlueBERT-Fine-Tuneclinical 0.840 0.760 0.846 0.831 0.819

Table 2: Test results on clinical tasks.

BC5CDR BC5CDR Model ChemProt DDI Avg disease chemical

BlueBERTbiomedical 0.725 0.739 0.866 0.935 0.816 MT-BlueBERT-Reﬁnementbiomedical 0.714 0.792 0.824 0.930 0.815 MT-BlueBERT-Fine-Tunebiomedical 0.729 0.820 0.865 0.931 0.836

Table 3: Test results on biomedical tasks. optimizer (Kingma and Ba, 2015) with a learn- To evaluate the models on different domains, we 5 ing rate of 5e− , a batch size of 32, a linear multi-task learned various MT-BERT on BLUE learning rate decay schedule with warm-up over biomedical tasks and clinical tasks, respectively. 0.1, and a weight decay of 0.01 applied to every BlueBERTclinical is the base BlueBERT model epoch of training by following (Liu et al., 2019c). pretrained on PubMed abstracts and MIMIC-III We use the BioBERT (Lee et al., 2020), Blue- clinical notes, and fine-tuned for each BLUE task BERT base model (Peng et al., 2019), and Clin- on task-specific data. MT- model are the pro- icalBERT (Alsentzer et al., 2019) as the domain- posed models described in Section3. We used 2 specific language model . As a result, all the to- the pre-trained BlueBERTclinical to initialize its kenized texts using wordpieces were chopped to shared layers, refined the model via MTL on the spans no longer than 128 tokens. We set the max- BLUE tasks (MT-BlueBERT-Refinementclinical). imum number of epochs to 100. We also set the We keep fine-tuning the model for each BLUE task dropout rate of all the task-specific layers as 0.1. To using task-specific data, then got MT-BlueBERT- avoid the exploding gradient problem, we clipped Fine-Tuneclinical. the gradient norm within 1. To fine-tune the MT- Table2 shows the results on clinical tasks. MT- BERT on specific tasks, we set the maximum num- BlueBERT-Fine-Tuneclinical created new state-of- 5 ber of epochs to 10 and learning rate e− . the-art results on 2 tasks and pushing the bench- 4.3 Results mark to 81.9%, which amounts to 1.3% abso- lution improvement over BlueBERTclinical and One of the most important criteria of building prac- 1.2% absolute improvement over MT-BlueBERT- tical systems is fast adaptation to new domains. Refinementclinical. On the ShAReCLEFE task, the 2https://github.com/ncbi-nlp/bluebert model gained the largest improvement by 6%. On

209 i2b2 2010 re DDI

ClinicalSTS MedNLI ChemProt BC5CDR disease

ShARe/CLEFE BC5CDR chemical

Figure 2: Pairwise MTL relationships in clinical (left) and biomedical (right) domains.

Model ClinicalSTS i2b2 2010 re MedNLI ShARe/CLEFE Avg MT-ClinicalBERT-Fine-Tune 0.816 0.746 0.834 0.817 0.803 MT-BioBERT-Fine-Tune 0.837 0.741 0.832 0.818 0.807 MT-BlueBERT-Fine-Tunebiomedical 0.824 0.738 0.824 0.825 0.803 MT-BlueBERT-Fine-Tuneclinical 0.840 0.760 0.846 0.831 0.819

Table 4: Test results of MT-BERT-Fine-Tune models on clinical tasks.

BC5CDR BC5CDR Model ChemProt DDI Avg disease chemical MT-BioBERT-Fine-Tune 0.729 0.812 0.851 0.928 0.830 MT-BlueBERT-Fine-Tunebiomedical 0.729 0.820 0.865 0.931 0.836 MT-BlueBERT-Fine-Tuneclinical 0.714 0.792 0.824 0.930 0.815

Table 5: Test results of MT-BERT-Fine-Tune models on biomedical tasks. the MedNLI task, the MT model gained improve- on 2 tasks and pushing the benchmark to 83.6%, ment by 2.4%. On the remaining tasks, the MT which amounts to 2.0% absolute improvement over model also performed well by reaching the state- BlueBERTbiomedical and 2.1% absolute improve- of-the-art performance with less than 1% differ- ment over MT-BlueBERT-Refinementbiomedical. ences. When compared the models with and with- On the DDI task, the model gained the largest out fine-tuning on single datasets, Table2 shows improvement by 8.1%. that the multi-task refinement model is similar to single baselines on average. Consider that MT- 4.4 Discussion BlueBERT-Refinementclinical is one model while 4.4.1 Pairwise MTL BlueBERTclinical are 4 individual models, we be- To investigate which tasks are beneficial or harm- lieve the MT refinement model would bring the ben- ful to others, we train on two tasks jointly us- efit when researchers are in the hassle of choosing ing MT-BlueBERT-Refinementbiomedical and MT- the suitable model for new problems or problems BlueBERT-Refinementclinical. Figure2 gives pair- with limited training data. wise relationships. The directed green (or red and In biomedical tasks, we used grey) edge from s to t means s improves (or de- BlueBERTbiomedical as the baseline because creases and has no effect on) t. it achieved the best performance on the BLUE In the clinical tasks, ShARe/CLEFE always gets benchmark. Table3 shows the similar results benefits from multi-task learning the remaining 3 as in the clinical tasks. MT-BlueBERT-Fine- tasks as the incoming edges are green. One fac- Tunebiomedical created new state-of-the-art results tor might be that ShARe/CLEFE is an NER task

210 BlueBERT BlueBERT MT-BioBERT MT-BlueBERT MT-BlueBERT Model biomedical clinical Fine-Tune Fine-Tunebiomedical Fine-Tuneclinical ClinicalSTS 0.845 0.848 0.807 0.820 0.807 i2b2 2010 re 0.744 0.764 0.740 0.738 0.748 MedNLI 0.822 0.840 0.831 0.814 0.842 ChemProt 0.725 0.692 0.735 0.724 0.686 DDI 0.739 0.760 0.810 0.808 0.779 BC5CDR disease 0.866 0.854 0.849 0.853 0.848 BC5CDR chemical 0.935 0.924 0.928 0.928 0.914 ShARe/CLEFE 0.754 0.771 0.812 0.814 0.830 Avg 0.804 0.807 0.814 0.812 0.807

Table 6: Test results on eight BLUE tasks. that generally requires more training data to fulfill calBERT used “cased” text while BlueBERT used the data-hungry need of the BERT model. Clini- “uncased” text; and (2) the number of epochs to con- calSTS helps MedNLI because the nature of both tinuously pretrained the model. Given that there are are related and their inputs are a pair of sentences. limited details of pretraining ClinicalBERT, further MedNLI can help other tasks except ClinicalSTS investigation may be necessary. partially because the test set of ClinialSTS is too In the biomedical tasks, Table5 shows that small to reflect the changes. We also note that i2b2 MT-BioBERT-Fine-Tune and MT-BlueBERT-Fine- 2010 re can be both beneficial and harmful, depend- Tunebiomedical reached comparable results and pre- ing on which other tasks they are trained with. One training on clinical notes has a negligible impact. potential cause is i2b2 2010 re was collected from three different hospitals and have the largest label 4.4.3 Results on all BLUE tasks size of 8. Next, we also compare MT-BERT with its vari- In the biomedical tasks, both DDI and ChemProt ants on all BLUE tasks. Table6 shows that MT- tasks can be improved by MTL on other tasks, po- BioBERT-Fine-Tune reached the best performance tentially because they are harder with largest size of on average and MT-BlueBERT-Fine-Tunebiomedical label thus require more training data. In the mean- stays closely. While confusing results were ob- while, BC5CDR chemical and disease can barely tained when combing variety of tasks in both be improved potentially because they have already biomedical and clinical domains, we observed got large dataset to fit the model. again that MTL models pretrained on biomedical literature perform better in biomedical tasks; and 4.4.2 MTL on BERT variants MTL models pretrained on both biomedical literature and clinical notes perform better in clinical First, we would like to compare multi-task tasks. These observations may suggest that it might learning on BERT variants: BioBERT, Clinical- be helpful to train separate deep neural networks BERT, and BlueBERT. In the clinical tasks (Ta- on different types of text genres in BioNLP. ble4), MT-BlueBERT-Fine-Tune clinical outperforms other models on all tasks. When compared 5 Conclusions and future work the MTL models using BERT model pretrained on PubMed only (rows 2 and 3) and on the combina- In this work, we conduct an empirical study on tion of PubMed and clinical notes (row 4), it shows MTL for biomedical and clinical tasks, which so the impact of using clinical notes during the pre- far has been mostly studied with one or two tasks. training process. This observation is consistently Our results provide insights regarding domain adap- as shown in (Peng et al., 2019). On the other hand, tation and show benefits of the MTL refinement MT-ClinicalBERT-Fine-Tune, which used Clinical- and fine-tuning. We recommend a combination of BERT during the pretraining, drops 1.6% across the MTL refinement and task-specific fine-tuning ∼ the tasks. The differences between ClinicalBERT approach based on the evaluation results. When and BlueBERT are at least in 2-fold. (1) Clini- learned and fine-tuned on a different domain, MT-

211 BERT achieved improvements by 2.0% and 1.3% Rich Caruana. 1997. Multitask learning. Machine in biomedical and clinical domains, respectively. Learning, 28(1):41–75. Specifically, it has brought significant improve- Soravit Changpinyo, Hexiang Hu, and Fei Sha. 2018. ments in 4 tasks. Multi-task learning for sequence tagging: an empiri- There are two limitations to this work. First, our cal study. In COLING, pages 2965–2977. results on MTL training across all BLUE bench- Qingyu Chen, Yifan Peng, Tiarnan Keenan, Shazia mark show that MTL is not always effective. We Dharssi, Elvira Agro N, Wai T. Wong, Emily Y. are interested in exploring further the character- Chew, and Zhiyong Lu. 2019. A multi-task deep ization of task relationships. For example, it is learning model for the classification of age-related not clear whether there are data characteristics that macular degeneration. AMIA 2019 Informatics Sum- help to determine its success (Mart´ınez Alonso and mit, 2019:505–514. Plank, 2017; Changpinyo et al., 2018). In addi- Ronan Collobert and Jason Weston. 2008. A unified ar- tion, our results suggest that the model could ben- chitecture for natural language processing. In ICML, efit more from some specific examples of some pages 160–167. of the tasks in Table1. For example, it might Gamal Crichton, Sampo Pyysalo, Billy Chiu, and Anna be of interest to not using the BC5CDR corpus Korhonen. 2017. A neural network multi-task learn- in the relation extraction task in future. Second, ing approach to biomedical named entity recogni- we studied one approach to MTL by sharing the tion. BMC Bioinformatics, 18:368. encoder between all tasks while keeping several Louise Deleger,´ Robert Bossy, Estelle Chaix, task-specific decoders. Other approaches, such as Mouhamadou Ba, Arnaud Ferre,´ Philippe Bessieres,` fine-tuning only the task specific layers, soft pa- and Claire Nedellec.´ 2016. Overview of the bacteria rameter sharing (Ruder, 2017), knowledge distilla- biotope task at BioNLP shared task 2016. In tion (Liu et al., 2019b), need to be investigated in Proceedings of BioNLP Shared Task Workshop, pages 12–22. the future. While our work only scratches the surface of Jacob Devlin, Ming-Wei Chang, Kenton Lee, and MTL in the medical domain, we hope it will shed Kristina Toutanova. 2018. BERT: pre-training of light on the development of generalizable NLP deep bidirectional transformers for language understanding. arXiv preprint: 1810.04805. models and task relations that can lead to gains from MTL models over single-task setups. Yongping Du, Yunpeng Pan, and Junzhong Ji. 2017. A novel serial deep multi-task learning model for Acknowledgments large scale biomedical semantic indexing. In IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 533–537. This work was supported by the Intramural Re- search Programs of the NIH National Library Harsha Gurulingappa, Abdul Mateen Rajput, Angus of Medicine. This work was also supported Roberts, Juliane Fluck, Martin Hofmann-Apitius, by the National Library of Medicine of the Na- and Luca Toldo. 2012. Development of a benchmark corpus to support the automatic extraction of tional Institutes of Health under award number drug-related adverse effects from medical case re- K99LM013001. We are also grateful to the au- ports. Journal of Biomedical Informatics, 45:885– thors of mt-dnn (https://github.com/namisan/ 892. mt-dnn) to make the codes publicly available. Hrayr Harutyunyan, Hrant Khachatrian, David C. Kale, Greg Ver Steeg, and Aram Galstyan. 2019. Multi- task learning and benchmarking with clinical time References series data. Scientific data, 6:96.

Emily Alsentzer, John Murphy, William Boag, Wei- Dan He, David Kuhn, and Laxmi Parida. 2016. Novel Hung Weng, Di Jindi, Tristan Naumann, and applications of multitask learning and multiple out- Matthew McDermott. 2019. Publicly available clini- put regression to multiple genetic trait prediction. cal BERT embeddings. In Proceedings of the 2nd Bioinformatics (Oxford, England), 32:i37–i43. Clinical Natural Language Processing Workshop, pages 72–78. Mar´ıa Herrero-Zazo, Isabel Segura-Bedmar, Paloma Mart´ınez, and Thierry Declerck. 2013. The DDI Joachim Bingel and Anders Søgaard. 2017. Identify- corpus: an annotated corpus with pharmacological ing beneﬁcial task relations for multi-task learning substances and drug-drug interactions. Journal of in deep neural networks. In EACL, pages 164–169. Biomedical Informatics, 46:914–920.

212 Muhammad Raza Khan, Morteza Ziyadi, and Mo- Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian- hamed AbdelHady. 2020. MT-BioNER: multi-task feng Gao. 2019b. Improving multi-task deep neural learning for biomedical named entity recognition us- networks via knowledge distillation for natural lan- ing deep bidirectional transformers. arXiv preprint: guage understanding. arXiv preprint: 1904.09482. 2001.08904. Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian- Diederik P. Kingma and Jimmy Ba. 2015. Adam: a feng Gao. 2019c. Multi-task deep neural networks method for stochastic optimization. In International for natural language understanding. In ACL, pages Conference on Learning Representations (ICLR), 4487–4496. pages 1–15. Martin Krallinger, Obdulia Rabal, Saber A. Akhondi, Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Mart´ın Perez´ Perez,´ Jesus´ Santamar´ıa, Gael Perez´ Vinyals, and Lukasz Kaiser. 2016. Multi-task se- Rodr´ıguez, Georgios Tsatsaronis, Ander Intxau- quence to sequence learning. In ICLR. rrondo, Jose´ Antonio Lopez1´ Umesh Nandal, Erin Van Buel, Akileshwari Chandrasekhar, Mar- Hector´ Mart´ınez Alonso and Barbara Plank. 2017. leen Rodenburg, Astrid Laegreid, Marius Doornen- When is multitask learning effective? semantic se- bal, Julen Oyarzabal, Analia Lourenc¸o, and Alfonso quence prediction under varying data conditions. In Valencia. 2017. Overview of the BioCreative VI EACL, pages 44–53. chemical-protein interaction track. In Proceedings of the BioCreative workshop, pages 141–146. Tahir Mehmood, Alfonso E Gerevini, Alberto Lavelli, and Ivan Serina. 2019. Multi-task learning applied Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, to biomedical named entity recognition task. In Ital- Donghyeon Kim, Sunkyu Kim, Chan Ho So, ian Conference on Computational Linguistics. and Jaewoo Kang. 2020. BioBERT: a pre-trained biomedical language representation model for Yifan Peng, Shankai Yan, and Zhiyong Lu. 2019. biomedical text mining. Bioinformatics (Oxford, Transfer learning in biomedical natural language England), 36:1234–1240. processing: an evaluation of BERT and ELMo on ten benchmarking datasets. In Proceedings of the Diya Li and Heng Ji. 2019. Syntax-aware multi-task Workshop on Biomedical Natural Language Process- graph convolutional networks for biomedical rela- ing (BioNLP), pages 58–65. tion extraction. In Proceedings of the Tenth Inter- national Workshop on Health Text Mining and Infor- Sebastian Ruder. 2017. An overview of multi-task mation Analysis (LOUHI 2019), pages 28–33. learning in deep neural networks. arXiv preprint: Fei Li, Meishan Zhang, Guohong Fu, and Donghong Ji. 1706.05098. 2017. A neural joint model for entity and relation extraction from biomedical text. BMC Bioinformatics, Xue Shi, Dehuan Jiang, Yuanhang Huang, Xiaolong 18:198. Wang, Qingcai Chen, Jun Yan, and Buzhou Tang. 2019. Family history information extraction via Jiao Li, Yueping Sun, Robin J. Johnson, Daniela Sci- deep joint learning. BMC medical informatics and aky, Chih-Hsuan Wei, Robert Leaman, Allan Peter decision making, 19:277. Davis, Carolyn J. Mattingly, Thomas C. Wiegers, and Zhiyong Lu. 2016. BioCreative V CDR task Chaitanya Shivade. 2017. Mednli – a natural language corpus: a resource for chemical disease relation ex- inference dataset for the clinical domain. traction. Database (Oxford), 2016. Hanna Suominen, Sanna Salantera,¨ Sumithra Velupil- Qingqing Li, Zhihao Yang, Ling Luo, Lei Wang, Yin lai, Wendy W. Chapman, Guergana Savova, Zhang, Hongfei Lin, Jian Wang, Liang Yang, Kan Noemie Elhadad, Sameer Pradhan, Brett R. South, Xu, and Yijia Zhang. 2018. A multi-task learning Danielle L. Mowery, Gareth J. F. Jones, et al. based approach to biomedical entity relation extrac- 2013. Overview of the ShARe/CLEF eHealth evalu- tion. In IEEE International Conference on Bioinfor- ation lab 2013. In International Conference of the matics and Biomedicine (BIBM), pages 680–682. Cross-Language Evaluation Forum for European Shikun Liu, Edward Johns, and Andrew J. Davison. Languages, pages 212–231. 2019a. End-to-end multi-task learning with atten- ¨ tion. In Proceedings of the IEEE Conference on Ozlem Uzuner, Brett R. South, Shuying Shen, and Computer Vision and Pattern Recognition (CVPR), Scott L. DuVall. 2011. 2010 i2b2/VA challenge on pages 1871–1880. concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Asso- Sijia Liu, Majid Rastegar Mojarad, Yanshan Wang, ciation : JAMIA, 18:552–556. Liwei Wang, Feichen Shen, Sunyang Fu, and Hongfang Liu. 2018. Overview of the BioCre- Xi Wang, Jiagao Lyu, Li Dong, and Ke Xu. 2019a. ative/OHNLP 2018 family history extraction task. Multitask learning for biomedical named entity In Proceedings of the BioCreative Workshop, pages recognition with cross-sharing structure. BMC 1–5. Bioinformatics, 20:427.

213 Xiaogang Wang, Cha Zhang, and Zhengyou Zhang. 2009. Boosted multi-task learning for face veriﬁ- cation with applications to web image and video search. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 142–149. Xuan Wang, Yu Zhang, Xiang Ren, Yuhao Zhang, Marinka Zitnik, Jingbo Shang, Curtis Langlotz, and Jiawei Han. 2019b. Cross-type biomedical named entity recognition with deep multi-task learning. Bioinformatics (Oxford, England), 35:1745–1752. Yanshan Wang, Naveed Afzal, Sunyang Fu, Liwei Wang, Feichen Shen, Majid Rastegar-Mojarad, and Hongfang Liu. 2018. MedSTS: a resource for clinical semantic textual similarity. Language Resources and Evaluation, pages 1–16. Junjie Xing, Kenny Zhu, and Shaodian Zhang. 2018. Adaptive multi-task transfer learning for Chinese word segmentation in medical text. In COLING, pages 3619–3630. Kui Xue, Yangming Zhou, Zhiyuan Ma, Tong Ruan, Huanhuan Zhang, and Ping He. 2019. Fine-tuning BERT for joint entity and relation extraction in Chi- nese medical text. In 2019 IEEE International Con- ference on Bioinformatics and Biomedicine (BIBM), pages 892–897. Jianliang Yang, Yuenan Liu, Minghui Qian, Chenghua Guan, and Xiangfei Yuan. 2019. Information extraction from electronic medical records using multitask recurrent neural network with contextual word embedding. Applied Sciences, 9(18):3658.

Qun Zhang, Zhenzhen Li, Dawei Feng, Dongsheng Li, Zhen Huang, and Yuxing Peng. 2018. Multitask learning for Chinese named entity recognition. In Advances in Multimedia Information Processing – PCM, pages 653–662.

Yu Zhang and Qiang Yang. 2017. A survey on multi- task learning. arXiv preprint: 1707.08114. Sendong Zhao, Ting Liu, Sicheng Zhao, and Fei Wang. 2019. A neural multi-task learning framework to jointly model medical named entity recognition and normalization. In Proceedings of the AAAI Con- ference on Artiﬁcial Intelligence, volume 33, pages 817–824.

Jiayu Zhou, Lei Yuan, Jun Liu, and Jieping Ye. 2011. A multi-task learning formulation for predicting disease progression. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 814–822.

214 Author Index

Allen, Carl, 167 Jimeno Yepes, Antonio, 195 Amin, Saadullah, 187 Joshi, Reenam,1 Androutsopoulos, Ion, 140 Kamath, Sanjay, 49 Balaževic,´ Ivana, 167 Kanjaria, Karina, 60 Baldwin, Timothy, 105, 156 Karargyris, Alexandros, 60 Ballah, Deddeh, 60 Kashyap, Satyananda, 60 Bean, Daniel, 86 Koroleva, Anna, 49 Bendayan, Rebecca, 86 Kovaleva, Olga, 60 Bethard, Steven, 70 Kraljevic, Zeljko, 86 Beymer, David Beymer, 60 Bleiweiss, Avi, 150 Lahav, Dan, 28 Bossuyt, Patrick, 49 Lehman, Eric, 123 Brandt, Cynthia, 167 Li, Juanxi, 14 Lin, Chen, 70 Chandramouli, Rajarathnam, 133 Lin, Simon, 14 Chang, David, 167 Liu, Fei, 105 Chawla, Daniel, 167 Lu, Zhiyong, 205 Chen, Qingyu, 205 Luo, Fan, 133 Choi, Jinho D., 95 Cohen, Trevor, 38 Marshall, Iain, 123 Cohen, Yaara, 28 Martinez Iraola, David, 195 Coy, Adam, 60 Mascio, Aurelie, 86 McDonald, Ryan, 140 Das, Manirupa, 14 Miller, Timothy, 70 de Bruijn, Berry, 177 Min, So Yeon, 112 DeYoung, Jay, 123 Mukherjee, Vandana Mukherjee, 60 Dligach, Dmitriy, 70 Dobson, Richard, 76, 86 Nejadgholi, Isar, 177 Dunﬁeld, Katherine Ann, 187 Neumann, Guenter, 187 Nye, Benjamin, 123 Eyal, Matan, 28 Pappas, Dimitris, 140 Fosler-Lussier, Eric, 14 Paroubek, Patrick, 49 Fraser, Kathleen C., 177 Patzer, Rachel E., 95 Paullada, Amandalynne, 38 Gilkerson, James, 156 Peddagangireddy, Vishal, 133 Goldberg, Yoav, 28 Peng, Yifan, 205 Guo, Yufan, 60 Percha, Bethany, 38

Hardefeldt, Laura, 156 Raghavan, Preethi, 112 Hogan, Julien, 95 Ramnath, Rajiv, 14 Huang, Yungui, 14 Rawat, Bhanu Pratap Singh, 112 Hur, Brian, 156 Rios, Anthony,1 Roberts, Angus, 86 Ibrahim, Zina, 76

215 Rumshisky, Anna, 60 Rust, Steve, 14

Sadde, Shoval, 28 Sadeque, Farig, 70 Savova, Guergana, 70 Searle, Thomas, 76 ShaﬁeiBavani, Elaheh, 195 Shin, Hejin,1 Shivade, Chaitanya, 60 Shlain, Micah, 28 Stavropoulos, Petros, 140 Stewart, Robert, 86 Subbalakshmi, Koduvayur, 133 Szolovits, Peter, 112

Taub Tabib, Hillel, 28 Taylor, Andrew, 167

Vechkaeva, Anna, 187 Verspoor, Karin, 105, 156

Wallace, Byron C., 123 Wang, Ning, 133 Wang, Yuxia, 105 Weng, Wei-Hung, 112 Wu, Joy, 60

Xu, Liyan, 95

Zhong, Xu, 195