2Nd Workshop on Building and Evaluating Resources for Biomedical Text Mining
Total Page:16
File Type:pdf, Size:1020Kb
2nd Workshop on Building and Evaluating Resources for Biomedical Text Mining Tuesday, 18th March 2010 Valletta, Malta Organisers: Sophia Ananiadou Kevin Cohen Dina Demner-Fushman Workshop Programme 9:15 – 9:30 Welcome 9:30 – 10:30 Invited Talk (chair: Sophia Ananiadou) Pierre Zweigenbaum, Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI-CNRS), France 10:30 – 11:00 Coffee break 11:00 – 12:30 Session 1 (chair: Kevin Cohen) 11:00 Spelling Correction in Clinical Notes with Emphasis on First Suggestion Accuracy Jon Patrick, Mojtaba Sabbagh, Suvir Jain and Haifeng Zheng 11:25 Automatically Building a Repository to Support Evidence Based Practice Dina Demner-Fushman, Joanna Karpinski and George Thoma 11:50 An Empirical Evaluation of Resources for the Identif cation of Diseases and Adverse Effects in Biomedical Literature Harsha Gurulingappa, Roman Klinger, Martin Hofmann-Apitius and Juliane Fluck 12:15 A Task-Oriented Extension of Chinese MeSH Concepts Hierarchy Xinkai Wang and Sophia Ananiadou 12:40 – 14:10 Lunch break 14:10 – 15:10 Invited talk (chair: Sophia Ananiadou) Simonetta Montemagni, Istituto di Linguistica Computazionale (ILC-CNR), Italy 15:10 – 16:55 Session 2 (chair: Dina Demner-Fushman) 15:10 Structuring of Status Descriptions in Hospital Patient Records Svetla Boytcheva, Ivelina Nikolova, Elena Paskaleva, Galia Angelova, Dimitar Tcharaktchiev and Nadya Dimitrova 15:35 Annotation of All Coreference in Biomedical Text: Guideline Selection and Adaptation K. Bretonnel Cohen, Arrick Lanfranchi, William Corvey, William A. Baumgartner Jr., Christophe Roeder, Philip V. Ogren, Martha Palmer and Lawrence Hunter 16:00 – 16:30 Coffee break 16:30 Contribution of Syntactic Structures to Single Word Term Extraction Xing Zhang and Alex Chengyu Fang 16:55 – 17:10 Concluding remarks (chair: Sophia Ananiadou) i Workshop Organisers Sophia Ananiadou, NaCTeM, University of Manchester, UK Kevin Cohen, Center for Computational Pharmacology and The MITRE Corporation, USA Dina Demner-Fushman, National Library of Medicine, USA Workshop Programme Committee Olivier Bodenreider, National Library of Medicine, USA Wendy Chapman, University of Pittsburgh, USA Aaron Cohen, Oregon Health and Science University Liu Hong Fang, Georgetown University Medical Center, USA Martin Krallinger, National Biotechnology Center, Spain John McNaught, NaCTeM, University of Manchester, UK John Pestian, Computational Medicine Center, University of Cincinnati, USA Andrey Rzhetsky, University of Chicago, USA Jian Su, Institute for Infocomm Research, Singapore Junichi Tsujii, University of Tokyo, Japan and NaCTeM, University of Manchester, UK Yoshimasa Tsuruoka, JAIST, Japan Karin Verspoor, Center for Computational Pharmacology, University of Colorado, USA Xinglong Wang, University of Manchester, UK Bonnie Webber, University of Edinburgh, UK John Wilbur, NCBI, NLM, NIH, USA Pierre Zweigenbaum, LIMSI-CNRS, France ii Table of Contents Foreword Sophia Ananiadou, Kevin Cohen and Dina Demner Fushman ………………... 1 Spelling Correction in Clinical Notes with Emphasis on First Suggestion Accuracy Jon Patrick, Mojtaba Sabbagh, Suvir Jain and Haifeng Zheng ……………….. 2 Automatically Building a Repository to Support Evidence Based Practice Dina Demner-Fushman, Joanna Karpinski and George Thoma ………………. 9 An Empirical Evaluation of Resources for the Identif cation of Diseases and Adverse Effects in Biomedical Literature Harsha Gurulingappa, Roman Klinger, Martin Hofmann-Apitius and Juliane Fluck ………………………………………………………………. 15 A Task-Oriented Extension of the Chinese MeSH Concepts Hierarchy Xinkai Wang and Sophia Ananiadou …………………………………………… 23 Structuring of Status Descriptions in Hospital Patient Records Svetla Boytcheva, Ivelina Nikolova, Elena Paskaleva, Galia Angelova, Dimitar Tcharaktchiev and Nadya Dimitrova …………………………………. 31 Annotation of All Coreference in Biomedical Text: Guideline Selection and Adaptation K. Bretonnel Cohen, Arrick Lanfranchi, William Corvey, William A. Baumgartner Jr., Christophe Roeder, Philip V. Ogren, Martha Palmer and Lawrence Hunter …………………………………………. 37 Contribution of Syntactic Structures to Single Word Term Extraction Xing Zhang and Alex Chengyu Fang …………………………………………… 42 iii Author Index Sophia Ananiadou, 23 Galia Angelova, 31 William A. Baumgartner Jr., 37 Svetla Boytcheva, 31 K. Bretonnel Cohen, 37 William Corvey, 37 Dina Demner-Fushman, 9 Nadya Dimitrova , 31 Alex Chengyu Fang, 42 Juliane Fluck, 15 Harsha Gurulingappa, 15 Martin Hofmann-Apitius, 15 Lawrence Hunter, 37 Suvir Jain, 2 Joanna Karpinski, 9 Roman Klinger, 15 Arrick Lanfranchi, 37 Ivelina Nikolova, 31 Philip V. Ogren, 37 Martha Palmer, 37 Elena Paskaleva, 31 Jon Patrick, 2 Christophe Roeder, 37 Mojtaba Sabbagh, 2 Dimitar Tcharaktchiev, 31 George Thoma, 9 Xinkai Wang, 23 Xing Zhang, 42 Haifeng Zheng, 2 iv FOREWORD This volum e contains the pap ers accepted at the 2nd w orkshop on Building and Evaluating Resources for Biomedical Text Mining held at LREC 2010, Malta. Biomedical text mining over the last decade has become one of the d riving application areas for the NLP community, resulting in a series of very successful yearly specialist wo rkshops at ACL since 2002, BioNLP, as well as the launch of the BioMed special interest group in 2008. In the past, most of the work has focused on solving specific problems, often using task-tailored and private data sets. This data was rarely reused, in particular outside the efforts of the providers. This has changed during the last years, as many research groups have made available resources that have been built either purposely or as by-products of research or ev aluation efforts. A num ber of projects, initiatives and organisations have been dedicated to building and providing biomedical text mining resources (e.g., the GENIA suite of corpora, PennBioIE, TREC Ge nomics track, BioCreative, Yapex, LLL05, BOOTStrep, JNLPBA, KDD data, Medstract, BioText, etc.). There is an increasing need to provide comm unity-wide discussions on the desi gn, a vailability and interoperability of resources for bio-text m ining, following on from specific applications such as gene name identification and protein-protein inter actions in BioCreative I/II, the BioNLP'09 shared task on event extraction etc., and recognition of clinically relevant entities and relations in the i2b2 challenges to the use of common resources and tools in real life applications. The papers in this volu me reflect the current tran sitional state of the b iomedical text m ining field and range f rom named entity r ecognition in ME DLINE abstracts (Gurulingappa et a l., Zhang and Fang) to reuse of annotation tools and guidelines to fully annotate 97 journal articles (Cohen et al.). The corpus of 97 fully annotated articles will be m ade publicly ava ilable upon com pletion of syntactic, semantic, and discourse structure annotation with the emphasis on coreference annotation. The biom edical text m ining field is expanding in two directions : enrichm ent and use of cross- lingual resources and work in resource-poor languages on the one hand, and significant inroads in processing clinical narrative on the other hand. The enrichm ent and evaluation of cross-lingual resources a re exem plifies in the work on ex tension of the Chinese Medi cal Su bject Headings vocabulary (W ang and Ananiadou). Spanning a re source-poor language and clinical dom ain, Boytcheva et al. focus o n the mining of Bulgarian c linical notes for descriptions of patients’ status and com plications. Patrick et al. focus on spelling errors in clinic al notes and pres ent a detailed description of a spelling suggestion generator. Clin ical notes are linked to MEDLINE citations in the collection under construction at NIH, as described by Demner-Fushman et al. We wish to thank the authors f or subm itting papers f or considera tion, and the mem bers of th e program committee for their time and effort during the review process. We would also like to thank our invited speakers, Simonetta Montemagni and Pierre Zweigenbaum for their contribution. Sophia Ananiadou, Kevin Cohen and Dina Demner-Fushman 1 Spelling correction in Clinical Notes with Emphasis on First Suggestion Accuracy Jon Patrick, Mojtaba Sabbagh, Suvir Jain, Haifeng Zheng Health Information Technology Research Laboratory School of IT, University of Sydney [email protected], [email protected], [email protected], [email protected] Abstract Spelling correction in a large collection of clinical notes is a relatively lesser explored research area and serious problem as typically 30% of tokens are unknown. In this paper we explore techniques to devise a spelling suggestion generator for clinical data with an emphasis on the accuracy of the first suggestion. We use a combination of a rule-based suggestion generation system and a context-sensitive ranking algorithm to obtain the best results. The ranking is achieved using a language model created by bootstrapping from the corpus. The described method is embedded in a practical work-flow that we use for Knowledge Discovery and Reuse in clinical records. 1. Introduction cludes the following resources: Our work specializes in processing corpora of medical texts 1. Systematised Nomenclature of Medicine-Clinical ((Wang and Patrick, 2008), (Ryan et al., 2008), (Zhang and Terms, i.e., SNOMED-CT: 99860 words Patrick, 2007) and (Asgari and Patrick, 2008)). These cor- pora usually contain many years of patient records from 2. The Moby Lexicon: 354992 words hospital