Semantic Spaces of Clinical Text Can Then Be Utilized in a Number of Medically Relevant Applications

SEMANTIC SPACES OF CLINICAL TEXT Leveraging Distributional Semantics for Natural Language Processing of Electronic Health Records Aron Henriksson Licentiate Thesis Department of Computer and Systems Sciences Stockholm University October, 2013 Stockholm University DSV Report Series No. 13-009 ISSN 1101-8526 c 2013 Aron Henriksson Typeset by the author using LATEX Printed in Stockholm, Sweden by US-AB I ABSTRACT The large amounts of clinical data generated by electronic health record systems are an underutilized resource, which, if tapped, has enormous potential to improve health care. Since the majority of this data is in the form of unstructured text, which is challenging to analyze computationally, there is a need for sophisticated clinical language processing methods. Unsupervised methods that exploit statistical properties of the data are particularly valuable due to the limited availability of annotated corpora in the clinical domain. Information extraction and natural language processing systems need to incorporate some knowledge of semantics. One approach exploits the distributional properties of language – more specifically, term co-occurrence information – to model the relative meaning of terms in high-dimensional vector space. Such methods have been used with success in a number of general language processing tasks; however, their application in the clinical domain has previously only been explored to a limited extent. By applying models of distributional semantics to clinical text, semantic spaces can be constructed in a completely unsupervised fashion. Semantic spaces of clinical text can then be utilized in a number of medically relevant applications. The application of distributional semantics in the clinical domain is here demonstrated in three use cases: (1) synonym extraction of medical terms, (2) assignment of diagnosis codes and (3) identification of adverse drug reactions. To apply distributional semantics effectively to a wide range of both general and, in particular, clinical language processing tasks, certain limitations or challenges need to be addressed, such as how to model the meaning of multiword terms and account for the function of negation: a simple means of incorporating paraphrasing and negation in a distributional semantic framework is here proposed and evaluated. The notion of ensembles of semantic spaces is also introduced; these are shown to outperform the use of a single semantic space on the synonym extraction task. This idea allows different models of distributional semantics, with different parameter configurations and induced from different corpora, to be combined. This is not least important in the clinical domain, as it allows potentially limited amounts of clinical data to be supplemented with data from other, more readily available sources. The importance of configuring the dimensionality of semantic spaces, particularly when – as is typically the case in the clinical domain – the vocabulary grows large, is also demonstrated. II III SAMMANFATTNING De stora mängder kliniska data som genereras i patientjournalsystem är en underutnyttjad resurs med en enorm potential att förbättra hälso- och sjukvården. Då merparten av kliniska data är i form av ostrukturerad text, vilken är utmanande för datorer att analysera, finns det ett behov av sofistikerade metoder som kan behandla kliniskt språk. Metoder som inte kräver märkta exempel utan istället utnyttjar statistiska egenskaper i datamängden är särskilt värdefulla, med tanke på den begränsade tillgången till annoterade korpusar i den kliniska domänen. System för informationsextraktion och språkbehandling behöver innehålla viss kunskap om semantik. En metod går ut på att utnyttja de distributionella egenskaperna hos språk – mer specifikt, statistisk över hur termer samförekommer – för att modellera den relativa betydelsen av termer i ett högdimensionellt vektorrum. Metoden har använts med framgång i en rad uppgifter för behandling av allmänna språk; dess tillämpning i den kliniska domänen har dock endast utforskats i mindre utsträckning. Genom att tillämpa modeller för distributionell semantik på klinisk text kan semantiska rum konstrueras utan någon tillgång till märkta exempel. Semantiska rum av klinisk text kan sedan användas i en rad medicinskt relevanta tillämpningar. Tillämpningen av distributionell semantik i den kliniska domänen illustreras här i tre användningsområden: (1) synonymextraktion av medicinska termer, (2) tilldelning av diagnoskoder och (3) identifiering av läkemedelsbiverkningar. Det krävs dock att vissa begränsningar eller utmaningar adresseras för att möjlig- göra en effektiv tillämpning av distributionell semantik på ett brett spektrum av uppgifter som behandlar språk – både allmänt och, i synnerhet, kliniskt – såsom hur man kan modellera betydelsen av flerordstermer och redogöra för funktionen av negation: ett enkelt sätt att modellera parafrasering och negation i ett distri- butionellt semantiskt ramverk presenteras och utvärderas. Idén om ensembler av semantisk rum introduceras också; dessa överträffer användningen av ett enda semantiskt rum för synonymextraktion. Den här metoden möjliggör en kombination av olika modeller för distributionell semantik, med olika parameterkonfigurationer samt inducerade från olika korpusar. Detta är inte minst viktigt i den kliniska domänen, då det gör det möjligt att komplettera potentiellt begränsade mängder kliniska data med data från andra, mer lättillgängliga källor. Arbetet påvisar också vikten av att konfigurera dimensionaliteten av semantiska rum, i synnerhet när vokabulären är omfattande, vilket är vanligt i den kliniska domänen. IV V LIST OF PAPERS This thesis is based on the following papers: I Aron Henriksson, Mike Conway, Martin Duneld and Wendy W. Chapman (2013). Identifying Synonymy between SNOMED Clinical Terms of Varying Length Using Distributional Analysis of Electronic Health Records. To appear in Proceedings of the Annual Symposium of the American Medical Informatics Association (AMIA), Washington DC, USA. II Aron Henriksson, Hans Moen, Maria Skeppstedt, Vidas Daudaraviciusˇ and Martin Duneld (2013). Synonym Extraction and Abbreviation Expansion with Ensembles of Semantic Spaces. Submitted. III Aron Henriksson and Martin Hassel (2011). Exploiting Structured Data, Negation Detection and SNOMED CT Terms in a Random Indexing Approach to Clinical Coding. In Proceedings of the RANLP Workshop on Biomedical Nat- ural Language Processing, Association for Computational Linguistics (ACL), pages 3–10, Hissar, Bulgaria. IV Aron Henriksson and Martin Hassel (2013). Optimizing the Dimensionality of Clinical Term Spaces for Improved Diagnosis Coding Support. In Proceedings of the 4th International Louhi Workshop on Health Document Text Mining and Information Analysis (Louhi 2013), Sydney, Australia. V Aron Henriksson, Maria Kvist, Martin Hassel and Hercules Dalianis (2012). Exploration of Adverse Drug Reactions in Semantic Vector Space Models of Clinical Text. In Proceedings of the ICML Workshop on Machine Learning for Clinical Data Analysis, Edinburgh, UK. VI VII ACKNOWLEDGEMENTS Research may seem a solitary activity, conjuring up images of the lone researcher cooped up in a gloomy study, lit up by the mere flickering of candle light – or computer screen – faced with the daunting endeavor of making a contribution, however minuscule, to the body of scientific knowledge. Occasionally, this can be the case; for the most part, it is a fundamentally joint effort that is more likely to bear fruit in collaboration with others, some of whom I would like to acknowledge here. First and foremost, I am indebted to my supervisors: Professor Hercules Dalianis and Dr. Martin Duneld (formerly Hassel) – you make up a great, complementary supervision duo. You have introduced me to the rewarding world of research and, in particular, the exciting area of natural language processing. Hercules, you were the one who gave me the opportunity to join this research group. Your inborn social nature and optimistic outlook are important counterweights to the sometimes stressful and constrained, navel-gazing world of a PhD student. Martin, thank you for introducing me to, and sharing my interest in, distributional semantics. I appreciate and have learned much from our discussions on issues concerning research, both big and small. I also have the great fortune of belonging to a wonderful research group and I would like to thank both former and current members for providing a pleasant environment to work in and great opportunities for collaboration. I would like to name a few persons in particular: Dr. Maria Kvist, who, as cheerful officemate and sensible co-author, has provided me with ready access to important medical domain expertise and good advice; Dr. Sumithra Velupillai, who, as a recent graduate in our group, has set an impressive precedent and allowed those of us who follow to walk down a path that is perhaps a little less thorny; and Maria Skeppstedt, whom, as co-author and fellow PhD candidate, I always insist on collaborating with and whose humility cloaks a wealth of strengths. Thank you all for the fun times – I look forward to continued collaborations. It is a privilege to have the opportunity of conducting research in a stimulating environment: the DADEL project (High-Performance Data Mining for Drug Effect Detection), funded by the Swedish Foundation for Strategic Research (SSF), has certainly contributed to that, providing an exciting application area and allowing me to receive quality feedback

Load more