Injection of Automatically Selected Dbpedia Subjects in Electronic

Injection of Automatically Selected DBpedia Subjects in Electronic Medical Records to boost Hospitalization Prediction Raphaël Gazzotti, Catherine Faron Zucker, Fabien Gandon, Virginie Lacroix-Hugues, David Darmon To cite this version: Raphaël Gazzotti, Catherine Faron Zucker, Fabien Gandon, Virginie Lacroix-Hugues, David Darmon. Injection of Automatically Selected DBpedia Subjects in Electronic Medical Records to boost Hos- pitalization Prediction. SAC 2020 - 35th ACM/SIGAPP Symposium On Applied Computing, Mar 2020, Brno, Czech Republic. 10.1145/3341105.3373932. hal-02389918 HAL Id: hal-02389918 https://hal.archives-ouvertes.fr/hal-02389918 Submitted on 16 Dec 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Injection of Automatically Selected DBpedia Subjects in Electronic Medical Records to boost Hospitalization Prediction Raphaël Gazzotti Catherine Faron-Zucker Fabien Gandon Université Côte d’Azur, Inria, CNRS, Université Côte d’Azur, Inria, CNRS, Inria, Université Côte d’Azur, CNRS, I3S, Sophia-Antipolis, France I3S, Sophia-Antipolis, France I3S, Sophia-Antipolis, France SynchroNext, Nice, France [email protected] [email protected] [email protected] Virginie Lacroix-Hugues David Darmon Université Côte d’Azur, RETINES, Université Côte d’Azur, RETINES, Département d’Enseignement et de Département d’Enseignement et de Recherche en Médecine Générale, Recherche en Médecine Générale, Faculté de médecine, Nice, France Faculté de médecine, Nice, France [email protected] [email protected] ABSTRACT 1 INTRODUCTION Although there are many medical standard vocabularies available, Electronic medical records (EMRs) contain vital information about it remains challenging to properly identify domain concepts in a patient’s state of health, and their analysis should enable pre- electronic medical records. Variations in the annotations of these venting pathologies that may affect a patient in the future. Their texts in terms of coverage and abstraction may be due to the chosen exploitation through automated approaches makes it possible to annotation methods and the knowledge graphs, and may lead to discover patterns that, once addressed, are likely to improve the very different performances in the automated processing of these living conditions of the population. However, linguistic variability annotations. We propose a semi-supervised approach based on and tacit knowledge hinder automated processing, as they can lead DBpedia to extract medical subjects from EMRs and evaluate the to erroneous conclusions. In this paper we extract entities that impact of augmenting the features used to represent EMRs with help predict the hospitalization of patients from their electronic these subjects in the task of predicting hospitalization. We compare medical records and linked DBpedia1 entities. DBpedia employs the impact of subjects selected by experts vs. by machine learning Semantic Web standards and structures Wikimedia project data methods through feature selection. Our approach was experimented with the Resource Description Framework (RDF). However, given on data from the database PRIMEGE PACA that contains more than the amount of general information available on DBpedia, it is chal- 600,000 consultations carried out by 17 general practitioners (GPs). lenging to filter knowledge specific to the healthcare domain. This is especially the case when it comes to identify concept relevant to CCS CONCEPTS the prediction of hospitalized patients. To answer this problem, we • Applied computing → Health informatics; • Theory of com- estimate the relevance of concepts and select the most promising putation → Semantics and reasoning; • Computing method- ones to construct the vector representation of EMRs used to predict ologies → Information extraction; Feature selection; hospitalization. As a field of experimentation, we used a dataset extracted from KEYWORDS the PRIMEGE PACA relational database [15] which contains more than 600,000 consultations in French by 17 general practitioners Information extraction, Predictive model, Electronic medical record, (Table 1). In this database, text descriptions written by general Knowledge graph. practitioners are available with international classification codes ACM Reference Format: of prescribed drugs, pathologies and reasons for consultations, as Raphaël Gazzotti, Catherine Faron-Zucker, Fabien Gandon, Virginie Lacroix- well as the numerical values of the different medical examination Hugues, and David Darmon. 2020. Injection of Automatically Selected DB- results obtained by a patient. pedia Subjects in Electronic Medical Records to boost Hospitalization Pre- In that context, our main research question is: How to extract diction. In The 35th ACM/SIGAPP Symposium on Applied Computing (SAC knowledge relevant for the prediction of the occurrence of an event? ’20), March 30-April 3, 2020, Brno, Czech Republic. ACM, New York, NY, USA, In our case study, we extract subjects related to hospitalization using Article 4, 8 pages. https://doi.org/10.1145/3341105.3373932 knowledge from DBpedia and EMRs. In this paper, we focus on the Publication rights licensed to ACM. ACM acknowledges that this contribution was following sub-questions: authored or co-authored by an employee, contractor or affiliate of a national govern- • How to filter relevant domain knowledge from a general ment. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. knowledge source? SAC ’20, March 30-April 3, 2020, Brno, Czech Republic • How to deal with subjectivity in the annotation process? © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-6866-7/20/03...$15.00 1DBpedia is a crowd-sourced extraction of structured data from Wikimedia projects https://doi.org/10.1145/3341105.3373932 http://dbpedia.org SAC ’20, March 30-April 3, 2020, Brno, Czech Republic R. Gazzotti et al. Table 1: Data collected in the PRIMEGE PACA database. daily living using machine learning algorithms. The authors high- light better accuracy and results than with traditional approaches, Category Data collected regardless of the machine learning algorithm on which this task GPs Sex, birth year, city, postcode has been addressed (up to 1.9% on average). Although they exploit solely the knowledge graph developed specifically for the purpose Patients Sex, birth year, city, postcode of discovering new rules, without trying to exploit other knowledge Socio-professional category, occupation sources where a mapping could have been done. Their study shows Number of children, family status the value of structured knowledge in classification tasks. Long term condition (Y/N) The SIFR Bioportal project [21] provides a web service based on Personal history the NCBO BioPortal [23] to annotate clinical texts in French with Family history biomedical knowledge graphs. This service is able to handle clinical Risk factors notes involving negations, experiencers (the patient or members Allergies of his family) and temporal aspects in the context of the entity references. However, the adopted approach involves domain spe- Consultations Date cific knowledge graphs, while general resources like EMRs require Reasons of consultation general repositories such as, for instance, DBpedia. Symptoms related by the patient In [10], the authors show that combining bag-of-words (BOW), and medical observation biomedical entities and UMLS (the Unified Medical Language Sys- Further investigations tem2) improve classification results in several tasks such as infor- Diagnoses mation retrieval, information extraction and text summarization Drugs prescribed (dose, number of boxes, regardless of the classifier. We intend here to study the same kind reasons of the prescription) of impact but from a more general repository like DBpedia and Paramedical prescriptions (biology/imaging) on a domain-specific prediction task: we also propose a method to Medical procedures select relevant domain knowledge in order to boost hospitalization prediction. In [11], we studied the contributions of different knowledge • Is automatic extraction and selection of knowledge efficient graphs (ATC, ICPC-2, NDF-RT, Wikidata and DBpedia) for hospital- in that context? ization prediction. Compared to [11], this paper explores in more depth the impact of knowledge enrichment using DBpedia while To answer these questions, we survey the related work (section relying on the same prediction method. Our goal is to provide a 2) and position our contribution. We then introduce the proposed method to solve the problem of retrieving relevant knowledge in method for knowledge extraction from texts and specify the filters the medical domain from general knowledge source. The intuition used to retrieve medical knowledge (section 3). Subsequently, we behind the use of DBpedia is that general knowledge is only avail- present the experimental protocol to compare the impact of knowl- able

Injection of Automatically Selected Dbpedia Subjects in Electronic

Classifying Relevant Social Media Posts During Disasters Using Ensemble of Domain-Agnostic and Domain-Speciﬁc Word Embeddings

Knowledge Extraction by Internet Monitoring to Enhance Crisis Management

Handling Semantic Complexity of Big Data Using Machine Learning and RDF Ontology Model

Information Extraction Based on Named Entity for Tourism Corpus

A Comparison of Knowledge Extraction Tools for the Semantic Web

Knowledge Graphs on the Web – an Overview Arxiv:2003.00719V3 [Cs

A Second Look at Word Embeddings W4705: Natural Language Processing

Cw2vec: Learning Chinese Word Embeddings with Stroke N-Gram Information

Effects of Pre-Trained Word Embeddings on Text-Based Deception Detection

CN-Dbpedia: a Never-Ending Chinese Knowledge Extraction System

Knowledge Extraction for Hybrid Question Answering

A Question Answering System Using Graph-Pattern Association Rules (QAGPAR) on YAGO Knowledge Base