Natural Language Processing for Biomedical Text Mining
Total Page:16
File Type:pdf, Size:1020Kb
Natural Language Processing for biomedical text mining Thierry Hamon LIMSI, CNRS, Université Paris-Saclay, Orsay, France Université Paris 13, Sorbonne Paris Cité, Villetaneuse, France [email protected] May 2019 1/86 Who am I? Associate professor in Computer science at the university Paris 13 Head of the major in Computer Science at the Engineering School Sup Galilée International coordinator the Engineering School Sup Galilée Research: Lab: LIMSI Topics: Natural Language Processing, Terminological Acquisition, Text mining 2/86 Sup Galilée The Engineering School of the University Paris 13 Institut Galilée (Science Faculty) 3/86 Bachelor/Master Engineer diploma Bac+5 M2 Engineer 3 Bac+4 M1 Engineer 2 Bac+3 B3 Engineer 1 Bac+2 B2 CP2I-2 Bac+1 B1 CP2I-1 French Higher Education Organisation Sup Galilée in the Bologna process Baccalauréat secondary school 4/86 Engineer diploma Engineer 3 Engineer 2 Engineer 1 CP2I-2 CP2I-1 French Higher Education Organisation Sup Galilée in the Bologna process Bachelor/Master Bac+5 M2 Bac+4 M1 Bac+3 B3 Bac+2 B2 Bac+1 B1 Baccalauréat secondary school 4/86 Engineer 3 Engineer 2 Engineer 1 CP2I-2 CP2I-1 French Higher Education Organisation Sup Galilée in the Bologna process Bachelor/Master Engineer diploma Bac+5 M2 Bac+4 M1 Bac+3 B3 Bac+2 B2 Bac+1 B1 Baccalauréat secondary school 4/86 Engineer 3 Engineer 2 Engineer 1 French Higher Education Organisation Sup Galilée in the Bologna process Bachelor/Master Engineer diploma Bac+5 M2 Bac+4 M1 Bac+3 B3 Bac+2 B2 CP2I-2 Bac+1 B1 CP2I-1 Baccalauréat secondary school 4/86 French Higher Education Organisation Sup Galilée in the Bologna process Bachelor/Master Engineer diploma Bac+5 M2 Engineer 3 Bac+4 M1 Engineer 2 Bac+3 B3 Engineer 1 Bac+2 B2 CP2I-2 Bac+1 B1 CP2I-1 Baccalauréat secondary school 4/86 Engineering School Sup Galilée The Engineering School of the university Paris 13 Component of the Institut Galilée (Science Faculty) Public school Recognized by the engineering accreditation board (CTI) Label EUR-ACE® Selective program Double diploma with ESPRIT (Tunisia), ENSMR (Marocco) ERASMUS+ program with Tunisia and Marocco 5/86 Engineering School Sup Galilée Engineering Master Degree (award to a master degree): Major in Energy Applied Mathematics Instrumentation Telecommunications and Networks Computer Science About 400 students + Engineering School Training Program (CP2I) About 180 bachelor students 6/86 University Paris 13 (President: Jean-Pierre Astruc) A multidisciplinary University in the north of Paris Humanities, Law, Economics, Communication, Management, Medicine and Biology, 3 University Institutes of Technology (IUT - 2 two first years of Bachelor level) Sciences: Institut Galilée More than 24,000 students on 5 campus 7/86 Engineer Master Degree Learning organization 180 credits over 6 semesters Total number of hours during 3 years: 1800 - 2000h 800/700 h in 1 st year, 800/700 h in 2 nd year, 400 h in 3 rd year Lectures, Tutorials, Lab works in: Non scientific courses (languages, humanities, corporate) Scientific courses (scientific culture and major) Equilibrium between theoretical and practical approaches 8/86 Engineer Master Degree Learning organization Presence of projects 10% of the hourly volume Different types of project every year 3 internships Discovery of company environment: 1 month / 1 st year Technical internship: 2 to 4 months / 2 nd year Engineer internship (end of studies): 4 to 6 months / 3 rd year Minimum number of weeks over 3 years: 28 Internship in industry or academic laboratory (minimum 14 in industry) 9/86 Requirement for the diploma 180 ECTS over 6 semesters English: B2+ (TOEIC) at the end of 3 years Possibility to get this level 2 years after the end of the 3rd year 28 weeks of internship 1 or 2 months abroad (for studying or working) if one of the requirements is not met, the student cannot be awarded the Engineer diploma 10/86 Common courses Common scientific courses (1st semester, 8.5 % of the teaching time) Knowledge and understanding of common scientific background for the engineers Mathematics for the engineers, Probabilities and Statistics, Analysis and Data processing Basic programming (C) General culture courses (1 day/week, 24% of the teaching time) English Corporate culture related to an annual topic: Employments of engineers, team work Company and entrepreneurship Labor/employment laws, IT legislations ... 11/86 Computer Science Engineers in Computer Science with general background: Software: life cycle, components, modeling Computing: paradigms, algorithmics, modeling Program oriented toward: Knowledge representation and modeling Software engineering and programming Soft Architecture (Database, operating systems, network) During the 3 rd year, elective courses associated to a innovative domain (10 crédits): Decision support, Operation Research, Machine Learning Natural Language Processing, Text Mining, Information retrieval, Network analysis, cloud computing In collaboration with other specialities and masters: Internet of Things, etc. 12/86 Natural Language Processing for biomedical text mining Thierry Hamon LIMSI, CNRS, Université Paris-Saclay, Orsay, France Université Paris 13, Sorbonne Paris Cité, Villetaneuse, France [email protected] May 2019 13/86 Introduction Context Most of the data are unstructured about 90% of the data produced in 2011 (1.8 trillion of gigabytes) [Oracle, 2011] 85% of data produced in compagnies Unstructured data: textual data Important source of information Accessing and reading are costly, time-consuming and sometimes impossible Need of methods for information retrieval and information extraction 14/86 Introduction Context In biomedical domain, constant increase of amount of Scientific Medical literature Scientific papers in digital libraries or portal Medical, pharmacological, epidemiological reports Electronic Health Records in hospitals Discharge summaries Radiological reports Patient-related textual data documents explaining diseases to patients, health behaviors social media (online discussion forums, twitter messages) 15/86 Introduction Context Example: Scientific article publications Medline (U.S. National Library of Medicine bibliographic database) - https://www.ncbi.nlm.nih.gov/pubmed/ Evolution of the number of references to articles in life sciences Citations Added to MEDLINE® per Year Currently: More than 27 million references 16/86 Introduction What is text mining? Objective: Extraction of useful and non-trivial knowledge from texts Extraction of information useful for a given application from textual data, i.e. writen in natural language Collecting and linking this information Feed databases or knowledge bases with information extracted from texts Indirectly: allow data mining on unstructured/textual data 17/86 Text mining Methods and algorithms to explore unstructured data, i.e. texts written in Natural Language Objectives: Extraction and categorisation of information available in the texts Introduction Data mining vs. Text mining Data mining Methods and algorithms to explore structured data, issued from databases, data warehouse or knowledge bases Objectives: Highlight rules, identify trends or behaviours which are invisible to humans 18/86 Introduction Data mining vs. Text mining Data mining Methods and algorithms to explore structured data, issued from databases, data warehouse or knowledge bases Objectives: Highlight rules, identify trends or behaviours which are invisible to humans Text mining Methods and algorithms to explore unstructured data, i.e. texts written in Natural Language Objectives: Extraction and categorisation of information available in the texts 18/86 Introduction What are text mining applications? EHR: Search and find relevant information, Hospital information system Provide synthetic views of patient-related information EHR / Scientific literature: Information storage in databases for statistics, epidemiologic survey, Information system in hospital, etc. Formalize information or knowledge Social media: Epidemiologic analysis, Therapeutical Patient Education, Potential adverse drug effect identification 19/86 Introduction What information to identify? Semantic entities: terms with semantic types Semantic relations between entities Temporal information related to events Numerical information Modifiers for identifying polarity, modality, presence/absence, uncertainty 20/86 Introduction Needs for analysis of biomedical texts Various resources: Terminologies, Ontologies, Open Linked Data Lexica, Consumer Health Vocabularies Semantic description of entities NLP approaches and methods: Rule-based approaches (more or less sophisticated regular expressions) Machine Learning approaches (supervised, semi-supervised, unsupervised) Evaluation against independent reference data 21/86 Introduction Difficulties Textual data may be noisy, sparse, multilingual Text processing is time-consuming, may require contextual information Terminological and semantic variation, semantic ambiguity, unknown or new words and terms, etc. ! High and unpredictable number of dimensions Complex and embedded semantic relations 22/86 Introduction Difficulties Ambiguities of the natural language at each level: lexicon: spell[N] vs. spell[V], Apple[company] vs. apple[fruit] гори[V] (a form of burn) vs. гори (inflectional form of mountain) syntax: the doctor examines the patient with a stetoscope Joe experienced severe shortness of breath and chest pain at home while having sex, which became more unpleasant at the emergency room. 23/86 Introduction Difficulties Ambiguities of the natural