Natural Language Processing for biomedical

Thierry Hamon

LIMSI, CNRS, Université Paris-Saclay, Orsay, France Université Paris 13, Sorbonne Paris Cité, Villetaneuse, France [email protected]

May 2019

1/86 Who am I?

Associate professor in Computer science at the university Paris 13 Head of the major in Computer Science at the Engineering School Sup Galilée International coordinator the Engineering School Sup Galilée Research: Lab: LIMSI Topics: Natural Language Processing, Terminological Acquisition, Text mining

2/86 Sup Galilée The Engineering School of the University Paris 13 Institut Galilée (Science Faculty)

3/86 Bachelor/Master Engineer diploma

Bac+5 M2 Engineer 3 Bac+4 M1 Engineer 2 Bac+3 B3 Engineer 1 Bac+2 B2 CP2I-2 Bac+1 B1 CP2I-1

French Higher Education Organisation Sup Galilée in the Bologna process

Baccalauréat secondary school

4/86 Engineer diploma

Engineer 3 Engineer 2 Engineer 1 CP2I-2 CP2I-1

French Higher Education Organisation Sup Galilée in the Bologna process

Bachelor/Master

Bac+5 M2 Bac+4 M1 Bac+3 B3 Bac+2 B2 Bac+1 B1 Baccalauréat secondary school

4/86 Engineer 3 Engineer 2 Engineer 1 CP2I-2 CP2I-1

French Higher Education Organisation Sup Galilée in the Bologna process

Bachelor/Master Engineer diploma

Bac+5 M2 Bac+4 M1 Bac+3 B3 Bac+2 B2 Bac+1 B1 Baccalauréat secondary school

4/86 Engineer 3 Engineer 2 Engineer 1

French Higher Education Organisation Sup Galilée in the Bologna process

Bachelor/Master Engineer diploma

Bac+5 M2 Bac+4 M1 Bac+3 B3 Bac+2 B2 CP2I-2 Bac+1 B1 CP2I-1 Baccalauréat secondary school

4/86 French Higher Education Organisation Sup Galilée in the Bologna process

Bachelor/Master Engineer diploma

Bac+5 M2 Engineer 3 Bac+4 M1 Engineer 2 Bac+3 B3 Engineer 1 Bac+2 B2 CP2I-2 Bac+1 B1 CP2I-1 Baccalauréat secondary school

4/86 Engineering School Sup Galilée

The Engineering School of the university Paris 13 Component of the Institut Galilée (Science Faculty) Public school Recognized by the engineering accreditation board (CTI) Label EUR-ACE® Selective program

Double diploma with ESPRIT (Tunisia), ENSMR (Marocco) ERASMUS+ program with Tunisia and Marocco

5/86 Engineering School Sup Galilée

Engineering Master Degree (award to a master degree): Major in Energy Applied Mathematics Instrumentation Telecommunications and Networks Computer Science About 400 students + Engineering School Training Program (CP2I) About 180 bachelor students

6/86 University Paris 13 (President: Jean-Pierre Astruc)

A multidisciplinary University in the north of Paris Humanities, Law, Economics, Communication, Management, Medicine and Biology, 3 University Institutes of Technology (IUT - 2 two first years of Bachelor level) Sciences: Institut Galilée More than 24,000 students on 5 campus

7/86 Engineer Master Degree Learning organization

180 credits over 6 semesters

Total number of hours during 3 years: 1800 - 2000h 800/700 h in 1 st year, 800/700 h in 2 nd year, 400 h in 3 rd year Lectures, Tutorials, Lab works in: Non scientific courses (languages, humanities, corporate) Scientific courses (scientific culture and major)

Equilibrium between theoretical and practical approaches

8/86 Engineer Master Degree Learning organization

Presence of projects 10% of the hourly volume Different types of project every year 3 internships Discovery of company environment: 1 month / 1 st year Technical internship: 2 to 4 months / 2 nd year Engineer internship (end of studies): 4 to 6 months / 3 rd year Minimum number of weeks over 3 years: 28 Internship in industry or academic laboratory (minimum 14 in industry)

9/86 Requirement for the diploma

180 ECTS over 6 semesters English: B2+ (TOEIC) at the end of 3 years Possibility to get this level 2 years after the end of the 3rd year 28 weeks of internship 1 or 2 months abroad (for studying or working) if one of the requirements is not met, the student cannot be awarded the Engineer diploma

10/86 Common courses Common scientific courses (1st semester, 8.5 % of the teaching time) Knowledge and understanding of common scientific background for the engineers Mathematics for the engineers, Probabilities and Statistics, Analysis and Data processing Basic programming (C) General culture courses (1 day/week, 24% of the teaching time) English Corporate culture related to an annual topic: Employments of engineers, team work Company and entrepreneurship Labor/employment laws, IT legislations ...

11/86 Computer Science Engineers in Computer Science with general background: Software: life cycle, components, modeling Computing: paradigms, algorithmics, modeling Program oriented toward: Knowledge representation and modeling Software engineering and programming Soft Architecture (Database, operating systems, network) During the 3 rd year, elective courses associated to a innovative domain (10 crédits): Decision support, Operation Research, Natural Language Processing, Text Mining, Information retrieval, Network analysis, cloud computing In collaboration with other specialities and masters: Internet of Things, etc.

12/86 Natural Language Processing for biomedical text mining

Thierry Hamon

LIMSI, CNRS, Université Paris-Saclay, Orsay, France Université Paris 13, Sorbonne Paris Cité, Villetaneuse, France [email protected]

May 2019

13/86 Introduction Context

Most of the data are unstructured about 90% of the data produced in 2011 (1.8 trillion of gigabytes) [Oracle, 2011] 85% of data produced in compagnies : textual data Important source of information Accessing and reading are costly, time-consuming and sometimes impossible Need of methods for information retrieval and

14/86 Introduction Context

In biomedical domain, constant increase of amount of Scientific Medical literature Scientific papers in digital libraries or portal Medical, pharmacological, epidemiological reports Electronic Health Records in hospitals Discharge summaries Radiological reports Patient-related textual data documents explaining diseases to patients, health behaviors social media (online discussion forums, twitter messages)

15/86 Introduction Context Example: Scientific article publications Medline (U.S. National Library of Medicine bibliographic database) - https://www.ncbi.nlm.nih.gov/pubmed/ Evolution of the number of references to articles in life sciences

Citations Added to MEDLINE® per Year

Currently: More than 27 million references

16/86 Introduction What is text mining?

Objective: Extraction of useful and non-trivial knowledge from texts Extraction of information useful for a given application from textual data, i.e. writen in natural language Collecting and linking this information Feed databases or knowledge bases with information extracted from texts Indirectly: allow data mining on unstructured/textual data

17/86 Text mining Methods and algorithms to explore unstructured data, i.e. texts written in Natural Language Objectives: Extraction and categorisation of information available in the texts

Introduction Data mining vs. Text mining

Data mining Methods and algorithms to explore structured data, issued from databases, data warehouse or knowledge bases Objectives: Highlight rules, identify trends or behaviours which are invisible to humans

18/86 Introduction Data mining vs. Text mining

Data mining Methods and algorithms to explore structured data, issued from databases, data warehouse or knowledge bases Objectives: Highlight rules, identify trends or behaviours which are invisible to humans Text mining Methods and algorithms to explore unstructured data, i.e. texts written in Natural Language Objectives: Extraction and categorisation of information available in the texts

18/86 Introduction What are text mining applications?

EHR: Search and find relevant information, Hospital information system Provide synthetic views of patient-related information EHR / Scientific literature: Information storage in databases for statistics, epidemiologic survey, Information system in hospital, etc. Formalize information or knowledge Social media: Epidemiologic analysis, Therapeutical Patient Education, Potential adverse drug effect identification

19/86 Introduction What information to identify? Semantic entities: terms with semantic types Semantic relations between entities Temporal information related to events Numerical information Modifiers for identifying polarity, modality, presence/absence, uncertainty

20/86 Introduction Needs for analysis of biomedical texts

Various resources: Terminologies, Ontologies, Open Linked Data Lexica, Consumer Health Vocabularies Semantic description of entities NLP approaches and methods: Rule-based approaches (more or less sophisticated regular expressions) Machine Learning approaches (supervised, semi-supervised, unsupervised) Evaluation against independent reference data

21/86 Introduction Difficulties Textual data may be noisy, sparse, multilingual Text processing is time-consuming, may require contextual information

Terminological and semantic variation, semantic ambiguity, unknown or new words and terms, etc. → High and unpredictable number of dimensions Complex and embedded semantic relations

22/86 Introduction Difficulties

Ambiguities of the natural language at each level: lexicon: spell[N] vs. spell[V], Apple[company] vs. apple[fruit] гори[V] (a form of burn) vs. гори (inflectional form of mountain) syntax: the doctor examines the patient with a stetoscope Joe experienced severe shortness of breath and chest pain at home while having sex, which became more unpleasant at the emergency room.

23/86 Introduction Difficulties

Ambiguities of the natural language at each level: semantics: a red pencil, He reached the bank. поділися (form of disappear) vs. поділися (lemma of share) pragmatics: The chicken is ready to eat. Margaret invited Susan for a visit, and she gave her a good lunch. a very pleasant patient

24/86 Introduction Difficulties

Variation in semantically similar wording: Bayer is buying Monsanto Bayer clinches Monsanto Bayer and Monsanto [...] will merge Bayer's announced acquisition of Monsanto Monsanto-Bayer merger Metonymy: the latest Apple/Samsung Metaphor: Web giants, or noir (black gold in French) Spelling errors: Appel(call in French)/Apple Mix of Latin and Ukrainian characters (different UTF-8 codes): i vs. і, o vs. о, p vs. р, y vs. у...

25/86 Introduction Three experiments in biomedical text mining

1 Recognition of Medication, assertion, temporal information in EHR

[?, ?, ?, ?, ?] Work with Natalia Grabar (CNRS STL - Lille 3), Amandine Périnet (LIM&BIO - Paris 13), Cyril Grouin, Sophie Rosset, Xavier Tannier, Pierre Zweigenbaum (LIMSI, CNRS) 2 Mining literature for identifying risk factors [?] Work with Martin Graña, Víctor Raggio and Hugo Naya (Institut Pasteur de Montevideo), and Natalia Grabar (CNRS STL - Lille 3) 3 Cross-Lingual Transfer Methods for Terminology Acquisition [?] Work with Natalia Grabar (CNRS STL - Lille 3)

26/86 Mining EHR Mining Patients' Electronic Health Records [?, ?, ?, ?, ?] Description of the hospitalization A lot of (personal) information about patients Problems Therapies (treatments, drugs, etc.) Tests and analysis (lab data, etc.) Assertions regarding facts (certainty, hypothesis, etc.) Temporal information (useful for the clinical timeline) The best way to record information (database are difficult to maintain) BUT the texts are written by practitioners: in a hurry, with mistakes, with little or incorrect syntactic structures, etc.

27/86 Mining EHR Objectives

Identification of Medication names given to patients Related information (dosage, duration, frequency, mode of administration, reason for prescription) Assertion: certainty and uncertainty of information in medical texts focus on the relation {patient / medical problem} Temporal expressions: date, time and duration of medical events Participation to several I2B2 Challenges

28/86 Mining EHR Drug-related information

phosphate disodique anhydre steroidal anti−inflammatory phosphate monosodique anhydre composition brain oedema is a lactosis methylprednisolone allergic shock cortisone INN Quincke oedema sodium

solumedrol prescribed for suffocation by larynx oedema

swelling of face FDI salt arterial hypertension dosage

acne mode

DDI osteoporosis frequency digitaline reason adverse effects ulcer of stomach insulin depression prescription features duration

29/86 Mining EHR Assertion task

The patient suffers from Positive certainty abdominal pain Certainty Negative certainty The patient denies suffering from abdominal pain

Assertion The patient is to call the hospital Hypothesis if he suffers from abdominal pain Degree of Condition With shrimps, the patient suffers certainty from abdominal pain

Possibility It was thought that the patient might suffer from abdominal pain

30/86 Mining EHR Example Medication name, associated information, assertions and time expressions

The patient is currently off diuretics at this time. Daily weights should be checked and if her weight increases by more than 3 pounds Dr. Bockoven should be notified. The patient was also started on calcitriol given elevation of parathyroid hormone. Cardiovascular: Rate and rhythm: The patient has a history of atrial fibrillation with a slow ventricular response. Two weeks ago, the patient was started on metoprolol 12.5 mg p.o. q.6 h. for rate control , however , this dose was decreased to 12.5 mg p.o. twice a day, given some bradycardia on her telemetry. The patient was also started on Flecainide 75 mg p.o. q.12 h. She will continue on these two medications upon discharge.

31/86 Mining EHR Example Medication name, associated information, assertions and time expressions

RRR , lots of BS's , neuro nonfocal , ext with 1+ edema. On atenolol , zestril , norvasc , premarin , detrol , lasix 60 qd , nebs prn at home. Labs sig for Cr 0.7 , CK 48 , TnI .05 , QBC 9.5 , Hct 41.3. From CV point of view , thought to be CHF exac. ROMI'd without events on monitor and diuresed 2L/day. IV Lasix 80 bid to start transitioned to 60 po bid.BNP>assay.6/17 dobut MIBI with mod sized ant septal wall defect c/w diagonal lesion , 3/22 Echo with EF 55-60% , mild LAE/RAE , no WMA , mod large RV. No further CV studies. Cont previously meds on d/c.From FEN point of view ,2 L fluid restriction , 2 g Na restriction. Nutrition consult , but pt very resistant to diet changes. From GI point of view , GERD; nexium started. From pulm point of view ,CXR c/w sl fluid overload ,no focal findings ,no pulm edema. Given NC O2 and BiPAP at night.

32/86 Mining EHR Material Documents

Discharge summaries: 1,249 documents (provided by the I2B2 challenges) 2009: 649 docs in the training set , 553 docs in the test set, 17 manually annotated documents (for illustrating the annotation guidelines) 2010: 349 annotated documents + 827 raw documents in the training set, 477 in the test set Assertions: 11,968 in the training set, 18,550 in the test set 2012: 190 docs in the training set, 120 docs in the test

33/86 Mining EHR Material Terminologies and lexica Medication names: RxNorm (243,869 entries) and Therapeutic classes and groups of medication from the FDA website Ambiguous medication (red blood cells, magnesium, iron): specific status during the annotation process Medical problems: 45,898 terms (Diagnosis and Morphology axes of the Snomed International), 476 terms from the training set documents Medication-related information Regular expressions for frequency, dosage, duration and mode of administration 52 identification rules for reasons: characterization of Snomed Int terms and/or extracted terms as reasons

34/86 Mining EHR Material Terminologies and lexica Assertions: Negation: 284 markers from the NegEx resource [?] Lexical clues: on exertion (condition) Morphological clues: afebrile (negative certainty) Contextual information (342 markers) Clues in the sentence, Section headings ... could represent a multifocal pneumonic process (possible) ALLERGIES, SOCIAL HISTORY, lists Lexico-syntactic patterns (137 patterns)

be to (address | request | notify) DT (office | clinic | hospital) ifPB (Hypothesis) TE to (evaluate | check | eval | consult) (from | if | with | against) PB (Possibility)

35/86 Mining EHR Document processing

Annotation of the documents Use of terminological and linguistic resources and selection and disambiguation rules CRF-based models [?, ?] tuning Heideltime system [?, ?] Design of post-processing modules for Disambiguation and negative contexts of medication names Computing of dependency relations between patient, medication names and related information, or assertion Improving the CRF-based system with extracted terms [?]

36/86 Mining EHR Enriching documents with linguistic information

XML document with structural annotations Symbolic approach: use of NLP methods

Tokenisation Terminological resources and Dictionary Named entity tagging of named entities disambiguation rules

Word and sentence segmentation Concurrent annotations and annotation selection

Part−Of−Speech Specialised Tagging lexicon Design of post-processing modules for

Lemmatisation Annotation disambiguation Extraction of terms Establishment of dependency

Tagging relations between patient, Terminoloy of the terms medication names and related

Semantic tagging information, or assertion Ontology Annotation based on the Ogmios NLP platform XML document with linguistic and structural annotations (developed during the EU Project Alvis)

37/86 , words, lemma and part-of-speech, named entities and terms with semantic types

[TIMEX3] [DRUG] [DOSAGE] [MODADM] Two weeks ago the patient was started on metoprolol 12.5 mg p.o.

[FREQ] [DISORDER] q.6 h. for rate control

[DISORDER] [DISORDER] The patient has a history of atrial fibrillation with a slow ventricular response

Mining EHR Enriching documents with linguistic information

Identification of the sentences

Two weeks ago , the patient was started on metoprolol 12.5 mg p.o.

q.6 h. for rate control ...

The patient has a history of atrial fibrillation with a slow ventricular response .

38/86 , lemma and part-of-speech, named entities and terms with semantic types

[TIMEX3] [DRUG] [DOSAGE] [MODADM] Two weeks ago the patient was started on metoprolol 12.5 mg p.o.

[FREQ] [DISORDER] q.6 h. for rate control

[DISORDER] [DISORDER] The patient has a history of atrial fibrillation with a slow ventricular response

Mining EHR Enriching documents with linguistic information

Identification of the sentences, words

Two weeks ago , the patient was started on metoprolol 12.5 mg p.o.

q.6 h. for rate control ...

The patient has a history of atrial fibrillation with a slow ventricular response .

38/86 , named entities and terms with semantic types

[TIMEX3] [DRUG] [DOSAGE] [MODADM] Two weeks ago the patient was started on metoprolol 12.5 mg p.o.

[FREQ] [DISORDER] q.6 h. for rate control

[DISORDER] [DISORDER] The patient has a history of atrial fibrillation with a slow ventricular response

Mining EHR Enriching documents with linguistic information

Identification of the sentences, words, lemma and part-of-speech

Two weeks ago , the patient was started on metoprolol 12.5 mg p.o. CD NNS RB DT NN VBD VBN IN FW CD NN SYM

q.6 h. for rate control ... FW NP IN NN NN

The patient has a history of atrial fibrillation with a slow ventricular response . DT NN VBZ DT NN IN JJ NN IN DT JJ JJ NN

38/86 and terms with semantic types

[DRUG] Two weeks ago the patient was started on metoprolol 12.5 mg p.o.

[DISORDER] q.6 h. for rate control

[DISORDER] [DISORDER] The patient has a history of atrial fibrillation with a slow ventricular response

Mining EHR Enriching documents with linguistic information

Identification of the sentences, words, lemma and part-of-speech, named entities

[TIMEX3] [DOSAGE] [MODADM] Two weeks ago , the patient was started on metoprolol 12.5 mg p.o. CD NNS RB DT NN VBD VBN IN FW CD NN SYM

[FREQ] q.6 h. for rate control ... FW NP IN NN NN

The patient has a history of atrial fibrillation with a slow ventricular response . DT NN VBZ DT NN IN JJ NN IN DT JJ JJ NN

38/86 Two weeks ago the patient was started on metoprolol 12.5 mg p.o.

q.6 h. for rate control

The patient has a history of atrial fibrillation with a slow ventricular response

Mining EHR Enriching documents with linguistic information

Identification of the sentences, words, lemma and part-of-speech, named entities and terms with semantic types

[TIMEX3] [DRUG] [DOSAGE] [MODADM] Two weeks ago , the patient was started on metoprolol 12.5 mg p.o. CD NNS RB DT NN VBD VBN IN FW CD NN SYM

[FREQ] [DISORDER] q.6 h. for rate control ... FW NP IN NN NN

[DISORDER] [DISORDER] The patient has a history of atrial fibrillation with a slow ventricular response . DT NN VBZ DT NN IN JJ NN IN DT JJ JJ NN

38/86 Mining EHR Concurrent annotation of documents Preparing material for document annotation Named Entity Recognition (frequency, duration, dosage, mode of administration)

+ internal disambiguation (avoid nested annotations of different types and merge annotations of the same type)

Term and semantic tagging (medication and reasons, negation and reason marker, assertion)

based on linguistic information (word and sentence segmentation, lemmatization)

+ internal disambiguation (nested terms, parenthesed medication names, etc.)

39/86 Mining EHR Time expression identification [?] Tuning Heideltime system [?] for English and French EHR Enrichment and encoding of linguistic temporal expressions specific to medical and clinical domain: post-operative day #, b.i.d. meaning twice a day, day of life, etc.

Admission date as the reference or starting point for computing relative dates and their normalised value if the admission date is 14 June 2017, the normalised value of 2 days later is 16 June 2017. Additional normalizations of the temporal expressions: normalization the durations in approximate numerical values to avoid undefined values external computation for some durations and frequencies due to limitations in HeidelTime's internal arithmetic processor

40/86 Mining EHR Annotation selection

Processing of ambiguous medication names : laboratory data or medication 1 if a list section: status changed in medication

HOME MEDS: methadone 20 bid, imdur 120 bid, hydral taking 25 bid, lasix 20 bid, coumadin, colace, iron, nexium 40 bid

Rejection of medicaton names: if in allergy sections ALLERGY: prednisone, penicillins, tamsulosin, simvastatin

Removal of drug names in negative contexts

41/86 Mining EHR Annotation selection

Guessing new drug names with semantic patterns m do mo? f [?]

1 Noun phrases recognized by the term extractor YATEA 2 Stopwords rejected 3 Filtering with typical suffixes of the medication names

Diovan 160mg PO BID, HCTZ 25mg PO QD, Imdur ER 60mg PO QD, NTG .4mg PRN CP, Norvasc 10mg PO QD, Pavachol 80mg PO QD.

42/86 Mining EHR Results Medication task Focus on various parameters for reason identification and guessing medication names

RUN2 RUN1 RUN3 System 0.7801 0.7681 (-0.0120) 0.7719 (-0.0082) m 0.8142 0.8093 (-0.0049) 0.808 (-0.0062) do 0.8234 0.8172 (-0.0062) 0.821 (-0.0024) f 0.837 0.8304 (-0.0066) 0.8345 (-0.0025) mo 0.8655 0.8577 (-0.0078) 0.8624 (-0.0031) du 0.3575 0.3516 (-0.0059) 0.3505 (-0.0070) r 0.2867 0.2759 (-0.0108) 0.2666 (-0.0201)

RUN1: All reasons RUN2: All reasons without semantic tagging and reason markers RUN3: All reasons without semantic tagging and use of reason markers Guessing medication names

43/86 Mining EHR Results Medication task

exact inexact F P R F P R System 0.7801 0.7997 0.7614 0.7792 0.8111 0.7497 m 0.8142 0.8448 0.7858 0.8304 0.8666 0.7971 do 0.8234 0.8728 0.7793 0.8503 0.8799 0.8226 f 0.837 0.8306 0.8435 0.8411 0.8436 0.8386 mo 0.8655 0.8543 0.877 0.863 0.844 0.8828 du 0.3575 0.3483 0.3673 0.3607 0.3669 0.3546 r 0.2867 0.3047 0.2708 0.3386 0.4386 0.2757

Reason: difficult to identify the exact noun phrases (-13% between inexact and exact precision)

44/86 Mining EHR Results Assertion task and time expression identification

List of markers + section headings

Categories Training Test PRF PRF Associated to somebody else 0.96 0.80 0.88 0.84 0.74 0.79 Hypothesis 0.71 0.31 0.43 0.63 0.24 0.35 Condition 0.08 0.40 0.14 0.08 0.33 0.12 Possibility 0.46 0.57 0.51 0.51 0.47 0.49 Absent 0.92 0.75 0.82 0.87 0.75 0.81 Present 0.86 0.90 0.88 0.84 0.87 0.86 Assertions 0.82 0.82 0.82 0.80 0.80 0.80

Precision Recall F-measure Temporal expressions 0.8611 0.8170 0.8385

45/86 Mining EHR Conclusion

F-measure of the system: 0.800 (avg) Analysis of the resource contribution: Importance of the markers Need to include syntactic structures Difficulty to identify certainty degrees few examples for condition and hypothesis

46/86 Mining EHR Further improvements

Medication tasks: Duration extraction: identification of specific prepositional phrases based on parsing Medical problem identification: development of a specific reasoning module Assertion task: Enrich resources with synonyms (Wordnet) Improving the patterns: using syntactic dependencies integrating semantic classes (verbs of evidence, verbs to get in touch with somebody, etc.)

47/86 Mining Literature Mining literature to identify risk factors and the associated pathologies [?] Objective: Massive exploitation of Medline bibliographical database for extracting risk factors and their associations with health conditions Risk factors: increase people's chance to develop a given disease Information on risk factors is wide-spread over the web: websites, bibliographical databases, ... Previous works: Genomic scientific literature (BioCreative, TREC Genomics), clinical records (I2B2 NLP Challenge 2014), processing of narratives [?] Data mining (KDD challenge 2004) [?, ?, ?]

48/86 Mining Literature Material

Bibliographical database Medline (titles, abtracts) Selection of potential citations/PMIDs, i.e. containing the sequences risk factors, factor of risk 187,544 citations selected: over 42 million word occurrences MeSH (thesaurus for information storage and retrieval) Disease-related MeSH term recognition in citations

49/86 Mining Literature Document processing

1 Annotation of Medline citations with linguistic information Ogmios NLP platform [?] Segmentation, POS-tagging & lemmatization -- Genia Tagger [?] Term recognition but also term extraction -- YATEA [?]

2 Risk factors identification

50/86 Mining Literature Document processing

1 Annotation of Medline citations with linguistic information Ogmios NLP platform [?] Segmentation, POS-tagging & lemmatization -- Genia Tagger [?] Term recognition but also term extraction -- YATEA [?]

2 Risk factors identification

50/86 Mining Literature Term recognition vs. Term extraction

Term recognition: Tagging of texts with terms issued from a terminologies Use of more or less complexe methods (string matching, terminological variant computing, semantic distances, ML methods...)

Term extraction: Discovering of terms in texts Identification of noun phrases which are potential terms (term candidates) Computing of the strength of the term components (unithood) the strength of the relation to the domain (termhood) [?]

51/86 Mining Literature

Term extraction with YATEA Yet Another Term ExtrActor (Aubin&Hamon, 2006) Term extration from French and English texts Shallow parsing of texts Parsing focusing on the parts of the sentence which may contain terms (usually the noun phrases) With recursively applied minimal parsing patterns endogenous learning Term candidate decomposition in Head and Modifier components (component syntactic role in the noun phrase) Each component of a term candidate is also considered as a term candidate Unparseable noun phrases are rejected

52/86 Mining Literature

YATEA Yet Another Term ExtrActor (Aubin et Hamon, 2006)

Several statistical measures are associated with each term candidate (Number of occurrences, C-Value1, C-Value*, etc.) [?]

Module CPAN http://search.cpan.org/~thhamon/Lingua-YaTeA/ Developpement during the European project ALVIS Description of the shallow parsing with configuration files Possibility of tuning for a domain (BioYATEA) [?] For other languages: on-going work for Ukrainian and Arabic

53/86 Term extraction

rule-based approaches

Identification of chunks thanks to morpho-syntactic information (frontiers - verbs, adverbs, etc.)

Mining Literature

Term extraction with YATEA

Textes

lemmatisation + POS tagging

22CD yoJJ maleNN ,, hNN /SYM oNN primitiveJJ neuroectodermalJJ tumorNN withIN metsNNS toTO brainNN andCC spineNN ,, transferredVBN fromIN Hospital1NNP ,, initiallyRB inIN Dept1NNP andCC thenRB transferredVBN toTO theDT floorNN .. HePRP wasVBD initiallyRB diagnosedVBN withIN aDT thoracicJJ gangliogliomNN //resectedVBN inIN 2012CD .. HePRP hadVBD backJJ painNN inin 2CD /SYM 04CD ,, seenVBN atIN Dept2NNP ,, andCC wasbe foundVBN toTO haveVB metsNNS toTO brainNN andCC spineNN ..

54/86 Mining Literature

Term extraction with YATEA

Textes Term extraction

rule-based approaches lemmatisation + POS tagging Identification of chunks thanks to morpho-syntactic information (frontiers - verbs, adverbs, etc.)

22CD yoJJ maleNN ,, hNN /SYM oNN primitiveJJ neuroectodermalJJ tumorNN withIN metsNNS toTO brainNN andCC spineNN ,, transferredVBN fromIN Hospital1NNP ,, initiallyRB inIN Dept1NNP andCC thenRB transferredVBN toTO theDT floorNN .. HePRP wasVBD initiallyRB diagnosedVBN withIN aDT thoracicJJ gangliogliomNN //resectedVBN inIN 2012CD .. HePRP hadVBD backJJ painNN inin 2CD /SYM 04CD ,, seenVBN atIN Dept2NNP ,, andCC wasbe foundVBN toTO haveVB metsNNS toTO brainNN andCC spineNN ..

54/86 Mining Literature

Term extraction with YATEA Parsing of the noun phrases to detect term candidates 1. Identification of term candidates described by parsing patterns M H

JJ NN (< H > : Head of the noun phrase, < M > : modifier of the head)

M H neuroectodermal tumor → (neuroectodermal< M > tumor< T >) neuroectodermal tumor

H M shortness of breath → shortness< T > of breath< M > shortness (of) breath

55/86 primitive

Use of the already parsed term M H neuroectodermal tumor neuroectodermal tumor

Temporary simplification (folding): primitiveJJ tumorNN

M H M H Use of the parsing pattern: → JJ NN primitive tumor M H

Unfolding : primitive M H

neuroectodermal tumor

Mining Literature

Term extraction with YATEA 2. Use of the previously parsed term candidates (island of reliability) to parse remaining noun phrases Example: primitive neuroectodermal tumor

56/86 primitive

Temporary simplification (folding): primitiveJJ tumorNN

M H M H Use of the parsing pattern: → JJ NN primitive tumor M H

Unfolding : primitive M H

neuroectodermal tumor

Mining Literature

Term extraction with YATEA 2. Use of the previously parsed term candidates (island of reliability) to parse remaining noun phrases Example: primitive neuroectodermal tumor Use of the already parsed term M H neuroectodermal tumor neuroectodermal tumor

56/86 Temporary simplification (folding): primitiveJJ tumorNN

M H M H Use of the parsing pattern: → JJ NN primitive tumor M H

Unfolding : primitive M H

neuroectodermal tumor

Mining Literature

Term extraction with YATEA 2. Use of the previously parsed term candidates (island of reliability) to parse remaining noun phrases Example: primitive neuroectodermal tumor Use of the already parsed term M H neuroectodermal tumor primitive neuroectodermal tumor

56/86 M H M H Use of the parsing pattern: → JJ NN primitive tumor M H

Unfolding : primitive M H

neuroectodermal tumor

Mining Literature

Term extraction with YATEA 2. Use of the previously parsed term candidates (island of reliability) to parse remaining noun phrases Example: primitive neuroectodermal tumor Use of the already parsed term M H neuroectodermal tumor primitive neuroectodermal tumor

Temporary simplification (folding): primitiveJJ tumorNN

56/86 M H

Unfolding : primitive M H

neuroectodermal tumor

Mining Literature

Term extraction with YATEA 2. Use of the previously parsed term candidates (island of reliability) to parse remaining noun phrases Example: primitive neuroectodermal tumor Use of the already parsed term M H neuroectodermal tumor primitive neuroectodermal tumor

Temporary simplification (folding): primitiveJJ tumorNN

M H M H Use of the parsing pattern: → JJ NN primitive tumor

56/86 Mining Literature

Term extraction with YATEA 2. Use of the previously parsed term candidates (island of reliability) to parse remaining noun phrases Example: primitive neuroectodermal tumor Use of the already parsed term M H neuroectodermal tumor primitive neuroectodermal tumor

Temporary simplification (folding): primitiveJJ tumorNN

M H M H Use of the parsing pattern: → JJ NN primitive tumor M H

Unfolding : primitive M H

neuroectodermal tumor

56/86 Candidate Ranked term Term extraction Term ranking terms candidates rule-based approaches frequency term length C-Value

Mining Literature

Term extraction with YATEA

Textes

lemmatisation + POS tagging

22CD yoJJ maleNN ,, hNN /SYM oNN primitiveJJ neuroectodermalJJ tumorNN withIN metsNNS toTO brainNN andCC spineNN ,, transferredVBN fromIN Hospital1NNP ,, initiallyRB inIN Dept1NNP andCC thenRB transferredVBN toTO theDT floorNN .. HePRP wasVBD initiallyRB diagnosedVBN withIN aDT thoracicJJ gangliogliomNN //resectedVBN inIN 2012CD .. HePRP hadVBD backJJ painNN inin 2CD /SYM 04CD ,, seenVBN atIN Dept2NNP ,, andCC wasbe foundVBN toTO haveVB metsNNS toTO brainNN andCC spineNN ..

57/86 Ranked term Term ranking candidates frequency term length C-Value

Mining Literature

Term extraction with YATEA

Candidate Textes Term extraction terms rule-based approaches lemmatisation + POS tagging

yo male thoracic gangliogliom h back pain o mets primitive neuroectodermal tumor brain mets spine brain floor spine ...

57/86 Mining Literature

Term extraction with YATEA

Candidate Ranked term Textes Term extraction Term ranking terms candidates rule-based approaches frequency lemmatisation term length + POS tagging C-Value

f l Cv1 f l Cv1 yo male 1 1 1.58 spine 2 1 2 h 1 1 1 floor 1 1 1 o 1 1 0 thoracic gangliogliom 1 2 1.58 mets 2 1 2 back pain 1 2 1.58 brain 2 1 2 primitive neuroectodermal tumor 1 3 2.32 ...

57/86 Mining Literature Document processing

1 Annotation of Medline citations with linguistic information Ogmios NLP platform [?] Segmentation, POS-tagging & lemmatization -- Genia Tagger [?] Term recognition and extraction -- YATEA [?]

2 Risk factors identification

58/86 Mining Literature Document processing

1 Annotation of Medline citations with linguistic information Ogmios NLP platform [?] Segmentation, POS-tagging & lemmatization -- Genia Tagger [?] Term recognition and extraction -- YATEA [?]

2 Risk factors identification

58/86 Mining Literature Risk factor identification

Semantico-syntactic patterns 5 patterns for risk factors and pathologies 12 patterns for handling enumerations 3 patterns for pathologies as a risk factor for where as a risk factor for: trigger sequence : noun phrases corresponding to risk factors : pathologies ? and *: optional and recurrent elements MeSH descriptors of citations Descriptors belonging to C heading of diseases

59/86 Mining Literature Risk factor identification Examples Pattern: is a risk factor for ...a high intake of calcium and phosphorus is a risk

factor for the development of metabolic acidosis .

(PMID 1435825)

Pattern: risk factors for ,? include ...had more than one of the common risk factors for cerebrovascular accidents , including hypertension , advanced age , hyperfibrinogenemia , diabetes mellitus , and

past history of cerebrovascular accident. (PMID 1560589)

60/86 Mining Literature Risk factor identification Examples Pattern: is a risk factor for ...a high intake of calcium and phosphorus is a risk

factor for the development of metabolic acidosis .

(PMID 1435825)

Pattern: risk factors for ,? include ...had more than one of the common risk factors for cerebrovascular accidents , including hypertension , advanced age , hyperfibrinogenemia , diabetes mellitus , and

past history of cerebrovascular accident. (PMID 1560589)

60/86 Mining Literature Results Application of three kinds of patterns (1) {risk factor, pathology}, (2) risk factors, (3) pathologies Definition of relations: direct relations with patterns {risk factor, pathology} combination of information provided by (2) and (3) 10,445 PMIDs provide information 313 pairs {risk factor, pathology} 15,398 pairs by combination of (2) and (3) 5,873 risk factors (2) not associated with any pathology MeSH indexing: 5,106 pathologies and health conditions

21,584 triplets {risk factor, pathologytext?, pathologyMeSH ?} 17,620 (14,895) pairs only provided by the patterns 5,717 (4,412) pairs contain MeSH descriptors as pathology

61/86 Mining Literature Evaluation Evaluation of precision ratio of correct extractions among the overall results Manual evaluation: no dedicated and comprehensive gold standard is available Comparison with three relationships provided by Snomed CT (nomenclature for organizing and exhanging clinical data) has causative agent: direct cause of the disorder or finding (92,807 relations) bacterial endocarditis has causative agent bacterium due to: relate a clinical finding directly to its cause (25,309 relations) acute pancreatitis due to infection associated with: clinically relevant association between terms without either asserting or excluding a causal or sequential relationship between the two (36,134 relations) fentanyl allergy has causative agent fentanyl

62/86 Mining Literature Evaluation 1 Quality and exhaustiveness of risk factors for a given pathology Evaluation by medical doctor of 1,102 risk factors for coronary heart disease: 88.38% precision hypertension: {smoking; cigarette smoking; smoking history; importance of total life consumption of cigarettes} 2 Comparison between text mining results for 20 pathologies (3,100 extractions, about 25%) and Snomed CT causal and associative relations (154,130 pairs) 19 extractions (0.6%) considered as already in Snomed CT Snomed CT not dedicated to risk factors, but they may occur acquired immunodeficiency syndrome:{bisexuality, blood transfusion, intravenous drug abuse }

63/86 Mining Literature Conclusion

Extraction of information related to risk factors Relation with associated pathologies Text mining approach based on semantico-syntactic patterns Evaluation by medical doctor and computer scientist 88.38% of risk factors related to coronary heart disease are correct about 70% of extracted pathologies are equivalent with MeSH indexing Snomed CT is not dedidated to the recording of risk factors, although they may occur ⇒ Creation of a dedicated resource for risk factors is suitable

64/86 Mining Literature Future work

Use of other patterns, i.e. predictor, precursor ... Machine learning methods Knowledge representation: homogeneous groups of risk factors environmental, social, clinical, behavioral ... Characterization of this information modal, negative contexts Geographical, demographic variation

65/86 Terminology building by Transfer Cross-Lingual Transfer Method for Building Ukrainian Medical Terminology [?]

Nowadays, methods and automatic tools for several European languages and Japanese [?, ?, ?] For many languages: few NLP tools are available and suitable for automatic terminology extraction while textual data exist and terminological resources are required

66/86 Terminology building by Transfer Our objective Design of specific methods for the acquisition of such terminological resources in Ukrainian

Approaches: Compilation of terminological resources Automatic building of terminologies Observations: increasing availability of parallel bilingual corpora

Methodology: Use of specialized parallel corpora including a low-resourced language (Ukrainian) to build bilingual and trilingual terminologies by the means of the cross-lingual transfer principle

67/86 Terminology building by Transfer Cross-lingual transfer principle [?, ?] Hypothesis: parallel and aligned corpora with two languages L1 and L2 syntactic or semantic annotations and information from L1 Method: transpose these annotations or information from L1 to L2, obtain the corresponding annotations and information in L2 Efficient way for [?, ?] processing multilingual texts from low-resourced languages creating various types of annotations: part-of-speech, semantic categories or even acoustic and prosodic features

68/86 Terminology building by Transfer Drawbacks of the transfer principle

The transfer methodology depends on the quality of the extracted information and annotation from L1 texts the quality of alignment usually a statistical alignment method depending on the size of the corpora: the bigger the better → Define an approach to bypass these drawbacks

69/86 Terminology building by Transfer Material

Medical data in three languages (Ukrainian, French, and English): Ukrainian Wikipedia: source of relevant terms help for the word-level alignment of the MedlinePlus corpus MedlinePlus corpus: a collection of specialized texts providing the basis for the building of the terminology

70/86 Terminology building by Transfer Medicine-related articles from Ukrainian Wikipedia Selection of the Ukrainian part of the Wikipedia using medicine-related categories, such as Медицина (medicine) or Захворювання (disorders) Potentially covers a wide range of medical notions Use of information in the infobox

71/86 Terminology building by Transfer Parallel medical corpus [?]

Patient-oriented brochures in three languages (Ukrainian, French, and English) from MedlinePlus on several medical topics (body systems, disorders and conditions, diagnosis and therapy, health and wellness) created in English and then translated in several other languages (including French and Ukrainian) http://natalia.grabar.free.fr/resources.php About 43,000 words in each language

72/86 Processing of the InfoBoxes

Medical terms with MeSH codes

UMLS Цукровий діабет тип 2 Querying UMLS UMLS

NIDDM Type 2 Diabetes Mellitus Pairs of medical terms DID2, (UK/FR and UK/EN) Diabète avec insulinorésistance

Terminology building by Transfer Extraction of bilingual terminology from the Wikipedia Objective: complete and help the alignment method applied to the MedlinePlus corpus Use of content of the infoboxes Ukrainian Wikipedia medical part

73/86 Medical terms with MeSH codes

UMLS Цукровий діабет тип 2 Querying UMLS UMLS

NIDDM Type 2 Diabetes Mellitus Pairs of medical terms DID2, (UK/FR and UK/EN) Diabète avec insulinorésistance

Terminology building by Transfer Extraction of bilingual terminology from the Wikipedia Objective: complete and help the alignment method applied to the MedlinePlus corpus Use of content of the infoboxes Ukrainian Wikipedia medical part

Processing of the InfoBoxes

73/86 UMLS Querying UMLS UMLS

NIDDM Type 2 Diabetes Mellitus Pairs of medical terms DID2, (UK/FR and UK/EN) Diabète avec insulinorésistance

Terminology building by Transfer Extraction of bilingual terminology from the Wikipedia Objective: complete and help the alignment method applied to the MedlinePlus corpus Use of content of the infoboxes Ukrainian Wikipedia medical part

Processing of the InfoBoxes

Medical terms with MeSH codes

Цукровий діабет тип 2

73/86 Pairs of medical terms (UK/FR and UK/EN)

Terminology building by Transfer Extraction of bilingual terminology from the Wikipedia Objective: complete and help the alignment method applied to the MedlinePlus corpus Use of content of the infoboxes Ukrainian Wikipedia medical part

Processing of the InfoBoxes

Medical terms with MeSH codes

UMLS Цукровий діабет тип 2 Querying UMLS UMLS

NIDDM Type 2 Diabetes Mellitus DID2, Diabète avec insulinorésistance

73/86 UMLS

Terminology building by Transfer Extraction of bilingual terminology from the Wikipedia Objective: complete and help the alignment method applied to the MedlinePlus corpus Use of content of the infoboxes Ukrainian Wikipedia medical part

Processing of the InfoBoxes

Medical terms with MeSH codes

Цукровий діабет тип 2 Querying UMLS UMLS

NIDDM Type 2 Diabetes Mellitus Pairs of medical terms DID2, (UK/FR and UK/EN) Diabète avec insulinorésistance

73/86 Terminology building by Transfer Transfert-based term extraction

Раковіклітиниростутьі ділятьсяшвидше, ніж здоровіклітини cancer cells grow and divide more quickly than healthy cells

74/86 Terminology building by Transfer Transfert-based term extraction FR & EN: POS tagging and term extraction with YATEA

Раковіклітиниростутьі ділятьсяшвидше, ніж здоровіклітини cancer

term cells grow and divide more quickly than healthy

term cells

74/86 Terminology building by Transfer Transfert-based term extraction Corpora alignment at the word level

Раковіклітиниростутьі ділятьсяшвидше, ніж здоровіклітини cancer X cells X grow X and X divide XX more quickly than X healthy X cells X

74/86 Terminology building by Transfer Transfert-based term extraction Merge of identified term and alignment

Раковіклітиниростутьі ділятьсяшвидше, ніж здоровіклітини cancer X term cells X grow X and X divide XX more quickly than X healthy X term cells X

74/86 Terminology building by Transfer Transfert-based term extraction Transfert of term annotation to Ukrainian texts

Раковіклітиниростутьі ділятьсяшвидше, ніж здоровіклітини cancer X term cells X grow X and X divide XX more quickly than X healthy X term cells X

74/86 Terminology building by Transfer Transfert-based term extraction Extraction of the corresponding terms in Ukrainian

term term

Раковіклітиниростутьі ділятьсяшвидше, ніж здоровіклітини cancer X term cells X grow X and X divide XX more quickly than X healthy X term cells X

74/86 Terminology building by Transfer Transfert-based term extraction

Illustration of the transfer method

English Ukrainian Cancer cells grow and divide more Ракові клітини ростуть і діляться quickly than healthy cells. Cancer швидше, ніж здорові клітини. При treatments are made to work on лікуванні раку здійснюється вплив these fast growing cells. на ці клітини, що швидко ростуть. - Tiredness - Втома - Nausea or vomiting - Нудота або блювота - Pain - Біль - Hair loss called alopecia - Втрата волосся, що називається алопецією

75/86 Terminology building by Transfer Transfert-based term extraction

Illustration of the transfer method

English Ukrainian Cancer cells grow and divide more Ракові клітини ростуть і діляться quickly than healthy cells. Cancer швидше, ніж здорові клітини. treatments are made to work on При лікуванні раку здійснюється these fast growing cells. вплив на ці клітини, що швидко ростуть. - Tiredness - Втома - Nausea or vomiting - Нудота або блювота - Pain - Біль - Hair loss called alopecia - Втрата волосся, що називає- ться алопецією

75/86 Transfer 1 Transfer 2

POS tagging with TreeTagger and Flemm FR & EN term extraction with YATEA

Wikipedia pairs Extraction of UK terms of medical terms corresponding to lines

Pairs of candidate terms Giza++ suite (UK/FR and UK/EN) Cross-fertilization (including MkCls) Cross-fertilization with single-word terms with single-word terms

MedlinePlus corpora aligned at the word level

UK term extraction by transfer

Pairs of candidate terms (UK/FR and UK/EN)

Terminology building by Transfer Extraction of bilingual terminology from the MedlinePlus corpus

MedlinePlus Corpora UK/FR & UK/EN Cleaning and manual paragraph alignment

76/86 Transfer 1 Transfer 2

Wikipedia pairs Extraction of UK terms of medical terms corresponding to lines

Pairs of candidate terms Giza++ suite (UK/FR and UK/EN) Cross-fertilization (including MkCls) Cross-fertilization with single-word terms with single-word terms

MedlinePlus corpora aligned at the word level

UK term extraction by transfer

Pairs of candidate terms (UK/FR and UK/EN)

Terminology building by Transfer Extraction of bilingual terminology from the MedlinePlus corpus

MedlinePlus Corpora UK/FR & UK/EN Cleaning and manual paragraph alignment POS tagging with TreeTagger and Flemm FR & EN term extraction with YATEA

76/86 Transfer 2

Wikipedia pairs of medical terms

Pairs of candidate terms Giza++ suite (UK/FR and UK/EN) Cross-fertilization (including MkCls) Cross-fertilization with single-word terms with single-word terms

MedlinePlus corpora aligned at the word level

UK term extraction by transfer

Pairs of candidate terms (UK/FR and UK/EN)

Terminology building by Transfer Extraction of bilingual terminology from the MedlinePlus corpus

Transfer 1 MedlinePlus Corpora UK/FR & UK/EN Cleaning and manual paragraph alignment POS tagging with TreeTagger and Flemm FR & EN term extraction with YATEA

Extraction of UK terms corresponding to lines

76/86 Transfer 2

Wikipedia pairs of medical terms

Giza++ suite Cross-fertilization (including MkCls) Cross-fertilization with single-word terms with single-word terms

MedlinePlus corpora aligned at the word level

UK term extraction by transfer

Pairs of candidate terms (UK/FR and UK/EN)

Terminology building by Transfer Extraction of bilingual terminology from the MedlinePlus corpus

Transfer 1 MedlinePlus Corpora UK/FR & UK/EN Cleaning and manual paragraph alignment POS tagging with TreeTagger and Flemm FR & EN term extraction with YATEA

Extraction of UK terms corresponding to lines

Pairs of candidate terms (UK/FR and UK/EN)

76/86 Wikipedia pairs of medical terms

Giza++ suite Cross-fertilization (including MkCls) Cross-fertilization with single-word terms with single-word terms

MedlinePlus corpora aligned at the word level

UK term extraction by transfer

Pairs of candidate terms (UK/FR and UK/EN)

Terminology building by Transfer Extraction of bilingual terminology from the MedlinePlus corpus

Transfer 1 MedlinePlus Corpora Transfer 2 UK/FR & UK/EN Cleaning and manual paragraph alignment POS tagging with TreeTagger and Flemm FR & EN term extraction with YATEA

Extraction of UK terms corresponding to lines

Pairs of candidate terms (UK/FR and UK/EN)

76/86 Wikipedia pairs of medical terms

Cross-fertilization Cross-fertilization with single-word terms with single-word terms

UK term extraction by transfer

Pairs of candidate terms (UK/FR and UK/EN)

Terminology building by Transfer Extraction of bilingual terminology from the MedlinePlus corpus

Transfer 1 MedlinePlus Corpora Transfer 2 UK/FR & UK/EN Cleaning and manual paragraph alignment POS tagging with TreeTagger and Flemm FR & EN term extraction with YATEA

Extraction of UK terms corresponding to lines

Pairs of candidate terms Giza++ suite (UK/FR and UK/EN) (including MkCls)

MedlinePlus corpora aligned at the word level

76/86 Wikipedia pairs of medical terms

Cross-fertilization Cross-fertilization with single-word terms with single-word terms

Pairs of candidate terms (UK/FR and UK/EN)

Terminology building by Transfer Extraction of bilingual terminology from the MedlinePlus corpus

Transfer 1 MedlinePlus Corpora Transfer 2 UK/FR & UK/EN Cleaning and manual paragraph alignment POS tagging with TreeTagger and Flemm FR & EN term extraction with YATEA

Extraction of UK terms corresponding to lines

Pairs of candidate terms Giza++ suite (UK/FR and UK/EN) (including MkCls)

MedlinePlus corpora aligned at the word level

UK term extraction by transfer

76/86 Wikipedia pairs of medical terms

Cross-fertilization Cross-fertilization with single-word terms with single-word terms

Terminology building by Transfer Extraction of bilingual terminology from the MedlinePlus corpus

Transfer 1 MedlinePlus Corpora Transfer 2 UK/FR & UK/EN Cleaning and manual paragraph alignment POS tagging with TreeTagger and Flemm FR & EN term extraction with YATEA

Extraction of UK terms corresponding to lines

Pairs of candidate terms Giza++ suite (UK/FR and UK/EN) (including MkCls)

MedlinePlus corpora aligned at the word level

UK term extraction by transfer

Pairs of candidate terms (UK/FR and UK/EN)

76/86 Wikipedia pairs of medical terms

Cross-fertilization with single-word terms

Terminology building by Transfer Extraction of bilingual terminology from the MedlinePlus corpus

Transfer 1 MedlinePlus Corpora Transfer 2 UK/FR & UK/EN Cleaning and manual paragraph alignment POS tagging with TreeTagger and Flemm FR & EN term extraction with YATEA

Extraction of UK terms corresponding to lines

Pairs of candidate terms Giza++ suite (UK/FR and UK/EN) Cross-fertilization (including MkCls) with single-word terms

MedlinePlus corpora aligned at the word level

UK term extraction by transfer

Pairs of candidate terms (UK/FR and UK/EN)

76/86 Terminology building by Transfer Extraction of bilingual terminology from the MedlinePlus corpus

Transfer 1 MedlinePlus Corpora Transfer 2 UK/FR & UK/EN Cleaning and manual paragraph alignment POS tagging with TreeTagger and Flemm FR & EN term extraction with YATEA

Wikipedia pairs Extraction of UK terms of medical terms corresponding to lines

Pairs of candidate terms Giza++ suite (UK/FR and UK/EN) Cross-fertilization (including MkCls) Cross-fertilization with single-word terms with single-word terms

MedlinePlus corpora aligned at the word level

UK term extraction by transfer

Pairs of candidate terms (UK/FR and UK/EN)

76/86 Terminology building by Transfer Evaluation Performed by an Ukrainian native speaker having knowledge in medical informatics Manual checking of the extracted candidates: correct/non correct Validation: Terms: independently in each language Bilingual and trilingual relations Computing of the precision of the results:

correct answers all the answers with exact and inexact match (the correct term is included or includes the candidate)

77/86 Terminology building by Transfer Results Bilingual terminology from Wikipedia 357 Ukrainian medical terms (among them 177 single-word terms) Use of the MeSH codes and UMLS: 1428 French terms (among them, 339 single-word terms) 3625 English terms (among them, 448 single-word terms) Difference with the number of Ukrainian terms due to the MeSH synonyms Bilingual pairs: 1,515 Ukrainian/French term pairs (270 pairs between single-word terms) 3,789 Ukrainian/English term pairs (405 pairs between single-word terms) Precision: 1 because of the collecting method

78/86 Terminology building by Transfer Results Bilingual terminology from the MedlinePlus - Transfer 1 436 Ukrainian terms with 0.966 precision associated with 316 French terms and 354 English terms 282 triples between Ukrainian/French/English terms (prec.: 0.954) 63 pairs only between Ukrainian/French terms (prec.: 0.937) 115 pairs only between Ukrainian/English terms (prec.: 0.965) Relations involving synonyms: {втома, fatigue/tiredness}, {фаллопієва труба, trompes de fallope/trompe utérine}(fallopian tube), {втрата слуху/втрачається слух, hearing loss} associating several case forms with same English or French form: {вагітність, pregnancy} and {вагітності, pregnancy}

79/86 Terminology building by Transfer Analysis Bilingual terminology from the MedlinePlus - Transfer 1 Few errors: mainly partial match between two languages: {ви можете спати, dormir/sleep} - lit. you can sleep. {появу виразок у роті, mouth sores} - lit. (appearance of) mouth sores Causes of silence: variation due to the translation which prevents the transfer 1 method to extract term in French or English Догляд: match with French title Soins but not with the English title Your care Problem solved by the Transfer 2 method errors in the POS tagging or term extraction strategy Incapacity of the term extractor to identify French or English terms

80/86 Terminology building by Transfer Results Bilingual terminology from the MedlinePlus - Transfer 2

9,040 Ukrainian extracted terms (prec.: 0.454) Exact match: Higher precision of the French (0.674) and English terms (0.761) But low number of terms: 3,671 for French, 3,597 for English Due to the rich morphology of the Ukrainian language: {напад, нападу}- attack,{припадків, припадки}- seizure, {костей, кістки}- bones Extraction of synonymous terms: {биття, удару}- beats, {приступам, припадків}- attacks/seizures

81/86 Terminology building by Transfer Results Bilingual terminology from the MedlinePlus - Transfer 2

Relations: 3,724 pairs of Ukrainian/French terms (prec.: 0.309) 4,745 pairs of Ukrainian/English terms (prec.: 0.401) 4,724 triples of Ukrainian/French/English terms (prec.: 0.419) Inexact match: Higher precision: +0.40 points for the Ukrainian terms, +0.05 for the French and English terms. Due to the alignment quality?

82/86 Terminology building by Transfer Analysis Bilingual terminology from the MedlinePlus - Transfer 2 Error analysis Most of the errors are due to the alignment problems when the alignment is correct, the Ukrainian terms are correctly extracted by the transfer Term analysis Most of the extracted terms are specific to the medical domain {шприца, syringe}, {холестерину, cholesterol}, {фактори ризику, risk factors}, {трахеотомією, tracheostomy}), Other terms: close and approximating notions: {діти, children}, {здорову їжу, healthy diet}, {серцевий напад, heart attack}, {склянок рідини, glasses of liquid} Interesting observation: French and English terms correspond to phrases in Ukrainian: undercooked foods: не до кінця приготовлену їжу (lit. food which is not fully cooked) indolore (painless): При цьому обстеженні Ви не відчуєте жодного болю (lit. With this exam you will feel no pain)

83/86 Terminology building by Transfer Conclusion

Proposition of transfer-based methods to extract the term candidates in Ukrainian create term pairs Ukrainian/French and Ukrainian/English Works on freely available multilingual corpora in French, English and Ukrainian Resulting terminological resource: 4,588 Ukrainian medical terms and 34,267 relations with French and English terms → Method suitable for building terminology in low-resourced languages

84/86 Terminology building by Transfer Future Work

Bilingual word alignment with Fast-Align [?] Use of statistical and morphological cues Use of transfer method for keyphrase extraction from scientific papers ⇒ Ongoing work with Kyiv Institute of Cybernetics Proposing a similar term extration method to work with comparable copora

85/86 Conclusion Overall conclusion Biomedical text mining: a complex task which involves several types of information ...... to link together many strategies for identifying the information a lot of terminological and linguistic resources ...... more or less available or difficult to build according to languages and areas Current challenges concept recognition (disambiguation, normalization) multilingual approaches approaches for low-resourced languages use of information issued from social media

86/86 Conclusion

Дякую!

87/86 Conclusion

Ahmad (Rabiah) et Bath (Peter A). -- Identification of risk factors for 15-year mortality among community-dwelling older people using Cox regression and a genetic algorithm. Journal of Gerontology, vol. 60 (8), 2005, pp. 1052--8.

Aubin (Sophie) et Hamon (Thierry). -- Improving Term Extraction with Terminological Resources. In : Advances in Natural Language Processing (5th International Conference on NLP, FinTAL 2006), éd. par Salakoski (Tapio), Ginter (Filip), Pyysalo (Sampo) et Pahikkala (Tapio). pp. 380--387. -- Springer.

Blake (Catherine). -- A text mining approach to enable detection of candidate risk factors. In : Medinfo, pp. 1528--1528.

Cabré (MT), Estopà (R) et Vivaldi (J). -- Automatic term detection: a review of current systems, pp. 53--88. -- John Benjamins, 2001.

Cerrito (Patricia). -- Inside text Mining. Health management technology, vol. 25 (3), 2004, pp. 28--31.

Chapman (Wendy), Bridewell (Will), Hanbury (Paul), Cooper (Gregory) et Buchanan (Bruce). -- Evaluation of negation phrases in narrative clinical reports. In : Annual Symposium of the American Medical Informatics Association (AMIA). -- Washington, 2001.

Dyer (Chris), Chahuneau (Victor) et Smith (Noah A.). -- A Simple, Fast, and Effective Reparameterization of IBM Model 2. In : NAACL/HLT, pp. 644--648.

Golik (Wiktoria), Bossy (Robert), Ratkovic (Zorana) et Nédellec (Claire). --

87/86 Conclusion

Improving term extraction with linguistic analysis in the biomedical domain. In : Proceedings of the 14th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing'13). -- Samos, Greece, March 2013.

Grouin (Cyril), Abacha (Asma Ben), Bernhard (Delphine), Cartoni (Bruno), Deléger (Louise), Grau (Brigitte), Ligozat (Anne-Laure), Minard (Anne-Lyse), Rosset (Sophie) et Zweigenbaum (Pierre). -- CARAMBA: Concept, Assertion, and Relation Annotation using Machine-learning Based Approaches. In : Proceedings of the workshop I2B2 2010.

Grouin (Cyril), Grabar (Natalia), Hamon (Thierry), Rosset (Sophie), Tannier (Xavier) et Zweigenbaum (Pierre). -- Eventual situations for timeline extraction from clinical reports. Journal of American Medical Informatics Association, vol. 20 (5), September 2013, pp. 820--827. -- (IF: 3.609).

Hamon (Thierry) et Grabar (Natalia). -- Linguistic approach for identification of medication names and related information in clinical narratives. Journal of American Medical Informatics Association, vol. 17 (5), Sep-Oct 2010, pp. 549--554. -- PMID: 20819862. Hamon (Thierry) et Grabar (Natalia). -- Tuning HeidelTime for identifying time expressions in clinical texts in English and French. In : Proceedings of The Fifth International Workshop on Health Text Mining and Information Analysis (LOUHI2014) -- Short paper/Poster, pp. 101--105. -- Gothenburg, Sweden, April 2014.

Hamon (Thierry) et Grabar (Natalia). -- Adaptation of Cross-Lingual Transfer Methods for the Building of Medical Terminology in Ukrainian. In : Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLING2016). -- Springer.

87/86 Conclusion

Hamon (Thierry) et Grabar (Natalia). -- Creation of a multilingual aligned corpus with Ukrainian as the target language and its exploitation. In : Proceedings of Computational Linguistics and Intelligent Systems (COLINS 2017), pp. 10--19.

Hamon (Thierry), Nazarenko (Adeline), Poibeau (Thierry), Aubin (Sophie) et Derivière (Julien). -- A Robust Linguistic Platform for Efficient and Domain specific Web Content Analysis. In : Proceedings of RIAO 2007. -- Pittsburgh, USA, 2007. 15 pages.

Hamon (Thierry), Graña (Martin), Raggio (Víctor), Grabar (Natalia) et Naya (Hugo). -- Identification of relations between risk factors and their pathologies or health conditions by mining scientific literature. In : Proceedings of MEDINFO 2010, pp. 964--968. -- PMID: 20841827. Hamon (Thierry), Grabar (Natalia) et Kokkinakis (Dimitrios). -- Medication Extraction and Guessing in Swedish, French and English. In : Proceedings of MedInfo 2013. -- Copenhagen, Danemark, August 2013.

Hamon (Thierry), Engström (Christopher) et Silvestrov (Sergei). -- Term ranking adaptation to the domain: genetic algorithm based optimisation of the C-Value. In : Proceedings of PolTAL 2014 -- Advances in Natural Language Processing, éd. par Springer , pp. 71--83.

Kageura (K) et Umino (B). -- Methods of Automatic Term Recognition. In : National Center for Science Information Systems, pp. 1--22.

Kolyshkina (I) et van Rooyen (M). -- Text mining for insurance claim cost prediction, pp. 192--202. -- Springer-Verlag, 2006.

Lopez (Adam), Nossal (Mike), Hwa (Rebecca) et Resnik (Philip). --

87/86 Conclusion

Word-Level Alignment for Multilingual Resource Acquisition. In : LREC Workshop on Linguistic Knowledge Acquisition and Representation: Bootstrapping Annotated Data. -- Las Palmas, Spain, 2002.

McDonald (Ryan), Petrov (Slav) et Hall (Keith). -- Multi-source transfer of delexicalized dependency parsers. In : EMNLP.

Minard (AL), Ligozat (AL), Ben Abacha (A), Bernhard (D), Cartoni (B), Deléger (L), Grau (B), Rosset (S), Zweigenbaum (P) et Grouin (C). -- Hybrid methods for improving information access in clinical documents: concept, assertion, and relation identification. J Am Med Inform Assoc, vol. 18 (5), 2011, pp. 588--93.

Pazienza (Maria Teresa), Pennacchiotti (Marco) et Zanzotto (FabioMassimo). -- Terminology Extraction: An Analysis of Linguistic and Statistical Approaches. In : Knowledge Mining, éd. par Sirmakessis (Spiros), pp. 255--279. -- Springer Berlin Heidelberg, 2005.

Périnet (Amandine), Grabar (Natalia) et Hamon (Thierry). -- Identification des assertions dans les textes médicaux : application à la relation {patient, problème médical}. Traitement Automatique des Langues (TAL), vol. 52 (1), 2011, pp. 97--132.

Strötgen (Jannik) et Gertz (Michael). -- Temporal Tagging on Different Domains: Challenges, Strategies, and Gold Standards. In : Proceedings of the Eigth International Conference on Language Resources and Evaluation (LREC'12). pp. 3746--3753. -- ELRA. Tsuruoka (Yoshimasa), Tateishi (Yuka), Kim (Jin-Dong), Ohta (Tomoko), McNaught (John), Ananiadou (Sophia) et Tsujii (Jun'ichi). -- Developing a Robust Part-of-Speech Tagger for Biomedical Text. In : Proceedings of Advances in Informatics - 10th Panhellenic Conference on Informatics, pp. 382--392.

87/86 Conclusion

Yarowsky (David), Ngai (Grace) et Wicentowski (Richard). -- Inducing multilingual text analysis tools via robust projection across aligned corpora. In : HLT.

Zeman (D) et Resnik (P). -- Cross-language parser adaptation between related languages. In : NLP for Less Privileged Languages.

Zweigenbaum (Pierre), Lavergne (Thomas), Grabar (Natalia), Hamon (Thierry), Rosset (Sophie) et Grouin (Cyril). -- Combining an expert-based medical entity recognizer to a machine-learning system: methods and a case study. Biomedical Informatics Insights, vol. 6 (Suppl. 1), 2013, pp. 51--62.

87/86