Towards the French Biomedical Ontology Enrichment
Total Page:16
File Type:pdf, Size:1020Kb
Awarded by University of Montpellier Prepared at I2S∗ Graduate School, UMR 5506 Research Unit, ADVANSE Team, and Laboratory of Informatics, Robotics And Microelectronics of Montpellier Speciality: Computer Science Defended by Mr Juan Antonio LOSSIO-VENTURA [email protected] Towards the French Biomedical Ontology Enrichment Defended on 09/11/2015 in front of a jury composed by: Sophia Ananiadou Professor Univ. of Manchester President Fabio Crestani Professor Univ. of Lugano Reviewer Pierre Zweigenbaum Professor LIMSI-CNRS Reviewer Natalia Grabar Researcher CNRS Examinator Mathieu Roche Research Professor Cirad, TETIS, LIRMM Director Clement Jonquet Associate Professor Univ. of Montpellier Advisor Maguelonne Teisseire Research Professor TETIS, LIRMM Advisor [<Dreams have only one owner at a time. For that, dreamers tend to be alone.] ∗ I2S: Information, Structures and Systems. 2 Dedication This thesis is lovingly dedicated to my mother Laly Ventura. The fact of making me stronger, her support, encouragement, and constant love have positively influenced all my life. i ii Acknowledgments During the past three years, I met a lot of wonderful people. They helped me with- out asking any response. These people contributed to this thesis as well as to my personal development. First, I would like to thank my thesis committee: Prof. Sophia Ananiadou, Prof. Fabio Crestani, Prof. Pierre Zweigenbaum, and Dr. Natalia Grabar, for their time reading deeply my thesis, for their insightful comments and for the hard questions which allowed me to improve my research from various perspectives. I also would like to express my sincere gratitude to my advisors Clement Jon- quet, Mathieu Roche, and Maguelonne Teisseire for the continuous support of my PhD and related research. Thanks for their patience, motivation, encouragement, knowledge and inspirational guidance during the three years of my thesis. Their enthusiasm and encouragement have been a great motivation to me. I would not change them, I could not have found better advisors and mentors for my PhD. My sincere thanks also goes to Prof. Pascal Poncelet, who provided me the opportunity to join the ADVANSE team, giving me access to the laboratory and research facilities. His precious support made possible to conduct other related re- search to my PhD. I am extremely grateful to my colleagues Lilia, Jessica, Sarah, Amine, Mike, Vijay, etc. for their help and useful discussions during my research. We had stim- ulating discussions, as well as several sleepless nights, which served to have fun. In particular, I am grateful to Lilia, Jessica, and Mike. I sincerely thanks all the ADVANSE team for their moral support and encour- agement during my stay in Montpellier. Finally, I want to express my very profound gratitude to my mother and to my sisters for providing me unfailing support and continuous encouragement through- out my three years of research. This accomplishment would not have been possible without them. Thank you. iii iv This thesis was supported in part by the French National Research Agency un- der the JCJC program, grant ANR-12-JS02-01001, as well as by the University of Montpellier, CNRS, IBC of Montpellier project and the FINCyT program, Peru. Abstract Big Data for the biomedical domain involves a major issue: the analysis of large volumes of heterogeneous data (e.g. video, audio, text, image). Ontology, i.e. conceptual models of the reality, can play a crucial role in biomedical fields for automating data processing, querying, and matching heterogeneous data. Various English resources exist, but considerably fewer are available in French and there is a substantial lack of related tools and services to exploit them. Ontologies were initially built manually. A few semi-automatic methodologies have been proposed in recent years. Semi-automatic construction/enrichment of ontologies are mostly achieved using natural language processing (NLP) techniques to assess texts. NLP methods have to take the lexical and semantic complexity of biomedical data into account: (1) lexical refers to complex phrases to take into account, (2) semantic refers to sense and context induction of the terminology. In this thesis, we address the above-mentioned challenges by proposing method- ologies for construction/enrichment of biomedical ontologies based on two main contributions. The first contribution concerns the automatic extraction of special- ized biomedical terms (lexical complexity) from corpora. New ranking measures for single- and multi-word term extraction methods are proposed and evaluated. In addition, we present BioTex web and desktop application that implements the pro- posed measures. The second contribution concerns concept extraction and semantic linkage of extracted terminology (semantic complexity). This work seeks to induce semantic concepts of new candidate terms, and to find semantic links, i.e. relevant locations of new candidate terms, in an existing biomedical ontology. We propose a methodology that extracts new terms in MeSH ontology. Quantitative and quali- tative assessments conducted by experts and non-experts on real data highlight the relevance of the contributions. v vi Résumé En biomedicine, le domaine du “Big Data” (l’infobésité) pose le problème de l’analyse de gros volumes de données hétérogènes (i.e. vidéo, audio, texte, image). Les ontolo- gies biomédicales, modèle conceptuel de la réalité, peuvent jouer un rôle important afin d’automatiser le traitement des données, les requêtes et la mise en correspon- dance des données hétérogènes. Il existe plusieurs ressources en anglais mais elles sont moins riches pour le français. Le manque d’outils et de services connexes pour les exploiter accentue ces lacunes. Dans un premier temps, les ontologies ont été construites manuellement. Au cours de ces dernières années, quelques méth- odes semi-automatiques ont été proposées. Ces techniques semi-automatiques de construction/enrichissement d’ontologies sont principalement induites à partir de textes en utilisant des techniques du traitement automatique du langage naturel (TALN). Les méthodes de TALN permettent de prendre en compte la complexité lexicale et sémantique des données biomédicales : (1) lexicale pour faire référence aux syntagmes biomédicaux complexes à considérer et (2) sémantique pour traiter l’induction du concept et du contexte de la terminologie. Dans cette thèse, afin de relever les défis mentionnés précédemment, nous pro- posons des méthodologies pour l’enrichissement/la construction d’ontologies biomédi- cales fondées sur deux principales contributions. La première contribution est liée à l’extraction automatique de termes biomédicaux spécialisés (complexité lexicale) à partir de corpus. De nouvelles mesures d’extraction et de classement de termes com- posés d’un ou plusieurs mots ont été proposées et évaluées. L’application BioTex implémente les mesures définies. La seconde contribution concerne l’extraction de concepts et le lien sémantique de la terminologie extraite (complexité sémantique). Ce travail vise à induire des concepts pour les nouveaux termes candidats et de déterminer leurs liens sémantiques, c’est-à-dire les positions les plus pertinentes au sein d’une ontologie biomédicale existante. Nous avons ainsi proposé une approche d’extraction de concepts qui intègre de nouveaux termes dans l’ontologie MeSH. Les évaluations, quantitatives et qualitatives, menées par des experts et non experts sur des données réelles, soulignent l’intérêt de ces contributions. vii viii Research Publications Edition of International Conference Proceedings • Lossio-Ventura, J. A., and Alatrista-Salas, H., Editors. Proceedings of the 2nd Annual International Symposium on Information Management and Big Data - (SIMBig 2015), Cusco, Peru, September 2-4, 2015, volume 1478 of CEUR Workshop Proceedings. CEUR-WS.org, 2015. • Lossio-Ventura, J. A., and Alatrista-Salas, H., Editors. Proceedings of the 1st Annual International Symposium on Information Management and Big Data - (SIMBig 2014), Cusco, Peru, October 8-10, 2014, volume 1318 of CEUR Workshop Proceedings. CEUR-WS.org, 2014. International Journals • Lossio-Ventura, J. A., Jonquet, C., Roche, M., and Teisseire, M. Biomedical term extraction: overview and a new methodology. Information Retrieval Journal, Springer Netherlands, vol. 19, 1, August 2015, pp 55-99. • Lossio-Ventura, J. A., Jonquet, C., Roche, M., and Teisseire, M. Towards a mixed approach to extract biomedical terms from documents. Interna- tional Journal of Knowledge Discovery in Bioinformatics - (IJKDB), vol. 4, 1, January-March 2014, 1-15. French Journal • Roche, M., Fortuno, S., Lossio-Ventura, J. A., Akli, A., Belkebir, S., Lounis, T., and Toure, S. Extraction automatique des mots-clés à partir de publica- tions scientifiques pour l’indexation et l’ouverture des données en agronomie. Cahiers Agricultures, vol. 24, 5, Septembre-Octobre 2015. pp. 313-320. International Conferences • Lossio-Ventura, J. A., Jonquet, C., Roche, M., and Teisseire, M. A Way to Automatically Enrich Biomedical Ontologies. In Proceedings of the 19th Inter- national Conference on Extending Database Technology - Posters - (EDBT’2016). Bordeaux, France, 2016 (to appear). ix x • Lossio-Ventura, J. A., Jonquet, C., Roche, M., and Teisseire, M. Commu- nication Overload Management Through Social Interactions Clustering. In Proceedings of the 31st ACM/SIGAPP Symposium on Applied Computing - (SAC’2016). Pisa, Italy, 2016 (to appear). • Lossio-Ventura, J. A., Jonquet, C., Roche, M., and Teisseire, M.