Nanoinformatics Approaches for Information Extraction and Text Mining in Nanomedical Research Texts
Total Page:16
File Type:pdf, Size:1020Kb
Universidad Politécnica de Madrid Escuela Técnica Superior de Ingenieros Informáticos NANOINFORMATICS APPROACHES FOR INFORMATION EXTRACTION AND TEXT MINING IN NANOMEDICAL RESEARCH TEXTS Doctoral Dissertation In partial fulfillment of the requirements for the Doctoral Degree in Artificial Intelligence Author: Diana de la Iglesia Jiménez MSc Computer Science Madrid, 2014 Universidad Politécnica de Madrid Escuela Técnica Superior de Ingenieros Informáticos NANOINFORMATICS APPROACHES FOR INFORMATION EXTRACTION AND TEXT MINING IN NANOMEDICAL RESEARCH TEXTS Doctoral Dissertation In partial fulfillment of the requirements for the Doctoral Degree in Artificial Intelligence Author: Diana de la Iglesia Jiménez MSc Computer Science Advisors: Víctor Maojo García PhD Computer Science Miguel García Remesal PhD Computer Science Madrid, 2014 Tribunal nombrado por el Magfco. y Excmo. Sr. Rector de la Universidad Politécnica de Madrid el día . de . .. de 2014 Presidente D. Vocal 1º D. Vocal 2º D. Vocal 3º D. Secretario D. Suplente 1º D. Suplente 2º D. Realizado el acto de lectura y defensa de la Tesis el día . de . de 2014 en Madrid. Calificación: . EL PRESIDENTE LOS VOCALES EL SECRETARIO A mis padres When I heard the learn’d astronomer; When the proofs, the figures, were ranged in columns before me; When I was shown the charts and the diagrams, to add, divide, and measure them; When I, sitting, heard the astronomer, where he lectured with much applause in the lecture-room, How soon, unaccountable, I became tired and sick; Till rising and gliding out, I wander’d off by myself, In the mystical moist night-air, and from time to time, Look’d up in perfect silence at the stars. Walt Whitman - Leaves of Grass, 1900 ACKNOWLEDGEMENTS It would not have been possible to write this doctoral dissertation without the help and support of the kind people around me, to only some of whom it is possible to give particular mention here. First, I would like to express my sincere gratitude to my advisors, Dr. Víctor Maojo and Dr. Miguel García-Remesal, for making this research possible. I appreciate their vast knowledge and skills in many areas. I would also like to thank the members of the committee, as well as Dr. Casimir Kulikowski and Dr. José Luis Oliveira for taking time out from their busy schedules to serve as external experts. Special thanks go to Dr. Raúl Cachau, without whose motivation and encouragement I would not became so interested in nanotechnology. He provided me with a clever vision of the field, technical support and became a friend. Thanks also to other nanotechnology scientists and experts, such as Dr. Martin Fritts and Dr. Nathan Baker, for their valuable and constructive work. I would like to thank my parents, María del Carmen and Juan, for the support they provided me through my entire life, for which my mere expression of thanks likewise does not suffice. I must also acknowledge Jorge, who has been a constant source of strength all these years and has never stopped supporting me. I am also grateful to my friends in the Universidad Politécnica de Madrid, Andrés, Alex, Dani and Lili, and, particularly, Alberto Anguita, for our conversations and very good moments, which helped enrich the experience. I would also like to express my gratitude to the great people I met over the last years, with a special mention to Sergio Paraíso and Ana Freire. Finally, I appreciate the financial support from the European Commission, the Spanish Ministry of Economy and Competitiveness and the Consejo Social of the Universidad Politécnica de Madrid that funded parts of the research discussed in this dissertation. Madrid, June 2014 Diana de la Iglesia Jiménez IX ABSTRACT Nanotechnology is a research area of recent development that deals with the manipulation and control of matter with dimensions ranging from 1 to 100 nanometers. At the nanoscale, materials exhibit singular physical, chemical and biological phenomena, very different from those manifested at the conventional scale. In medicine, nanosized compounds and nanostructured materials offer improved drug targeting and efficacy with respect to traditional formulations, and reveal novel diagnostic and therapeutic properties. Nevertheless, the complexity of information at the nano level is much higher than the complexity at the conventional biological levels (from populations to the cell). Thus, any nanomedical research workflow inherently demands advanced information management. Unfortunately, Biomedical Informatics (BMI) has not yet provided the necessary framework to deal with such information challenges, nor adapted its methods and tools to the new research field. In this context, the novel area of nanoinformatics aims to build new bridges between medicine, nanotechnology and informatics, allowing the application of computational methods to solve informational issues at the wide intersection between biomedicine and nanotechnology. The above observations determine the context of this doctoral dissertation, which is focused on analyzing the nanomedical domain in-depth, and developing nanoinformatics strategies and tools to map across disciplines, data sources, computational resources, and information extraction and text mining techniques, for leveraging available nanomedical data. The author analyzes, through real-life case studies, some research tasks in nanomedicine that would require or could benefit from the use of nanoinformatics methods and tools, illustrating present drawbacks and limitations of BMI approaches to deal with data belonging to the nanomedical domain. Three different scenarios, comparing both the biomedical and nanomedical contexts, are discussed as examples of activities that researchers would perform while conducting their research: i) searching over the Web for data sources and computational resources supporting their research; ii) searching the literature for experimental results and publications related to their research, and iii) searching clinical trial registries for clinical results related to their research. The development of these activities will depend on the use of informatics tools and services, such as web browsers, databases of citations and abstracts indexing the biomedical literature, and web-based clinical trial registries, respectively. For each scenario, this document provides a detailed analysis of the potential information barriers that could hamper the successful development of the different research tasks in both fields (biomedicine and nanomedicine), emphasizing the existing challenges for nanomedical research —where the major barriers have been found. The author illustrates how the application of BMI methodologies to these scenarios can be proven successful in the biomedical domain, whilst these methodologies present severe limitations when applied to the nanomedical context. To address such limitations, the author proposes an original nanoinformatics approach specifically designed to deal with the special characteristics of information at the nano level. This approach consists of an in-depth analysis of the scientific literature and available clinical trial registries to extract relevant information about experiments and results in nanomedicine —textual patterns, common vocabulary, experiment descriptors, characterization parameters, etc.—, followed by the development of mechanisms to automatically structure and analyze this information. This analysis resulted in the generation of a gold standard —a manually annotated training or reference set—, which was applied to the automatic classification of clinical trial summaries, distinguishing studies focused on nanodrugs and nanodevices from those aimed at testing traditional pharmaceuticals. The present work aims to provide the necessary methods for organizing, curating and validating existing nanomedical data on a scale suitable for decision-making. Similar analysis for different nanomedical research tasks would help to detect which nanoinformatics resources are required to meet current goals in the field, as well as to generate densely populated and machine-interpretable XI reference datasets from the literature and other unstructured sources for further testing novel algorithms and inferring new valuable information for nanomedicine. XII RESUMEN La nanotecnología es un área de investigación de reciente creación que trata con la manipulación y el control de la materia con dimensiones comprendidas entre 1 y 100 nanómetros. A escala nanométrica, los materiales exhiben fenómenos físicos, químicos y biológicos singulares, muy distintos a los que manifiestan a escala convencional. En medicina, los compuestos miniaturizados a nanoescala y los materiales nanoestructurados ofrecen una mayor eficacia con respecto a las formulaciones químicas tradicionales, así como una mejora en la focalización del medicamento hacia la diana terapéutica, revelando así nuevas propiedades diagnósticas y terapéuticas. A su vez, la complejidad de la información a nivel nano es mucho mayor que en los niveles biológicos convencionales (desde el nivel de población hasta el nivel de célula) y, por tanto, cualquier flujo de trabajo en nanomedicina requiere, de forma inherente, estrategias de gestión de información avanzadas. Desafortunadamente, la informática biomédica todavía no ha proporcionado el marco de trabajo que permita lidiar con estos retos de la información a nivel nano, ni ha adaptado sus métodos y herramientas a este nuevo campo de investigación. En este contexto, la nueva área de la nanoinformática pretende detectar y establecer