Towards Effective Biomedical Knowledge Discovery Through Subject-Centric Semantic Integration of the Life-Science Information Space
Total Page:16
File Type:pdf, Size:1020Kb
TECHNISCHE UNIVERSITÄT MÜNCHEN Lehrstuhl für Genomorientierte Bioinformatik Towards Effective Biomedical Knowledge Discovery through Subject-Centric Semantic Integration of the Life-Science Information Space Karamfilka Krasimirova Nenova Vollständiger Abdruck der von der Fakultät Wissenschaftszentrum Weihenstephan für Ernährung, Landnutzung und Umwelt der Technischen Universität München zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigten Dissertation. Vorsitzender: Prof. Dr. M. Hrabé de Angelis Prüfer der Dissertation: 1. Univ.-Prof. Dr. H.-W. Mewes 2. Univ.-Prof. Dr. R. Zimmer (Ludwig-Maximilians-Universität München) Die Dissertation wurde am 20.10.2008 bei der Technischen Universität München eingereicht und durch die Fakultät Wissenschaftszentrum Weihenstephan für Ernährung, Landnutzung und Umwelt am 19.03.2009 angenommen. Acknowledgements This is a great opportunity to express my deep gratitude to many people who supported me during my Ph.D. research and in the writing process of this thesis. Without their help, guidance and encouragement, this study would not have been completed. Above all, I would like to thank Dr. Volker Stümpflen, who inspired me for the undertaken research. His professional competence, valuable advices and commitment were indispensable for my work. As a mentor and leader of the Biological Information Systems group at MIPS, he has been always open-minded for constructive discussions and core contributor to the very friendly atmosphere in the group. I am also very grateful to Prof. Dr. Hans-Werner Mewes for giving me the opportunity to acquire doctor’s degree at the Institute for Bioinformatics and Systems Biology. I highly appreciate his trust in me and my work and I am thankful for his guidance in the key moments of my research, the unreserved support and the patience during the writing process. I would like to extend these thanks to Prof. Dr. Inge Schestag, a former study advisor at the University for Applied Sciences Darmstadt and a good friend, for encouraging me to dare my Ph.D. study. At this place, I would like to thank also past and present colleagues at MIPS for listening, giving advices and helping me in the diverse topics regarding my work, in particular Dr. Matthias Oesterheld, Richard Gregory, Thorsten Barnickel, Roland Arnold, Mara Hartsperger, Florian Büttner, Octave Noubibou, Christoph Oberhauser, Thorsten Schmidt, Sebastian Toepel, Dr. Thomas Rattei and Dr. Louise Gregory. Moreover, I have to thank Dr. Ulrich Güldener, Dr. Martin Münsterkötter, Dr. Corinna Montrone, Dr. Irmtraud Dunger, and Dr. Andreas Ruepp for sharing their biological expertise with me. Christian Chors and Dr. Matthias Klaften from the Institute for Experimental Genetics I thank for the interest in my work and the constructive feedbacks. Additionally, I want to thank Cornelia Canady, Elisabeth Noheimer, Petra Fuhrmann, Gabi Kastenmueller, Wanseon Lee, Yu Wang and all other institute members for the nice working atmosphere. Last but not least, I would like to thank my family in Bulgaria for their concerns and continuous encouragements, Annett Würfel, Markus and Julius Döbereiner for cheering me up in the writing process, Tommaso Nuccio for helping me with the thesis revision, and a very special thank goes out to Kuo-Sek Lee for his trust in me. Karamfilka K. Nenova Abstract Key premises for successful life-science research are the access, combination, and interpretation of already acquired knowledge. Within the last two decades, considerable data regarding various biological research aspects has been generated and collected in a huge number of diverse life-science information resources. Although most of the resources provide public access on the Web to share the already gained knowledge, a vast knowledge gap has emerged between the generated volume of information and discovered novel knowledge in life-science. Not only because related biological information is commonly spread over several distributed resources, but also because essential problems exist regarding context dependency and differentiation of biological concepts and entities. Accordingly to bridge this crucial gap, biological information has to be integrated and provided not only in a homogenous way, but also in the right context for the more effective exploration and interpretation, which represents still a demanding knowledge management task. The objective of this thesis was the development of an integrative approach also applicable for the life-science information space that allows a more effective knowledge discovery. The generated solution follows a new paradigm for subject-centric knowledge representation, which reflects the human way of associative thinking in terms of subjects and associations between them. The novel integrative approach is realized by applying both state-of-the-art technologies for dynamic information request and retrieval and also the semantic technology Topic Maps. The designed approach was implemented within the software framework GeKnowME (Generic Knowledge Modeling Environment), which supports scientists with powerful tools for exploration and navigation through correlated biological entities to accelerate the discovery process in a specific knowledge domain. The framework is generic enough to be applicable for a broad range of use cases. To illustrate the potential of the GeKnowME system, a sample use case called “Human Genetic Diseases” is introduced by integrating distributed resources containing relevant information. The emerged coherent information space is explored for novel insights. Zusammenfassung Die Forschung in den unterschiedlichen Biowissenschaften führte in den letzten zwei Jahrzehnten zur Erhebung großer Datenmengen, die in einer Vielzahl von biologischen Informationsressourcen gesammelt und gepflegt werden. Obwohl die meisten Ressourcen öffentlich zugänglich sind und somit den Zugriff auf bereits erworbene Erkenntnisse ermöglichen, nimmt die Kluft zwischen diesen und dem daraus neu gewonnenen Wissen in den Biowissenschaften stetig zu. Ursache hierfür ist einerseits, dass zusammenhängende biologische Entitäten häufig über mehrere Ressourcen verteilt sind, und andererseits, das Fehlen einer einheitlichen Repräsentation kontextabhängiger biologischer Konzepte und Entitäten. Neben einer homogenen Integration der biologischen Information ist die Einordnung dieser in den relevanten Kontext erforderlich um eine effektive Exploration und Interpretation zu ermöglichen. Gerade in den Biowissenschaften stellt sich dies als eine besondere Herausforderung für das Wissensmanagement dar. Ziel dieser Arbeit war die Entwicklung und Umsetzung eines Konzepts für ein Verfahren, welches für die Integration biologischer Wissensdomänen anwendbar ist und auf diese Weise eine effektivere Erkenntnisgewinnung ermöglicht. Das entwickelte Konzept basiert auf einem neuen subject-centric Ansatz zur Wissensrepräsentation, welcher das menschliche assoziative Denken im Bezug auf Entitäten und deren Verbindungen abbildet. Sowohl aktuelle Technologien zur dynamischen Informationsgewinnung als auch die semantische Technologie Topic Maps wurden eingesetzt um diesen innovativen Integrationsansatz zu realisieren. Das Software-Framework GeKnowME (Generic Knowledge Modeling Environment), welches Wissenschaftlern leistungsfähige Werkzeuge für die Erforschung und Navigation durch zusammenhängende biologische Entitäten in spezifischen Wissensdomänen bereithält, stellt die Implementierung dieses Ansatzes dar. Das generische System eignet sich für ein breites Spektrum an Anwendungsfällen. Die Leistungsfähigkeit von GeKnowME wird exemplarisch am Anwendungsfall „Genetische Erkrankungen des Menschen“ aufgezeigt. Hierzu wurden erforderliche verteilte Ressourcen integriert und das daraus entstandene Informationsnetzwerk umfassend analysiert. Die Resultate erlauben neue Einblicke basierend auf bereits bekannter Information. Contents 1 Introduction .................................................................................................. 1 2 Knowledge Management in Life-Science ................................................... 3 2.1 Knowledge Hierarchy and Life Complexity ............................................................ 4 2.2 Life-Science Information Resources ....................................................................... 9 2.3 Integration of Life-Science Information Resources .............................................. 12 2.3.1 Challenges .................................................................................................. 13 Technical Challenges ................................................................................. 13 Conceptual Challenges ............................................................................... 15 2.3.2 Integrative Approaches .............................................................................. 16 Hypertext Linking ...................................................................................... 16 Full-Text Indexing ..................................................................................... 17 Data Warehousing ...................................................................................... 17 Federated Database Management Systems ................................................ 19 Peer Data Management Systems ................................................................ 21 2.4 Knowledge Representation .................................................................................... 23 2.4.1 Semantics