“Gene Ontology”- Based Semantic Similarity in the Context of Functional Genomics
Total Page:16
File Type:pdf, Size:1020Kb
Investigating “Gene Ontology”- based semantic similarity in the context of functional genomics Danielle Welter This thesis is submitted in partial fulfilment of the requirement for the degree of Doctor of Philosophy School of Computer Science and Informatics Cardiff University May 2011 DECLARATION This work has not previously been accepted in substance for any degree and is not concurrently submitted in candidature for any degree. Signed........................ Date............... STATEMENT 1 This thesis is being submitted in partial fulfilment of the requirements for the degree of PhD. Signed........................ Date............... STATEMENT 2 This thesis is the result of my own independent work/investigation, except where otherwise stated. Other sources are acknowledged by explicit references. Signed........................ Date............... STATEMENT 3 I hereby give consent for my thesis, if accepted, to be available for photocopying and for inter-library loan, and for the title and summary to be made available to outside organisations. Signed........................ Date............... Abstract Gene functional annotations are an essential part of knowledge discovery in the analysis of large datasets, with the Gene Ontology [Ashburner et al., 2000] as the de facto standard for such annotations. A considerable number of approaches for quantifying functional similarity between gene products based on the semantic sim- ilarity between their annotations have been developed, but little guidance exists as to which of these measures are the most appropriate for different purposes. This was addressed here by comparing the performances of a number of similarity measures and associated parameters. This comparison provided some interesting new insights as well as confirming emerging trends from the literature. There is also a pressing need for novel ways of applying these measures to fa- cilitate the functional analysis of lists of gene products. We developed a novel algorithm, FuSiGroups, to group GO terms based on their semantic similarity and genes based on their functional similarity. This two-fold grouping results in groups of not only functionally similar genes but also an associated set of related GO terms that characterise a single functional aspect relating the genes in the group, which facilitates analysis by creating more coherent groups. Each gene can belong to multi- ple groups, so the groups more accurately reflect the complexity of biological reality than clusters generated using traditional approaches. FuSiGroups was tested on a number of scenarios and in each case, successfully generated biologically relevant groups, identifying the key functional aspects of the dataset. The algorithm also managed to eliminate genes that were functionally unrelated to the bulk of the dataset and distinguish between different biological pathways. Although dataset size is currently a limiting factor, with smaller datasets performing the best, FuSiGroups has been demonstrated as a promising approach for the functional analysis of gene products. Acknowledgements “No man is an island”, wrote John Donne in 16241. While some describe doing a PhD as “the loneliest task in the world”, nothing could be further from the truth. I could not have finished this PhD without the help and support of a great many people, too many in fact to name them all here without making this thesis an even heftier tome. The fact that I only thank a few by name shall however be no indication that my gratitude to those who remain unnamed is any less great. First of all, I would like to thank my supervisors, Prof W.A. Gray and Dr P. Kille, who, with patience and experience, guided me throughout my project. Through countless discussions, they helped me further my own knowledge and understanding of my work by sharing theirs with me. They taught, guided and challenged, allowing me to make my own mistakes but never stray too far from the right path. I am forever in their debt. My thanks also go to the “Fonds National de la Recherche” (FNR) in Luxem- bourg, who funded my PhD. I owe a great deal of gratitude of my fellow PhD students at the Cardiff School of Computer Science and Informatics, in particular Alysia. They made my years here a very enjoyable experience, both academically and socially. No words can describe how infinitely grateful I am to my parents and my sister, for their love and never-ending support. My family’s unwavering faith in my ability to complete this project always encouraged me to keep going. For this and many many other things, I thank them! Last but by no means least, my thanks go to Ian, the most important person in my life. With love and patience, he stood by me through all the ups and downs of my project, putting up with my craziness, supporting me and encouraging me. I could not have done this without him by my side. 1Meditation XVII in Devotions upon Emergent Occasions, John Donne, 1624 iv Contents List of Figures x List of Tables xii Glossary xvi 1 Introduction 1 1.1 Problemdefinition ............................ 4 1.1.1 Researchquestion......................... 4 1.1.2 Aimsandobjectives ....................... 5 1.2 Contributionstoknowledge . .. .. 7 1.3 Thesis disposition . 8 2 Semantic and functional similarity 10 2.1 TheGeneOntology............................ 11 2.1.1 StructureoftheGeneOntology . 11 2.1.2 GeneOntologyannotation . 12 2.1.3 WhattheGOisnot ....................... 13 2.2 SimilaritybetweenGOterms . 14 2.2.1 Node-basedapproaches. 14 2.2.2 Edge-basedapproaches . 21 2.2.3 Hybridapproaches ........................ 24 2.3 Similaritybetweengeneproducts . 25 2.3.1 Group-wiseapproaches . 25 2.3.2 Pair-wiseapproaches . 29 2.3.3 FunSim .............................. 32 2.4 Evaluating semantic and functional similarity . 33 2.5 Applications of semantic and functional similarity . 36 2.5.1 Existing tools . 39 v CONTENTS 2.6 Summary ................................. 41 3 The study domain 42 3.1 Studydesign................................ 42 3.1.1 Semantic similarity approaches . 42 3.1.2 Functional similarity approaches . 44 3.1.3 Ontologicalaspects . .. .. 46 3.1.4 Evidencecodes .......................... 47 3.1.5 Thedataset ............................ 49 3.1.6 The grouping algorithm . 50 3.2 Evaluationstrategy............................ 57 3.2.1 Semantic and functional similarity approaches . 57 3.2.2 Threshold determination . 64 3.2.3 FuSiGroupsgroupingresults . 67 3.3 Implementationconsiderations. 68 3.3.1 FuSiGroups ............................ 69 3.3.2 Dataset .............................. 72 3.3.3 Experiments............................ 75 3.4 Summary ................................. 75 4 Semantic and functional similarity approaches 78 4.1 ROCcurves ................................ 79 4.2 AUCresults................................ 86 4.2.1 Statistical analysis . 89 4.3 Semantic similarity approaches . 91 4.3.1 Ancestor.............................. 91 4.3.2 Annotations............................ 92 4.3.3 Functional similarity approaches . 92 4.3.4 Summary ............................. 93 4.4 Ancestors ................................. 94 4.5 Annotations................................ 95 4.6 Functional similarity approaches . 96 4.7 Summary ................................. 97 5 Threshold determination 99 5.1 Semanticthresholds. .. .. .100 5.1.1 Resnik...............................101 5.1.2 Schlicker..............................102 vi CONTENTS 5.2 Functionalthresholds. .104 5.2.1 Resnik-BMA&MAX......................106 5.2.2 Schlicker-BMA&MAX. .107 5.3 Summary .................................112 6 Grouping trends 113 6.1 Numberofgroups.............................114 6.2 Groupcontent...............................117 6.2.1 Groupsizes ............................117 6.2.2 Groupsbyontology. .118 6.2.3 Numberofgenes .........................121 6.3 Groupdefinitions .............................123 6.3.1 Definition size . 123 6.3.2 Group size vs. definition size . 125 6.3.3 Definitions by ontology . 126 6.3.4 NumberofGOterms. .. .. .128 6.3.5 Definitionsizevs.termdepth . .128 6.4 Summary .................................130 7 The complete Eisen dataset 132 7.1 Largestgroupsandmostcommonaspects . 133 7.2 Supergroups................................139 7.2.1 Pseudocode ............................142 7.2.2 Mergingresults ..........................142 7.3 Groupingvs.clustering. .143 7.3.1 Expressionclustering . .143 7.3.2 Comparing expression and semantic clustering . 147 7.4 Summary .................................150 8 Biological evaluation 152 8.1 Proteasome ................................153 8.1.1 Geneselection...........................153 8.1.2 Grouping .............................156 8.2 Ribosomalgenes .............................165 8.2.1 Geneselection...........................165 8.2.2 Grouping .............................167 8.3 Pathwayidentification . .173 8.3.1 Geneselection...........................173 vii CONTENTS 8.3.2 Grouping .............................179 8.4 Summary .................................184 9 Discussion & Conclusion 186 9.1 Semantic and functional similarity approaches . 186 9.1.1 Semantic similarity . 186 9.1.2 Functional similarity . 187 9.1.3 Otherparameters . .187 9.1.4 Recommendations. .188 9.2 FuSiGroups ................................188 9.2.1 Groupingtrends..........................189 9.2.2 Groupingvs.clustering. .190 9.2.3 Groupingresults .........................190 9.2.4 FuSiGroupscompared to other approaches . 193 9.2.5 Analysispathway .........................194 9.3 Futurework