Supporting Disease Candidate Gene Discovery Based on Phenotype Mining
Total Page:16
File Type:pdf, Size:1020Kb
SUPPORTINGDISEASECANDIDATEGENEDISCOVERY BASEDONPHENOTYPEMINING anika oellrich Wolfson College A dissertation submitted to the University of Cambridge for the degree of Doctor of Philosophy European Molecular Biology Laboratory European Bioinformatics Institute Wellcome Trust Genome Campus Hinxton, Cambridge, CB10 1SD United Kingdom [email protected] 29th August 2012 To my grandparents and parents. DECLARATION This dissertation is my own work and includes nothing which is the outcome of work done in collaboration except as specified in the text. It is not substantially the same as any I have submitted for a degree, diploma or other qualification at any other university; and no part has already been, or is currently being submitted for any degree, diploma or other qualification. This dissertation does not exceed the specified length of 60000 words as defined by The Biology Degree Committee. Cambridge, 29th August 2012 Anika Oellrich SUPPORTINGDISEASECANDIDATEGENE DISCOVERYBASEDONPHENOTYPEMINING — by Anika Oellrich — Even though numerous biological and computational experiments have been devoted to the understanding of the molecular mechanisms underlying human genetic disorders, a large number of those disor- ders is still without identified genetic mechanisms. Genetic pleiotropy as well as the polygenic nature of some human genetic disorders pose challenges which still need to be overcome before an under- standing of the disease underlying molecular mechanisms may be achieved. Mouse models are used to extensively study human disease through mutagenesis experiments and their findings are reported in publicly accessible databases as well as scientific publications. In my thesis, I focus on supporting the identification of disease gene candidates by mining phenotype information from three different resources: the Mammalian Genome Informatics (MGI) database, the Online Mendelian Inheritance in Man (OMIM) database and the scien- tific literature. To enable the integration of the resources, I developed a pipeline which ranks mouse models for human genetic disorders and with that enables the identification of promising disease gene candidates. Mouse models are ranked according to their phenotype similarity and hence the ranking pipeline can be used as long as a phenotype description of the human disorder is at hand. No prior information about the genetic causes is required which makes this approach especially valuable in the case of orphan diseases, where it is hard to identify the molecular mechanisms due to their rare occurrence. Furthermore, I generated mouse-specific disease profiles and demonstrate their validity by ap- plying them to the mouse model ranking pipeline and evaluating the obtained results against disease gene reporting databases. Those mouse-specific profiles may further broaden our knowledge about genetic diseases by using them for annotation enrichment. To illustrate their potential, I applied the mouse-specific profiles to a disease classi- fication task. Manual investigation of the obtained classification results reveals phenotypes to enrich existing OMIM disease annotations. Due to the incompleteness of the existing phenotype resources and the intense labour and time consumption of manual curation, my work also focussed on the extraction of phenotype information from scientific literature. Not only the abstract but also every other part of a scientific publication is analysed for its potential to provide phenotype information. Textual features as well as several ontologies are used to identify and extract phenotype mentions into a formal representa- vii tion. The extracted phenotypes can then be used to provide database curators with a selection of phenotypes contained in a paper and by doing so speed up the curation process. Thus, literature extracted phenotypes enrich existing phenotype databases and consequently support the data mining efforts based upon phenotype information. viii ACKNOWLEDGEMENTS First of all, I would like to thank my PhD supervisor Dietrich Rebholz- Schuhmann for his continuous support, his guidance, his encourage- ment to follow my intuitions and consequently learn to guide my own research. Dietrich was always very supportive with last minute paper submissions and travel funds to several workshops and conferences. Furthermore, I would like to thank Robert Hoehndorf – LLAP n= – and George Gkoutos. I collaborated closely with Robert and George, especially in the last three years of my PhD. Both provided me with in- sights into biological and computational aspects of phenotype mining and spent numerous hours discussing with me (I know GeorgioUs, you are always right apart from 1986!). In addition, I would like to thank Robert Busch, Peer Bork and An- ton Enright who, as part of my Thesis Advisory Committee, have chal- lenged my ideas and therefore let me gain valuable insights through discussion. I am very grateful to everyone who has proof-read parts of this thesis, especially those reading substantial parts of it and having had to put up with lots of questions - refunds and compensations will be provided! My special thanks also go to Christoph Grabmueller and Maria Liakata. Christoph supported me in technical questions while Maria provided insights into the theory of research work. I further would like to thank Adam Bernard who has not only become a friend over the last couple of years but also provided a lot of feedback to my work. My research group was also very supportive with various discus- sions, prep-talks and journal clubs. I would like to thank Samuel, Yumi, Senay, Chen, Irina, Silvestras, Jee-Hyub, Antonio, Ian, and Shyama for their continuous help! The vivid predoc community at EBI was also a great support, not only with morning coffees and lunch time seminars. I would like to especially thank the predocs from my year – Tim, Inigo, Pablo, Joe, Nenad, the pink unicorn, and Charlie; those from previous years – Joern, Dominic and Dace; and those who were adopted – Dagmar, Claire and Mikhail. ix Heidi and Elin, thank you very much for helping me when I needed it the most! Heidi – many thanks for all the meetings and sharing your knowledge with me. I am also very thankful for having found a role model during those years at EBI – Mela, thank you very, very much for the way you are and the inspiration and guidance you gave me for almost six years now!!! Last, but by no means least, I would like to thank the individuals closest to me. First, I would like to thank my family who provided me with the stepping-stones to become what I am and were supportive all the way long: my parents – Christina & Ulrich; my brother and his family – Daniel, Doreen, Lara & Lena; my grandparents – Brunhilde & Heinz. And finally many, many thanks to Steviiie & Bobo: dankeschon, dankeschon, ich bin ganz comfortable, Kartoffelkopf. x CONTRIBUTIONS Contributions to the individual parts of this thesis are listed here according to chapter. chapter 2 I analysed and recorded the co-existing ways of describing pheno- types, derived the categorisation and drafted the manuscript for the workshop. Dietrich Rebholz-Schuhmann supervised the work and contributed to the workshop manuscript. chapter 3 I designed the study with the help of Robert Hoehndorf. I imple- mented the required R and Groovy scripts. Robert Hoehndorf con- tributed two Groovy Scripts, and generated the combined mappings. Robert Hoehndorf and Dietrich Rebholz-Schuhmann supervised the research. Georgios Gkoutos and Robert Hoehndorf helped with the validation of the initial pairs and the biological validity of the results. All of us contributed to the submitted manuscripts. chapter 4 I designed the study and implemented the required groovy scripts as well as the web interface. Robert Hoehndorf and Dietrich Rebholz- Schuhmann supervised the work. Robert Hoehndorf provided the classification of the diseases based on the mouse profiles. George Gkoutos helped with the manual evaluation of the classification results. All contributed to a manuscript which is to be submitted soon. chapter 5 Christoph Grabmueller implemented a general set-up of the anno- tation servers, while I derived the domain specific dictionaries un- derlying those servers. Irina Colgiu provided a speed-improved and corrected version of the Gene Ontology (GO) server implementation described in (Gaudan et al., 2008). I designed the study and imple- mented all the other required software using Groovy. I also manually evaluated a subset of the error cases in the entity–quality (EQ) state- ments and provided feedback to Georgios Gkoutos. Georgios Gkoutos updated the logical definitions accordingly and provided additional xi information in unclear cases. Dietrich Rebholz-Schuhmann supervised the study. chapter 6 I designed the study and implemented all required testing software as Groovy scripts. I also carried out the manual curation of the generated EQ statements to find flaws and allow for the definition of generalised patterns in the decomposition process. Doubtful cases were clarified with Georgios Gkoutos as were cases in which an error in the EQ state- ments was suspected. Georgios Gkoutos corrected the EQ statements accordingly and provided support. Dietrich Rebholz-Schuhmann su- pervised the studies. The server set-up was the same as in chapter 5, partially contributed from Christoph Grabmueller and Irina Colgiu. Christoph Grabmueller, Dietrich Rebholz-Schuhmann and me, we all contributed to the submitted manuscript. xii GLOSSARY The following terms are used throughout my dissertation. Genotype A genotype is the