Visualization and Exploration of Transcriptomics Data Nils Gehlenborg
Total Page:16
File Type:pdf, Size:1020Kb
Visualization and Exploration of Transcriptomics Data 05 The identifier 800 year identifier Nils Gehlenborg Sidney Sussex College To celebrate our 800 year history an adaptation of the core identifier has been commissioned. This should be used on communications in the time period up to and including 2009. The 800 year identifier consists of three elements: the shield, the University of Cambridge logotype and the 800 years wording. It should not be redrawn, digitally manipulated or altered. The elements should not be A dissertation submitted to the University of Cambridge used independently and their relationship should for the degree of Doctor of Philosophy remain consistent. The 800 year identifier must always be reproduced from a digital master reference. This is available in eps, jpeg and gif format. Please ensure the appropriate artwork format is used. File formats European Molecular Biology Laboratory, eps: all professionally printed applications European Bioinformatics Institute, jpeg: Microsoft programmes Wellcome Trust Genome Campus, gif: online usage Hinxton, Cambridge, CB10 1SD, Colour United Kingdom. The 800 year identifier only appears in the five colour variants shown on this page. Email: [email protected] Black, Red Pantone 032, Yellow Pantone 109 and white October 12, 2010 shield with black (or white name). Single colour black or white. Please try to avoid any other colour combinations. Pantone 032 R237 G41 B57 Pantone 109 R254 G209 B0 To Maureen. This dissertation is my own work and contains nothing which is the outcome of work done in collaboration with others, except as specified in the text and acknowledgements. This dissertation is not substantially the same as any I have submit- ted for a degree, diploma or other qualification at any other university, and no part has already been, or is currently being submitted for any degree, diploma or other qualification. This dissertation does not exceed the specified word limit of 60,000 words as defined by the Biology Degree Committee. This dissertation has been typeset in 12 pt Palatino using LATEX2# ac- cording to the specifications defined by the Board of Graduate Studies and the Biology Degree Committee. October 12, 2010 Nils Gehlenborg Visualization and Exploration of Transcriptomics Data Summary October 12, 2010 Nils Gehlenborg Sidney Sussex College During the last decade, high-throughput analysis of transcriptomes with microarrays and other technologies has become increasingly mature and affordable, which has led to rapid growth of the number and size of avail- able transcriptomics data sets. This in turn has created new challenges for bioinformatics to provide adequate methods for data exploration and visualization, which are the topic of this dissertation. The first research question that I address is the efficient retrieval of transcriptomics data sets from large databases, which is motivated by the rapid growth of public repositories for transcriptomics data. This makes the exploration of such repositories increasingly challenging, but also pro- vides opportunities for biological discovery. I describe a knowledge- driven approach for exploration of transcriptomics data repositories based on ontology visualization and data-driven approaches based on gene set enrichment analysis and generative probabilistic models. The second research question of this dissertation deals with the vi- sualization of large transcriptomics data sets. This work has been driven by the observation that there is a growing number of data sets with hun- dreds or thousands of samples but a lack of suitable methods to visualize such data sets. To address this problem, I first present an analysis of the visualization tasks in this context, and then describe the design of an in- teractive visualization method based on pixel-oriented visualizations and tree maps. This design study is complemented with a description of the implementation of the method and a discussion of practical aspects that need to be taken into account when visualizing data sets of such scale. This dissertation includes several case studies in which I describe the application of the proposed methods to a range of real-world data sets and discuss my findings. Preface This dissertation marks the end of my student life, which has been a beau- tiful ten year journey that took me halfway around the globe, and during which I have met many inspiring people who have shaped my view of the world. My decision to pursue a PhD grew out of the wonderful experi- ence I had as an undergraduate researcher in the group of Kay Nieselt in Tubingen¨ and I’m very thankful that she gave me the opportunity to take some first steps in the big, great world of science. I’m also very indebted to my former mentors at the Institute for Systems Biology, Inyoul Lee and Daehee Hwang, who inspired me with their perseverance and work ethic and for whose support over the years I am ever so grateful. For the last four years, I have had Alvis Brazma as my advisor to whom I am most thankful for his support and advice as well as for allow- ing me the freedom to develop my scientific interests and define my own area of research. I also thank the members of my thesis advisory commit- tee, Nick Luscombe, Gos Micklem and Lars Steinmetz for their guidance and encouragement. I’m indebted to the members of the Functional Genomics team at the EBI, who supported so many aspects of my research during last four years. I want to acknowledge in particular the help of Nikolay Kolesnikov, who worked with me on the ArrayExpress Explorer interface, Margus Lukk, who provided me with the data set that ultimately led to the de- velopment of the Space Maps visualization, Gabriella Rustici, who was so helpful with everything related to actual wet lab biology, James Malone and Tomasz Adamusiak, who answered many, many questions about the Experimental Factor Ontology, Ele Holloway, who solved a few curation mysteries for me, Misha Kapushesky, who provided me with access to the ArrayExpress Atlas data, and Helen Parkinson, who helped out whenever I had a questions about ArrayExpress, annotation or ontologies. I also want to thank my fellow PhD students and office mates Katherine Lawler ii and Angelaˆ Gonc¸alves for sharing their knowledge with me. In particu- lar, I want to thank Garth Ilsley for the many conversations we had over the last four years and for joining me for some time in A3-118. Richard Coulson also deserves special mention for many insightful discussions about biology and, even more so, for fun evenings at the pub followed by mandatory late night Indian dinners. Much of the research I undertook during the four years of my PhD was in collaboration with the group of Samuel Kaski in Helsinki. I thank Sami for making me feel so welcome during my visits, and for the great discussions we had during our meetings. Very special thanks go to Jose´ Caldas, from whom I have learned a lot, and whom I greatly respect as a person, a scientist and a colleague. Furthermore, I’m indebted to Ali Faisal, Jaakko Peltonen and Helena Aidos for being such good collabo- rators and co-authors. I also have fond memories of good times spent outside the lab with Leo Lathi and, in particular, with Gayle Leen. If I had to name one single reason why it took me four instead of three and a half years to finish my PhD, I would blame it on the time spent on the organization of the “Visualization of Biological Data” workshop. Nonetheless, I’m enormously grateful for having been part of this, as well as for the opportunities that arose from it, and I’m deeply grateful to Sean´ O’Donoghue and Jim Procter for going through this experience with me. In particular, I want to thank Sean,´ and also Anne-Claude Gavin, for their help with the review on visualization of systems biology data. Of course, this would not have been possible without the support of our co-authors Nitin Baliga, Alexander Goesmann, Hiroaki Kitano, Oliver Kohlbacher, Heiko Neuweger, Reinhard Schneider, Dan Tenenbaum, and in particular Matt Hibbs, whose comments and feedback were especially helpful. Cydney Nielsen and Miriah Meyer have been such wonderful friends and colleagues, and I appreciate our many insightful discussions about visualization in biology. I also thank Joe Parry, my local connec- tion to the visual analytics community, for teaching me a lot about visu- alization in the intelligence analysis field, and about good real ale pubs in Cambridge. I’m grateful also to Bang Wong, Alan Blackwell and Roy Ruddle for conversations about visualization that shaped my views and influenced my work, as well as for coming to the EBI to speak at the “EBI Interfaces” seminars. In this context, I also want to thank Francis Rowland, Eamonn Maguire and Jenny Cham for helping me to organize these sem- inars and for growing a sizable visualization community on the Genome Campus. Finally, I want to thank all of my friends, and especially my fam- iii ily, in both Germany and the United States, for being so encouraging and helpful during my life as a student. And despite all the support I have received from the many people mentioned above, I could not have made it to this point without the encouragement and love of my wife, Maureen. I’m immensely grateful to her for being such an understanding and sup- portive partner. iv Contents Summaryi Preface ii Contentsv List of Figures ix List of Tables xi List of Acronyms xii 1 Introduction1 1.1 Transcriptomics Data.......................2 1.1.1 Gene Expression.....................2 1.1.2 Measurement Technologies...............5 1.1.3 Experimental Design...................8 1.1.4 Data Representation................... 10 1.1.5 Repositories.......................