Comprehensive Analysis of High-Throughput Experiments for Investigating Transcription and Transcriptional Regulation
Total Page:16
File Type:pdf, Size:1020Kb
Comprehensive analysis of high-throughput experiments for investigating transcription and transcriptional regulation Joern Michael Toedling Jesus College A dissertation submitted to the University of Cambridge for the degree of Doctor of Philosophy European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom. Email: [email protected] 12 January 2009 Meinen Eltern This dissertation is my own work and contains nothing which is the outcome of work done in collaboration with others, except as specified in the text and acknowledgements. This dissertation is not substantially the same as any I have submit- ted for a degree, diploma or other qualification at any other university, and no part has already been, or is currently being submitted for any degree, diploma or other qualification. This dissertation does not exceed the specified length limit of 300 pages as defined by the Biology Degree Committee. This dissertation has been typeset in 12 pt Palatino using LATEX2# ac- cording to the specifications defined by the Board of Graduate Studies and the Biology Degree Committee. 12 January 2009 Joern Michael Toedling Comprehensive analysis of high-throughput experiments for investigating transcription and transcriptional regulation Summary Joern Michael Toedling 12 January 2009 Jesus College As the number of fully sequenced genomes grows, efforts are shifted to- wards investigation of functional aspects. One research focus is the tran- scriptome, the set of all transcribed genomic features. We aspire to under- stand what features constitute the transcriptome, in which context these are transcribed and how their transcription is regulated. Studies that aim to answer these questions frequently make use of high-throughput tech- nologies that allow for investigation of multiple genomic regions, or tran- scribed copies of genomic regions, in parallel. In this dissertation, I present three high-throughput studies I have been involved in, in which data gained from oligo-nucleotide tiling microar- rays or large-scale cDNA sequencing provided insights into the transcript- ome and transcriptional regulation in the model organisms Saccharomyces cerevisiae and Mus musculus. Interpretation of such high-throughput data poses two major computational tasks. The primary statistical analysis in- cludes quality assessment, data normalisation and identification of signif- icantly affected targets, i.e. regions of the genome deemed transcribed or involved in transcriptional regulation. Second, in an integrative bioinfor- matic analysis, the identified targets need to be interpreted in context of the current genome annotation and related experimental results. I pro- vide details of these individual steps as they were conducted in the three studies. For both primary and integrative analysis, functional, extensible and well- documented software is required, which implements individual analysis steps, allows for concise visualisation of intermittent and final results and facilitates the construction of automated, programmed workflows. Ide- ally such software is optimised with respect to scalability, reproducibil- ity and methodical scope of the analyses. This dissertation contains de- tails of two such software packages in the Bioconductor project, which I (co-)developed. Preface This dissertation describes work carried out at the European Bioinformat- ics Institute (EBI) in Cambridge, UK, between March 2005 and June 2008. The EBI is an outstation of the European Molecular Biology Laboratory (EMBL). I was the recipient of an EMBL predoctoral fellowship to work together with Wolfgang Huber. In the course of collaborations, I also worked together with Lars Stein- metz, Zhenyu Xu, Marina Granovskaia, Antje Purmann and Tammo Krueger, and I would like to thank them for many constructive discus- sions, opinions and ideas. Credit is also due to Silke Sperling and Jenny Fischer for the successful collaborations. The following people helped me to make my time at the EBI interesting, instructional, exciting, and enjoyable: Wolfgang, vielen Dank dafur,¨ dass Du mir die Moglichkeit¨ gegeben hast, von Dir zu lernen und in Deiner Gruppe zu arbeiten. Es war nicht immer einfach, aber immer spannend, und ich kann sagen, dass ich am Ende viel aus der Zusammenarbeit mit Dir mitnehmen kann. Ich habe viel gelernt von Dir, uber¨ die Analyse von high-throughput Daten im Allgemeinen, uber¨ statistische Herangehensweisen und naturlich¨ uber¨ R. Fur¨ Deinen Beistand in der anstrengenden Zeit des HeartRepair-Dilemmas moche¨ ich Dir besonders danken. Und vielleicht bin ich mittlerweile sogar bereit einzusehen, dass auch Diskussionen uber¨ Aspect Ratios eine geringe Daseinsberechtigung haben konnten.¨ Many thanks to Sarah Teichmann, Alvis Brazma, and Lars Steinmetz for their expert advice in and after my thesis advisory committee meetings. Ich mochte¨ an dieser Stelle auch Rainer Spang und Martin Vingron danken dafur,¨ dass sie mein Interesse an der Bioinformatik geweckt und gefordert¨ haben. ii Katherine, thank you so much for carefully proofreading this manuscript and for your wonderful suggestions for improving it. I owe you, and I promise that I won’t carry about with putting the preposition “on” everywhere. Also, many thanks for your friendship over the last four years. Special thanks to all members of the Huber group, many as they were in such a dynamic group. Elin, thanks for countless coffees and for being such a good friend. I hope that you will find happiness and that someday you can also look back and think that it was not too bad a time at EBI. Oleg, thanks for always being very helpful. I knew I could always talk to you about all sorts of things, work-wise and otherwise. It was great to have you there and a sad day when you decided to move on. Tine, many thanks for the shared time, short as it was. Your excitement about all aspects of your work, and be it only the poplar genome sequence, was very contagious, and you were even more fun when you could let go. Greg, your remarks were always very entertaining, first because I could not understand them, later on because they were genuinely true and funny. Richard, thanks for patiently putting up with lots of questions about statistics and ChIP-chip. Ola´ L´ıgia, muito obrigado for the Portuguese classes. Kristen, thank you very much for proofreading the two chapters and for providing elegant English phrases. Audrey, Raeka, Ramona, Alessandro, Bernd, David, Florian, Matt, Mike, Paul, Remy, Simon, Thomas, and Tony, thanks for the nice and interesting chats and the good time. I am also very grateful to all members of the EBI predoc community and associates for providing entertainment and distraction in the frequent times of need, especially Michael, Daniela, Jacky, John, Daniel, and Melanie. Without you guys, it would have been four very hard and boring years indeed. Vielen Dank auch meinen Freunden in Berlin, Oxford und Bremen, auf die ich trotz der Entfernung doch immer zahlen¨ konnte, speziell meiner Schwester Hanna, Peer, Andreas, und Janine. Der großte¨ Dank geht an meine Eltern. Danke fur¨ alles, und speziell dafur,¨ dass Ihr mich in all den Jahren des Studierens in Berlin und Cambridge im- mer bedingungslos unterstutzt¨ habt! Ich habe es Euch bestimmt nicht im- mer leicht gemacht, meinen Weg gutzuheißen, aber ich hoffe und denke, daß Ihr mit dem Ergebnis zufrieden seid. iii Contents Summaryi Preface ii Contents iv List of Figures vii List of Tables ix List of Acronymsx Glossary xii Concerning gene and protein names................. xiii 1 Introduction1 1.1 Transcription............................2 1.2 Regulation of transcription....................3 1.2.1 Histone modifications..................4 1.2.2 Transcription factors...................5 1.3 Transcript orientation.......................6 1.4 DNA microarrays.........................6 1.4.1 ChIP-chip......................... 13 1.5 Microarray-based investigation of transcription throughout the cell cycle............................ 14 1.5.1 Budding yeast cell cycle................. 14 1.5.2 Microarrays and the cell cycle.............. 16 1.6 Gene annotation.......................... 18 iv 2 Coexpression of adjacent genes 20 2.1 Introduction............................ 20 2.1.1 Gene clusters....................... 20 2.1.2 Measures of gene coexpression among clusters.... 21 2.2 Material and Methods...................... 22 2.3 Results............................... 25 2.3.1 Permutation scheme for estimating the null distribu- tion of coexpression scores............... 25 2.3.2 Chromosomal clustering of transcriptomes...... 25 2.3.3 Evaluating coexpression across tissues......... 26 2.3.4 Highly coexpressed clusters............... 33 2.3.5 HCPs and housekeeping functionality......... 33 2.3.6 Location, orientation, and dimension of HCCs.... 33 2.3.7 Functionality, paralogy and transcriptional regula- tion of highly coexpressed gene clusters........ 34 2.3.8 H. sapiens microarray data................ 36 2.4 Discussion............................. 38 3 Transcriptional regulation in developing cardiomyocytes 42 3.1 Histone modifications...................... 43 3.1.1 Introduction........................ 43 3.1.2 Material and methods.................. 46 3.1.3 Results........................... 55 3.1.4 Discussion......................... 62 3.2 Transcription factor binding events............... 67 3.2.1 Introduction........................ 67 3.2.2 Material and methods.................. 68 3.2.3