Exploiting high throughput DNA sequencing data for genomic analysis Markus Hsi-Yang Fritz Darwin College A dissertation submitted to the University of Cambridge for the degree of Doctor of Philosophy 14th October 2011 Markus Hsi-Yang Fritz EMBL-European Bioinformatics Institute Wellcome Trust Genome Campus Hinxton, Cambridge, CB10 1SD United Kingdom email:
[email protected] This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration except where specifically indicated in the text. No part of this work has been submitted or is currently being submitted for any other qualification. This document does not exceed the word limit of 60,000 words1 as defined by the Biology Degree Committee. Markus Hsi-Yang Fritz 14th October 2011 1 excluding bibliography, figures, appendices etc. To an exceptional scientist — my Dad. Exploiting high throughput DNA sequencing data for genomic analysis Summary Markus Hsi-Yang Fritz Darwin College The last few years have witnessed a drastic increase in genomic data. This has been facilitated by the shift away from the Sanger sequencing tech- nique to an array of high-throughput methods — so-called next-generation sequencing technologies. This enormous growth of available DNA data has been a tremendous boon to large-scale genomics studies and has rapidly advanced fields such as environmental genomics, ancient DNA research, population genomics and disease association. On the other hand, however, researchers and sequence archives are now facing an enormous data deluge. Critically, the rate of sequencing data accumulation is now outstripping advances in hard drive capacity, network bandwidth and processing power.